CN114785567A

CN114785567A - Traffic identification method, device, equipment and medium

Info

Publication number: CN114785567A
Application number: CN202210348479.2A
Authority: CN
Inventors: 王萌; 刘文懋; 顾杜娟; 吴铁军; 赵光远
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-07-22

Abstract

The application discloses a traffic identification method, a traffic identification device, traffic identification equipment and a traffic identification medium, which are used for accurately detecting TLS encrypted communication traffic of malicious software. Due to the fact that the traffic identification model is trained in advance, any acquired traffic feature to be processed can be identified through the traffic identification model, whether the TLS encrypted traffic corresponding to the traffic feature to be processed is malicious traffic or not is determined, namely the identification result of the traffic feature to be processed is acquired, the TLS encrypted communication traffic of malicious software is accurately detected, and the TLS encrypted traffic identification efficiency is improved.

Description

Traffic identification method, device, equipment and medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a method, an apparatus, a device, and a medium for traffic identification.

Background

In recent years, as the awareness of network security increases, the awareness of data protection also increases, and encryption technology is rapidly spreading on the internet. The Transport Layer Security (TLS) is now used by various major websites to protect users' messages, transactions, and credentials as a standard protocol for packet encryption, but more and more malware also uses TLS encryption to hide its malicious communication behavior, such as obtaining instructions sent by malware or sending sensitive data collected from infected hosts to bypass traditional detection devices or detection platforms. Therefore, how to identify the TLS encrypted traffic of the malware is very necessary.

Disclosure of Invention

The embodiment of the application provides a traffic identification method, a traffic identification device, traffic identification equipment and a traffic identification medium, which are used for accurately identifying TLS encrypted communication traffic of malicious software.

The embodiment of the application provides a traffic identification method, which comprises the following steps:

acquiring flow characteristics to be processed;

determining the recognition result of the flow characteristics to be processed through a pre-trained flow recognition model; and the identification result comprises whether the security transport layer protocol TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic.

The embodiment of the application provides a flow identification device, the device includes:

the acquisition unit is used for acquiring the flow characteristics to be processed;

the processing unit is used for determining the recognition result of the flow characteristics to be processed through a pre-trained flow recognition model; and the identification result comprises whether the security transport layer protocol TLS encrypted flow corresponding to the flow characteristic to be processed is malicious flow or not.

An embodiment of the present application provides an electronic device, where the electronic device at least includes a processor and a memory, and the processor is configured to implement the steps of the traffic identification method as described above when executing a computer program stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the traffic identification method as described above.

An embodiment of the present application provides a computer program product, including: computer program code for causing a computer to perform the steps of the traffic identification method as described above, when said computer program code is run on a computer.

Because the traffic identification model is trained in advance, any acquired traffic characteristic to be processed can be identified through the traffic identification model, whether the TLS encrypted traffic corresponding to the traffic characteristic to be processed is malicious traffic or not is determined, namely, the identification result of the traffic characteristic to be processed is acquired, the TLS encrypted communication traffic of malicious software is accurately detected, and the TLS encrypted traffic identification efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a flow identification process provided in the present application;

fig. 2 is a schematic flow chart of specific flow identification provided in the embodiment of the present application;

fig. 3 is a schematic structural diagram of a flow rate identification device according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will now be described in further detail with reference to the accompanying drawings, wherein like reference numerals refer to like elements throughout. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method, or computer program product. Thus, the present application may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

For convenience of understanding, some concepts related to the embodiments of the present application are explained below:

quintuple information: source IP, source port, protocol, destination IP, destination port.

Bidirectional network flow: in a certain time, the data set carried by all network packets with the same five-tuple information, the source IP address and the destination IP address, and the source port number and the destination port number can be interchanged, so as to mark a bidirectional network flow.

Client JA3 fingerprint: is the MD5 HASH value of fields such as encryption suite, extension, etc. in the Client Hello message.

In the related art, the TLS encrypted communication traffic may be identified by means of payload of the TLS encrypted communication traffic, or by means of JA3 fingerprint matching of the TLS encrypted communication traffic. However, after the TLS encrypted communication traffic is encrypted, it is difficult to accurately identify the TLS encrypted communication traffic of the malware through the methods, so that the accuracy of the TLS encrypted communication traffic of the malware through the methods is reduced. Therefore, a method for further accurately identifying malicious traffic is needed.

In order to solve the above problems, the present application provides a traffic identification method, apparatus, device, and medium. Because the traffic identification model is trained in advance, any acquired traffic characteristic to be processed can be identified through the traffic identification model, whether the TLS encrypted traffic corresponding to the traffic characteristic to be processed is malicious traffic or not is determined, namely, the identification result of the traffic characteristic to be processed is acquired, the TLS encrypted communication traffic of malicious software is accurately detected, and the TLS encrypted traffic identification efficiency is improved.

For convenience of understanding of the embodiment of the present application, the application scenarios of the present application are introduced above, and the service scenarios described in the embodiment of the present application are for more clearly explaining the technical solutions of the embodiment of the present application, and do not constitute limitations on the technical solutions provided in the embodiment of the present application.

Example 1:

fig. 1 is a schematic diagram of a flow identification process provided in the present application, where the process includes:

s101: and acquiring the flow characteristics to be processed.

The traffic identification method provided by the embodiment of the application is applied to electronic equipment (for convenience of description, referred to as traffic identification equipment), and the traffic identification equipment may be intelligent equipment such as a mobile terminal and a computer, or may be a server, such as an application server and a cloud server.

In an actual application scenario, after the electronic device performing traffic identification obtains the traffic features to be processed, the traffic identification method provided by the application may be adopted to perform corresponding processing, so as to determine whether the traffic features to be processed are malicious traffic.

The method for acquiring the to-be-processed flow characteristics by the electronic equipment for flow identification mainly comprises at least one of the following conditions:

in the first situation, when a worker needs to perform traffic identification on at least one flow, a service processing request may be input to the intelligent device, and after receiving the service processing request for the at least one flow, the intelligent device may send to-be-processed traffic characteristics corresponding to the at least one flow to the electronic device performing the traffic identification.

The electronic device for performing traffic recognition may be the same as or different from the smart device.

In case two, the electronic device performing traffic identification may obtain the characteristics of the traffic to be processed by running a preconfigured traffic collection tool, for example, a network sniffer wireshark.

In one example, the obtaining the pending traffic characteristics includes:

acquiring a TLS encrypted flow data packet;

aggregating the TLS encrypted traffic in each TLS encrypted traffic data packet by taking a bidirectional network flow as a unit to obtain at least one bidirectional network flow;

performing feature extraction on the bidirectional network flow aiming at the at least one bidirectional network flow to acquire feature data contained in the bidirectional network flow; and determining the to-be-processed traffic characteristics of the bidirectional network flow based on each characteristic data.

In the application, the electronic device may obtain the TLS encrypted traffic data packet. For example, the electronic device may obtain the TLS encrypted traffic data packet by operating a preconfigured traffic collection tool, may also directly obtain the TLS encrypted traffic data packet sent by another device, and may also obtain the TLS encrypted traffic data packet configured by a worker. And then, the electronic equipment aggregates the TLS encrypted traffic in each acquired data packet by taking the bidirectional network flow as a unit, so as to acquire at least one bidirectional network flow. For example, in a preset time period, a TLS encrypted traffic packet 1 sent by a device a to a device B belongs to the same bidirectional network flow as a TLS encrypted traffic packet 2 sent by the device B to the device a, a source IP address and a destination IP address in five-tuple information included in the TLS encrypted traffic packet 1 are exchanged, a source port number and a destination port number in the five-tuple information included in the packet 1 are exchanged, and exchanged five-tuple information is obtained, where the exchanged five-tuple information is identical to the five-tuple information included in the TLS encrypted traffic packet 2, that is, the five-tuple information included in the TLS encrypted traffic packet 1 is identical to the five-tuple information included in the TLS encrypted traffic packet 2. After the TLS encrypted traffic of any bidirectional network flow is obtained, feature extraction may be performed on the bidirectional network flow, thereby obtaining feature data included in the bidirectional network flow. The feature data may be obtained through a feature extraction algorithm, or data located at a preset field position in the bidirectional network flow may be used as the feature data, or the feature data may be obtained through a preset flow analysis tool, for example, Joy of cisco. After the feature data of the bidirectional network flow is obtained, the traffic feature to be processed of the bidirectional network flow may be determined based on each feature data. For example, according to a preset arrangement rule, each feature data is sequenced to obtain a feature sequence of the bidirectional network flow, and the feature sequence is determined as a traffic feature to be processed.

In a possible implementation manner, the obtaining TLS encrypted traffic data packet includes:

acquiring a flow data packet in a network;

and if the flow data packet contains the TLS protocol flow, determining that the flow data packet is the TLS encrypted flow data packet.

In the application, the electronic device may obtain the traffic data packet from the network. For example, the electronic device may obtain the traffic data packet by operating a preconfigured traffic collection tool, may also directly obtain the traffic data packet sent by other devices, and may also obtain the traffic data packet configured by a worker. And then determining whether the traffic data packet is a TLS encrypted traffic data packet or not according to whether the obtained traffic data packet contains the traffic of the TLS protocol or not.

In an example, considering that TLS encrypted traffic generally includes a preset TLS field, in the present application, the electronic device may analyze an obtained traffic data packet, and then determine whether the analyzed traffic data packet includes the preset TLS field, so as to determine whether the obtained traffic data packet includes traffic of a TLS protocol. And if the analyzed flow data packet contains the preset TLS field, determining that the obtained flow data packet contains the flow of the TLS protocol. And if the analyzed flow data packet does not contain the preset TLS field, determining that the obtained flow data packet does not contain the flow of the TLS protocol. Illustratively, the electronic device may parse the obtained traffic data packet by running a traffic parsing tool, such as Joy of cisco.

In a possible implementation manner, before determining whether the obtained traffic data packet includes the traffic of the TLS protocol, the electronic device may obtain a large number of traffic data packets through the wireshark, and store the obtained traffic data packets in the pcap format, thereby facilitating subsequent processing on the obtained traffic data packets.

S102: determining the recognition result of the flow characteristics to be processed through a pre-trained flow recognition model; and the identification result comprises whether the security transport layer protocol TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic.

In the application, a traffic identification model is trained in advance, so that the acquired traffic characteristics to be processed can be identified through the traffic identification model.

In a specific implementation process, after the to-be-processed traffic characteristics are obtained based on the above embodiment, the to-be-processed traffic characteristics may be input into a traffic recognition model trained in advance. And processing the input to-be-processed flow characteristics through the flow identification model to obtain an identification result of the to-be-processed flow characteristics. And the identification result comprises whether the TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic.

In a practical application scenario, for different malware (denoted as source malware), an attacker may use multiple morphing techniques and automatic malicious code generation tools to generate a large number of encrypted malware variants of the same family type as the source malware, thereby avoiding traffic detection. Therefore, in the process of identifying the flow, the malicious software family to which the malicious flow belongs is identified, so that the subsequent response of an attack event is facilitated, and the network security is improved. Wherein the malware family characterizes a family type of malware. It is understood that the family type is a category of malware.

In the application, when the traffic characteristics to be processed are determined to be malicious traffic through a pre-trained traffic identification model, a malware family to which the TLS encrypted traffic corresponding to the traffic characteristics to be processed belongs can also be obtained through the traffic identification model, that is, the obtained identification result of the traffic characteristics to be processed also includes the malware family to which the TLS encrypted traffic corresponding to the traffic characteristics to be processed belongs, so that the security of a network is further improved, and the subsequent response of an attack event is facilitated.

Example 2:

in order to avoid detection, encrypted malware variants are layered endlessly, but malicious traffic generated by each encrypted malware of the same source malware in the code execution process shows certain similarity on many features, such as byte distribution features, TLS handshake features, certificate features, and the like. Therefore, in order to more accurately identify the traffic, in the present application, based on the above embodiment, the similarity between malicious traffic and benign traffic in TLS encrypted traffic and malicious traffic generated by malware of the same family type can be fully utilized to extract feature data with differentiation from the acquired bidirectional network flow. Wherein the characteristic data comprises one or more of:

firstly, the meta-features are not only beneficial to identifying malicious traffic and benign traffic, but also beneficial to distinguishing different malware families, for example, the entropy value of the malicious traffic may be higher than that of the benign traffic, the duration of the malicious traffic is more probable than that of the benign traffic, and the byte distribution shows different between the malicious traffic and the benign traffic and between different malware families. Therefore, meta-features may be included in the feature data. Wherein the meta-feature comprises one or more of: the number of incoming flow packets, the number of outgoing flow packets, the number of incoming bytes, the number of outgoing bytes, the mean of the byte distribution, the variance of the byte distribution, the entropy, the total entropy, the duration of the flow.

Secondly, considering that the packet length characteristics of malicious traffic are generally different from those of benign traffic, and the packet length characteristics of malicious traffic belonging to different family types are different from each other. Therefore, in the present application, the packet length feature may be included in the feature data. The packet length feature may characterize a transition probability matrix of the payload length of the first 100 packets of traffic.

For example, a matrix of 10 × 10 is preset, since the maximum value of the payload of a packet does not exceed 1500 bytes, the length of all packets is divided by 150 to be discrete to be in the range of 0-9, the column of any element contained in the matrix in which the element is located represents the length of the current packet, the row of the element in the matrix represents the length of the previous packet of the current packet, the value of the element represents the change between the length of the current packet and the length of the previous packet, and then the element is divided by the sum of all elements of the row in which the element is located, so that the processed element can represent the transition probability of the payload length.

And thirdly, considering that the packet interval characteristics of malicious traffic are generally different from those of benign traffic, and the packet interval characteristics of malicious traffic belonging to different family types are different from each other. Therefore, in the present application, the packet interval feature may be included in the feature data. The packet interval characteristic may characterize a transition probability matrix of the time interval between the first 100 packets of traffic.

For example, a matrix of 10 × 10 is preset, the length of all data packets is divided by 50 to be dispersed into a range of 0 to 9, a column of any element in the matrix represents an arrival time interval of a current data packet, a row of the element in the matrix represents an arrival time interval of a previous data packet of the current data packet, a value of the element represents a change between the arrival time interval of the current data packet and the arrival time interval of the previous data packet, and the element is divided by the sum of all elements in the row, so that a processed element can represent a transition probability of the time interval.

And fourthly, considering that the byte distribution probability characteristics of malicious traffic are generally different from those of benign traffic, and the byte distribution probability characteristics of the malicious traffic belonging to different family types are different. Therefore, in the present application, the byte distribution probability characteristics may be included in the characteristic data. The byte distribution probability characteristic may characterize the probability that each particular byte value appears in the payload of a traffic packet of the traffic.

For example, considering that the value range of each byte in the payload of the packet is 0 to 255, a 256-dimensional row vector may be preset, and if the data included in the traffic packet of the traffic hits the enumerated byte value, 1 is added to the position corresponding to the byte value in the row vector, and then for each element in the row vector, the element is divided by the sum of all elements included in the row vector, so that the processed element may represent the probability of occurrence of the byte value corresponding to the element.

And considering that the TLS is used by malicious software belonging to different family types different from a normal environment, the TLS handshake feature is not only beneficial to distinguishing malicious traffic from benign traffic, but also can help to identify the malicious software family to which the malicious traffic belongs. For example, a client-provided cipher suite for malicious traffic may be different from a client-provided cipher suite for benign traffic, a client key length for malicious traffic may be different from a client key length for benign traffic, which typically uses newer parameters in the TLS library, while malicious traffic typically uses older and less secure TLS parameters. Meanwhile, some malware may give the TLS handshake feature a specific configuration and value when writing the TLS handshake feature, resulting in malicious traffic belonging to different malware families showing differences in these parameters. Thus, the feature data may include a TLS handshake feature. Wherein the TLS handshake features include one or more of: the method comprises the steps of a cipher suite provided by a client, the number of the cipher suites provided by the client, expansion provided by the client, the number of the expansion provided by the client, the length of a client key, a cipher suite selected by a server, expansion selected by the server, the number of the expansion selected by the server, elliptic curves supported in the provided expansion, the number of the elliptic curves supported in the provided expansion, the format of elliptic curve points in the provided expansion, the number of the formats of the elliptic curve points in the provided expansion, TLS versions provided by the TLS client and TLS versions selected by the TLS server.

It should be noted that the features such as the number and version can be directly expressed by numerical values and coded numerical codes, and the features such as the provided cipher suite can be expressed by a feature one-hot coding mode.

Sixthly, considering that the certificate characteristics are very helpful for distinguishing malicious traffic from benign traffic and distinguishing the malware family to which the malicious traffic belongs, for example, the certificates of many malicious software are generally self-signed, the number of SANs (Subject Alternative Names) in the certificates of the malicious traffic is different from that in the certificates of the benign traffic, and the number of SANs in the certificates of the malicious traffic belonging to different malware families is distributed differently. In addition, the certificate validity period also reflects parameter preference differences between malicious traffic and benign traffic, as well as parameter preference differences between different malware families. Thus, certificate signatures may be included in the signature data. Wherein the certificate characteristics include one or more of: the number of valid days of the certificate, the number of certificates SAN, the number of certificates, whether the certificate is self-signed.

By combining the feature data of the multiple dimensions, the characteristics of the traffic can be fully described, so that the difference between benign traffic and malicious traffic and the difference between different malware families can be accurately characterized. The discrimination of the TLS handshake features and the certificate features is very beneficial to the identification of the traffic identification model, but the TLS handshake features and the certificate features can be set in advance and have the characteristic of easy change, so that the meta features, the packet length features, the packet interval features and the byte distribution probability features are relatively not directly controlled by malicious software to change, and the robustness of the traffic identification model is enhanced.

Example 3:

in order to accurately identify the flow rate, on the basis of the above embodiments, in the present application, the flow rate identification model is obtained as follows:

obtaining any sample flow characteristic: the sample flow characteristics correspond to a first recognition result;

acquiring a second identification result of the sample flow characteristics through an original flow identification model;

and training the original flow recognition model based on the first recognition result and the second recognition result to obtain a trained flow recognition model.

In the present application, in order to train a traffic recognition model, sample traffic characteristics for training the traffic recognition model are collected in advance, and the traffic recognition model is trained through the obtained sample traffic characteristics. Wherein, any sample flow characteristic corresponds to the identification result (marked as the first identification result). The first identification result may include whether TLS encrypted traffic corresponding to the sample traffic characteristic is malicious traffic. Optionally, the first identification result of the sample traffic characteristics that corresponds to the TLS encrypted traffic is malicious traffic may further include a malware family to which the TLS encrypted traffic corresponding to the sample traffic characteristics belongs.

The electronic device used for training the traffic recognition model may be the same as or different from the electronic device used for traffic recognition.

It should be noted that the method for acquiring the sample flow characteristic is similar to the method for acquiring the flow characteristic to be processed. Illustratively, an electronic device for training a traffic recognition model may capture traffic packets in pcap format via wireshark. And then determining whether the traffic in the traffic data packet is malicious traffic according to the source of the traffic data packet, namely determining a first identification result corresponding to the sample traffic characteristic corresponding to the traffic.

For example, the traffic data packets collected from the legitimate software, in which the traffic is benign traffic, may be represented by a "0" identifier, the traffic data packets collected from the enterprise internal sandbox running the malware that needs to pay attention to the malicious software are malicious traffic, and the traffic in the traffic data packets is malicious traffic, for example, may be represented by a "1" identifier. After acquiring traffic data packets and tags corresponding to traffic in the traffic data packets, determining TLS encrypted traffic data packets from all the acquired traffic data packets, and then aggregating TLS encrypted traffic in each TLS encrypted traffic data packet by taking bidirectional network flows as a unit to acquire at least one bidirectional network flow; aiming at the at least one bidirectional network flow, extracting the characteristics of the bidirectional network flow to obtain the characteristic data contained in the bidirectional network flow; based on each of the characteristic data, a sample traffic characteristic of the bidirectional network flow is determined.

Considering that a traffic data packet collected by malicious software which needs to pay a great deal of attention when running in an internal sandbox of an enterprise contains a small amount of background traffic, for example, traffic generated by the sandbox itself, and the like, the background traffic can be filtered. For example, traffic packets whose collected traffic packets' IPs match the whitelist IPs are filtered.

It should be noted that the traffic identification model does not process quintuple information included in the traffic (including the traffic characteristics to be processed and the sample traffic characteristics), and the quintuple information is only used for identifying the traffic input to the traffic identification model and representing a network bidirectional flow.

In the specific implementation process, any sample flow characteristic is obtained and input into the original flow identification model. And acquiring the recognition result (marked as a second recognition result) of the sample flow characteristic through the original flow recognition model. And then determining a loss value according to the second recognition result and the corresponding first recognition result. And training the original flow recognition model according to the loss value so as to adjust the parameter value of each parameter of the original flow recognition model.

And executing the operation aiming at each acquired sample flow characteristic, and finishing the training of the flow identification model when a preset convergence condition is met.

The convergence condition may be that the sum of the loss values respectively corresponding to each sample flow characteristic in the current iteration is smaller than a preset loss value threshold, or that the sum of the determined loss values is always in a downward trend and tends to be gentle, or that the number of iterations for training the original flow identification model reaches a set maximum number of iterations, and the like. The specific implementation can be flexibly set, and is not particularly limited herein.

In the process of performing the traffic recognition model training, an offline mode is generally adopted, and the original traffic recognition model is trained in advance through the electronic device performing the model training and the sample traffic characteristics, so as to obtain the trained traffic recognition model.

And based on the traffic recognition model trained in the embodiment, storing the trained traffic recognition model into the electronic equipment for subsequent traffic recognition.

The traffic recognition model may be a gradient enhancement framework (LightGBM), a cyclic convolution network, a convolutional neural network, or the like. In the specific implementation process, the flexible setting can be performed according to the actual requirement, and is not specifically limited herein.

As a possible implementation manner, when performing the traffic recognition model training, the sample traffic characteristics may be divided into training samples and test samples, the original traffic recognition model is trained based on the training samples, and then the reliability of the trained traffic recognition model is verified based on the test samples. Illustratively, the traffic recognition model is trained by a ten-fold cross validation method.

Illustratively, the evaluation indexes of the flow identification model are as follows:

(1) the precision ratio is as follows: and the sample flow characteristics which are correctly predicted to be the malicious flow account for the proportion of the sample flow characteristics which are actually predicted to be the malicious flow.

(2) And (4) recall rate: and the proportion of the sample flow characteristics which are correctly predicted to be the malicious flow in the sample flow characteristics which are actually the malicious flow is represented.

(3) F1 value: 2 precision recall/(precision + recall).

The latest part of data is separated from the collected benign traffic and the malicious traffic generated by the 4 family types of malicious software running in the sandbox inside the enterprise, the data is 2.24GB in total and is used for testing and evaluating the traffic identification model, the overall accuracy is 0.984, the overall recall rate is 0.944, and the evaluation results are shown in the following table.

Type of data	Rate of accuracy	Recall rate	F1 value
				Benign traffic	1.0	1.0	1
Family type 1	0.974	0.968	0.971
				Family type 2	1.0	0.976	0.988
Family type 3	1.0	0.8	0.889
				Family type 4	0.944	0.978	0.961

Particularly, compared with a training sample, malicious traffic of any family type of encrypted malware variants is added into a testing sample, and it can be seen that the traffic recognition model still has a good recognition effect on the malicious traffic generated by the encrypted malware variants.

Example 4:

the following describes the traffic identification method provided by the present application in detail through a specific embodiment, and fig. 2 is a schematic diagram of a specific traffic identification process provided by the embodiment of the present application, where the process includes:

s201: sample traffic data packets are collected.

S202: and preprocessing the sample flow data packet.

Wherein, preprocessing the sample flow data packet comprises: storing the obtained sample flow data packet into a pcap format; determining a first identification result of a sample flow characteristic corresponding to the flow in the sample flow data packet according to the source of the sample flow data packet; filtering background traffic in the collected sample traffic data packets.

The specific implementation process of the above pretreatment has been described in the above embodiments, and repeated descriptions are not repeated.

S203: and acquiring sample flow characteristics based on the preprocessed sample flow data packet.

Illustratively, TLS encrypted traffic data packets are obtained from all obtained sample traffic data packets, and then, with bidirectional network flows as a unit, TLS encrypted traffic in each TLS encrypted traffic data packet is aggregated to obtain at least one bidirectional network flow; aiming at the at least one bidirectional network flow, extracting the characteristics of the bidirectional network flow to obtain the characteristic data contained in the bidirectional network flow; based on each of the characteristic data, a sample traffic characteristic of the bidirectional network flow is determined.

S204: and training the original flow identification model based on the acquired sample flow characteristics to acquire the trained flow identification model.

And acquiring a trained traffic identification model based on the embodiment, and storing the trained traffic identification model.

S205: and acquiring a flow data packet in the network.

S206: and preprocessing the acquired traffic data packet.

Illustratively, the process of preprocessing the acquired traffic data packet includes: acquiring a TLS encrypted flow data packet (marked as a target data packet) from the acquired flow data packet; and aggregating the TLS encrypted traffic in each target data packet by taking the bidirectional network flow as a unit to obtain at least one bidirectional network flow.

S207: and (4) performing feature extraction on the preprocessed flow data to obtain the flow features to be processed.

Illustratively, the process of obtaining the characteristics of the traffic to be processed includes: for at least one bidirectional network flow acquired in S206, performing feature extraction on the bidirectional network flow to acquire feature data included in the bidirectional network flow; and determining the flow characteristics to be processed of the bidirectional network flow based on each acquired characteristic data.

S208: and determining the recognition result of the flow characteristics to be processed through a pre-trained flow recognition model.

The identification result includes whether the TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic or not, and a malware family to which the TLS encrypted traffic corresponding to the traffic characteristics to be processed belongs when the TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic.

It should be noted that the specific implementation process of the above steps has been described in the above embodiments, and repeated parts are not described again.

Example 5:

the present application provides a flow identification device, and fig. 3 is a schematic structural diagram of the flow identification device provided in the present application, the device includes:

an obtaining unit 31, configured to obtain a flow characteristic to be processed;

the processing unit 32 is configured to determine, through a pre-trained traffic recognition model, a recognition result of the to-be-processed traffic feature; and the identification result comprises whether the security transport layer protocol TLS encrypted traffic corresponding to the traffic characteristics to be processed is malicious traffic.

In some possible embodiments, the obtaining unit 31 is specifically configured to obtain a TLS encrypted traffic packet; aggregating the TLS encrypted traffic in each TLS encrypted traffic data packet by taking a bidirectional network flow as a unit to obtain at least one bidirectional network flow; performing feature extraction on the bidirectional network flow aiming at the at least one bidirectional network flow to acquire feature data contained in the bidirectional network flow; and determining the to-be-processed traffic characteristics of the bidirectional network flow based on each characteristic data.

In some possible embodiments, the obtaining unit 31 is specifically configured to obtain a traffic data packet in a network; and if the flow data packet contains the TLS protocol flow, determining that the flow data packet is the TLS encrypted flow data packet. In some possible embodiments, the apparatus further comprises: a training unit;

the training unit is used for obtaining the flow identification model in the following way:

and training the original traffic recognition model based on the first recognition result and the second recognition result to obtain a trained traffic recognition model.

Example 6:

as shown in fig. 4, which is a schematic structural diagram of an electronic device provided in an embodiment of the present application, on the basis of the foregoing embodiments, an embodiment of the present application further provides an electronic device, as shown in fig. 4, including: the system comprises a processor 41, a communication interface 42, a memory 43 and a communication bus 44, wherein the processor 41, the communication interface 42 and the memory 43 complete mutual communication through the communication bus 44;

the memory 43 has stored therein a computer program which, when executed by the processor 41, causes the processor 41 to perform the steps of:

acquiring flow characteristics to be processed;

determining the recognition result of the flow characteristics to be processed through a pre-trained flow recognition model; and the identification result comprises whether the security transport layer protocol TLS encrypted flow corresponding to the flow characteristic to be processed is malicious flow or not.

Because the principle of solving the problem of the electronic device is similar to that of the traffic identification method, the implementation of the electronic device may refer to the implementation of the method, and repeated parts are not described again.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.

The communication interface 42 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

Due to the fact that the traffic identification model is trained in advance, any acquired traffic feature to be processed can be identified through the traffic identification model, whether the TLS encrypted traffic corresponding to the traffic feature to be processed is malicious traffic or not is determined, namely the identification result of the traffic feature to be processed is acquired, the TLS encrypted communication traffic of malicious software is accurately detected, and the TLS encrypted traffic identification efficiency is improved.

Example 7:

on the basis of the foregoing embodiments, the present application further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program runs on the processor, the processor is caused to execute the following steps:

acquiring flow characteristics to be processed;

Because the principle of the above traffic identification device for solving the problem is similar to the traffic identification method, the implementation of the above traffic identification device may refer to the implementation of the method, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A traffic identification method, characterized in that the method comprises:

acquiring flow characteristics to be processed;

2. The method of claim 1, wherein the obtaining the pending flow characteristics comprises:

acquiring a TLS encrypted flow data packet;

aggregating the TLS encrypted traffic in each TLS encrypted traffic data packet by taking bidirectional network flow as a unit to obtain at least one bidirectional network flow;

for the at least one bidirectional network flow, extracting the characteristics of the bidirectional network flow to obtain characteristic data contained in the bidirectional network flow; and determining the to-be-processed flow characteristics of the bidirectional network flow based on each characteristic data.

3. The method as claimed in claim 2, wherein said obtaining TLS encrypted traffic packets comprises:

acquiring a flow data packet in a network;

4. The method of claim 2, wherein the characterization data comprises one or more of: meta-feature, packet length feature, packet interval feature, byte distribution probability feature, TLS handshake feature, certificate feature.

5. The method according to claim 1, wherein the identification result further includes a malware family to which TLS encrypted traffic belongs when TLS encrypted traffic corresponding to the to-be-processed traffic feature is the malicious traffic; wherein the malware family characterizes a family type of malware.

6. The method according to claim 1 or 5, characterized in that the traffic recognition model is obtained by:

7. A flow identification device, the device comprising:

the acquiring unit is used for acquiring the flow characteristics to be processed;

8. An electronic device, characterized in that the electronic device comprises at least a processor and a memory, the processor being configured to implement the steps of the traffic identification method according to any of claims 1-6 when executing a computer program stored in the memory.

9. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the traffic identification method according to any one of claims 1 to 6.

10. A computer program product, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the steps of the traffic identification method as claimed in any of the claims 1-6.