CN109495513B

CN109495513B - Unsupervised encrypted malicious traffic detection method, unsupervised encrypted malicious traffic detection device, unsupervised encrypted malicious traffic detection equipment and unsupervised encrypted malicious traffic detection medium

Info

Publication number: CN109495513B
Application number: CN201811635919.2A
Authority: CN
Inventors: 江斌
Original assignee: Jike Xin'an Beijing Technology Co ltd
Current assignee: Geek Xin'an (Chengdu) Technology Co.,Ltd.
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-06-01
Anticipated expiration: 2038-12-29
Also published as: CN109495513A

Abstract

The embodiment of the disclosure provides an unsupervised encrypted malicious traffic detection method, an unsupervised encrypted malicious traffic detection device, unsupervised encrypted malicious traffic detection equipment and a unsupervised encrypted malicious traffic detection medium, wherein the method comprises the following steps: acquiring a required data feature set based on network flow; establishing a bipartite graph between the client and the server by using the acquired data feature set; performing primary clustering on the client and the server node by a graph segmentation method; vectorizing a client service and a server node in a larger connected subgraph in the primary clustering; clustering the data subjected to opposite quantization again by using a DBScan algorithm; and judging malicious flow and nodes by using the clustering result after re-clustering. The method utilizes the unsupervised learning model based on the graph, can directly carry out encryption flow detection under the condition of no prior knowledge and no labeled sample set, obtains different types of families by carrying out dichotomy on the graph, converts the large family into the small family, and then respectively detects and identifies malicious flow through the characteristics of the flow, and is simple and easy to operate.

Description

Unsupervised encrypted malicious traffic detection method, unsupervised encrypted malicious traffic detection device, unsupervised encrypted malicious traffic detection equipment and unsupervised encrypted malicious traffic detection medium

Technical Field

The disclosure relates to the technical field of traffic data detection, in particular to an unsupervised encrypted malicious traffic detection method and device, electronic equipment and a storage medium.

Background

Network communication is an information application that is currently involved by almost all businesses and individuals. With the increasing importance of enterprises and personal users on information security, the use scenarios of encryption technologies in network communication are increasing. Namely, the communication content can not be identified by other users except the two communication parties on the network through an encryption method.

Meanwhile, when various malicious programs such as network trojans, worms and the like communicate with the control end, in order to avoid the identification of network detection equipment, encrypted flow communication is often adopted. This causes the problem that normal encrypted traffic and malicious encrypted traffic cannot be distinguished, and brings great challenges to network security detection.

At present, the detection of encrypted malicious traffic mainly adopts a method of supervised machine learning. Through the detection models of the malicious encrypted traffic and the normal encrypted traffic, the detection models can be used for judging whether the encrypted traffic is the malicious traffic or not.

The main problems of the existing scheme are as follows:

(1) model training depends on a large number of black samples, and the detection model obtained by training is likely to be inaccurate due to insufficient number of samples;

(2) depending on expert knowledge analysis and flow characteristic extraction, if the expert experience is unreliable, the final classification result may have a larger problem;

(3) the detection capability for new attack samples is poor due to the fact that the method is based on previous experience knowledge;

(4) the attacker can easily bypass the detection according to the feature set, namely once the attacker finds the feature set used for detection, the features can be avoided through certain technical means.

Therefore, how to effectively separate malicious traffic has become a technical problem to be solved urgently.

Disclosure of Invention

The present disclosure is directed to an unsupervised encrypted malicious traffic detection method, an unsupervised encrypted malicious traffic detection device, an electronic device, and a storage medium, which can quickly detect malicious encrypted traffic in traffic information.

In a first aspect, the present disclosure provides an unsupervised encrypted malicious traffic detection method, including the following steps:

step S101: acquiring a required data feature set based on network flow;

step S102, establishing a bipartite graph between a client and a server by utilizing the collected data feature set;

s103, performing primary clustering on the client and the server node by a graph segmentation method;

step S104, vectorizing the client and the server node of the larger connected subgraph in the primary clustering;

step S105, clustering the data subjected to opposite quantization again by using a DBScan algorithm;

and S106, judging malicious flow and nodes by using the clustering result after re-clustering.

Optionally, the data feature set includes:

a client encryption suite, a TLS extension supported by the client, and a server certificate.

Optionally, the establishing a bipartite graph between the client and the server by using the collected data feature set includes:

randomly selecting any client node, connecting the client node with the corresponding associated service end node, and forming the edges of the client node and the service end node;

traversing all client nodes and server nodes, wherein all the client nodes and the server nodes in the data feature set form corresponding connection relations;

and establishing a bipartite graph by using the connection relation formed by the client node and the server node.

Optionally, the performing primary clustering on the client or the server node by using the graph cut method includes:

performing sub-graph clustering on the bipartite graph;

and dividing each subgraph completely without the incidence relation into different clusters, thereby carrying out primary clustering.

Optionally, performing vectorization processing on the client and the server node of the larger connected subgraph in the primary clustering includes:

selecting the connected subgraphs with more nodes in the primary clustering;

starting from any node in the subgraph, randomly selecting a node as a next node according to the communication relation to form a sequence with the length of t;

and for each node in the sequence, learning the feature representation of the node by using other nodes around the node by using a skip-gram method, and reducing the representation of each node from multidimensional oneHot coding into a node feature vector.

Optionally, the DBScan algorithm includes:

calculating the distance between each node;

determining the node similarity based on the distance;

nodes with similarity are grouped into one class.

Optionally, the determining malicious traffic and nodes by using the clustering result after re-clustering includes:

judging malicious traffic by using the characteristics of the service end node and/or the client end node in each cluster after re-clustering;

if the characteristics of most of the service end nodes and/or client end nodes in one cluster are irregular characteristics, judging the cluster to be a malicious cluster;

and recovering the corresponding relation existing between all the client nodes and the server nodes in the malicious cluster, wherein the corresponding relation is the malicious flow to be detected.

In a second aspect, the present disclosure provides an unsupervised encrypted malicious traffic detection apparatus, including:

the data acquisition unit is used for acquiring a required data feature set based on network flow;

the construction unit is used for establishing a bipartite graph between the client and the server by utilizing the acquired data feature set;

the primary clustering unit is used for carrying out primary clustering on the client and the server node by a graph segmentation method;

the vectorization unit is used for vectorizing the client and the server node of the larger connected subgraph in the primary clustering;

a secondary clustering unit for clustering the data after the opposite quantization again by using a DBScan algorithm;

and the judging unit is used for judging the malicious flow and the node by using the clustering result after the clustering is performed again.

In a third aspect, the present disclosure provides an electronic device comprising a processor and a memory, wherein the memory stores computer program instructions executable by the processor, and the processor implements the method steps of any one of the first aspect when executing the computer program instructions.

In a fourth aspect, the present disclosure provides a computer readable storage medium storing computer program instructions which, when invoked and executed by a processor, implement the method steps of any of the first aspects.

Compared with the prior art, the beneficial effects of the embodiment of the disclosure are that:

the method has the advantages that the unsupervised learning model based on the graph is utilized, the encrypted flow detection can be directly carried out under the condition that no priori knowledge and no labeled sample set exist, different types of families are obtained by dividing the graph into two parts, the large family is converted into the small family, and then the malicious flow is detected and identified through the flow characteristics.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

Fig. 1 is a schematic flow chart of an unsupervised encryption malicious traffic detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a bipartite graph of an unsupervised encryption malicious traffic detection method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an unsupervised encrypted malicious traffic detection apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in the presently disclosed embodiments and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two, but does not exclude the presence of at least one.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe technical names in embodiments of the present disclosure, the technical names should not be limited to the terms. These terms are only used to distinguish between technical names. For example, a first check signature may also be referred to as a second check signature, and similarly, a second check signature may also be referred to as a first check signature, without departing from the scope of embodiments of the present disclosure.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Referring to fig. 1, in a first aspect, the present disclosure provides an unsupervised encrypted malicious traffic detection method, including the following steps:

step S101: acquiring a required data feature set based on network flow;

optionally, the data feature set includes:

as shown in fig. 2, if the client metadata features extracted from the traffic are shown as nodes on the left side of the graph, and the certificate metadata features are shown as nodes on the right side of the graph, any one flow can be represented as an edge from a node composed of the client features to the certificate node. And because there is no edge between client metadata, there is no edge between digital certificates either. Therefore, the nodes are divided into two types to form a bipartite graph.

Specifically, optionally, the establishing a bipartite graph between the client and the server by using the collected data feature set includes:

the graph partitioning method means that for the bipartite graph formed in step S102, there exist multiple sub-graphs, that is, there is no edge between the sub-graphs, and there is no intersection between the sub-graphs, so that a discrete family is formed, and nodes can be divided into different groups according to this feature, thereby completing primary clustering.

Specifically, the primary clustering of the client or the server node by the graph cut method includes:

performing sub-graph clustering on the bipartite graph;

Step S104, vectorizing the client and the server node in the larger connected subgraph in the primary clustering;

selecting the connected subgraphs with more nodes in the primary clustering;

Specifically, the vectorization process requires two specific steps:

the first step is to establish a node sequence by using random walk, and the specific method is as follows:

starting from any node in the graph, randomly selecting a node as a next node according to the communication relation to form a sequence, and defining the length of the sequence as t, so that a sequence with the length of t, which alternately appears between a client node and a server node, can be formed;

each node in the graph is used as an initial node to perform the above steps, and if there are Q nodes, Q sequences with the length of t are generated.

The second step is to obtain the feature vector of each node by using a skip-gram method for the nodes in the sequences, wherein the specific method is as follows:

its input is a series of nodes, each node is represented by OneHot coding, and for a node N, if its rank is N, its corresponding initial vector is (0,0,0,0,0, … …,1, … …,0,0,0,0, 0), i.e. the vector has the N-th bit of 1 and the rest of bits of 0. Its output is a feature vector P of lower dimension of the node, with length P, which is typically much smaller than n.

The process of reducing the OneHot feature to P is as follows, for each node X, different contexts Y (1-k) in different sequences can be obtained, for each (X, Y) pair, a neural network is trained using a back propagation algorithm, the input of the neural network is X of OneHot, the training label is Y of OneHot, the hidden layer is P, and the training parameters are matrix W of m × P (m is the total number of nodes). Since only one position in X has a value of 1, only the parameter value of a certain row in W is updated in the process of back propagation, and the parameter of the row is the low-dimensional feature P of X obtained by training. For each X, a low-dimensional P feature vector can be obtained using the method described above.

the basic idea of the DBScan algorithm is to calculate distances between nodes, determine node similarity based on the distances, and group the nodes with similarity into one class, thereby completing data clustering.

Optionally, the DBScan algorithm includes:

calculating the distance between each node;

determining the node similarity based on the distance;

nodes with similarity are grouped into one class.

After clustering, a server certificate node and a client encryption suite fingerprint node exist in each cluster. By using the domain name in the CN field and SAN field in the certificate node, it can be determined whether the service end of the certificate is a regular website. If most of the certificates in a cluster are denormal website certificates, the cluster can be considered as a malicious cluster. And restoring the corresponding relation between all the clients and the service segment in the cluster to obtain the malicious traffic to be detected.

Example 2

As shown in fig. 3, in a second aspect, the present disclosure provides an unsupervised encrypted malicious traffic detection apparatus, including: a data acquisition unit 301, a construction unit 302, a primary clustering unit 303, a vectorization unit 304, a secondary clustering unit 305, and a determination unit 306, specifically,

a data acquisition unit 301, configured to acquire a required data feature set based on network traffic;

the building unit 302 is configured to build a bipartite graph between the client and the server by using the collected data feature set;

a primary clustering unit 303, configured to perform primary clustering on the client and the server node by using a graph segmentation method;

a vectorization unit 304, configured to perform vectorization processing on the client and the server node of the larger connected subgraph in the primary cluster;

a re-clustering unit 305 for re-clustering the data after the opposite quantization by using a DBScan algorithm;

and a determining unit 306, configured to determine malicious traffic and nodes by using the clustering result after re-clustering.

Example 3

The present disclosure provides a computer readable storage medium storing computer program instructions which, when invoked and executed by a processor, implement the method steps of any of the first aspects.

Example 4

As shown in fig. 4, the present disclosure provides an electronic device comprising a processor and a memory, wherein the memory stores computer program instructions executable by the processor, and the processor implements the method steps of any one of the first aspect when executing the computer program instructions.

Referring now to FIG. 4, a block diagram of an electronic device 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 401.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two internet protocol addresses; sending a node evaluation request comprising the at least two internet protocol addresses to node evaluation equipment, wherein the node evaluation equipment selects the internet protocol addresses from the at least two internet protocol addresses and returns the internet protocol addresses; receiving an internet protocol address returned by the node evaluation equipment; wherein the obtained internet protocol address indicates an edge node in the content distribution network.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving a node evaluation request comprising at least two internet protocol addresses; selecting an internet protocol address from the at least two internet protocol addresses; returning the selected internet protocol address; wherein the received internet protocol address indicates an edge node in the content distribution network.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

Claims

1. An unsupervised encryption malicious traffic detection method is characterized by comprising the following steps:

step S101: acquiring a required data feature set based on network flow; the set of data features includes: a client encryption suite, TLS extensions supported by the client, and a server certificate;

s103, performing primary clustering on the client and the server node by a graph segmentation method; the primary clustering of the client or the server node by the graph segmentation method comprises the following steps: performing sub-graph clustering on the bipartite graph; dividing each subgraph completely without incidence relation into different clusters, thereby carrying out primary clustering;

step S104, vectorizing the client and the server node of the larger connected subgraph in the primary clustering; the vectorization processing of the client and the server node of the larger connected subgraph in the primary clustering comprises the following steps: selecting the connected subgraphs with more nodes in the primary clustering; starting from any node in the subgraph, randomly selecting a node as a next node according to the communication relation to form a sequence with the length of t; for each node in the sequence, a skip-gram method is used, other nodes around the skip-gram method are utilized to learn characteristic representation of the skip-gram method, and the representation of each node is reduced from multidimensional OneHot coding into a node characteristic vector;

2. The method of claim 1, wherein the building a bipartite graph between a client and a server using the collected data feature sets comprises:

3. The method of claim 2, wherein the DBScan algorithm comprises:

calculating the distance between each node;

determining the node similarity based on the distance;

nodes with similarity are grouped into one class.

4. The method according to claim 3, wherein the determining malicious traffic and nodes by using the re-clustered clustering result comprises:

5. An unsupervised encrypted malicious traffic detection device, comprising:

the data acquisition unit is used for acquiring a required data feature set based on network flow; the set of data features includes: a client encryption suite, TLS extensions supported by the client, and a server certificate;

the primary clustering unit is used for carrying out primary clustering on the client and the server node by a graph segmentation method; the primary clustering of the client or the server node by the graph segmentation method comprises the following steps: performing sub-graph clustering on the bipartite graph; dividing each subgraph completely without incidence relation into different clusters, thereby carrying out primary clustering;

the vectorization unit is used for vectorizing the client and the server node of the larger connected subgraph in the primary clustering; the vectorization processing of the client and the server node of the larger connected subgraph in the primary clustering comprises the following steps: selecting the connected subgraphs with more nodes in the primary clustering; starting from any node in the subgraph, randomly selecting a node as a next node according to the communication relation to form a sequence with the length of t; for each node in the sequence, a skip-gram method is used, other nodes around the skip-gram method are utilized to learn characteristic representation of the skip-gram method, and the representation of each node is reduced from multidimensional OneHot coding into a node characteristic vector;

6. An electronic device comprising a processor and a memory, the memory storing computer program instructions executable by the processor, the processor implementing the method steps of any of claims 1-4 when executing the computer program instructions.

7. A computer-readable storage medium, characterized in that computer program instructions are stored which, when called and executed by a processor, implement the method steps of any of claims 1-4.