CN109495513A

CN109495513A - Unsupervised encryption malicious traffic stream detection method, device, equipment and medium

Info

Publication number: CN109495513A
Application number: CN201811635919.2A
Authority: CN
Inventors: 江斌
Original assignee: Geek Xin'an (beijing) Technology Co Ltd
Current assignee: Geek Xin'an (Chengdu) Technology Co.,Ltd.
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-03-19
Anticipated expiration: 2038-12-29
Also published as: CN109495513B

Abstract

The embodiment of the present disclosure provides a kind of unsupervised encryption malicious traffic stream detection method, device, equipment and medium, and described method includes following steps: based on data characteristics collection needed for network flow acquisition；The bipartite graph between client and server-side is established using the data characteristics collection of acquisition；Client and service end node are clustered for the first time by figure cutting method；Vectorization processing is carried out with business end node to the client clothes in biggish connection subgraph in the first cluster；Data after vectorization are clustered again using DBScan algorithm；Malicious traffic stream and node are determined using the cluster result after the cluster again.The disclosure utilizes the unsupervised learning model based on figure, encryption flow detection can be directly carried out in the case where no priori knowledge and mark sample set, different types of race is separately won to obtain by carrying out two to figure, small race is converted by big nation, it is tested respectively by the feature of flow again and identifies malicious traffic stream, method is simple to operation.

Description

Unsupervised encryption malicious traffic stream detection method, device, equipment and medium

Technical field

This disclosure relates to data on flows detection technique field, specially a kind of unsupervised encryption malicious traffic stream detection side Method, device, electronic equipment and storage medium.

Background technique

Network communication is the Information application that current nearly all enterprises and individuals can be related to.With enterprise and personal use Family is higher and higher for the attention degree of information security, and the usage scenario of encryption technology is more and more in current network communication.I.e. By encryption method Content of Communication can not be identified by the other users on network in addition to communicating pair.

At the same time, all kinds of rogue programs such as network wooden horse, worm etc. with control terminal when being communicated, in order to hide net The identification of network detection device, often also using encryption traffic communication.This has resulted in normal encryption flow and malice encryption flow Indistinguishable problem brings very big challenge for network security detection.

The method that detection currently for encryption malicious traffic stream mainly uses Supervised machine learning.Pass through malice encryption stream The detection model of amount and normal encryption flow, the detection model can be used to differentiate whether encryption flow to be malicious traffic stream.

Main problem existing for existing scheme is:

(1) model training relies on a large amount of black sample, and sample size deficiency likely results in the detection model that training obtains Inaccuracy；

(2) it relies on expertise analysis and extracts traffic characteristic, if expertise is unreliable, final classification results There may be larger problem；

(3) due to based on Heuristics before, so poor for new attack pattern detection ability；

(4) it is easy to be bypassed by attacker according to characteristic set, once that is, attacker has found detection feature set used, then may be used To evade these features by certain technological means.

Therefore, how to efficiently separate malicious traffic stream and have become a technical problem urgently to be resolved.

Summary of the invention

The disclosure be designed to provide a kind of unsupervised encryption malicious traffic stream detection method, device, electronic equipment and Storage medium rapidly the malice in detection flows information can encrypt flow.

In a first aspect, the disclosure provides a kind of unsupervised encryption malicious traffic stream detection method, include the following steps:

Step S101: based on data characteristics collection needed for network flow acquisition；

Step S102: the bipartite graph between client and server-side is established using the data characteristics collection of acquisition；

Step S103: client and service end node are clustered for the first time by figure cutting method；

Step S104: client and service end node to larger connection subgraph in the first cluster carry out vectorization Processing；

Step S105: the data after vectorization are clustered again using DBScan algorithm；

Step S106: malicious traffic stream and node are determined using the cluster result after the cluster again.

Optionally, the data characteristics collection includes:

Client encryption suite, the TLS extension of client support, server-side certificate.

Optionally, the data characteristics collection using acquisition establishes the bipartite graph between client and server-side, wraps It includes:

Any client node is selected at random, is connected it and is corresponded to associated service end node, forms the client node With the side of service end node；

All clients node and service end node are traversed, the data characteristics concentrates all client nodes and clothes Business end node all forms corresponding connection relationship；

Bipartite graph is established using the connection relationship that the client node and service end node are formed.

It is optionally, described that client or service end node are clustered for the first time by figure cutting method, comprising:

The bipartite graph is subjected to subgraph cluster；

Each subgraph of absolutely not incidence relation is divided into different clusters, to be clustered for the first time.

Optionally, the client of larger connection subgraph in the first cluster and service end node are carried out at vectorization Reason, comprising:

Select the connection subgraph that number of nodes is more in the first cluster；

Any one node from the subgraph randomly chooses a node as next section according to link relation Point forms the sequence that a length is t；

For each of sequence node, using skip-gram method, learn itself using other nodes around it Character representation, by the expression of each node from the OneHot of various dimensions coding dimensionality reduction at node diagnostic vector.

Optionally, the DBScan algorithm includes:

Calculate the distance between each node；

Based on node similitude described in the range estimation；

Node with similitude is gathered for one kind.

Optionally, the cluster result using after the cluster again determines malicious traffic stream and node, comprising:

Malicious traffic stream is carried out using the feature for servicing end node and/or client node in each cluster after clustering again to sentence It is fixed；

If the feature of most service end node and/or client node is informal feature in a cluster, sentence The fixed cluster is malice cluster；

Corresponding relationship existing for client node all in the malice cluster and service end node is restored, then to be wanted The malicious traffic stream of detection.

Second aspect, the disclosure provide a kind of unsupervised encryption malicious traffic stream detection device, comprising:

Data acquisition unit, for based on data characteristics collection needed for network flow acquisition；

Construction unit, for establishing the bipartite graph between client and server-side using the data characteristics collection of acquisition；

First cluster cell, for being clustered for the first time by figure cutting method to client and service end node；

Vectorization unit, for the client and service end node progress to larger connection subgraph in the first cluster Vectorization processing；

Cluster cell again clusters the data after vectorization using DBScan algorithm again；

Judging unit, for determining malicious traffic stream and node using the cluster result after the cluster again.

The third aspect, the disclosure provide a kind of electronic equipment, including processor and memory, the memory are stored with energy Enough computer program instructions executed by the processor when the processor executes the computer program instructions, realize the On the one hand any method and step.

Fourth aspect, the disclosure provide a kind of computer readable storage medium, are stored with computer program instructions, the meter Calculation machine program instruction realizes any method and step of first aspect when being called and being executed by processor.

Compared with prior art, the beneficial effect of the embodiment of the present disclosure is:

The disclosure utilizes the unsupervised learning model based on figure, can be the case where no priori knowledge is with mark sample set Encryption flow detection is directly carried out down, is separately won to obtain different types of race by carrying out two to figure, is converted small race for big nation, then divide It is not tested by the feature of flow and identifies malicious traffic stream, method is simple to operation, can efficiently detect encryption Malicious traffic stream.

Detailed description of the invention

In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this public affairs The some embodiments opened for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is the unsupervised encryption malicious traffic stream detection method flow diagram that the embodiment of the present disclosure provides；

Fig. 2 is the unsupervised encryption malicious traffic stream detection method bipartite graph schematic diagram that the embodiment of the present disclosure provides；

Fig. 3 is the structural schematic diagram for the unsupervised encryption malicious traffic stream detection device that the embodiment of the present disclosure provides；

Fig. 4 is the structural schematic diagram for the electronic equipment that the embodiment of the present disclosure provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present disclosure clearer, below in conjunction with the embodiment of the present disclosure In attached drawing, the technical solution in the embodiment of the present disclosure is clearly and completely described, it is clear that described embodiment is Disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in the disclosure, those of ordinary skill in the art Every other embodiment obtained without creative efforts belongs to the range of disclosure protection.

The term used in the embodiments of the present disclosure is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The disclosure.In the embodiment of the present disclosure and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning, " a variety of " generally comprise at least two, but not It excludes to include at least one situation.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

It will be appreciated that though in the embodiments of the present disclosure may be using term first, second, third, etc. come description technique name Claim, but these technical names should not necessarily be limited by these terms.These terms are only used to distinguish technical name.For example, not taking off In the case where embodiment of the present disclosure range, the first signature verification can also be referred to as the second signature verification, similarly, the second school Sign test name can also be referred to as the first signature verification.

Depending on context, word as used in this " if ", " if " can be construed to " ... when " or " when ... " or " in response to determination " or " in response to detection ".Similarly, context is depended on, phrase " if it is determined that " or " such as Fruit detection (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when detection (statement Condition or event) when " or " in response to detection (condition or event of statement) ".

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Include, so that commodity or system including a series of elements not only include those elements, but also including not clear The other element listed, or further include for this commodity or the intrinsic element of system.In the feelings not limited more Under condition, the element that is limited by sentence "including a ...", it is not excluded that in the commodity or system for including the element also There are other identical elements.

In addition, the step timing in following each method embodiments is only a kind of citing, rather than considered critical.

Referring to Fig. 1, in a first aspect, the disclosure provides a kind of unsupervised encryption malicious traffic stream detection method, including such as Lower step:

Optionally, the data characteristics collection includes:

As shown in Fig. 2, by the client metadata feature such as figure left node extracted in flow, certificate metadata feature Right side node is such as schemed, then any one stream can be expressed as the node of client features composition to certificate node Side.And due to not having side between client metadata, side there will not be between digital certificate.Therefore node is equivalent to be divided into Two classes form a bipartite graph.

Specific optional, the data characteristics collection using acquisition establishes two points between client and server-side Figure, comprising:

Wherein, figure cutting method refers to, for the bipartite graph that step S102 is formed, can have multiple subgraphs, i.e., this is a little It is not no side between figure, does not have intersection between them, to form discrete race, can be divided into node according to this feature Different groups completes first cluster.

Specifically, described cluster client or service end node by figure cutting method for the first time, comprising:

The bipartite graph is subjected to subgraph cluster；

Step S104: to the client and service end node progress vector in larger connection subgraph in the first cluster Change processing；

Specifically, vectorization processing needs specific two steps:

The first step is to establish sequence node using random walk, and the specific method is as follows:

Any one node from figure randomly chooses a node as next node, shape according to link relation At a sequence, defined nucleotide sequence length is t, then can form a client node and service the length that end node is alternately present For the sequence of t；

Each of figure node will carry out above-mentioned steps as start node can then generate if there is Q node Q length is t sequence.

Second step is the feature vector for obtaining each node using skip-gram method to the node in these sequences, tool Body method is as follows:

Its input is a string of sequence nodes, and the OneHot coded representation of each node: being directed to a node N, if His is ordered as n, then the initial vector corresponding to it is (0,0,0,0,0 ... ..., 1 ... ..., 0,0,0,0), i.e., in addition to n-th Position is 1, remaining is all 0 vector.Its output is the feature vector P of the more low dimensional of node, the length is p, general feelings P is far smaller than n under condition.

OneHot Feature Dimension Reduction is as follows at the process of P, for each nodes X, it is available it in different sequences Different context Y (1-k), it is right for each (X, Y), use back-propagation algorithm training one neural network, nerve net The input of network is the x of OneHot, and training label is the y of OneHot, and hidden layer is P, and training parameter is that (m is section for the matrix W of m*p Point sum).Since the value of only one position in x is 1, during backpropagation, the parameter of certain a line in W only will be updated Value, the parameter of this line are exactly that training obtains the low dimensional feature P of X.For each X, one can be obtained in aforementioned manners A low dimensional P feature vector.

The basic thought of DBScan algorithm is the distance between calculate node, is based on range estimation node similitude, will have There is the node of similitude to gather for one kind, to complete data clusters.

Optionally, the DBScan algorithm includes:

Calculate the distance between each node；

Based on node similitude described in the range estimation；

Node with similitude is gathered for one kind.

It can presence service end certificate node and client encryption suite minutiae in each cluster after cluster.Utilize card The domain name in CN field, SAN field in book node, it can be determined that whether the server-side of the certificate is regular website.If one Most certificate is informal website certificate in a cluster, it may be considered that the cluster is malice cluster.By client all in this cluster It holds relationship corresponding with existing for service segment to restore, is then malicious traffic stream to be detected.

Embodiment 2

As shown in figure 3, second aspect, the disclosure provides a kind of unsupervised encryption malicious traffic stream detection device, comprising: number According to acquisition unit 301, construction unit 302, first cluster cell 303, vectorization unit 304 cluster cell 305 and sentences again Order member 306, specifically,

Data acquisition unit 301, for based on data characteristics collection needed for network flow acquisition；

Construction unit 302 establishes two points between client and server-side for the data characteristics collection using acquisition Figure；

First cluster cell 303, for being clustered for the first time by figure cutting method to client and service end node；

Vectorization unit 304, for the client and service end node to larger connection subgraph in the first cluster Carry out vectorization processing；

Cluster cell 305 again cluster the data after vectorization using DBScan algorithm again；

Judging unit 306, for determining malicious traffic stream and node using the cluster result after the cluster again.

Embodiment 3

The disclosure provides a kind of computer readable storage medium, is stored with computer program instructions, the computer program It instructs and realizes any method and step of first aspect when being called and being executed by processor.

Embodiment 4

As shown in figure 4, the disclosure provides a kind of electronic equipment, including processor and memory, the memory are stored with The computer program instructions that can be executed by the processor when processor executes the computer program instructions, are realized Any method and step of first aspect.

Below with reference to Fig. 4, it illustrates the structural representations for the electronic equipment 400 for being suitable for being used to realize the embodiment of the present disclosure Figure.Terminal device in the embodiment of the present disclosure can include but is not limited to such as mobile phone, laptop, digital broadcasting and connect Receive device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle Carry navigation terminal) etc. mobile terminal and such as number TV, desktop computer etc. fixed terminal.Electricity shown in Fig. 4 Sub- equipment is only an example, should not function to the embodiment of the present disclosure and use scope bring any restrictions.

As shown in figure 4, electronic equipment 400 may include processing unit (such as central processing unit, graphics processor etc.) 401, random access can be loaded into according to the program being stored in read-only memory (ROM) 402 or from storage device 408 Program in memory (RAM) 403 and execute various movements appropriate and processing.In RAM 403, it is also stored with electronic equipment Various programs and data needed for 400 operations.Processing unit 401, ROM 402 and RAM 403 pass through the phase each other of bus 404 Even.Input/output (I/O) interface 405 is also connected to bus 404.

In general, following device can connect to I/O interface 405: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 406 of head, microphone, accelerometer, gyroscope etc.；Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 407 of dynamic device etc.；Storage device 408 including such as tape, hard disk etc.；And communication device 409.Communication device 409, which can permit electronic equipment 400, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 4 shows tool There is the electronic equipment 400 of various devices, it should be understood that being not required for implementing or having all devices shown.It can be with Alternatively implement or have more or fewer devices.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communication device 409, or from storage device 408 It is mounted, or is mounted from ROM 402.When the computer program is executed by processing unit 401, the embodiment of the present disclosure is executed Method in the above-mentioned function that limits.

It should be noted that the above-mentioned computer-readable medium of the disclosure can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In open, computer-readable signal media may include in a base band or as the data-signal that carrier wave a part is propagated, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable and deposit Any computer-readable medium other than storage media, the computer-readable signal media can send, propagate or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: electric wire, optical cable, RF (radio frequency) etc. are above-mentioned Any appropriate combination.

Above-mentioned computer-readable medium can be included in above-mentioned electronic equipment；It is also possible to individualism, and not It is fitted into the electronic equipment.

Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are by the electricity When sub- equipment executes, so that the electronic equipment: obtaining at least two internet protocol addresses；Send to Node evaluation equipment includes institute State the Node evaluation request of at least two internet protocol addresses, wherein the Node evaluation equipment is internet from described at least two In protocol address, chooses internet protocol address and return；Receive the internet protocol address that the Node evaluation equipment returns；Its In, the fringe node in acquired internet protocol address instruction content distributing network.

Alternatively, above-mentioned computer-readable medium carries one or more program, when said one or multiple programs When being executed by the electronic equipment, so that the electronic equipment: receiving the Node evaluation including at least two internet protocol addresses and request； From at least two internet protocol address, internet protocol address is chosen；Return to the internet protocol address selected；Wherein, The fringe node in internet protocol address instruction content distributing network received.

The calculating of the operation for executing the disclosure can be write with one or more programming languages or combinations thereof Machine program code, above procedure design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present disclosure can be realized by way of software, can also be by hard The mode of part is realized.Wherein, the title of unit does not constitute the restriction to the unit itself under certain conditions, for example, the One acquiring unit is also described as " obtaining the unit of at least two internet protocol addresses ".

Claims

1. a kind of unsupervised encryption malicious traffic stream detection method, which comprises the steps of:

Step S104: client and service end node to larger connection subgraph in the first cluster carry out at vectorization Reason；

2. the method according to claim 1, wherein the data characteristics collection includes:

3. the method according to claim 1, wherein the data characteristics collection using acquisition establishes client Bipartite graph between end and server-side, comprising:

Any client node is selected at random, is connected it and is corresponded to associated service end node, forms the client node and clothes The side of business end node；

All clients node and service end node are traversed, the data characteristics concentrates all client nodes and server-side Node all forms corresponding connection relationship；

4. according to the method described in claim 3, it is characterized in that, described to client or service end segment by figure cutting method Point is clustered for the first time, comprising:

The bipartite graph is subjected to subgraph cluster；

5. according to the method described in claim 4, it is characterized in that, described to larger connection subgraph in the first cluster Client and service end node carry out vectorization processing, comprising:

Any one node from the subgraph randomly chooses a node as next node, shape according to link relation The sequence for being t at a length；

The spy of itself is learnt using other nodes around it using skip-gram method for each of sequence node Sign indicates, by the expression of each node from the OneHot of various dimensions coding dimensionality reduction at node diagnostic vector.

6. according to the method described in claim 5, it is characterized in that, the DBScan algorithm includes:

Calculate the distance between each node；

Based on node similitude described in the range estimation；

Node with similitude is gathered for one kind.

7. according to the method described in claim 6, it is characterized in that, the cluster result using after the cluster again determines Malicious traffic stream and node, comprising:

Malicious traffic stream judgement is carried out using the feature for servicing end node and/or client node in each cluster after clustering again；

If the feature of most service end node and/or client node is informal feature in a cluster, determining should Cluster is malice cluster；

Corresponding relationship existing for client node all in the malice cluster and service end node is restored, then to be detected Malicious traffic stream.

8. a kind of unsupervised encryption malicious traffic stream detection device characterized by comprising

Vectorization unit, for the client and service end node progress vector to larger connection subgraph in the first cluster Change processing；

9. a kind of electronic equipment, which is characterized in that including processor and memory, the memory is stored with can be by the place The computer program instructions that device executes are managed, when the processor executes the computer program instructions, realize that claim 1-7 appoints Method and step described in one.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer program instructions, the computer program Instruction realizes method and step as claimed in claim 1 to 7 when being called and being executed by processor.