CN111818049A

CN111818049A - Botnet flow detection method and system based on Markov model

Info

Publication number: CN111818049A
Application number: CN202010651260.0A
Authority: CN
Inventors: 毕建宇; 迟永梅
Original assignee: Baomu Technology Tianjin Co ltd
Current assignee: Baomu Technology Tianjin Co ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-23
Anticipated expiration: 2040-07-08
Also published as: CN111818049B

Abstract

The invention provides a zombie network flow detection method based on a Markov model, which comprises the steps of obtaining flow data and storing the flow data; performing cluster analysis on the acquired stream data with the same source IP, the target port and the protocol quadruple to generate a state chain corresponding to the quadruple; carrying out similarity detection on the state chain generated by the quadruple by using a pre-trained Markov model; and storing the detection result and the stream data in a database. The invention realizes the purpose that the first-order Markov chain theory is utilized, the stream data clustered by the source IP, the target port and the protocol are sequenced by the starting time of the stream, a Markov chain model is established, the Markov chain model is compared with the Markov chain model of the pre-trained typical botnet flow, and the similarity is judged according to whether the ratio of the probability indexes generated by the two Markov chains is less than or equal to the set threshold value, so that the botnet flow in the unknown network with higher similarity is detected.

Description

Botnet flow detection method and system based on Markov model

Technical Field

The invention belongs to the field of network security, and particularly relates to a botnet flow detection method and system based on a Markov model.

Background

At present, a method for detecting the botnet flow is generally a machine learning algorithm, such as an LSTM algorithm, an EM clustering algorithm, a decision tree algorithm and the like. The common problems with these algorithms are:

1. data samples required for training and testing are difficult to collect because in a real network environment, the proportion of traffic sent by a zombie host is small, resulting in a great difference in the proportion of positive and negative samples required for training.

2. Finding suitable characteristics of the botnet data and extracting the characteristics as the basis for training and detection is difficult, because statistical analysis needs to be carried out on various types of botnet traffic data to analyze the characteristics of the botnet traffic data, and the proportion of the botnet traffic data in a real environment is very small.

3. Protocols adopted by different types of botnets and sent flow characteristics are different, and a method for effectively detecting all botnet flow characteristics is difficult to find.

Disclosure of Invention

In view of the above, the present invention is directed to a zombie network traffic detection method based on a markov model, so as to solve the problems in the prior art.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a botnet flow detection method based on a Markov model comprises the following steps:

(1) acquiring and storing flow data;

(2) performing cluster analysis on the acquired stream data with the same source IP, the target port and the protocol quadruple to generate a state chain corresponding to the quadruple;

(3) carrying out similarity detection on the state chain generated by the quadruple by using a pre-trained Markov model;

(4) and storing the detection result and the stream data in a database.

Further, in the step (1), bidirectional flow in the form of netflow is used as input data, and the data is stored in a queue after being input and then taken out one by one.

Further, the step (2) specifically includes:

judging whether the input data is a first stream, if so, adding 5 minutes to the start time of the stream to generate a time window, aggregating the input stream data with the same source IP, the target port and the protocol quadruple into the same class, extracting the size, duration and periodicity characteristics of the stream, and generating a stream state chain sorted in descending order according to the start time of the stream through a self-defined stream state comparison table;

if the flow is not the first flow, the flow indicates that a time window exists, and whether the flow is in the time window is judged;

if the new stream is in the existing time window, the new stream is classified into a quadruple set;

if the stream data is not located in the time window, judging whether the stream data contained in the quadruple corresponding to the stream is overlarge, if the stream data exceeds a set threshold, emptying all the existing stream data, updating the time window, and putting the stream data into the corresponding quadruple set.

Further, the step (2) further comprises

Judging whether the quadruple is matched with a white list rule, wherein the content of one white list rule comprises a source IP, a destination port and a protocol, and the matching method specifically comprises the following steps:

checking whether the source IP, destination port and protocol of the quadruplet are associated with any one of the existing white names

The single rule is the same; if the same, the corresponding quadruple is marked as a normal connection, and all the streams corresponding to the normal connection

The quantities are normal flow rates;

and if no white list rule matched with the quadruple exists, entering the next judgment.

Further, the step (2) further comprises

And judging whether the length of the state chain corresponding to the quadruple reaches a detection threshold value, namely whether the number of streams with the same source IP, destination port and protocol meets the detection requirement.

Further, the step (3) specifically includes the following steps:

(a) intercepting the length of a state chain of a known botnet flow model which is learned in advance to the length which is consistent with the length of the state chain of the quadruple model to be detected;

(b) generating an initial vector and a probability matrix of a corresponding first-order Markov model according to the intercepted state chain of the known botnet traffic model;

(c) according to the theory of the first-order Markov chain, respectively calculating the probability indexes of the state chains generating the known botnet flow model and the probability indexes of the state chains generating the quadruples to be detected under the probability matrix generated in the step (b);

(d) if the ratio of the two probability indexes obtained in the step (c) is less than or equal to a set similarity threshold, determining the tetrad to be detected as the botnet, and marking the connection corresponding to the tetrad as abnormal connection;

(e) traversing the state chain of each known botnet flow model, repeating the steps (a) to (d), wherein the botnet corresponding to the quadruple to be detected is the botnet corresponding to the probability index closest to 1, and if none of the probability indexes is less than or equal to the similarity threshold, marking the quadruple to be detected as unknown connection.

Further, the similarity threshold is set to 1.1.

The invention also provides a zombie network flow detection device based on the Markov model, which comprises

The data acquisition unit is used for acquiring and storing the flow data;

the cluster analysis unit is used for carrying out cluster analysis on the acquired stream data with the same source IP, the target port and the protocol quadruple to generate a state chain corresponding to the quadruple;

the model training unit is used for carrying out similarity detection on the state chain generated by the quadruple by using a pre-trained Markov model;

and the result storage unit is used for storing the detection result and the stream data to the database.

The invention also provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor when executing the computer program implementing the steps of the markov model based botnet flow detection method described above.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above markov model-based botnet flow detection method.

Compared with the prior art, the botnet flow detection method and system based on the Markov model have the following advantages:

(1) by learning the flow of the known botnet, the method can identify the similar flow existing in the unknown network, and the protocol type of the flow is not limited;

(2) the detection data of the invention is bidirectional Netflow stream data, and the result of other intrusion detection tools (such as Snort) is not required to be relied on as input data;

(3) according to the method, only representative botnet flow needs to be provided for modeling analysis, and training of mass sample data is not needed, so that the problem that the number of normal flow data and the number of botnet flow data are very different, and a proper data set is difficult to find for algorithm training is solved;

(4) the invention analyzes the flow characteristics of the botnet, extracts the duration, the size and the periodicity characteristics of each flow data for detection, and can effectively detect the botnet flow in an unknown environment through verification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of a method for detecting botnet traffic based on a Markov model according to an embodiment of the present invention;

FIG. 2 is a control interface of a botnet C2 server in an embodiment of the present invention, with the botnet host list data clearly visible;

FIG. 3 is a topology diagram of a network used in verifying validity in an embodiment of the present invention;

fig. 4 shows a test command and an output result for verifying validity in an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in FIG. 1, the invention provides a botnet traffic detection method based on Markov model, and the algorithm of the invention

The method is written based on the python language, and can receive and detect real-time input flow data and judge whether the flow data belongs to botnet flow. The detailed description of the algorithm flow is as follows:

1. the algorithm takes netflow-type bidirectional flow as input data, the data are stored in a queue after being input, and then the data are taken out one by one;

2. judging whether the input data is a first stream, if so, generating a time window by adding 5 minutes to the start time of the stream, aggregating the input stream data with the same source IP, the target port and the protocol quadruple into the same class, extracting the size, duration and periodicity characteristics of the stream, and generating a stream state chain sorted in descending order according to the start time of the stream through a self-defined stream state comparison table. Each state chain corresponds to a quadruple, and the state chains become longer as the number of streams in the quadruple increases.

If the flow is not the first flow, the existing time window is indicated, whether the flow is in the time window needs to be judged, if the flow is in the existing time window, the new flow is classified into the quadruple set, if the flow is not in the time window, whether the flow data contained in the quadruple corresponding to the flow is overlarge is judged, if the flow data exceeds a set threshold value, all the existing flow data are emptied, the time window is updated, and the flow data are placed into the corresponding quadruple set;

3. and judging whether the quadruple is matched with a white list rule, wherein the content of one white list rule comprises a source IP, a destination port and a protocol.

Specifically, the matching method of the white list rule is to check whether the source IP, the destination port, and the protocol of the quadruple are the same as any one existing white list rule. If the four tuples are the same, marking the corresponding quadruples as normal connection, wherein all the corresponding flows are normal flows, and if no white list rule matched with the quadruples exists, entering the next judgment;

the false alarm rate of the detection algorithm can be reduced by the condition;

4. judging whether the length of the state chain corresponding to the quadruple reaches a detection threshold value, namely whether the number of streams with the same source IP, destination port and protocol reaches the detection requirement;

the false alarm rate of the algorithm can be reduced through the filtering condition;

5. and carrying out similarity detection on the state chain generated by the quadruple by using a pre-trained Markov model, and specifically comprising the following steps:

a) intercepting the length of a state chain of a known botnet flow model which is learned in advance to the length which is consistent with the length of the state chain of the quadruple model to be detected;

b) generating an initial vector and a probability matrix of a corresponding first-order Markov model according to the intercepted state chain of the known botnet traffic model;

c) according to the theory of the first-order Markov chain, respectively calculating the probability indexes of the state chains generating the known botnet flow model and the probability indexes of the state chains generating the four-tuple to be detected under the probability matrix generated in the step b);

d) if the ratio of the two probability indexes obtained in the step c) is less than or equal to a set similarity threshold value, which is set as 1.1, determining the tetrad to be detected as the botnet, and marking the connection corresponding to the tetrad as abnormal connection;

e) traverse the state chain of each known botnet traffic model and repeat steps a) through d) above. And the botnet corresponding to the four-tuple to be detected is the corresponding botnet with the probability index ratio closest to 1. If any probability index is not less than or equal to the similarity threshold, marking the quadruple to be detected as unknown connection;

6. and storing the detection result and the stream data into a database.

The invention realizes the purpose that the first-order Markov chain theory is utilized, the stream data clustered by the source IP, the target port and the protocol are sequenced by the starting time of the stream, a Markov chain model is established, the Markov chain model is compared with the Markov chain model of the pre-trained typical botnet flow, and the similarity is judged according to whether the ratio of the probability indexes generated by the two Markov chains is less than or equal to a certain set threshold value, so that the botnet flow in an unknown network with higher similarity is detected.

The invention also provides a method for carrying out cluster analysis on the flow data with the same source IP, destination port and protocol after converting the network flow data into the Netflow type bidirectional flow data in the field of botnet flow detection.

By using the detection algorithm, if the typical flow of the botnet and the C2 server communication can be obtained, after the flow is analyzed and modeled, the effective detection of any known botnet flow can be realized.

In a closed local area network, we built two botnets to verify the effectiveness of the algorithm, which are windows-based botnets built using a Plasma RAT and a dark Comet RAT, respectively. Through the capture and analysis of the traffic sent by the botnet host and communicated with the C2 service, a corresponding Markov model is further established, and the real-time traffic detection of the two botnets is successfully realized. The specific implementation process is as follows:

taking a botnet composed of dark Comet RATs as an example, firstly, a simple botnet is established by using a virtual machine environment, and comprises a botnet host and a C2 server, as can be seen from fig. 2, a controlled list of botnets can be seen at a C2 server side, the C2 server controls a botnet host, when the botnet host is successfully connected with the C2 server, a keep-alive data packet is sent, and a markov chain model is established by capturing a typical traffic data packet interacted with the keep-alive data packet, so that all traffic data in a certain network can be detected by using the model, thereby discovering similar botnet traffic existing in the network, and positioning the botnet host and the C2 server.

The following are specific implementation steps:

1. capturing zombie host communication traffic and modeling

The method comprises the steps of keeping the connection state of a botnet host and a C2 server, capturing communication traffic of the botnet by using a wireshark tool at a C2 server side, and storing the communication traffic as a pcap packet. Extracting pcap packets into netflow form using argus tool, extracting linux commands of data:

1.argus-F argus.conf-r darkComet.pcap-w darkComet.biargus

2.ra-F ra.conf-n-Z b-r darkComet.biargus>darkComet.binetflow

the extracted data is located in the file dark Commet. bindflow, netflow sample data as follows:

StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes

2019/12/28

13:43:03.918146,3598.816406,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,SPA_SPA,0,0,364,25318

2019/12/28

14:43:22.807230,3584.171631,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,PA_A,0,0,356,24208

2019/12/28

15:43:26.970967,3580.367676,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,PA_A,0,0,356,24208

2019/12/28

16:43:27.300319,3599.892578,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,PA_A,0,0,357,24284

2019/12/28

17:43:27.394986,3599.155273,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,PA_A,0,0,357,24268

2019/12/28

18:43:46.524923,3599.338867,tcp,192.168.80.190,49711,

->,192.168.80.134,1604,PA_A,0,0,358,24344

...

as shown in the data header, each field of the data is divided by commas, and sequentially includes start time, duration, protocol, source IP address, source port, flow direction, destination address, destination port, protocol state, source service type, destination service type, total packet number, and total byte number of the netflow flow.

By using the stream data, a stream state chain sorted according to the descending order of the starting time of the stream is generated by the self-defined stream state comparison table of the method, and the result is as follows:

990i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0i0

the markov chain represents a typical communication traffic model of a botnet composed of dark Comet RATs, which is connected with the C2 server and is in a standby state, and is stored as a known botnet traffic model, namely:

From-Botnet-TCP-PLASMARAT1.5.DARKCOMET5.3-Windows.idle-1

2. and (3) validity verification: in unknown network environment, the model extracted in the step 1 is used for flow detection

A general office network for simulating a detection environment comprises a plurality of normal hosts, wherein two zombie hosts and a C2 server are deployed in the normal hosts, IP addresses of the zombie hosts are 192.168.80.190 and 192.168.80.191 respectively, and the address of the C2 server is 192.168.80.134. All the traffic sent by the hosts in the network is collected and exported to a botnet traffic detection device through a switch image port, an argus tool is used to process the received image traffic into a netlflow form (as described in step 1), and linux commands are as follows:

1.argus-F./argus.conf

ra-f./ra.conf-n-Z b-S127.0.0.1: 561-L-1-tcp or udp and transmits netlflow flow data to a detection program containing a pre-trained botnet state chain model, and analyzes and detects the received flow data. The network topology is shown in fig. 3.

The result of the execution of the detected command is shown in fig. 4.

As can be seen from the detection results in fig. 4, the detection application successfully detects the communication traffic between the zombie host and the C2 server existing in the network by analyzing the netflow stream data sent: sent from 192.168.80.190 and 192.168.80.191, the destination IP is 192.168.80.134, the destination port is 1604, and the protocol is TCP. All zombie hosts (2 in total) of this type of zombie network present in the network were successfully detected.

The data acquisition unit is used for acquiring and storing the flow data;

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A botnet flow detection method based on a Markov model is characterized in that: the method comprises the following steps:

(1) acquiring and storing flow data;

(4) and storing the detection result and the stream data in a database.

2. The markov model based botnet traffic detection method of claim 1, wherein: in the step (1), bidirectional flow in a netflow form is used as input data, and the data are stored in a queue after being input and then are taken out one by one.

3. The markov model based botnet traffic detection method of claim 1, wherein: the step (2) specifically comprises:

4. The markov model based botnet traffic detection method of claim 1, wherein: the step (2) further comprises

checking whether the source IP, the destination port and the protocol of the quadruple are the same as any one of the existing white list rules; if the four tuples are the same, marking the corresponding quadruples as normal connections, wherein all the corresponding flows are normal flows;

5. The markov model based botnet traffic detection method of claim 1, wherein: the step (2) further comprises

6. The markov model based botnet traffic detection method of claim 1, wherein: the step (3) specifically comprises the following steps:

7. The markov model based botnet traffic detection method of claim 6, wherein: the similarity threshold is set to 1.1.

8. A botnet flow detection device based on a Markov model is characterized in that: comprises that

The data acquisition unit is used for acquiring and storing the flow data;

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor when executing the computer program implements the steps of the markov model based botnet flow detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having a computer program stored thereon, which, when being executed by a processor, carries out the steps of the markov model-based botnet flow detection method according to any one of claims 1 to 7.