CN108494746B

CN108494746B - Method and system for detecting abnormal flow of network port

Info

Publication number: CN108494746B
Application number: CN201810187959.9A
Authority: CN
Inventors: 李明哲; 涂波; 刘丙双; 戴帅夫; 张建宇; 李少华; 闻博; 梅锋; 李莉; 蒋志鹏; 周模; 冯婷婷; 尚秋里; 张洛什; 李传海; 方喆君; 孙中豪
Original assignee: Chang'an Communication Technology Co ltd; National Computer Network and Information Security Management Center
Current assignee: Chang'an Communication Technology Co ltd; National Computer Network and Information Security Management Center
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2020-08-25
Anticipated expiration: 2038-03-07
Also published as: CN108494746A

Abstract

The invention discloses a method and a system for detecting abnormal flow of a network port. The method comprises the following steps: 1) reading the log flow of the communication session in the target data platform, grouping and summarizing according to the source port number and the destination port number, and then counting the flow index data of each port to form a flow sequence of the corresponding port; 2) forming an input vector of each port according to the flow sequence of each port, and inputting the input vector into an LSTM network to obtain a flow predicted value of the port at the moment t; comparing the flow predicted value at the port time t with the observed value; if the deviation of the two is greater than the set condition, determining that the flow of the port is abnormal; 3) the flow abnormity of the port is determined qualitatively according to the recent total flow logs of the port and a preset rule, and the flow abnormity event of the port is judged; and if the flow log cannot be judged, inputting the extracted flow log into a trained machine learning model to classify the flow abnormity of the port, and identifying the flow abnormity event of the port.

Description

Method and system for detecting abnormal flow of network port

Technical Field

The invention relates to the fields of big data, network security, deep learning and the like, in particular to a method and a system for detecting network port flow abnormity.

Background

Today's internet faces a number of security threats. For example, distributed denial of Service (DDoS) attacks cause severe losses to websites and devices of organizations. DDoS refers to launching DDoS attacks on one or more targets by combining multiple computers as an attack platform with the help of a client/server technology, thereby exponentially improving the power of denial of service attacks.

DDoS attacks are often initiated by a botnet. Botnets are a controllable network consisting of hosts of infected bots. An attacker sends instructions to the zombie host through a Command and Control channel (C & C, Command and Control), so that network attacks and crimes such as information stealing, denial of service attacks and the like are performed. Since the advent of the late nineties, zombie network architecture and morphology evolved from the initial simple centralized C & C to the P2P-based distributed C & C, and Domain names used evolved from the initial fixed Domain names to Domain name automatic Generation (Domain Generation Automation).

To address the threat of botnets, governments, enterprises and scientific research institutions are joining hands to detect and attack botnets. The important means of botnet detection is to collect and analyze internet traffic and discover abnormal features, thereby screening and locking botnet members and further taking percussive actions. Effective analysis of internet traffic not only helps to discover botnet members, but also can detect other malicious network traffic, such as malicious domain name requests, malicious file propagation, malicious link access, and DDoS attacks.

The discovery of network traffic anomalies relies on the prediction of normal traffic fluctuations. If the traffic fluctuation at each network port is observed over the wide area network, the entire traffic can be viewed as a 65536 dimensional time series. The present invention experiences the fact that such sequences have a periodicity of multiple frequencies and also have occasional fluctuations. Malicious network events often cause significant traffic fluctuations. In order to model such complex time series of network traffic, it may be considered to employ a Recurrent Neural Network (RNN).

Long Short-Term Memory Neural Network (LSTM) is a special type of RNN and can learn Long-Term dependence information. LSTM was proposed by Hochreiter & Schmidhuber (1997) and recently improved and generalized by Alex Graves. LSTM has enjoyed considerable success and widespread use in the fields of speech recognition, speech synthesis, handwritten concatenated word recognition, time series prediction, image title generation, end-to-end machine translation, and the like. The LSTM avoids gradient degradation and divergence in the neural network training process caused by long-term dependence through a deliberate design, and can sequence the characteristics of data in a front sequence and a back sequence.

Disclosure of Invention

The invention provides a method and a system for detecting network port flow abnormity, which are named as cPortMon and are used for discovering the network port flow abnormity in real time. The cPortMon is a subsystem of the cNeTS network safety monitoring analysis system. The cNeTS system deploys network traffic acquisition probes at a large number of backbone network inlets and outlets and stores the network traffic acquisition probes in a basic data big data platform.

The technical scheme of the invention is as follows:

a method for detecting abnormal flow of a network port comprises the following steps:

1) reading the log flow of the communication session in the target data platform, grouping and summarizing according to the source port number and the destination port number, and then counting the flow index data of each port to form a flow sequence of the corresponding port;

2) forming an input vector of each port according to the flow sequence of each port, and inputting the input vector into an LSTM network to obtain a flow predicted value of the port at the moment t; comparing the flow predicted value of the port moment t with the observed value of the port moment t; if the deviation of the two is greater than the set condition, determining that the flow of the port is abnormal;

3) for a port with abnormal flow, the threat finding module extracts all recent flow logs of the port from the target data platform, qualitatively determines the flow abnormality of the port according to the extracted flow logs and a preset rule, and judges a flow abnormality event of the port; and if the flow abnormity of the port cannot be qualitatively determined according to a preset rule, inputting the extracted flow log into a trained machine learning model to classify the flow abnormity of the port, and identifying the flow abnormity event of the port.

Further, the setting conditions are as follows:

wherein, o (t) is an observed value of the port time t, and p (t) is a predicted value of the port time t; o (tau) is the observed value of the port time tau, p (tau) is the predicted value of the port time tau, T is the observation period length, m is a natural number, k₁,k₂Is a scaling factor.

Further, k₁,k₂The values of (A) are all 2.

Further, the method for the threat discovery module to determine whether the traffic abnormal event of the port is a botnet event includes: when a single host initiates single SYN packet connection aiming at the same ports of a large number of hosts in the extracted flow log, the flow source is determined as a scanning source; and when a large-scale scanning source aiming at the same network port appears, determining that the flow abnormal event is an active botnet event.

Further, the method for determining the botnet machines in the botnet events comprises the following steps: determining a host which is regarded as a scanning source by the threat finding module as a zombie machine, and determining a host which periodically or quasi-periodically sends a request to the same unknown domain name as the zombie machine; the unknown domain name is a domain name outside a set known domain name list.

Further, the method for determining whether a traffic source periodically or quasi-periodically initiates a request to the same unknown domain name includes: acquiring a request resolution domain name set of the flow source, and filtering out known domain names; the following is then performed for each domain name remaining in the request resolution domain name set:

61) if the flow source requests the resolution of the current domain name d for the event sequence number N_dBelow threshold k₅If yes, ignoring all the resolution request events of the domain name d, and finishing the processing of the domain name d; otherwise, go to step 62);

62) aiming at all intervals, clustering all analysis request events of the flow source to the domain name d by using a DBSCAN algorithm, clustering the analysis request events with the same interval value into one class, and if a clustering result C meets the condition that | C is zero>k₆N_dIf yes, the flow source is determined to periodically send a request to the domain name d, the mean value u of the analysis request interval values in the clustering result C is taken as the domain name request period, and the step 64) is carried out; otherwise, go to step 63); k is a radical of₆The value is 0.9-0.98;

63) if a plurality of clustering results C appear_i,i＝1,2,...N_cThe mean value of the interval values of the resolution requests in the clustering results is denoted as u_i(ii) a Get u_min＝minu_iIf each u_iAre all approximately equal to u_minMultiples of (d); then the flow source is determined to quasi-periodically initiate a request to the domain name d with the period u_min(ii) a Otherwise, ending the processing of the domain name d;

64) and judging that the flow source periodically or quasi-periodically initiates a request to the domain name d, wherein the domain name d is a master domain name.

A network port flow abnormity detection system is characterized by comprising a flow acquisition module, a flow analysis module, a flow prediction module, an abnormity judgment module and a threat discovery module; wherein the content of the first and second substances,

the flow acquisition module is used for acquiring the communication session log flow in the target data platform;

the flow analysis module is used for grouping and summarizing the collected flow of the communication session logs according to the source port number and the destination port number, then counting the flow index data of each port and forming a flow sequence of the corresponding port;

the flow prediction module is used for forming an input vector of each port according to the flow sequence of each port, and inputting the input vector into the LSTM network to obtain a flow prediction value of the port at the moment t;

the anomaly judgment module is used for comparing the flow predicted value at the port moment t with the observed value at the port moment t, and if the deviation between the two is greater than a set condition, determining that the flow of the port is abnormal;

the threat finding module is used for extracting all recent flow logs of the port from the target data platform for the port with abnormal flow, and qualitatively determining the abnormal flow of the port according to the extracted flow logs and a preset rule so as to judge the abnormal flow event of the port; and if the flow abnormity of the port cannot be qualitatively determined according to a preset rule, inputting the extracted flow log into a trained machine learning model to classify the flow abnormity of the port, and identifying the flow abnormity event of the port.

The structure of cPortMon is shown in FIG. 1. The real Analysis of cPortMon reads the communication session log flow in the big data platform, and respectively collects the communication session log flow in groups according to the source port number and the destination port number, and counts the flow indexes such as the connection number, the byte number and the like by the preset period length T to form a flow sequence. It is recommended that T be set to a1 hour length.

And carrying out long-term observation and recording on the flow sequence of each port in each direction to form an observation sequence for long-term online training of the LSTM network:

o＝o(t),t＝O,T,2T,...

the LSTM is based on a historical observation value, and the sequence of predicted values generated for the flow of each port from the moment mT is as follows:

p＝p(t)＝p(o(t-T)，o(t-2T)，...)，t＝mT，(m+1)T，...

mT is recommended to exceed the one week time period. If the observed value o (t) of a certain port at the time t is far greater than the corresponding predicted value p (t), alarming the abnormal event of the port flow:

threshold k₁,k₂The exception triggering condition is controlled and the recommendation is set to 2.

The abnormal conditions are triggered by the formula under two conditions, the prediction error of the LSTM network is tolerated to a certain extent, and misjudgment caused by the prediction error can be reduced.

Other modules of cents also play an important role. The overall cNetS architecture is shown in fig. 2. And the cNeTS adopts a Mon-Mine framework to carry out module design. The Mon module carries out real-time and rapid monitoring on various network entities based on original data, and the Mine module carries out deep mining based on the abnormity found by the Mon module. Besides cPortMon, other Mon modules such as cpost Mon, cNameMon, cpinkmon and the like respectively monitor network traffic from the perspective of IP, domain name, URL and the like, and find corresponding traffic abnormal events in real time. After abnormal events discovered by various Mon modules, relevant data can be submitted to a threat discovery module for further verifying and qualifying the existence of malicious network events. According to the classification of the malicious events, the relevant data of the events are distributed to a plurality of mining traceability modules to perform further information mining. If a botnet event is assigned to cbottmine, a malicious file propagation event is assigned to cMalMon, and a DDoS attack event is assigned to cDoSMon. The threat intelligence library provides intelligence support for the threat discovery module, and various Mine modules provide intelligence feedback for the threat intelligence library. Additionally, threat intelligence repositories also support importing intelligence from external data sources.

When the cPortMon finds that the port flow is abnormal, the threat finding module extracts all recent flow logs of the related ports from the data management platform to qualitatively determine the event. And (3) a qualitative strategy, namely submitting a preset rule for judgment, submitting a trained machine learning model (random decision forest) for classification if the preset rule is judged, and if the preset rule is not judged. The threat discovery module architecture is shown in fig. 3.

For botnets, the present invention proposes the following rules:

when it appears that a single host initiates a single SYN packet connection to the same port of a large number of hosts, the source of the traffic (i.e., the single host initiating a single SYN packet connection to the same port of a large number of hosts) can be directly considered to be the source of the scan.

When a scan source for the same network port occurs that exceeds a set threshold, an active botnet propagation event is identified.

The threat intelligence repository has a vulnerability related to the port, and the known botnet uses the record of the vulnerability to map the botnet propagation event to the specific known botnet.

cbottmine is responsible for further analysis of botnet related traffic anomalies, and the tasks performed include Bot detection, Bot portrait and master control traceability, as shown in fig. 4.

The Bot detection function extracts the zombie machine list from the abnormal port traffic. Mainly by the following two criteria:

a host that is recognized by the threat discovery module as a scanning source;

hosts that initiate requests for the same unknown domain name periodically or quasi-periodically.

The periodic rule discovery algorithm acts on a sequence of resolution time intervals for the same domain name. Resolving a domain name set according to the request of the flow source, filtering known domain names by using Alexa10000, and executing the following operations on each remaining domain name:

1. if the resolution request event sequence number N of the current domain name d is_dBelow threshold k₅Then ignore the place of the present domain nameAnd if the analysis request event exists, the calculation process is exited. Otherwise, go to step 2. Proposal k₅The value is between 50 and 100.

2. And clustering by using a DBSCAN algorithm aiming at all intervals, and setting the distance error for 1 minute. If most of the spacing values are grouped into a class C, making | C tint>k₆N_dIf yes, then the periodicity is determined, the class average value u is taken as the domain name request period, and the step 4 is entered. Otherwise, go to step 3. Proposal k₆The value is between 0.9 and 0.98.

3. If multiple distinct classes C are present_i,i＝1,2,...N_cThe mean of these classes is denoted u_i. Get u_min＝minu_iIf each u_iAre all approximately u_minMultiples of (i.e. for arbitrary u)_iBoth satisfy one of the following two conditions:

「u_i/u_min」-u_i/u_min＜∈

also consider periodicity as u_minIf the condition for identifying periodicity is not met, the calculation is exited, and the value range of the suggestion ∈ is below 0.1.

4. And the current domain name is determined as a master control domain name and submitted to a threat intelligence library. Additionally, it is possible to detect whether the domain name is a DGA domain name, and is not within the scope of this discussion.

The Bot portrait module performs the following calculations on Bot found by the Bot detection module:

1. the time of the complications is found. And analyzing the occurrence time of the abnormal behaviors, and determining the earliest time as the infection time.

2. The health status is depicted. Before the time of the disease, the communication attribution distribution, the protocol type distribution, the local known port frequency distribution, the remote known port frequency distribution and other characteristics of the host computer are counted.

3. Tracing infection sources. Assuming that the time of infection and time of complication are close, with no latency, all events within a time window of 5 minutes in length are extracted before the time of complication. For the port invasive botnet, the scanned events and successful connection events of the corresponding ports are checked, and the successful connection sources are determined to be infection sources. For other types of botnets, non-fit sessions are detected within the time window, with remote IP as the suspected scan source.

4. Tracing the pest application behavior. And for the port invasive botnet, excavating the external scanning rule of the port invasive botnet, finding a successful port scanning event, and adding a newly-added victim list.

5. And feeding back the list of suspected infection sources and victims to a threat intelligence library.

And the master control source tracing module performs communication relation calculation based on the known Bot and tries to discover the upper level master control. The following criteria were used:

the master domain name found in the Bot detection module, whose resolved value is deemed the master IP address. And if the master control domain name resolution value cannot be found through the active resolution and passive monitoring modes, observing a connection address initiated by the current Bot after the resolution request.

If multiple Bots have communication relations with the same unknown port Pc of the same host Hc for multiple times, the Hc: Pc is determined as the master address.

If Bot is known to be periodically connected to the same unknown port Pc of the same host Hc. The periodic rule discovery algorithm is multiplexed with corresponding implementation in the Bot detection module.

Compared with the prior art, the invention has the following positive effects:

most of the research on abnormal traffic mining is oriented to enterprise Intranet environment, while the research of the present invention is directed to wide area network environment. Only a few organizations such as large operators, CERTs, etc. can perform traffic analysis monitoring in a wide area network environment. In such a scenario, the statistical characteristics of the flow tend to be obvious, and the contingency events are smoothed, so that the discovery of abnormal events is facilitated.

The invention has many advantages:

1. the discovery of new threats by the present invention is not caused by signature value matching operations, and therefore unknown threats can be discovered.

2. The invention analyzes the flow in a passive observation mode, does not interfere with the Internet province and is invisible to the botnet.

3. The method of the invention can be easily expanded to larger-scale clusters to realize monitoring of larger flow.

And 4, the cNeTS adopts a Mon-Mine separated framework, and each module has definite responsibility, thereby being beneficial to engineering research, development, maintenance and upgrading. The Mon module is suitable for adopting simple and rapid stream processing tasks, and the Mine module is suitable for complex offline mining tasks, thereby not only ensuring the throughput performance of the system, but also supporting the implementation of complex algorithms.

Drawings

FIG. 1 is a schematic diagram of a cPortMon module;

FIG. 2 is a diagram of the overall architecture of the cNeTS;

FIG. 3 is a flow diagram of a threat discovery module execution;

fig. 4 is a schematic diagram of the cbottmine module.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail.

The flow collection module of the cNET is realized by adopting a high-performance server, a plurality of 10Ge network cards are loaded, and a DPDK frame is operated to realize high-speed flow collection. The network flow is exported from the backbone network router and is introduced by shunting in a mirror image mode. The flow collection module collects the flow into a NetFlow abstract format and outputs the NetFlow abstract format to the cPortMon module and the cHostMon module, derives abstract fields such as a domain name, a source and sink IP and a timestamp aiming at the DNS response packet and outputs the abstract fields to the cNamemon module, and derives abstract information such as a URL (uniform resource locator), the source and sink IP and the like aiming at the HTTP request packet and outputs the abstract information to the cLinkMon module.

And the abstract flow is transmitted to each Mon class module through Apache Kafka. Each Mon module can utilize spark streaming to process flow data in real time, and can also store output copies of Kafka to a Hadoop platform, and performs data access through Hive at the later stage to execute an offline mining task.

The real-time flow processing program of the cPortMon accumulates the flow of each port in each time interval and stores the summary result after each time interval is finished. And reading the summary result of each time interval by the offline mining program of the cPortMon. Besides the flow summarizing result of the last 24 periods, a time sequence is constructed for the flow of each port, and a TensorFlow framework is used for flow prediction of the flow of the last 24 periods. And (4) comparing the predicted flow with the flow values of the latest 24 periods, if the deviation is overlarge, judging that the flow is an abnormal event, and submitting the abnormal event to a threat finding module for processing.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for detecting abnormal flow of a network port comprises the following steps:

2) forming an input vector of each port according to the flow sequence of each port, and inputting the input vector into an LSTM network to obtain a flow predicted value of the port at the moment t; comparing the flow predicted value of the port moment t with the observed value of the port moment t; if the deviation of the two is greater than the set condition, determining that the flow of the port is abnormal; wherein the setting conditions are as follows:

wherein, o (t) is an observed value of the port time t, and p (t) is a predicted value of the port time t; o (tau) is the observed value of the port time tau, p (tau) is the predicted value of the port time tau, T is the observation period length, m is a natural number, k₁,k₂Is a proportionality coefficient;

2. The method of claim 1, wherein k is k₁,k₂The values of (A) are all 2.

3. The method of claim 1, wherein the threat discovery module determines whether the traffic anomaly event of the port is a botnet event by: when a single host initiates single SYN packet connection aiming at the same ports of a large number of hosts in the extracted flow log, the flow source is determined as a scanning source; and when the scanning source aiming at the same network port exceeds a set threshold, determining that the abnormal flow event is an active botnet event.

4. The method of claim 3, wherein the method of determining zombie machines in a zombie network event comprises: determining a host which is regarded as a scanning source by the threat finding module as a zombie machine, and determining a host which periodically or quasi-periodically sends a request to the same unknown domain name as the zombie machine; the unknown domain name is a domain name outside a set known domain name list.

5. The method of claim 4, wherein the method of determining whether a traffic source periodically or quasi-periodically makes requests for the same unknown domain name comprises: acquiring a request resolution domain name set of the flow source, and filtering out known domain names; the following is then performed for each domain name remaining in the request resolution domain name set:

6. A network port flow abnormity detection system is characterized by comprising a flow acquisition module, a flow analysis module, a flow prediction module, an abnormity judgment module and a threat discovery module; wherein the content of the first and second substances,

an abnormality judgment module for judging the flow of the port at the time tComparing the predicted value with the observed value of the port at the moment t, and if the deviation between the predicted value and the observed value is greater than a set condition, determining that the flow of the port is abnormal; wherein the setting conditions are as follows:

7. The system of claim 6, wherein the threat discovery module determines whether the traffic anomaly event for the port is a botnet event by: when a single host initiates single SYN packet connection aiming at the same ports of a large number of hosts in the extracted flow log, the flow source is determined as a scanning source; and when the scanning source aiming at the same network port exceeds a set threshold, determining that the abnormal flow event is an active botnet event.

8. The system of claim 7, further comprising a Bot detection module for determining bots in a botnet event by: determining a host which is regarded as a scanning source by the threat finding module as a zombie machine, and determining a host which periodically or quasi-periodically sends a request to the same unknown domain name as the zombie machine; the unknown domain name is a domain name outside a set known domain name list.