TW201818285A

TW201818285A - FedMR-based botnet joint detection method enabling to detect suspicious traffic and suspicious IP before the botnet launches an attack, solving the problem of low detection rate in a single area and achieving the goal of cross-regional security and security cooperation

Info

Publication number: TW201818285A
Application number: TW105135438A
Authority: TW
Inventors: 謝錫堃; 張志標; 王俊又; 徐晟旼
Original assignee: 國立成功大學
Priority date: 2016-11-02
Filing date: 2016-11-02
Publication date: 2018-05-16
Also published as: TWI596498B

Abstract

A FedMR-based botnet joint detection method uses an unsupervised machine learning algorithm to provide a general-purpose P2P botnet detection mechanism without characterizing measurements for specific P2P botnets. The mechanism can identify a large number of botnet traffic with similar behaviors, both the existing P2P botnets and novel P2P botnets that are generated in the future can be marked out, and there is no need to check the content of the packets to ensure data privacy and to avoid the problem of packet encryption technology; it can also detect traces of communication between members of the P2P botnet during the incubation period, thus enabling to detect suspicious traffic and suspicious IP before the botnet launches an attack. In addition, the communication of P2P botnets may not be significant in a single area. Therefore, this invention will use the Fed-MR collaborative computing framework to perform cross-regional joint analysis to solve the problem of low detection rate in a single area in the prior art and achieve the goal of cross-regional security and security cooperation. In the future this method can be applied to academic networks, network providers, and other autonomous systems to detect malicious network behaviors and prevent botnet attacks and strengthen the network security protection.

Description

Joint detection method of botnet based on FedMR

本發明係有關於一種基於FedMR之跨區域殭屍網路聯偵方法，尤指涉及一種採用非監督式機器學習（machine learning）之演算法，特別係指可找出大量相似行為的殭屍網路流量，包含當前存在之各種P2P殭屍網路以及未來產生之新型P2P殭屍網路均可標記出來之方法。The invention relates to a cross-region botnet joint detection method based on FedMR, in particular to an algorithm using unsupervised machine learning, and particularly to a botnet network traffic that can find a large number of similar behaviors , Including the various existing P2P botnets as well as the new P2P botnets generated in the future.

殭屍網路（Botnet）隨著駭客技術之進步，從集中式型態演變成分散式P2P型態，使得偵測及追蹤更為困難，現今之P2P殭屍網路偵測方法主要係分析單一類型之殭屍網路，例如：Waledac、Storm-like、Nugache-like、Sality與ZeroAccess等，從大量的網路日誌中分析其行為及模式，挖掘該P2P殭屍網路之特徵門檻值（Feature Threshold），並利用該特徵門檻值對未來網路流量做惡意行為偵測。通常每一個版本之P2P殭屍網路都需有一組特定之門檻值以供偵測判斷，但這會造成兩個問題，第一，隨著P2P殭屍網路病毒版本更新與進化，新版本之P2P殭屍網路特徵門檻值可能隨之改變並變得更為隱晦，需重新收集該殭屍網路之網路流量並重新分析以建立新的門檻值；第二，病毒無時無刻在增加，P2P殭屍網路之偵測系統勢必得紀錄所有病毒之門檻值，最終將與傳統偵測病毒執行檔之病毒特徵（Virus Signature）方式相同，需要設置一個特徵資料庫紀錄無窮無盡之門檻值，才能有效偵測所有P2P 殭屍網路活動，面對巨量資訊的時代並不切實際。近期已有一些研究開始開發可同時偵測多種P2P殭屍網路之通用型偵測方法，為了達到「通用之P2P殭屍網路偵測」這個目標，它們的作法是去找出不同P2P殭屍網路之間的交集之共通特徵，再透過分群或分類之方式將多個具有惡意行為之網路流量分離出來；然而，此作法如同前段第一點提到，在殭屍網路行為改變後（例如：惡意程式版本更新、通訊路徑改變如IRC變為HTTP），同樣會發生與上述相同之問題，使用者必須重新收集網路流量日誌，並重新挖掘P2P殭屍網路之共通特徵，重新校正該共通特徵之門檻值，才能有效偵測新型P2P殭屍網路，否則整個系統之誤判率將隨之提高。鑑於以往特徵比對（Signature-Based）之偵測方法，如現有中國大陸第CN 201510643971與CN 105282152A號專利案、及美國第US 8677487 B2號專利案，其大多著重在預先定義之規則，如果符合規則才會發出警告，無法針對未知之惡意程式做標記與過濾，因此僅能適用已知之殭屍網路，對於新型之惡意殭屍網路則無法辨識出來。故，ㄧ般習用者係無法符合使用者於實際使用時之所需。With the advancement of hacker technology, botnets have evolved from a centralized type to a decentralized P2P type, making detection and tracking more difficult. Today's P2P botnet detection methods mainly analyze a single type Botnets, such as: Waledac, Storm-like, Nugache-like, Sality, and ZeroAccess, analyze their behaviors and patterns from a large number of network logs, and mine the feature threshold of the P2P botnet (Feature Threshold). And use this characteristic threshold to detect malicious behavior of future network traffic. Usually each version of the P2P botnet needs a specific set of thresholds for detection and judgment, but this will cause two problems. First, with the update and evolution of the P2P botnet virus version, the new version of the P2P botnet The network characteristic threshold may change accordingly and become more obscure. It is necessary to re-collect the network traffic of the botnet and re-analyze to establish a new threshold. Second, the virus is constantly increasing. The P2P botnet The detection system is bound to record the thresholds of all viruses. Eventually, the virus signatures of the virus detection files (Virus Signature) will be the same. You need to set a feature database to record endless thresholds to effectively detect all P2P The age of botnets in the face of huge amounts of information is impractical. Recently, some researches have begun to develop universal detection methods that can simultaneously detect multiple P2P botnets. In order to achieve the goal of "universal P2P botnet detection", their method is to find different P2P botnets. The common characteristics of the intersection between them, and then separate multiple malicious network traffic through clustering or classification; however, this approach is as mentioned in the first point of the previous paragraph, after the behavior of the botnet has changed (for example: Malware version update, communication path changes (such as IRC to HTTP), the same problem will occur, users must re-collect network traffic logs, and re-mind the common characteristics of P2P botnets, re-correct the common characteristics The threshold value can effectively detect the new P2P botnet, otherwise the false positive rate of the entire system will increase accordingly. In view of the previous signature-based detection methods, such as the existing Chinese patent cases Nos. CN 201510643971 and CN 105282152A and the US No. 8677487 B2 patent case, most of them focus on pre-defined rules, if they meet The rules only issue warnings, and cannot mark and filter unknown malicious programs. Therefore, only known botnets can be applied, and new types of malicious botnets cannot be identified. Therefore, ordinary users cannot meet the needs of users in actual use.

本發明之主要目的係在於，克服習知技術所遭遇之上述問題並提供一種採用非監督式機器學習之演算法，以不針對特定P2P殭屍網路之前提下，提供一套通用型之P2P殭屍網路偵測機制，可找出大量相似行為之殭屍網路流量，包含當前存在之各種P2P殭屍網路以及未來產生之新型P2P殭屍網路均可標記出來之基於FedMR之殭屍網路聯偵方法。本發明之次要目的係在於，提供一種不需要事先針對各種P2P 殭屍網路進行特徵量測，即可找出P2P殭屍網路通訊之方法，亦即一個通用之基於FedMR之殭屍網路聯偵方法。本發明之另一目的係在於，提供一種不需要對封包內容進行檢視，能確保資料隱私以及避免封包加密技術問題之基於FedMR之殭屍網路聯偵方法。本發明之再一目的係在於，提供一種能在潛伏階段偵測P2P殭屍網路之成員之間之微量通訊行為，可在殭屍網路發動攻擊前就將有嫌疑之流量及可疑IP偵測出來之基於FedMR之殭屍網路聯偵方法。本發明之又一目的係在於，提供一種透過Fed-MR協同式運算框架進行跨區域的聯合分析，聯合多個區域的網路流量日誌，提高分析的資訊總量，解決以往單一區域低偵測率之問題，並達到跨區域資安聯防目標之基於FedMR之殭屍網路聯偵方法。本發明之又一目的係在於，提供一種未來可應用在學術網路、網路提供商（ISP）等自治系統中（autonomous system），偵測惡意網路行為並預防殭屍網路攻擊，加強網路安全保護之基於FedMR之殭屍網路聯偵方法。為達以上之目的，本發明係一種基於FedMR之殭屍網路聯偵方法，其至少包含下列步驟：流量擷取（Traffic Extraction）步驟：數個區域雲（Region Cloud ）分別持有個別之網路流量（NetFlow）日誌（Log）之資料，日誌之格式為NetFlow，每筆資料為單一方向性（uni-direction）之網路流量連線（Flow），合併來源IP（Src IP）、來源通訊埠（Src_port）、目的地IP（Dst IP）、及目的地Port（Dst_port）互異之Flow成為單一Session，Flow之合併會依據逾時（Timeout）時間做合併，假設任一兩個單一方向性之Flow其之間間隔差距在預先定義之範圍內，則合併並累計相關統計值至Session裡面，並統計Session內所有資訊建立特徵向量值（Feature Vector）；過濾（Filter）步驟：包含前置過濾（Preprocessing Filtering）與P2P流量過濾（P2P Traffic Filtering）兩個子步驟，該前置過濾係將各式依據預先定義之白名單過濾，過濾白名單內之Session，Session內有任一（來源或目地皆可）IP在白名單內就會被過濾，接著以該P2P流量過濾判斷Session之遺失率（loss rate），假設遺失率大於一預設門檻值才會納入要分析之對象，透過過濾階段剃除白名單之Session與遺失率低之Session，可有效降低要分析之資料量；群聚（Grouping）步驟：分為三階段（Level），分別為Level 1 Grouping 、Level 2 Grouping、及Level 3 Grouping，該Level 1 Grouping判斷群聚同一組Src-Dst IP之相同行為之Session，相同行為定義為各Src-Dst Session之特徵向量距離在一個範圍內就定義為相同，如果相同行為之Session數量超過一門檻值，就保留在該Session所形成的L1流量群（L1 Group），該Level 2 Grouping則針對上個階段所留下來之L1 Group再群聚一次，並以同一Src IP對不同Dst IP之Session做判斷，群聚特徵向量相近之Session形成一個L2 Group，該Level 3 Grouping則是更進一步擴充，分析該Level 2 Grouping所產生之L2 Group，群聚特徵相近之L2 Group，最後做輸出L3 Group，上述步驟判定特徵相近之方式係利用向量距離公式得之，在一個特徵門檻值內皆判定為相近；群分配（Group Distributor）步驟：係依據各區域雲產生出來之L3 Groups分散給其他區域雲之群聚集（Group Aggregator）；群聚集步驟：將各區域雲之群聚集最後彙整成為一個完整流量群列表（Complete Group List），且該完整流量群列表會散布至各區域雲；群相似性量測（Group Similarity Measure）步驟：係依據各區域雲產生出來之完整流量群列表建立關聯圖（Relationship Graph），每個區域雲會把自己擁有之Group與完整流量群列表內之Group逐一比較，除了群ID（Group ID）與自己相同之Group不比較外，其餘的均會比較距離，計算出來之距離如果落在一範圍值（Distance_threshold）內，則表示兩點之間會建立連線，一併紀錄至該點之鄰居列表（Adjacency List）當中；建立群關聯圖（Graph Constructor）步驟：係於上層雲（Top Cloud）彙整各區域雲之鄰居列表成為一完整鄰居列表（Complete Adjacency List），此完整鄰居列表即為一個關聯圖之完整描述；評分與耦合（Ranking and Association）步驟：係對於關聯圖中之節點（node），節點為Group之代表，執行一評分演算法，例如：SimRank、PageRank，透過該評分演算法標記各節點之分數，分數在一範圍（Range）內的節點可以視為同一元素（Component），如此可以獲得許多的主要元素（Main Component），這些主要元素就是擁有高度相似網路行為之Group集合；以及收集可疑IP（Suspicious IP Collector）步驟：係彙整各主要元素內之Group（即節點），傳回給各區域雲，各區域雲透過Group編號還原成一可疑IP列表（Suspicious IP List），標記出有嫌疑之IP，而還原之IP會包含Src IP與Dst IP兩個集合，其中，該流量擷取步驟、該過濾步驟、及該群聚步驟皆獨立在該區域雲中執行以獲得第一階段L3 Group資料，而該群分配步驟、該群聚集步驟、該相似性量測步驟、及該建立群關聯圖步驟則在FedMR（Federated MapRedcue）運行，並可拆解MapReduce成為兩部分，一部份放在該區域雲執行，另一部分放在該上層雲執行，俾令在不修改程式碼之情況下，可以跨雲執行MapReduce工作。於本發明上述實施例中，該特徵向量值係根據Flow之基本資料為Session建立特徵向量，此特徵向量係表示一Session之活動統計向量，透過收集不同殭屍網路（Botnet）之日誌（Log），利用特徵選取（Feature Selection）做訓練分析，得到可有效偵測殭屍網路之14個特徵值，包含srcToDst_NumOfPkts、srcToDst_NumOfBytes、srcToDst_Byte_Max、srcToDst_Byte_Min、srcToDst_Byte_Mean、dstToSrc_NumOfBytes、dstToSrc_Byte_Max、dstToSrc_Byte_Min、dstToSrc_Byte_Mean、total_NumOfBytes、total_Byte_Max、total_Byte_Mean、total_Byte_STD、以及total_BytesTransferRatio，分別代表Src IP與Dst IP之間的封包數、Src IP與Dst IP之間的資料位元數、Src IP與Dst IP之間封包最大位元數、Src IP與Dst IP之間封包最小位元數、Src IP與Dst IP之間封包平均位元數、Dst IP與Src IP之間的資料位元數、Dst IP與Src IP之間封包最大位元數、Dst IP與Src IP之間封包最小位元數、Dst IP與Src IP之間封包平均位元數、Flow的資料位元數總和、Flow的資料最大位元數、Flow的資料最小位元數、Flow的資料位元數標準差、以及Flow的傳輸資料比，兩個方向的資料位元數的比值。於本發明上述實施例中，該14個特徵值係從資訊增益（information gain）排名所得到之結果，但實務上不限於此14個特徵，任何有用之特徵均可納入分析。於本發明上述實施例中，該流量擷取步驟中，該預先定義之範圍係將傳輸控制協定（Transmission Control Protocol, TCP）逾時設定為21秒或使用者資料報協定（User Datagram Protocol, UDP）逾時設定為22秒之內，但是通訊協定與網路環境調整不限定於上述的時間內。於本發明上述實施例中，該白名單係由使用者設定，可為領域名稱系統伺服器（Domain Name System Server, DNS Server）、已知IP（Well-Known IP）與內聯網IP（intranet-class IP），該白名單可隨時更新，並可搭配網路服務改變新增任意之IP，IP不限定於IPv4或IPv6，本方法可應用於未來之IP型態。於本發明上述實施例中，該群聚步驟係根據特徵向量之相似度決定群聚，相似度之公式係使用歐氏距離（Euclidean Distance）或任何可以判斷兩個資料維度距離之相關空間量測公式，而群聚之演算法在實施例中係採用DBScan-Like之演算法，以某一點為起點開始掃描節點，直到所有節點都被掃描完成，或是在預先定義之範圍內已經沒有任何節點；而該群聚之演算法亦可以任何有效的群聚演算法替代。於本發明上述實施例中，該群聚步驟中，Level 1至Level 3之演算法流程均相同，僅計算對象不同，本步驟係彙整行為相似的Session至同一個Group，僅有Level 1 Grouping會判斷Group之大小作過濾，Level 2與Level 3也有各自之門檻值判斷大小，決定是否保留Group。於本發明上述實施例中，該評分與耦合步驟中之評分演算法係使用改良之SimRank，為可平行運算之MapReduce之版本，但任何可執行評分之演算法均可使用。於本發明上述實施例中，該收集可疑IP步驟係可獨立在各區域雲中執行，也可單獨在該上層雲上面執行，端視使用者對於資料之隱私程度。於本發明上述實施例中，各區域雲先執行該流量擷取步驟、該過濾步驟、及該群聚步驟，統整網路流量日誌之資料，合併單一方向性之Flow成為雙向性（bi-directional）之Flow，該雙向性之Flow再進一步的群聚成個別之Session，該個別之Session建立好後會在更進一步的群聚成為獨立之Group。於本發明上述實施例中，該群聚集步驟係產生可供該相似性量測步驟比較之完整流量群列表，讓各區域雲在建立關聯圖步驟時可以平行獨立執行。於本發明上述實施例中，該評分與耦合步驟、及該收集可疑IP步驟皆獨立在該上層雲中執行。The main purpose of the present invention is to overcome the above-mentioned problems encountered by conventional technologies and provide an algorithm using unsupervised machine learning to provide a set of general-purpose P2P zombies without targeting specific P2P botnets. Network detection mechanism, which can find a large amount of botnet traffic with similar behaviors, including various existing P2P botnets and new P2P botnets generated in the future. The FedMR-based botnet joint detection method can be labeled . A secondary object of the present invention is to provide a method for finding P2P botnet communication without performing feature measurement on various P2P botnets in advance, that is, a universal FedMR-based botnet joint detection method. Another object of the present invention is to provide a FedMR-based botnet joint detection method that does not need to check the contents of the packet, can ensure data privacy, and avoids the problems of packet encryption technology. Another object of the present invention is to provide a micro communication behavior that can detect P2P botnet members during the incubation period, and detect suspicious traffic and suspicious IPs before the botnet launches an attack. Based on FedMR botnet joint detection method. Another object of the present invention is to provide a cross-region joint analysis through the Fed-MR collaborative computing framework, combine network traffic logs of multiple regions, increase the total amount of analyzed information, and solve the previous low detection in a single region. Rate, and achieve cross-regional information security joint defense goals based on FedMR-based botnet joint detection methods. Another object of the present invention is to provide an autonomous system that can be applied to academic networks, network providers (ISPs) and other autonomous systems in the future to detect malicious network behavior and prevent botnet attacks to strengthen the network. Road security protection based on FedMR botnet joint detection method. In order to achieve the above purpose, the present invention is a botnet joint detection method based on FedMR, which includes at least the following steps: Traffic Extraction step: several regional clouds (Region Clouds) respectively hold individual networks Data of the NetFlow log. The format of the log is NetFlow. Each data is a uni-direction network traffic connection. The source IP (Src IP) and source communication port are combined. (Src_port), destination IP (Dst IP), and destination port (Dst_port) are different. The flow becomes a single session. The merge of flows will be based on the timeout (Timeout) time. The gap between Flows is within a predefined range, then the relevant statistical values are merged and accumulated into the Session, and all information in the Session is counted to establish a Feature Vector value; the Filter step includes the pre-filtering ( Preprocessing Filtering) and P2P Traffic Filtering (P2P Traffic Filtering) two sub-steps, the pre-filtering will be based on a variety of pre-defined whitelist Filter, filter the sessions in the white list, any IP (source or destination) in the session will be filtered in the white list, and then use the P2P traffic filtering to determine the loss rate of the session, assuming the loss rate Only when the value is greater than a preset threshold, it will be included in the object to be analyzed. By filtering the whitelisted sessions and sessions with low loss rate through the filtering stage, the amount of data to be analyzed can be effectively reduced. Grouping steps: divided into three stages (Level): Level 1 Grouping, Level 2 Grouping, and Level 3 Grouping, respectively. The Level 1 Grouping determines the sessions of the same behavior grouping the same group of Src-Dst IPs. The same behavior is defined as the feature vector of each Src-Dst Session The distance is defined as the same within a range. If the number of sessions with the same behavior exceeds a threshold, the L1 traffic group (L1 Group) formed by the session is retained, and the Level 2 Grouping is targeted at the left over from the previous stage. The L1 Group is grouped again, and the sessions of different Dst IPs are judged by the same Src IP. The sessions with similar feature vectors are grouped into an L2 Group. The Level 3 Gr Ouping is a further extension. It analyzes the L2 Group produced by the Level 2 Grouping, groups L2 Groups with similar characteristics, and finally outputs L3 Group. The method of determining the similar characteristics in the above steps is obtained by using the vector distance formula. The thresholds are all determined to be similar; the Group Distributor step: the L3 Groups generated by the regional clouds are dispersed to the Group Aggregator of other regional clouds; the group aggregation step: the clusters of the regional clouds are aggregated Finally, it is aggregated into a complete traffic group list (Complete Group List), and the complete traffic group list will be distributed to the regional clouds; Group Similarity Measure steps: based on the complete traffic groups generated by the regional clouds List the relationship graph (Relationship Graph), each regional cloud will compare its own Group with the group in the complete traffic group list one by one, except the group ID (Group ID) and its own group is not compared, the rest will be Compare distances. If the calculated distance falls within a range value (Distance_threshold), then It shows that a connection will be established between the two points, and it will be recorded in the neighbor list of that point (Adjacency List). The steps of establishing the graph construct graph (Graph Constructor): tie the top cloud to the list of neighbors in each area. Become a complete neighbor list (Complete Adjacency List), this complete neighbor list is a complete description of an association graph; scoring and coupling (Ranking and Association) steps: for the nodes in the association graph (node), the node is a representative of the Group , Execute a scoring algorithm, such as: SimRank, PageRank. Use this scoring algorithm to mark the score of each node. Nodes with a score in a range can be regarded as the same component, so that you can get many main elements. (Main Component), these main elements are Groups with highly similar network behaviors; and the steps of collecting Suspicious IP Collector (Suspicious IP Collector): the Groups (ie nodes) within the main elements are aggregated and returned to the regional clouds, Clouds in each area are restored to a Suspicious IP List through the Group ID. A suspected IP is generated, and the restored IP will include two sets of Src IP and Dst IP. Among them, the traffic capture step, the filtering step, and the clustering step are all independently performed in the regional cloud to obtain the first Phase L3 Group data, and the group allocation step, the group aggregation step, the similarity measurement step, and the group association graph step are run in FedMR (Federated MapRedcue), and MapReduce can be disassembled into two parts, one One copy is executed in the cloud of the area, and the other is executed in the upper cloud. It is allowed to execute MapReduce work across the cloud without modifying the code. In the above embodiments of the present invention, the feature vector value is a feature vector created for a session according to the basic data of Flow. This feature vector represents a statistical activity vector for a session. By collecting logs of different botnets, using feature selection (feature selection) do training analysis, can effectively detect the 14 eigenvalues of the zombie network, comprising srcToDst_NumOfPkts, srcToDst_NumOfBytes, srcToDst_Byte_Max, srcToDst_Byte_Min, srcToDst_Byte_Mean, dstToSrc_NumOfBytes, dstToSrc_Byte_Max, dstToSrc_Byte_Min, dstToSrc_Byte_Mean, total_NumOfBytes, total_Byte_Max, total_Byte_Mean , Total_Byte_STD, and total_BytesTransferRatio respectively represent the number of packets between Src IP and Dst IP, the number of data bits between Src IP and Dst IP, the maximum number of bits in packets between Src IP and Dst IP, Src IP and Dst IP Minimum number of bits between packets, average number of bits between Src IP and Dst IP, number of data bits between Dst IP and Src IP, maximum number of bits between Dst IP and Src IP, Dst IP and Minimum number of bits in packets between Src IP, Ds t The average number of packets between IP and Src IP, the total number of data bits in Flow, the maximum number of data bits in Flow, the minimum number of data bits in Flow, the standard deviation of the number of data bits in Flow, and the transmission of Flow Data ratio, the ratio of the number of data bits in both directions. In the above embodiment of the present invention, the 14 feature values are obtained from the information gain ranking, but are not limited to these 14 features in practice, and any useful feature can be included in the analysis. In the above embodiment of the present invention, in the traffic capturing step, the predefined range is set to a Transmission Control Protocol (TCP) timeout of 21 seconds or a User Datagram Protocol (UDP). ) The timeout is set within 22 seconds, but the adjustment of the communication protocol and network environment is not limited to the above time. In the above embodiment of the present invention, the white list is set by a user, and may be a Domain Name System Server (DNS Server), a Well-Known IP, and an intranet- class IP), the white list can be updated at any time, and can be added to any IP with network services. The IP is not limited to IPv4 or IPv6. This method can be applied to future IP types. In the above embodiments of the present invention, the clustering step determines clustering based on the similarity of the feature vectors. The similarity formula uses Euclidean Distance or any related spatial measurement that can determine the distance between two data dimensions. Formula, and the clustering algorithm in the embodiment uses the DBScan-Like algorithm, starting from a certain point to start scanning nodes until all nodes have been scanned or there are no nodes within a predefined range ; The clustering algorithm can also be replaced by any effective clustering algorithm. In the above embodiment of the present invention, the algorithm flow of Level 1 to Level 3 in the clustering step are the same, and only the calculation objects are different. This step is to aggregate sessions with similar behaviors to the same Group. Only Level 1 Grouping Determine the size of the group for filtering. Level 2 and Level 3 also have their own thresholds to determine the size and decide whether to keep the group. In the above embodiments of the present invention, the scoring algorithm in the scoring and coupling step uses a modified SimRank, which is a version of MapReduce that can be operated in parallel, but any algorithm that can perform scoring can be used. In the above-mentioned embodiment of the present invention, the step of collecting suspicious IPs can be performed independently in each regional cloud, or can be performed on the upper-layer cloud independently, depending on the degree of privacy of the user on the data. In the above embodiment of the present invention, each region cloud first executes the traffic acquisition step, the filtering step, and the clustering step to integrate the data of the network traffic log, and merges the unidirectional Flow into a bidirectional (bi- Directional Flow, the two-way Flow is further clustered into individual sessions. After the individual sessions are established, they will be further clustered into independent groups. In the above embodiment of the present invention, the group aggregation step generates a complete traffic group list that can be compared with the similarity measurement step, so that each region cloud can be executed in parallel and independently when the association graph step is established. In the above embodiment of the present invention, the scoring and coupling steps, and the step of collecting suspicious IPs are performed independently in the upper cloud.

請參閱『第１圖』所示，係本發明基於FedMR之殭屍網路聯偵流程示意圖。如圖所示：本發明係一種基於FedMR之殭屍網路聯偵方法，係提供數個區域雲（Region Cloud ）１聯合，一同偵測殭屍網路（Botnet）之活動，克服網路流量（NetFlow）日誌（Log）過小，導致無法判斷是否有惡意程式活動之情況；並依循非監督式之機器學習（machine learning）演算法設計概念，建構一個可以自我調適，透過網路流量日誌，挖掘惡意程式活動之方法。該方法至少包含下列步驟：流量擷取（Traffic Extraction）步驟s101：數個區域雲１分別持有個別之網路流量日誌之資料，日誌之格式為NetFlow，讀取網路流量連線（Flow），因為NetFlow Flow都是單一方向性（uni-direction），合併來源IP（Src IP）、來源通訊埠（Src_port）、目的地IP（Dst IP）、及目的地Port（Dst_port）互異之Flow成為單一Session，Flow之合併會依據逾時（Timeout）時間做合併，假設任一兩個單一方向性之Flow其之間間隔差距在預先定義之範圍內，則合併並累計相關統計值至Session裡面，並統計Session內所有資訊建立特徵向量值（Feature Vector）。於一實施例中，該預先定義之範圍係將傳輸控制協定（Transmission Control Protocol, TCP）逾時設定為21秒或使用者資料報協定（User Datagram Protocol, UDP）逾時設定為22秒之內。本發明係根據Flow之基本資料建立特徵向量，此特徵向量係表示一Session之活動統計向量，透過收集不同殭屍網路之日誌，利用特徵選取（Feature Selection）做訓練分析，得到可有效偵測殭屍網路之14個特徵值，如表一所示。表一挑選上述14個特徵係從資訊增益（information gain）排名所得到之結果。本發明實驗部份採用這14個特徵做為可行性證明，但不限定只能使用該14個特徵，其他特徵亦可。過濾（Filter）步驟s102：包含前置過濾（Preprocessing Filtering）與P2P流量過濾（P2P Traffic Filtering）兩個子步驟，該前置過濾係將各式依據預先定義之白名單過濾，過濾到白名單內之Session，Session內有任一IP在白名單內就會被過濾，接著以該P2P流量過濾判斷Session之遺失率（loss rate），假設遺失率大於一預設門檻值才會納入要分析之對象，其原因為殭屍網路之節點通常不一定常駐存在，所以通訊上面會產生許多失敗之連線，透過過濾階段剃除白名單之Session與遺失率低之Session，可有效降低要分析之資料量。其中，該白名單係由使用者設定，通常為領域名稱系統伺服器（Domain Name System Server, DNS Server）、已知IP（Well-Known IP）與內聯網IP（intranet-class IP）。群聚（Grouping）步驟s103：分為三階段（Level），分別為Level 1 Grouping、Level 2 Grouping、及Level 3 Grouping，該Level 1 Grouping判斷群聚同一組Src-Dst IP之相同行為之Session，如果相同行為之Session數量超過一門檻值，就保留在該Session所形成之L1流量群（L1 Group），該Level 2 Grouping則針對上個階段所留下來之L1 Group再群聚一次，並以同一Src IP對不同Dst IP之Session做判斷，群聚特徵向量相近之Session形成一個L2 Group，該Level 3 Grouping則是更進一步擴充，分析該Level 2 Grouping所產生之L2 Group，群聚特徵相近之L2 Group，最後做輸出L3 Group。其中，群聚係根據特徵向量之相似度決定，相似度之公式可為任意之空間量測公式，本發明驗證之部分使用歐氏距離（Euclidean Distance）做示範。而群聚之演算法係採用DBScan-Like之演算法，以某一點為起點開始掃描節點，直到所有節點都被掃描完成，或是在預先定義之範圍內已經沒有任何節點。Level 1至Level 3之演算法流程均相同，僅計算對象不同，本步驟目的係彙整行為相似的Session至同一個Group，僅有Level 1 Grouping會判斷Group之大小作過濾，Level 2與Level 3也有各自之門檻值判斷大小，決定是否保留Group。群分配（Group Distributor）步驟s104：係依據各區域雲１產生出來之L3 Groups分散給其他區域雲１之群聚集（Group Aggregator）。群聚集步驟s105：將各區域雲１之群聚集最後彙整成為一個完整流量群列表（Complete Group List），該完整流量群列表會被用於建立關聯圖（Relationship Graph）（見步驟s106～s107），目的為產生一個可以比較之列表，讓各區域雲１在建圖時可以平行獨立執行。其中，每一Group內有一組特徵向量（請參考上述特徵向量之部分）。群相似性量測（Group Similarity Measure）步驟s106：係依據各區域雲１產生出來之完整流量群列表建立關聯圖，每個區域雲１會把自己擁有之Group與完整流量群列表內之Group逐一比較，除了群ID（Group ID）與自己相同之Group不比較外，其餘的均會比較距離，計算出來之距離如果落在一範圍值（Distance_threshold）內，則表示兩點之間會建立連線，一併紀錄至該點之鄰居列表（Adjacency List）當中。建立群關聯圖（Graph Constructor）步驟s107：當所有之步驟s107都執行完畢後，本步驟s107係於上層雲（Top Cloud）２彙整各區域雲１之鄰居列表成為一完整鄰居列表（Complete Adjacency List），此完整鄰居列表即為一個關聯圖之完整描述評分與耦合（Ranking and Association）步驟s108：係對於關聯圖中之節點（node）執行一評分演算法，本發明驗證時係使用改良之SimRank（可平行運算之MapReduce之版本），透過SimRank標記各節點之分數，節點代表Group，分數在一範圍（Range）內的節點可以視為同一元素（Component），如此可以獲得許多的主要元素（Main Component），這些主要元素就是擁有高度相似網路行為之Group集合。收集可疑IP（Suspicious IP Collector）步驟s109：係彙整各主要元素內之Group（即節點），傳回給各區域雲１，各區域雲１透過Group編號還原成一可疑IP列表（Suspicious IP List）之形式，標記出有嫌疑之IP，而還原之IP會包含Src IP與Dst IP兩個集合。本步驟s109可獨立在各區域雲１中執行，也可單獨在該上層雲２執行，端視使用者對於資料之隱私程度。如是，藉由上述揭露之流程構成一全新之基於FedMR之殭屍網路聯偵方法。當運用時，本方法假設有多個雲構成區域雲，如第１圖所示，共有三個區域雲１，每個區域雲１分別持有個別之網路流量日誌，日誌之格式為Netflow；在執行協同偵測殭屍網路時，各區域雲１先執行流量擷取步驟s101、過濾步驟s102及群聚步驟s103，統整Netflow日誌之資訊，合併單一方向性之Flow成為雙向性（bi-directional）之Flow；該雙向性之Flow會再進一步的Grouping成個別之Session，該個別之Session建立好後會在更進一步的Grouping成為獨立之Group。建立好之Group，透過群分配步驟s104與群聚集步驟s105合併成一個完整流量群列表，這份完整流量群列表會散布至各區域雲１。當各區域雲１都有完整流量群列表之後，再執行群相似性量測步驟s106，建立各自之鄰居列表，最後在上層雲２由建立群關聯圖步驟s107彙整成為一個完整鄰居列表，這個完整鄰居列表即代表一個關聯圖。該完整之關聯圖再交由評分與耦合步驟s108，分析找出關聯圖中高度關聯之節點（Group），讓節點構成一個主要元素，這些主要元素就是本發明所得出之擁有相似網路行為之Group；在這些Group之IP就極有可能有殭屍網路之活動出現，最後透過收集可疑IP步驟s109彙整成一個可疑IP列表。關聯圖之表現形式本發明採用鄰居列表之方式呈現，這樣的表示利於分析儲存，每一行代表一個節點與跟其連接之相臨點還有該點與其他點之距離。在執行階段部分，步驟s101至步驟s103透過流量擷取、過濾、群聚獲得第一階段Group資料，上述三個步驟都是獨立在區域雲１中執行。建立關聯圖過程之步驟s104至步驟s107則是在FedMR（Federated MapRedcue）３運行，並可拆解MapReduce成為兩部分，一部份放在區域雲１執行，另一部分放在上層雲２執行，達成在不修改程式碼之情況下，可以跨雲執行MapReduce工作。收集可疑IP過程之步驟s108至步驟s109皆獨立在該上層雲２中執行。 Netflow之資料會被轉換成特徵向量，特徵向量之內容可以隨意調整，本發明在驗證系統可行性時，定義了14個特徵向量作為標示一個Flow活動行為之標的。相似之判斷公式主要使用歐氏距離之公式，但不限定只用在此公式，任何可以判斷兩個資料維度距離之公式皆可以替代。以下以實際之網路流量日誌實驗本方法之可行性，並利用VirusTotal之服務驗證偵測出來之IP是否為有嫌疑之IP，如表二、表三所示。表二表三驗證IP門檻值（Verified IP Threshold）係用於確認一元素是否有惡意行為，1表示只要有一IP位於VirusTotal當中，就算有惡意行為，以此類推。藉此，本方法透過行為分析可辦別出不同之殭屍網路，此方法不僅適用已知之殭屍網路，對於新型之惡意殭屍網路仍能夠辨識出來，不同於傳統Signature-Based之偵測方法，對於混合之殭屍網路，亦可有效地辨別出中毒之IP。該聯偵方法可分為兩個執行階段： 1. 首先考慮到殭屍網路之週期性活動特性，分析並群聚具有週期性行為之通訊流量。 2. 考慮同一類型之P2P 殭屍網路成員之間之行為相似度。該相似度大致包含兩點特性：(1)通訊特徵相近；以及(2)通訊鄰近點（neighbors）之重複性（使用simrank演算法）。綜上所述，本發明係一種全新之基於FedMR之殭屍網路聯偵方法，可有效改善習用之種種缺點，採用非監督式機器學習（machine learning）之演算法，以不針對特定P2P殭屍網路之前提下進行特徵量測，提供一套通用型之P2P 殭屍網路偵測機制，可找出大量相似行為之殭屍網路流量，包含當前存在之各種P2P殭屍網路以及未來產生之新型P2P 殭屍網路均可標記出來，且不需要對封包內容進行分析，確保資料隱私以及避免封包加密技術之問題，並能在潛伏階段偵測P2P殭屍網路之成員之間之微量通訊行為，可在殭屍網路發動攻擊前就將有嫌疑之流量及可疑IP偵測出來。此外P2P殭屍網路之通訊，在單一區域（domain）未必顯著，因此本發明將透過Fed-MR協同式運算框架進行跨區域的聯合分析，解決以往單一區域低偵測率之問題，並達到跨區域資安聯防之目標。此方法未來可應用在學術網路、網路提供商（ISP）等自治系統（Autonomous system）中，用於偵測惡意網路行為並預防殭屍網路攻擊，加強網路安全保護，進而使本發明之産生能更進步、更實用、更符合使用者之所須，確已符合發明專利申請之要件，爰依法提出專利申請。惟以上所述者，僅為本發明之較佳實施例而已，當不能以此限定本發明實施之範圍；故，凡依本發明申請專利範圍及發明說明書內容所作之簡單的等效變化與修飾，皆應仍屬本發明專利涵蓋之範圍內。Please refer to "Figure 1", which is a schematic diagram of the FedMR-based botnet joint detection process of the present invention. As shown in the figure: The present invention is a botnet joint detection method based on FedMR, which provides a number of Region Clouds1 to jointly detect the activities of botnets and overcome network traffic (NetFlow ) The log is too small, which makes it impossible to determine whether there is malicious program activity; and follows the unsupervised machine learning algorithm design concept to build a self-adjusting, mining malicious program through network traffic logs Method of activity. The method includes at least the following steps: Traffic Extraction step s101: several regional clouds 1 each hold data of individual network traffic logs, the format of the logs is NetFlow, and read network flow connections (Flow) Because NetFlow Flow is uni-direction, combining the source IP (Src IP), source communication port (Src_port), destination IP (Dst IP), and destination port (Dst_port) into different flows becomes The merge of a single session and flow will be merged according to the timeout time. Assuming that the gap between any two single-directional flows is within a predefined range, the relevant statistical values are merged and accumulated into the session. And statistics all information in the Session to establish a feature vector value (Feature Vector). In one embodiment, the pre-defined range is that the Transmission Control Protocol (TCP) timeout is set to 21 seconds or the User Datagram Protocol (UDP) timeout is set to 22 seconds. . The present invention establishes a feature vector based on the basic data of Flow. This feature vector represents a statistical activity vector of a session. By collecting logs of different botnets and using feature selection for training analysis, it is possible to effectively detect zombies. The 14 characteristic values of the network are shown in Table 1. Table I The above 14 features are selected from the results obtained from the information gain ranking. The experimental part of the present invention uses these 14 features as a feasibility proof, but it is not limited to only using these 14 features, and other features are also possible. Filtering step s102: Contains two sub-steps of preprocessing filtering and P2P traffic filtering. This pre-filtering is based on the pre-defined white list filtering and filtering into the white list. Session, any IP in the session will be filtered in the white list, and then use the P2P traffic filtering to determine the loss rate of the session. Assuming that the loss rate is greater than a preset threshold, it will be included in the object to be analyzed. The reason is that botnet nodes do not always exist, so there will be many failed connections on the communication. Through the filtering stage, the whitelisted sessions and sessions with low loss rates can be shaved off, which can effectively reduce the amount of data to be analyzed. . The white list is set by a user, and is usually a Domain Name System Server (DNS Server), a Well-Known IP, and an intranet-class IP. Grouping step s103: It is divided into three stages (Level 1), namely Level 1 Grouping, Level 2 Grouping, and Level 3 Grouping. The Level 1 Grouping judges the sessions of the same behavior of grouping the same group of Src-Dst IP. If the number of sessions of the same behavior exceeds a threshold, the L1 traffic group (L1 Group) formed in the session is retained. The Level 2 Grouping is clustered again for the L1 Group left over from the previous stage and uses the same The Src IP judges the sessions of different Dst IPs. The sessions with similar clustering feature vectors form an L2 Group. The Level 3 Grouping is further expanded. Analyzing the L2 Groups generated by the Level 2 Grouping, the L2 Groups with similar clustering characteristics are analyzed. Group, and finally output L3 Group. Among them, the clustering system is determined according to the similarity of the feature vectors, and the formula of the similarity can be any spatial measurement formula. The part verified by the present invention uses Euclidean Distance as a demonstration. The clustering algorithm uses the DBScan-Like algorithm to start scanning nodes from a certain point until all nodes have been scanned or there are no nodes within a predefined range. The algorithm flow of Level 1 to Level 3 is the same, only the calculation objects are different. The purpose of this step is to aggregate sessions with similar behavior to the same Group. Only Level 1 Grouping will determine the size of the group for filtering, and Level 2 and Level 3 also have The respective thresholds determine the size and decide whether to keep the Group. Group Distributor step s104: The L3 Groups generated by each regional cloud 1 are dispersed to the group aggregators of other regional clouds 1. Group aggregation step s105: The cluster of cloud 1 in each area is aggregated into a complete traffic group list (Complete Group List), and the complete traffic group list will be used to establish a relationship graph (see steps s106 to s107) The purpose is to generate a comparable list, so that each area of cloud 1 can be executed in parallel and independently when creating a map. Among them, each group has a set of feature vectors (please refer to the part of the above feature vectors). Group Similarity Measure step s106: Establish an association diagram based on the complete traffic group list generated by cloud 1 in each area. Each area cloud 1 will group one owned by itself and the group in the complete traffic group list one by one. For comparison, except that the Group ID is not compared with the same Group, the others will compare the distance. If the calculated distance falls within a range value (Distance_threshold), it means that a connection will be established between the two points. , And record to the neighbor list (Adjacency List) of that point. Graph Constructor step s107: After all the steps s107 have been performed, this step s107 is in the Top Cloud 2 to consolidate the neighbor list of each area cloud 1 into a complete neighbor list (Complete Adjacency List) ), This complete neighbor list is a complete description of an association graph. Ranking and Association (S108): A scoring algorithm is performed on the nodes in the association graph. The verification of the present invention uses an improved SimRank. (MapReduce version that can be operated in parallel), the scores of each node are marked by SimRank, the node represents the Group, and the nodes within a range (Range) can be regarded as the same element (Component). In this way, many main elements (Main Component), these main elements are Groups with highly similar network behavior. Step s109 of collecting suspicious IP (collective suspicious IP) is to aggregate the groups (ie nodes) in the main elements and return them to the regional cloud 1. Each regional cloud 1 is restored to a suspicious IP list through the group number. In the form, the suspected IP is marked, and the restored IP will contain two sets of Src IP and Dst IP. This step s109 can be performed independently in each regional cloud 1 or separately in the upper cloud 2 depending on the user's degree of privacy of the data. If so, a new FedMR-based botnet joint detection method is constituted by the above-disclosed process. When applied, this method assumes that multiple clouds constitute regional clouds. As shown in Figure 1, there are three regional clouds 1, each of which holds an individual network traffic log, and the format of the log is Netflow; When performing collaborative detection of botnets, cloud 1 in each region first performs the traffic acquisition step s101, the filtering step s102, and the clustering step s103 to unify the information of the Netflow log and merge the unidirectional Flow into a bidirectional (bi- directional) Flow; the bidirectional Flow will be further grouped into individual sessions. After the individual sessions are established, they will be further grouped into independent groups. The established group is merged into a complete traffic group list through the group allocation step s104 and the group aggregation step s105, and this complete traffic group list will be distributed to each regional cloud 1. After the cloud 1 in each area has a complete list of traffic groups, the group similarity measurement step s106 is performed to establish their own neighbor lists. Finally, the upper cloud 2 is aggregated into a complete neighbor list by the group association map step s107. This complete The neighbor list represents an association graph. The complete association graph is then submitted to the scoring and coupling step s108 to analyze and find the highly related nodes (Group) in the association graph, so that the nodes constitute a major element. These major elements are the similar network behaviors obtained by the present invention. Group; The IPs of these Groups are very likely to have botnet activity. Finally, through the collection of suspicious IPs in step s109, a list of suspicious IPs is compiled. Representation form of the association graph The present invention is presented in the form of a neighbor list. Such a representation is advantageous for analysis and storage. Each row represents a node and a point adjacent to it and the distance between the point and other points. In the execution phase, steps s101 to s103 are used to obtain the first-stage Group data through traffic capture, filtering, and clustering. The above three steps are performed independently in the regional cloud 1. Steps s104 to s107 of the process of establishing the association graph are run in FedMR (Federated MapRedcue) 3, and MapReduce can be disassembled into two parts, one part is executed in the regional cloud 1 and the other is executed in the upper cloud 2 to achieve MapReduce jobs can be performed across the cloud without modifying the code. Steps s108 to s109 of the process of collecting suspicious IPs are independently performed in the upper cloud 2. Netflow data will be converted into feature vectors, and the content of the feature vectors can be adjusted at will. When verifying the feasibility of the system, the present invention defines 14 feature vectors as the target of a Flow activity. The similar judgment formula mainly uses the formula of Euclidean distance, but it is not limited to this formula. Any formula that can judge the distance between two data dimensions can be replaced. The following tests the feasibility of this method with actual network traffic logs, and uses VirusTotal's service to verify whether the detected IP is a suspect IP, as shown in Tables 2 and 3. Table II Table three Verified IP Threshold is used to confirm whether an element has malicious behavior. 1 means that as long as there is an IP in VirusTotal, malicious behavior is performed, and so on. In this way, this method can identify different botnets through behavior analysis. This method is not only suitable for known botnets, it can still identify new types of malicious botnets, which is different from traditional Signature-Based detection methods. For mixed botnets, it can also effectively identify poisoned IP. The joint detection method can be divided into two execution phases: 1. First, considering the periodic activity characteristics of the botnet, analyze and cluster communication traffic with periodic behavior. 2. Consider similar behavior between members of the same type of P2P botnet. The similarity roughly includes two characteristics: (1) similar communication characteristics; and (2) repeatability of communication neighbors (using simrank algorithm). To sum up, the present invention is a new FedMR-based botnet joint detection method, which can effectively improve the shortcomings of habituation. It uses an unsupervised machine learning algorithm to prevent specific P2P botnets. The feature measurement was previously carried out, and a universal P2P botnet detection mechanism is provided to find a large amount of botnet traffic with similar behavior, including various existing P2P botnets and new P2P generated in the future. The botnet can be marked, and no analysis of the packet content is needed to ensure data privacy and avoid the problems of packet encryption technology. It can detect the micro communication behavior between members of the P2P botnet during the incubation stage. The botnet detected suspicious traffic and suspicious IPs before launching an attack. In addition, the communication of P2P botnets may not be significant in a single domain. Therefore, the present invention will perform cross-region joint analysis through the Fed-MR collaborative computing framework to solve the problem of low detection rate in the previous single region and achieve cross-regional Objectives of regional information security joint defense. This method can be applied to autonomous systems such as academic networks and network providers (ISPs) in the future to detect malicious network behavior and prevent botnet attacks, strengthen network security protection, and make this The invention of invention can be more advanced, more practical, and more in line with the needs of users. It has indeed met the requirements for invention patent applications, and filed patent applications according to law. However, the above are only the preferred embodiments of the present invention, and the scope of implementation of the present invention cannot be limited by this; therefore, any simple equivalent changes and modifications made in accordance with the scope of the patent application and the contents of the invention specification of the present invention , All should still fall within the scope of the invention patent.

１‧‧‧區域雲1‧‧‧ regional cloud

２‧‧‧上層雲 2‧‧‧ upper cloud

３‧‧‧FedMR ‧‧‧FedMR

s101～s109‧‧‧步驟 s101 ～ s109‧‧‧step

第１圖，係本發明基於FedMR之殭屍網路聯偵流程示意圖。FIG. 1 is a schematic diagram of a joint botnet detection process based on FedMR according to the present invention.

Claims

A botnet joint detection method based on FedMR, which includes at least the following steps: Traffic Extraction step: several Region Clouds each hold a separate NetFlow Log Data, log format is NetFlow, each data is uni-direction network traffic connection (Flow), merge source IP (Src IP), source communication port (Src_port), destination IP (Dst IP) and destination port (Dst_port) have different Flows to form a single session. The merge of Flows will be merged according to the timeout time. It is assumed that the gap between any two unidirectional Flows is predefined. Within the range, the relevant statistical values are merged and accumulated into the session, and all information in the session is counted to establish a feature vector value. (Filter) steps: Including preprocessing filtering and P2P traffic filtering (P2P) Traffic Filtering) two sub-steps, the pre-filtering is based on a variety of pre-defined white list filtering, filtering Sessi in the white list On, any IP in the session will be filtered in the white list, and then the P2P traffic filtering will be used to determine the loss rate of the session. Assuming that the loss rate is greater than a preset threshold, it will be included in the object to be analyzed. By filtering the whitelisted sessions and sessions with low loss rate through the filtering stage, the amount of data to be analyzed can be effectively reduced; Grouping steps: divided into three stages (Level 1 Grouping, Level 2 Grouping, And Level 3 Grouping, the Level 1 Grouping judges to cluster sessions of the same behavior in the same group of Src-Dst IP. If the number of sessions of the same behavior exceeds a threshold, it will retain the L1 traffic group formed by the session. The Level 2 Grouping is again grouped for the L1 Group left over from the previous stage, and uses the same Src IP to judge sessions of different Dst IPs. Grouping sessions with similar feature vectors form an L2 Group. The Level 3 Grouping is a further expansion. It analyzes the L2 Group generated by the Level 2 Grouping, the L2 Group with similar clustering characteristics, and finally outputs the L3 Group. stributor) step: the L3 Groups generated by the regional clouds are dispersed to other regional cloud group aggregates; the group aggregation step: the regional cloud groups are aggregated into a complete traffic group list (Complete Group List) ), And the complete traffic group list will be distributed to the regional clouds; Group Similarity Measure steps: a relationship graph is established based on the complete traffic group list generated by the cloud in each area, and each area Cloud will compare the Groups it owns with the Groups in the complete traffic group list. Except that the Group ID does not compare with its own Group, the rest will compare the distance. If the calculated distance falls within a range, Within the value (Distance_threshold), it means that a connection will be established between the two points, and it will be recorded in the Adjacency List of that point; Set up the Graph Constructor step: Attach to the top cloud Consolidate the list of neighbors in each area into a complete neighbor list (Complete Adjacency List). The neighbor list is a complete description of an association graph. Ranking and Association (Ranking and Association) steps: For a node in the association graph, the node is a representative of the Group, a scoring algorithm is executed, and the scoring algorithm is used to mark The scores of each node. Nodes with a score in a range can be regarded as the same component. In this way, many main components can be obtained. These main elements are Group sets with highly similar network behavior. And the process of collecting suspicious IP Collector (Suspicious IP Collector): The group (ie nodes) in each main element is aggregated and returned to the regional clouds. Each regional cloud is restored to a suspicious IP list (Group) by the group number and marked. The suspected IP, and the restored IP will include two sets of Src IP and Dst IP; where the traffic capture step, the filtering step, and the clustering step are all independently performed in the regional cloud to obtain the first stage L3 Group data, and the group assignment step, the group aggregation step, the similarity measurement step, and the group association The steps are run in FedMR (Federated MapRedcue), and MapReduce can be disassembled into two parts, one part is executed in the area cloud, and the other part is executed in the upper cloud. It is possible to change the code without modifying the code. Perform MapReduce work across clouds.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, wherein the feature vector value is a feature vector for a session based on the basic data of Flow, and the feature vector represents a session activity statistics vector. By collecting logs from different botnets and using Feature Selection for training analysis, 14 characteristic values that can effectively detect the botnet are obtained, including srcToDst_NumOfPkts, srcToDst_NumOfBytes, srcToDst_Byte_Max, srcToDst_Byte_Min, srcToDst_Byte , DstToSrc_NumOfBytes, dstToSrc_Byte_Max, dstToSrc_Byte_Min, dstToSrc_Byte_Mean, total_NumOfBytes, total_Byte_Max, total_Byte_Mean, total_Byte_STD, and total_BytesTransferRatio, between the Src IP and Dst IP data, Dst IP Maximum number of packets between IP, minimum number of packets between Src IP and Dst IP, average number of packets between Src IP and Dst IP, number of data bits between Dst IP and Src IP, Dst IP Maximum bit between packet and Src IP Number of bits, the minimum number of bits in a packet between Dst IP and Src IP, the average number of bits in a packet between Dst IP and Src IP, the total number of data bits in Flow, the maximum number of data bits in Flow, the minimum number of data bits in Flow Element number, standard deviation of the number of data bits of Flow, and the ratio of data transmitted by Flow, the ratio of the number of data bits in both directions.

According to the FedMR-based botnet joint detection method described in item 2 of the scope of the patent application, wherein the 14 feature values are obtained from the information gain ranking, but are not limited to these 14 feature values , Any feature that effectively separates the botnet can be used.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, wherein in the traffic capture step, the predefined range is set as the Transmission Control Protocol (TCP) timeout to 21 seconds or User Datagram Protocol (UDP) timeout is set within 22 seconds, but it is not limited to the above two sets of timeout ranges and can be adjusted according to the application.

According to the FedMR-based botnet joint detection method described in the first patent application scope, wherein the whitelist is set by the user and can be a Domain Name System Server (DNS Server), a known IP (Well-Known IP), intranet-class IP, or any form of public IP in the future.

According to the FedMR-based botnet joint detection method described in the first patent application scope, wherein the clustering step is to determine clustering based on the similarity of the feature vectors, and the formula of the similarity is to use the Euclidean Distance Or any related space measurement formula that can determine the distance between two data dimensions, and the clustering algorithm uses the DBScan-Like algorithm to start scanning nodes from a certain point until all nodes have been scanned, or There are no nodes in the predefined range; the clustering algorithm can also be replaced by any effective clustering algorithm.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, in the clustering step, the algorithm flow of Level 1 to Level 3 is the same, only the calculation target is different, this step is a collection behavior Similar Sessions to the same Group, only Level 1 Grouping will determine the size of the Group for filtering, and Level 2 and Level 3 also have their own thresholds to determine the size and decide whether to keep the Group.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, wherein the scoring algorithm in the scoring and coupling step uses an improved SimRank, which is a version of MapReduce that can be operated in parallel, or uses Any algorithmic alternative that can perform scoring on association graphs.

According to the FedMR-based botnet joint detection method described in Item 1 of the scope of the patent application, the step of collecting suspicious IPs can be performed independently in each regional cloud, or it can be performed separately on the upper cloud. The privacy of the information.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, each region cloud first executes the traffic capture step, the filtering step, and the clustering step to unify the network traffic logs. Data, merge a single directional flow into a bi-directional flow, and the bidirectional flow is further clustered into individual sessions. After the individual sessions are established, they will be further clustered and become independent. Group.

According to the FedMR-based botnet joint detection method described in the first patent application scope, wherein the cluster aggregation step generates a complete traffic cluster list that can be compared for the similarity measurement step, so that the regional clouds are establishing associations Graph steps can be performed independently in parallel.

According to the FedMR-based botnet joint detection method described in item 1 of the scope of the patent application, wherein the scoring and coupling step and the step of collecting suspicious IPs are independently performed in the upper cloud.