CN107370752A

CN107370752A - A kind of efficient remote control Trojan detection method

Info

Publication number: CN107370752A
Application number: CN201710719001.5A
Authority: CN
Inventors: 姜伟; 吴贤达; 庄俊玺; 潘邵芹; 田原
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2017-11-21
Anticipated expiration: 2037-08-21
Also published as: CN107370752B

Abstract

The invention discloses a kind of efficient remote control Trojan detection method, this method judges to whether there is remote control Trojan in network by network behavior feature.This method can be applied in the detection of real network flow, and rate of false alarm is close to 0.Whole method includes following four-stage：First stage, flow collection；Second stage, behavioural characteristic extraction；Phase III, the realization of method：SMOTE over-samplings and XGBoost sorting techniques are combined, SMOTE over-sampling algorithms solve the classification problem of unbalanced dataset in data plane.The very high sorting algorithm XGBoost sorting techniques of the emerging precision in machine learning field are used for trojan horse detection first, while reaching compared with high-accuracy, solve the classification problem of unbalanced dataset from algorithm aspect.Fourth stage, the optimized evaluation of method.This method is focused on finding rule by the excavation of mixture of networks flow, is adapted to the identification work of wooden horse known to completing, can also detect unknown remote control Trojan.

Description

A kind of efficient remote control Trojan detection method

Technical field

The invention belongs to areas of information technology, and in particular to a kind of efficient remote control Trojan detection method.Realize accurate inspection Measure the known remote control Trojan mixed in flow, additionally it is possible to identify unknown remote control Trojan, to safeguarding network security, reduce state Family, the loss of enterprises and individuals have great importance.

Background technology

In recent years, remote control Trojan is constantly used for remote control and information stealth by attacker, and network security is brought Serious threat, cause and have a strong impact on and massive losses to country, enterprises and individuals.Remote control Trojan is by control terminal (client) Formed with two parts of controlled end (server end).Under normal circumstances, attacker is attacked using spear type fishing and social engineering Hit and find the machine that can infect, the TCP/IP or udp protocol of standard are utilized after finding, realize the real-time of control terminal and controlled end Communication.Attacker sends control instruction by control terminal, and controlled end performs phase after monitoring control instruction in victim host The controlled action answered, and result is returned into control terminal by network.With traditional security threat from virus and wooden horse not Be：Such wooden horse is full-featured, the data theft being usually used in APT attacks and privacy pry, there is disguised and length to hold Long property, it is very harmful.Well-known remote control Trojan, such as：Grey pigeon, Gh0st, PcShare, Nuclear, DarkComent, XtreamRAT, glacial epoch, PlugX, etc., kind up to more than 30, some unknown and variants of these known wooden horses are in network In stealthily affect our privacy.More seriously, once host computer system is broken, this main frame of invader's can Remote control Trojan is distributed to other pregnable computers, establishes Botnet.Current intruding detection system is in design It is designed primarily directed to all kinds of safety problems in LAN, may have been omitted the particularity of remote control Trojan, so as to far controls Wooden horse is likely to the testing mechanism around intruding detection system.How quickly and efficiently to detect and further take precautions against remote control Trojan, The significant challenge faced as security fields.

According to the remote control Trojan detection technique based on varying environment, the detection of remote control Trojan can be divided into Intrusion Detection based on host and be based on The detection of network, and the detection mode of fusion main frame and network characterization.As wooden horse speed of mutation is more and more faster so that be based on The detection efficiency of Host behavior feature substantially reduces, and the detection based on network behavior is more suitable for detecting present in network newly Raw unknown threat.The general character of Intrusion Detection based on host and the detection mode of fusion main frame and network characterization is to be required for above carrying in main frame Take behavioural characteristic.In order to which the method that we obtain has more preferable transplantability, we do not consider to extract behavioural characteristic from main frame, only close How note selects the feature on effective network, coordinates and finds appropriate detection algorithm, generates efficient remote control Trojan detection side Method.In recent years, the method for machine learning was applied to the trojan horse detection based on communication behavior by most of researchers, but most of Existing methods detection rate of false alarm is higher and is not particularly suited for the detection of extraordinary remote control Trojan.

Dan Jiang et al., it is directed to just being detected initial stage in remote control Trojan communication, from transport network layer data parlor Go out seven network characterizations every the extracting data less than 1S.Detection method is realized by random forests algorithm.Although this experiment has Higher accuracy rate, but rate of false alarm is higher, the selection of sample are remote control Trojan sample and 10 sections of normal use samples, method It is not particularly suited for mixing flow.

Li Wei et al., the labor communication feature of wooden horse have chosen periodicity DNS, up-downgoing byte ratio, up and down 7 network behavior features such as ratio, parcel accounting of row bag, have selected KNN and C4.5 algorithms, but it is higher to equally exist wrong report Problem.

Shicong Li et al., detection to wooden horse is realized by clustering algorithm, the algorithms selection Internet and IP layers Feature, complete the detection method for mixing flow, but the feature that this method is chosen is applied to the detection of remote control class wooden horse simultaneously It is not necessarily effective.It is proposed that detection method accuracy rate and rate of false alarm be better than the method.

Basic conception of the present invention is explained below.

Stream：The flow collected is subjected to Screening Treatment, chooses the flow based on Transmission Control Protocol, and according to [source IP address, Purpose IP address] difference extract different " stream ".Every in this method stream k be one section using flag bit as " SYN " is The three-way handshake of beginning starts to obtain, and the flow untill time threshold T (T=300S) is reached, this section of flow total length is designated as F_k(k=1,2,3 ... k).

Session：Session is that restructuring by flowing and filtering are formed.Each stream be decomposed into 1 to n it is different [source IP address, Source port, purpose IP address, destination interface] communication " session ".

Periodically stream：" TTI " between obtaining per two neighboring bag is defined as t, T_LinternalFor depositing stream In all bags interval, be denoted as T_Linternal={ t₀, t₁, t₂.........t_N-1}；By all T_LinternalMiddle element sum note For total time, represented with SUMT；All streams in T range are referred to as " periodically stream ", all time interval collection is periodically flowed and is combined into T_Linternal。

The content of the invention

The technical problems to be solved by the invention are the detections to remote control class wooden horse, mainly provide a kind of effective detection and far control The detection method of wooden horse, known and unknown remote control Trojan is detected exactly for realizing.Including：

1. collection network communication data packet, extracted according to the difference of [source IP address, purpose IP address] different " stream "；

2. pair stream requirement each captured to using flag bit as " SYN " is that the three-way handshake of starting starts, until when reaching Between one section of flow analysis untill threshold value T (T=300S), extract following feature：

f₀：Flag bit is all bag numbers of [FIN, ACK] or [RAT, ACK] during statistics periodically flows；

f₁：Count the session number that regular stream includes；

f₂：Most long session is proposed from regular stream, variance is asked to the sequence of all uplink packets of most long session composition；

f₃：It is that [PUSH, ACK] bag size subtracts descending each flag bit to calculate up average each flag bit in periodically stream For the value of [PUSH, ACK] bag size；1 is entered as if greater than 0,0 is entered as equal to 0, -1 is entered as less than 0；Remember in T time The byte for the bag that up flag is PUSH and be P_bup, number C_bup；Descending flag bit for [PUSH, ACK] bag byte and For P_bdown, number C_bdown.Then have：

f₄：Periodically lines per second is averagely descended to send byte number in stream.We have tried to achieve all descending total bytes of giving out a contract for a project in T time Number P_down, and according to T_LinternalTry to achieve the descending total time T used that gives out a contract for a project in T time_down；

f₅：Periodically up average bytes per second divided by descending average bytes per second in stream.According to T_LinternalTry to achieve T Up total time used of giving out a contract for a project is T in time_up, all up total bytes P that give out a contract for a project in T time_up, then：

f₆：The number of bag of the size more than 90 periodically wrapped in stream；

f₇：The downstream packets number of transmission per second, i.e. time used in the total number divided by downstream packets of downstream packets in periodically flowing；Note The number of all downstream packets is C in T time_down, then have

3. making label for the stream each captured, remote control Trojan communication flows is designated as 1, and proper communication flow is designated as 0.By label Database is stored in corresponding 8 kinds of behavioural characteristic data, generates training set；

4. SMOTE sampling algorithms and XGBoost sorting algorithms are combined, while by data plane and algorithm aspect Improve to solve the classification problem of unbalanced dataset.Then handle to obtain new training set by SMOTE sampling algorithms.Profit Classification learning is carried out to new compound training collection with XGBoost algorithms, obtains original grader.

5. utilizing raster search method, realize and systematically travel through Various Classifiers on Regional parameter combination, determined by cross validation Optimal parameter, then use these parameter optimization original classification devices in whole training.

6. using real network flow as detection object, the testing result of analysis method.

The beneficial effects of the invention are as follows：The present invention have chosen the size based on network packet, bag number, mark and when Between etc. feature, effectively realize remote control Trojan detection method.The main contributions of this method be to combine SMOTE over-samplings and XGBoost sorting techniques, by being improved while data plane and algorithm aspect to solve the classification problem of unbalanced dataset. Generation method is not limited to the detection to host side flow, at the same available for detection network key node with the presence or absence of known or The unknown remote control Trojan of person.

Brief description of the drawings

Fig. 1 this method schematic flow sheets.

Embodiment

S1. this remote control Trojan detection method, the generation of method mainly include following four module：Flow collection module, OK It is characterized extraction module, grader creation module, the optimized evaluation module of grader.

S2. flow collection module is responsible for the data set that acquisition method creates and detection is required；

S21. flow collection：Using NetAnalyzer and wireshark softwares, seven meters are captured under controllable environment The communication flows (wherein two implantation trojan horse programs) of calculation machine, these communication flows can be divided into three kinds, first, collecting both at home and abroad 24 kinds of remote control Trojan sample communications flows, second, known 10 kinds normal application software communication flows, third, the net mixed Network flow.Finally we have collected the communication flows of 291.17 hours altogether, and these communication flows are deposited with .pcap file formats Storage.

S22. flow screens：The filtering of communication flows.The stream based on Transmission Control Protocol is chosen from the .pcap files after preservation Amount, and extracted different " stream " according to the difference of [source IP address, purpose IP address].

S23. the restructuring of communication flows meets following two condition：(1) using flag bit as " SYN " be starting three-way handshake Start, one section until time threshold T (T=300S) is reached untill flows, each stream can by 1 to it is N number of it is different [source IP address, Source port, purpose IP address, destination interface] communication " session " composition；(2) duration of whole section of stream is more than 1S, i.e., does not consider to be less than The stream that 1S just terminates；

S3. behavior characteristic extraction module is responsible for analyzing remote control Trojan and the difference of mainframe network communication stream, searches out effectively Suitable for the network service feature of such detection.Every section of regular stream after processing is designated as F_k(k=1,2,3...k), behavior carries Modulus block comprises the following steps：

S31. count flag bit in bidirectional flow and be designated as f for the total number of [FIN, ACK] or [RAT, ACK]₀；

S32. F is counted_kThe number of bag of the size of middle bag more than 90；Filter and restructuring will be periodically flowed through, be decomposed into 1 and arrive n Individual different [source IP address, source port, purpose IP address, destination interface] communication " session " composition, M is designated as by most long session_s, Count the number of session and be designated as f₁；To M_sIn all uplink packets form new sequence and calculate the variance note of this section of sequence For f₂；

S33. F is periodically flowed_kIn up average each flag bit be that [PUSH, ACK] bag size subtracts descending each flag bit For [PUSH, ACK] bag size, f is designated as₃；1 is entered as if greater than 0,0 is entered as equal to 0, -1 is entered as less than 0；Remember T time Interior up flag bit is the byte of the bag of [PUSH, ACK] and is P_bup, number C_bup；Descending flag bit is [PUSH, ACK] The byte of bag and be P_bdown, number C_bdown.Then have：

S35. calculating in periodically stream averagely descends lines per second to send byte number, is designated as f₄；We tried to achieve it is all in T time under The capable total bytes P that gives out a contract for a project_down, and according to T_LinternalTry to achieve the descending total time T used that gives out a contract for a project in T time_down。

S36. up average bytes per second divided by descending average bytes per second in periodically stream are calculated and is designated as f₅；According to T_LinternalIt is T to try to achieve up total time used of giving out a contract for a project in T time_up, all up total bytes P that give out a contract for a project in T time_up, then：

S37. the downstream packets number scale for calculating transmission per second in periodically stream is f₇, i.e. the total number of downstream packets divided by downstream packets institute Time；The number for remembering all downstream packets in T time is C_down, then have

S4. grader creation module is responsible for dividing the training set newly synthesized using SMOTE algorithms using XGBoost algorithms Class learns, and generates an original classification device；

S41. every stream of capture is labelled, remote control Trojan communication flows is designated as 1, and proper communication flow is designated as 0.Will mark Label and corresponding 8 kinds of behavioural characteristics deposit database, generation method training set；Training set T1 is 291.17 small by seven machines When flow screening and filtering after obtain 1862 streams, 119 stream therein is produced by remote control Trojan.We are by T1 70% Data are used as training, are designated as TR1, and remaining 30% data are used as test, are designated as TE1；Followed by SMOTE algorithm process TR1 Data, normal stream and the ratio of remote control Trojan stream are in initial TR1：1214：89；

S42. classification ratio imbalance problem in such training set sample is considered, realizes SMOTE sampling algorithms.By SMOTE algorithms obtain the ratio of normal stream and remote control Trojan in new synthesis sample after certain operations are performed to raw data set For：1214：1246. training sets newly synthesized are designated as Tsynthesis. in addition, test set TE2 is amounted to by another five machines The flow of 145.83 hours is sieved through 1342 streams obtained after filter, and 86 stream therein is produced by remote control Trojan.

S43. classification learning training is carried out using XGBoost algorithms, in order to effectively avoid learning and owe study shape The generation of state, We conducted K-fold cross validations.K-fold Cross Validation are that initial data is divided into K groups, Each subset data is made into one-time authentication collection respectively, remaining K-1 groups subset data can so obtain K side as training set Method, performance indications sheet of the average of verifying the classification accuracy that collects final by the use of this K method as grader under this K-CV Wen Zhong, this method cross validation K values are set to 6.Ultimately generate a detection method；

S5. the optimized evaluation module of grader refers to choose the important parameter for choosing original classification device, and assesses most The Detection results of excellent grader.

S51. raster search method is utilized, realizes and systematically travels through many kinds of parameters combination, is determined by cross validation optimal Parameter, then use these parameter setting optimization methods in whole training.It is 72 to determine parameter to include estimator number, minimum Leaf node sample weights and for 1, the depth capacity of tree is 6, and the ratio of each tree stochastical sampling is 0.9, and every is adopted at random The accounting of the columns of sample is 0.8, and the least disadvantage function drop-out value needed for node split is 0.2.

S52. optimal parameter is brought into original classification device, generates optimal detection grader.

S53. test set being put into detection grader and identified, detection grader is judged the data in test set, Remote control Trojan communication such as be present, then corresponding communication flows output is 1, otherwise for 0. test result indicates that being given birth to according to as above method Into grader can effective detection go out whole remote control Trojan communications in test set (as shown in table 3, rate of false alarm be almost 0；

The wooden horse title of 1. model selections of table and corresponding version number

RAT samples	Version number	RAT samples	Version number
				Nuclear	3	Gh0st	2
Bandook	1	Upper emerging remote control	1
				Great white shark	1	DarkComent	2
Grey pigeon	1	remote	1
				Bozok	1	Taidoor	1
CyberGate RAT	1	PoisionIvy	2
				Pandora RAT	1	SpyNet	1
Comet Rat	1	Kong Juyuan is controlled	1
				Star RAT	1	Xtreme RAT	2
Pcshare	1	njRAT	3
				VanToM RAT	1	Plugx	2
X RAT	1	HAKOPS RAT	1

2. 4 kinds of detection cross validation testing result contrasts of table

3. 3 kinds of method testing result contrasts of table

Claims

1. a kind of remote control Trojan detection method, it is characterised in that the generation of method includes four main modulars；Module one, module Two, module three, module four represents respectively：Flow collection module, behavior characteristic extraction module, grader creation module, grader Optimized evaluation module；Flow collection module is responsible for gathering the data set that grader creates and detection is required；Filter out based on biography The communication traffic of defeated layer Transmission Control Protocol, communication traffic is divided according to [source IP address, purpose IP address], obtains a plurality of stream；Behavior Characteristic extracting module is responsible for analyzing remote control Trojan and the difference of mainframe network communication stream, searches out the network suitable for such detection Communication feature；Grader creation module generates original classification device using the training set generated；The optimized evaluation module of grader Refer to the grader match parameter for generation, optimize original classification device, obtain new grader, recycle new grader to surveying Testing result is assessed in examination collection classification；

Module one is realized in the following manner：To after each division stream k requirement to using flag bit as " SYN " be originate three times Shake hands beginning, one section of flow analysis untill time threshold T is reached, this section of flow is designated as periodically flowing F_k(k=1,2, 3...k)；Time threshold T is setting value；

Behavior characteristic extraction module periodically flows F to every section after the processing of module one_k(k=1,2,3...k), is carried in accordance with the following steps Take feature：

Step 1：Flag bit is designated as f for the total number of [FIN, ACK] or [RAT, ACK] in statistics bidirectional flow₀；

Step 2：Count F_kThe number of bag of the size of middle bag more than 90；Filter and restructuring will be periodically flowed through, be decomposed into 1 to n Different [source IP address, source port, purpose IP address, destination interface] communication " session " compositions, M is designated as by most long session_s, system Count the number of session and be designated as f₁；To M_sIn all uplink packets form new sequence and calculate the variance of this section of sequence and be designated as f₂；

Step 3：Periodically stream F_kIn up average each flag bit be that [PUSH, ACK] bag size subtracts descending each flag bit and is [PUSH, ACK] bag size, is designated as f₃；1 is entered as if greater than 0,0 is entered as equal to 0, -1 is entered as less than 0；Remember in T time The byte for the bag that up flag is PUSH and be P_bup, number C_bup；Descending flag is the byte of the bag of [PUSH, ACK] and is P_bdown, number C_bdown；Then have：

Step 4：Calculating in periodically stream averagely descends lines per second to send byte number, is designated as f₄；We have tried to achieve all descending in T time Give out a contract for a project total bytes P_down, and according to T_LinternalTry to achieve the descending total time T used that gives out a contract for a project in T time_down；

Step 5：Calculate up average bytes per second divided by descending average bytes per second in periodically stream and be designated as f₅；According to T_LinternalIt is T to try to achieve up total time used of giving out a contract for a project in T time_up, all up total bytes P that give out a contract for a project in T time_up, then：

Step 6：The number of bag of the size that statistics is wrapped in periodically flowing more than 90, is designated as f6；

Step 7：The downstream packets number scale for calculating transmission per second in periodically stream is f₇, i.e., used in the total number divided by downstream packets of downstream packets Time；The number for remembering all downstream packets in T time is C_down, then have

<mrow> <msub> <mi>f</mi> <mn>7</mn> </msub> <mo>=</mo> <mfrac> <msub> <mi>C</mi> <mrow> <mi>d</mi> <mi>o</mi> <mi>w</mi> <mi>n</mi> </mrow> </msub> <msub> <mi>T</mi> <mrow> <mi>d</mi> <mi>o</mi> <mi>w</mi> <mi>n</mi> </mrow> </msub> </mfrac> </mrow> 1

Module three is realized in the following manner：

Every stream of capture is labelled, remote control Trojan communication flows is designated as 1, and proper communication flow is designated as 0；By label and correspondingly 8 kinds of behavioural characteristic values deposit database, as method training set；Asked for classification ratio imbalance in such training set sample Topic, realizes SMOTE sampling algorithms, generates new compound training collection；New compound training collection is divided using XGBoost algorithms Class learns, and generates an original classification device.

2. method according to claim 1, it is characterised in that：Module four is realized in the following manner：

Step 1：Using raster search method, realize and systematically travel through many kinds of parameters combination, optimal ginseng is determined by cross validation Number, these parameter settings optimization original classification device is then used in whole training；It is 72 to determine parameter to include estimator number, Minimum leaf node sample weights and for 1, the depth capacity of tree is 6, and the ratio of each tree stochastical sampling is 0.9, every with The accounting of the columns of machine sampling is 0.8, and the least disadvantage function drop-out value needed for node split is 0.2；

Step 2：The optimal parameter that step 1 is obtained, bring into the original classification device of generation, generate optimum classifier；

Step 3：Test sample is handled using module one and module two；Data to be tested are generated after behavioural characteristic is extracted Test set be put into recognition classifier in optimum classifier and will export judged result to the data in test set, remote control such as be present Wooden horse communicates, then corresponding communication flows output is 1, is otherwise 0.