Quick Trojan detecting method based on the heartbeat behavioural analysis
Technical field
The present invention relates to a kind of Trojan Horse Detection of analyzing based on communication behavior, particularly relate to a kind of quick Trojan detecting method based on the heartbeat behavioural analysis.
Background technology
The current attack great majority of stealing secret information are to adopt wooden horse to realize, the characteristics of wooden horse maximum namely are that its behavior is often with stronger disguise.After wooden horse successfully is implanted to object-computer, the wooden horse control end must and controlled terminal communicate to controlled terminal assign control command or control controlled terminal the information of obtaining is returned to control end.The disguise of communication has determined the survival ability of wooden horse to a great extent.The Network Covert Channel technology of rising in recent years is about to communication data and is embedded into the technology of transmitting in the normal network communication protocol, has satisfied greatly the demand of wooden horse communication.Utilizing Network Covert Channel to communicate becomes the major way that wooden horse carries out communication, and the assailant often sets up convert channel by common protocol such as HTTP, HTTPS and controlled main frame carried out Long-distance Control, steal information.Developing rapidly of the wooden horse communication technology caused serious threat to national security is stable.Therefore, the network service behavior that how effectively to detect wooden horse just becomes important theory and technology problem of information security field.
At present, Trojan detecting method based on communication behavior is a lot, and main method concentrates on the detection of the interactive operation behavior between assailant and the controlled terminal, the method that detects for wooden horse heartbeat behavior also do not occur, and all there are certain defect in these class methods, and do not possess good versatility.
Borders etc. utilize the time interval, the request of HTTP request to wrap the various filters of the latent structures such as size, package head format, bandwidth occupancy, request rule and detect wooden horse communication.Yet wooden horse can just can be walked around the various filters that Borders etc. constructs by some communication details of simple modification.For example: wooden horse only need will the request bag size be limited in and can make the large small filter of request bag lose effect in a certain threshold value.
Pack etc. have proposed a kind of method that the HTTP convert channel is detected by the behavior profile of usage data stream.Behavior profile is based on a large amount of tolerance, such as sum and the connect hours of average data bag size, small data packets and large packet ratio, the variation of packet model, all sending/receiving packets.If the observation characteristic of a data flow departs from the behavior profile of normal HTTP packet, it then very likely is the HTTP convert channel.Method detects mainly for the HTTP tunnel, and versatility is relatively poor.
The Elman network is trained in continuous T CP ISN number of utilizing normal protocol to produce such as Tumoian, then ISN number of ISN number of reality predicting with neural net compared, then thinking when the difference of actual value and predicted value surpasses predefined threshold value has convert channel to exist.The author has realized the detection to the NUSHU convert channel by this method.But can only detecting specific wooden horse communication, the method do not possess equally versatility.
Zhang and Paxson utilize data packet interarrival times and data package size to describe a kind of wooden horse communication interaction model, for detection of rogue programs such as wooden horse and back doors.This model carries out following description to the wooden horse communication behavior: 1, the adjacent data packet interarrival times meets the Pareto distribution in the wooden horse communication process; 2, small data packets has command interaction owing in the wooden horse communication process, so should account for certain proportion.But can make by different algorithms the adjacent data packet interarrival times satisfy various distribution requirements in the actual wooden horse communication process, the data packet interarrival times can be subject to the impact of network topology to a great extent in addition, so there is certain drawback with it in the data packet interarrival times as behavior description.And the short command in the wooden horse communication process can be hidden in the larger html page information, can not realize effective detection so emphasize the ratio of the small data packets in the communication process.
Below the basic conception that the present invention relates to is made an explanation.
The wooden horse heartbeat: in order to characterize the viability of self, wooden horse can be set up between the client and server end and keep a session, until the trojan horse program of any end is closed or network connection disconnects.The maintenance of this session realizes by sending packet to the other side.Because the most of mode that adopts timed sending of this packet, its existing way and meaning are similar to the heartbeat of animal, therefore be called as " heartbeat packet ".
Heartbeat interval: twice adjacent " heartbeat " interprocedual has certain time interval, is referred to as " heartbeat interval ".Whether be steady state value according to " heartbeat interval ", wooden horse heartbeat mode can be divided into following two kinds: 1, regularly long heartbeat, namely " heartbeat interval " is steady state value.2, become the duration heartbeat.Because regularly long heartbeat rule is obvious, is difficult to resist statistical analysis.Therefore the assailant is normal adopts various algorithms with " heartbeat interval " randomization, makes it no longer have obvious statistical nature and resists detection.Especially, regularly long heartbeat also can be considered the ordinary situation that becomes the duration heartbeat.
The heartbeat process: wooden horse is when each transmission " heartbeat packet ", wooden horse controlled terminal and control end program may also can send some other packets to the other side, expression is to the affirmation of receiving packet, with " heartbeat packet " with follow a group acknowledge packet of its transmission to be called " heartbeat process ".
The wooden horse communication process: the wooden horse communication process can be divided into two stages: keep connecting without operational phase and operational phase.After wooden horse was implanted to goal systems, the assailant only can operate (this moment, wooden horse communication was in the operational phase) to wooden horse within the limited time period, and all the other most of the time wooden horses all are in idle condition.The part wooden horse under idle condition, keep with the assailant between associated process be called and keep being connected without the operational phase.
Four-tuple: claim that { source IP address, source port, purpose IP address, destination interface } is four-tuple.
Four-tuple of equal value: if four-tuple { a
1, b
1, c
1, d
1And { a
2, b
2, c
2, d
2Satisfy: a
1=c
2And b
1=d
2And c
1=a
2And d
1=b
2, then claim { a
1, b
1, c
1, d
1And { a
2, b
2, c
2, d
2It is four-tuple of equal value.
Summary of the invention
The objective of the invention is in order in time cut off contacting between wooden horse controlled terminal and the assailant, effectively to stop the generation of stolen penetralia spare by the analysis of wooden horse heartbeat behavior being realized the effective detection to wooden horse communication in the network.A kind of quick Trojan detecting method based on the heartbeat behavioural analysis specifically is provided.
Technical scheme: a kind of quick Trojan detecting method based on the heartbeat behavioural analysis, whether have by " heartbeat interval " analyzed between the adjacent two heartbeat processes whether the controlled terminal reception equates with the data packet number ratio that sends in regularity and the heartbeat process, detect doubtful wooden horse.
For ease of extracting the heartbeat behavioural characteristic, need network data is constructed as session chain sheet form.The efficient of establishment session chained list directly affects the extraction efficiency to the heartbeat behavioural characteristic, this is proposed a kind of algorithm of Rapid Establishment session chained list.
The network data of catching is put in order according to BlueDrama: with the IP address of monitored object and port as source IP address and source port.Packet is carried out sessionizing according to four-tuple of equal value, i.e. each session is by four-tuple unique identification of equal value (this moment each session chained list comprise bidirectional traffic), and selects the session chained list as the data structure of store session.Select the session chained list as the reason of the data structure of recording conversation to be: because network service is a dynamic process, the packet in the session constantly increases along with the carrying out of communication, also dynamic change will occur thereupon for the data structure of preserving session.Set up in the session chained list process, need to search position corresponding to packet according to the four-tuple of equal value of chained list node, and be inserted into to this position.Therefore, the recording mode of session and seek rate will directly affect Session reassemble efficient.
Session can use Multidimensional numerical or multistage chained list to preserve; Multidimensional numerical have storage efficiency high, search conveniently, the advantage such as access speed is fast, it is its memory allocated space in advance that but Multidimensional numerical requires, and can't change the Multidimensional numerical size in case set up, and easily causes the space waste, and BlueDrama quantity is unfixing, can't allocate the space in advance for it; The advantage of chained list is that capable of dynamic adds or deletion, do not need to allocate in advance the space, but that shortcoming is seek rate is slow.
The present invention adopts in conjunction with the session of recombinating of the array linked list structure of Hash table and multistage chained list.The array linked list structure refers to the data structure that array and chained list combine; Array linked list can be by sacrificing less memory space Effective Raise search efficiency.Can set according to the different qualities of each element in the four-tuple of equal value the link order of array linked list, be made as the first order of array linked list by and the most uniform element of respective session distributed number moderate span, set successively the link order, to obtain higher Session reassemble efficient.Make a concrete analysis of as follows:
If number of sessions is S, if all sessions are set up with the form of traditional single linked list, all to carry out sequential search to the session chained list after receiving packet at every turn, the average computation complexity of sequential search is O (S/2).
With the form arrangement session of array linked list, to establish array and have n subscript, the session chained list number of i subscript serial connection of array is α
i, then i lower target probability of receive data bag adding array is
Therefore the average time complexity that chained list is inquired about is:
Can get according to theorem " root mean square is more than or equal to arithmetic average ":
The inequality both sides square can be got simultaneously:
And if only if α
1=α
2=...=α
nThe time, wherein
Namely
The time
Minimum.
Hence one can see that, and when all chained list node mean allocation each lower timestamp to array, the time complexity that packet is searched is minimum, less than the computation complexity of single linked list.Therefore when setting up the session chained list, should choose suitable arrangement of elements order according to span and the corresponding number of sessions distribution situation of element in the four-tuple of equal value.
Span and the corresponding number of sessions distribution situation of each element are as follows in the four-tuple of equal value:
(1) source IP address: be often referred to intranet host IP address.The span of source IP address is: 10.0.0.0~10.255.255.25,172.16.0.0~172.31.255.255,192.168.0.0~192.168.255.255, the IP space of relative the Internet, the distribution of sessions that the source IP address space is little and it is corresponding is even.
(2) source port: according to the RFC protocol specification, source port number is generally any number between 1024~65535.The value space of source port is larger, and its corresponding number of sessions skewness.
(3) purpose IP address: the span of purpose IP address is whole IPv4 address space, and the value space is huge, and its corresponding number of sessions skewness.
(4) destination interface: destination interface is generally the formulation port of agreement, scope mainly concentrates between 1~1023, but in the current network service take agreements such as HTTP, HTTPS as main, therefore the destination interface of most of network service is the ports such as 80,443,8080, and its corresponding number of sessions is inhomogeneities very.
In sum, the source IP address span is less and be evenly distributed, and corresponding number of sessions distributes also more evenly, is applicable to the first order as array linked list.Take monitoring objective as C class local area network (LAN) as example, the construction method of array linked list is as follows: because last 1 byte distribution situation of source IP address is the most even, set up Hash table therefore it can be considered as the cryptographic Hash of source IP address, source IP address is set to the first order of array linked list.The rest may be inferred, respectively with source port, purpose IP address and destination interface second and third and the level Four as array linked list, based on the session list structure of array linked list as shown in Figure 1.
Detection to the wooden horse communication behavior can realize by detecting " heartbeat packet "." heartbeat packet " has obvious statistical law, and the extraction of " heartbeat packet " feature adopts traditional statistical analysis technique in conjunction with Time-Frequency Analysis Method.According to wooden horse communication behavior characteristics, at first judge in the session whether all have " class heartbeat behavior ", and calculate the behavior " heartbeat interval ".
The time stamp data stream that arrives packet in the note session is (unit: second):
Get
The minimum " heartbeat interval " of supposing the wooden horse heartbeat is min{ Δ t} (for example, getting min{ Δ t}=5), and is then right
If, z
i〉=min{ Δ t} then thinks to have " class heartbeat behavior " in the session.Otherwise there be not " class heartbeat behavior " usually in the error control mechanism according to TCP/IP in the normal conversation.
Exist in session under " class heartbeat behavior " prerequisite, this method is extracted 2 session statistical natures and is being connected the communication behavior that keeps without the operational phase for detection of wooden horse.
(1) " heartbeat process " receives equal with the packet ratio of transmission.
Because the heartbeat behavior of most of wooden horse has self-similarity in communication process, so the packet ratio that transmit leg (or recipient) receives in " heartbeat process " and sends equates.
To the long-time section of a continuous m t (t>min{ Δ t}), calculate respectively the packet ratio beta that receives and send
i, i=1,2 ... m ' ..., m.If
In do not have the situation that has at least the individual value of m ' to equate, then judge not have doubtful wooden horse heartbeat behavior in the session.Usually get m=10, t=900, m '=5.
(2) stationarity of " heartbeat interval " is less than threshold value.
" heartbeat interval " of wooden horse is not invariable, in order to hide statistical analysis, the part wooden horse has designed special algorithm and has been used for producing variable " heartbeat interval ", its objective is with " heartbeat interval " that change to hide constant heartbeat process, makes the wooden horse heartbeat become irregular and follows.
Adopt time frequency analysis to judge whether network communication data flow contains wooden horse heartbeat rule.After the time interval of proper network communication data packet was transformed into frequency domain, corresponding intermediate frequency and high frequency coefficient were all larger, and this shows that the time interval of proper network communication data packet shows the characteristic of non-stationary signal.This with during proper network is communicated by letter since the randomness that manual operation causes conform to.Wooden horse " heartbeat interval " is then opposite, because the wooden horse heartbeat has certain rule, causes it to show the characteristic of relative stationary signal.The wooden horse that wherein adopts regularly rectangular formula to carry out heartbeat is because the heartbeat rule is very obvious, so that the medium-high frequency coefficient of its signal is almost 0; And to becoming the wooden horse of duration heartbeat, although adopted the mode of various camouflages, but still fail to simulate random behavior, so though its characteristics of signals shows certain fluctuation, with proper network communicate by letter compare comparatively steady.Therefore the detected characteristics of utilizing time-frequency analysis technology to extract is not only effective to the wooden horse of regularly long " heartbeat interval ", effective equally to the wooden horse of the change duration " heartbeat interval " of introducing the pseudorandom fluctuation.
The time interval sampled result (unit: second) of getting the one-way coversation data flow is:
Wherein X represents the sampling set of the packet time interval, x
iRepresent i sampled value, n represents sample size.
For X being carried out the characteristic vector after the discrete Fourier transform (DFT) (DFT), wherein y
iI coefficient after the expression process DFT conversion.
The stationarity of " heartbeat interval " is defined as:
Wherein, Stability is the stationarity of " heartbeat interval ", and ω is threshold value (usually getting ω=15).When Stability is little when equaling ω, " heartbeat interval " has stationarity.
Because the computation complexity of DFT conversion is higher, can also be based on the statistic of lower the detected wooden horse heartbeat behavior of the essence structure computation complexity of secondary haar wavelet decomposition.Remember that still X represents the sampling set of the one-way data stream packets time interval, order
Be the characteristic vector after the conversion.Get
Wherein, w
iBe equivalent to initial data is made value after the second differnce.The stationarity of " heartbeat interval " is defined as at this moment:
Wherein, Stability is the stationarity of " heartbeat interval ", and ω is threshold value (usually getting ω=0.005).If greater than ω, then judging, the stationarity of " heartbeat interval " do not have doubtful wooden horse heartbeat behavior in the session.
Beneficial effect: the present invention proposes a kind of quick Trojan detecting method based on the heartbeat behavioural analysis.The method is keeping connection to utilize heartbeat packet to keep the characteristics that the wooden horse client and server connects without the operational phase for wooden horse, in conjunction with traditional statistical analysis and time-frequency analysis technology, analyze the difference between this stage wooden horse communication behavior and the proper network communication behavior, excavate the essential distinction between the two and extract behavioural characteristic.Utilize the institute's feature of extracting realization to the detection of wooden horse.
Method proposed by the invention can detect and keep connecting without operational phase wooden horse heartbeat behavior, can realize the wooden horse in the network is detected in conjunction with existing domain name white list strobe utility, thereby help in time to cut off contacting between trojan horse program and the assailant, effectively prevent the steal secret information generation of behavior of wooden horse.
Description of drawings
Fig. 1 is session list structure figure;
Fig. 2 is that the wooden horse controlled terminal is connected the connection maintenance without operational phase " heartbeat process " and " heartbeat interval " schematic diagram with control end;
Fig. 3 is the DFT transformed samples figure in wooden horse " heartbeat interval " packet time interval of communicating by letter with proper network.
Embodiment
Embodiment one: the quick Trojan detecting method based on the heartbeat behavioural analysis is:
The network data of catching is put in order according to BlueDrama: with the IP address of monitored object and port as source IP address and source port.Packet is carried out sessionizing according to four-tuple of equal value, i.e. each session is by four-tuple unique identification of equal value (this moment each session chained list comprise bidirectional traffic), and selects the session chained list as the data structure of store session.Select the session chained list as the reason of the data structure of recording conversation to be: because network service is a dynamic process, the packet in the session constantly increases along with the carrying out of communication, also dynamic change will occur thereupon for the data structure of preserving session.Set up in the session chained list process, need to search position corresponding to packet according to the four-tuple of equal value of chained list node, and be inserted into to this position.Therefore, the recording mode of session and seek rate will directly affect Session reassemble efficient.
Session can use Multidimensional numerical or multistage chained list to preserve; Multidimensional numerical have storage efficiency high, search conveniently, the advantage such as access speed is fast, it is its memory allocated space in advance that but Multidimensional numerical requires, and can't change the Multidimensional numerical size in case set up, and easily causes the space waste, and BlueDrama quantity is unfixing, can't allocate the space in advance for it; The advantage of chained list is that capable of dynamic adds or deletion, do not need to allocate in advance the space, but that shortcoming is seek rate is slow.
The present invention adopts in conjunction with the session of recombinating of the array linked list structure of Hash table and multistage chained list.The array linked list structure refers to the data structure that array and chained list combine; Array linked list can be by sacrificing less memory space Effective Raise search efficiency.Can set according to the different qualities of each element in the four-tuple of equal value the link order of array linked list, be made as the first order of array linked list by and the most uniform element of respective session distributed number moderate span, set successively the link order, to obtain higher Session reassemble efficient.Make a concrete analysis of as follows:
If number of sessions is S, if all sessions are set up with the form of traditional single linked list, all to carry out sequential search to the session chained list after receiving packet at every turn, the average computation complexity of sequential search is O (S/2).With the form arrangement session of array linked list, to establish array and have n subscript, the session chained list number of i subscript serial connection of array is α
i, then i lower target probability of receive data bag adding array is
Therefore the average time complexity that chained list is inquired about is:
Can get according to theorem " root mean square is more than or equal to arithmetic average ":
The inequality both sides square can be got simultaneously:
And if only if α
1=α
2=...=α
nThe time, wherein
Namely
The time
Minimum.
Hence one can see that, and when all chained list node mean allocation each lower timestamp to array, the time complexity that packet is searched is minimum, less than the computation complexity of single linked list.Therefore when setting up the session chained list, should choose suitable arrangement of elements order according to span and the corresponding number of sessions distribution situation of element in the four-tuple of equal value.Span and the corresponding number of sessions distribution situation of each element are as follows in the four-tuple of equal value:
(1) source IP address: be often referred to intranet host IP address.The span of source IP address is:
10.0.0.0~10.255.255.25,172.16.0.0~172.31.255.255,192.168.0.0~192.168.255.255, the IP space of relative the Internet, the distribution of sessions that the source IP address space is little and it is corresponding is even.
(2) source port: according to the RFC protocol specification, source port number is generally any number between 1024~65535.The value space of source port is larger, and its corresponding number of sessions skewness.
(3) purpose IP address: the span of purpose IP address is whole IPv4 address space, and the value space is huge, and its corresponding number of sessions skewness.
(4) destination interface: destination interface is generally the formulation port of agreement, scope mainly concentrates between 1~1023, but in the current network service take agreements such as HTTP, HTTPS as main, therefore the destination interface of most of network service is the ports such as 80,443,8080, and its corresponding number of sessions is inhomogeneities very.
In sum, the source IP address span is less and be evenly distributed, and corresponding number of sessions distributes also more evenly, is applicable to the first order as array linked list.Take monitoring objective as C class local area network (LAN) as example, the construction method of array linked list is as follows: because last 1 byte distribution situation of source IP address is the most even, set up Hash table therefore it can be considered as the cryptographic Hash of source IP address, source IP address is set to the first order of array linked list.The rest may be inferred, respectively with source port, purpose IP address and destination interface second and third and the level Four as array linked list, based on the session list structure of array linked list as shown in Figure 1.
Detection to the wooden horse communication behavior can realize by detecting " heartbeat packet "." heartbeat packet " has obvious statistical law, and the extraction of " heartbeat packet " feature adopts traditional statistical analysis technique in conjunction with Time-Frequency Analysis Method.According to wooden horse communication behavior characteristics, at first judge in the session whether all have " class heartbeat behavior ", and calculate the behavior " heartbeat interval ".
The time stamp data stream that arrives packet in the note session is (unit: second):
Get
The minimum " heartbeat interval " of supposing the wooden horse heartbeat is min{ Δ t} (for example, getting min{ Δ t}=5), and is then right
If, z
i〉=min{ Δ t} then thinks to have " class heartbeat behavior " in the session.Otherwise there be not " class heartbeat behavior " usually in the error control mechanism according to TCP/IP in the normal conversation.
Exist in session under " class heartbeat behavior " prerequisite, this method is extracted 2 session statistical natures and is being connected the communication behavior that keeps without the operational phase for detection of wooden horse.
(1) " heartbeat process " receives equal with the packet ratio of transmission.
Because the heartbeat behavior of most of wooden horse has self-similarity in communication process, so the packet ratio that transmit leg (or recipient) receives in " heartbeat process " and sends equates.
To the long-time section of a continuous m t (t>min{ Δ t}), calculate respectively the packet ratio beta that receives and send
i, i=1,2 ..., m.If
In do not have the situation that has at least the individual value of m ' to equate, then judge not have doubtful wooden horse heartbeat behavior in the session.Usually get m=10, t=900, m '=5.
(2) stationarity of " heartbeat interval " is less than threshold value.。
" heartbeat interval " of wooden horse is not invariable, in order to hide statistical analysis, the part wooden horse has designed special algorithm and has been used for producing variable " heartbeat interval ", its objective is with " heartbeat interval " that change to hide constant heartbeat process, makes the wooden horse heartbeat become irregular and follows.
Adopt time frequency analysis to judge whether network communication data flow contains wooden horse heartbeat rule.After the time interval of proper network communication data packet was transformed into frequency domain, corresponding intermediate frequency and high frequency coefficient were all larger, and this shows that the time interval of proper network communication data packet shows the characteristic of non-stationary signal.This with during proper network is communicated by letter since the randomness that manual operation causes conform to.Wooden horse " heartbeat interval " is then opposite, because the wooden horse heartbeat has certain rule, causes it to show the characteristic of relative stationary signal.The wooden horse that wherein adopts regularly rectangular formula to carry out heartbeat is because the heartbeat rule is very obvious, so that the medium-high frequency coefficient of its signal is almost 0; And to becoming the wooden horse of duration heartbeat, although adopted the mode of various camouflages, but still fail to simulate random behavior, so though its characteristics of signals shows certain fluctuation, with proper network communicate by letter compare comparatively steady.Therefore the detected characteristics of utilizing time-frequency analysis technology to extract is not only effective to the wooden horse of regularly long " heartbeat interval ", effective equally to the wooden horse of the change duration " heartbeat interval " of introducing the pseudorandom fluctuation.
The time interval sampled result (unit: second) of getting the one-way coversation data flow is:
Wherein X represents the sampling set of the packet time interval, x
iRepresent i sampled value, n represents sample size.
For X being carried out the characteristic vector after the discrete Fourier transform (DFT) (DFT), wherein y
iI coefficient after the expression process DFT conversion.
The stationarity of " heartbeat interval " is defined as:
Wherein, Stability is the stationarity of " heartbeat interval ", and ω is threshold value (usually getting ω=15).When Stability is little when equaling ω, " heartbeat interval " has stationarity.
Embodiment two: something in common no longer repeats among the present embodiment and the embodiment one, difference is: because the computation complexity of DFT conversion is higher, and can also be based on the statistic of lower the detected wooden horse heartbeat behavior of the essence structure computation complexity of secondary haar wavelet decomposition.Remember that still X represents the sampling set of the one-way data stream packets time interval, order
Be the characteristic vector after the conversion.Get
Wherein, w
iBe equivalent to initial data is made value after the second differnce.The stationarity of " heartbeat interval " is defined as at this moment:
Wherein, Stability is the stationarity of " heartbeat interval ", and ω is threshold value (usually getting ω=0.005).If greater than ω, then judging, the stationarity of " heartbeat interval " do not have doubtful wooden horse heartbeat behavior in the session.