Summary of the invention
For above shortcomings part in prior art, the technical problem to be solved in the present invention is to provide a kind of instant messaging Worm detection method.
The present invention adopts following technical scheme:
A kind of instant messaging Worm detection method, for communication server, comprises the following steps:
1) learning phase, by the behavioural characteristic of worm on the data analysis network that infects worm on network, is analyzed the behavioral data of normal users by characteristic function, deposits in database;
2) the detection-phase detection module is accepted the new data by gateway and is adopted the similarity of characteristic function in the database in simple mahalanobis distance and step 1) to be contrasted, and then judges new data and whether be subject to invermination.
Further, simple mahalanobis distance computing formula is:
Wherein,
for simple mahalanobis distance, the number that m is characteristic function, x
ifor i characteristic value of new data, y
ifor i characteristic value of learning phase data,
for i mean eigenvalue of learning phase, x is the new data characteristic vector, and y is the learning phase averaged feature vector,
be the variance of i characteristic value, calculate the simple mahalanobis distance of new data
with { X
n, n=1,2,3 ... mean simple mahalanobis distance sequence, and n means the time interval here, simple mahalanobis distance is larger, means that the probability of invermination is larger.
Further, adopt non-parametric CUSUM to make detection algorithm insensitive to the site access pattern: at first not losing under any characteristic { X
n, n=1,2,3 ... be transformed into another random sequence { Z
n, n=1,2,3 ..., make all Z
nin negative value can not accumulate in time, the definition Z
nas follows:
Z
n=X
n-β (11)
Parameter beta is a constant, and for specific network condition, it contributes to produce { a Z of the random sequence with negative value
n, n=1,2,3 ..., the recurrence condition is as follows:
y
n=(y
n-1+Z
n)
+
y
0=0 (12)
Wherein as (y
n-1+ Z
n) 0 o'clock, (y
n-1+ Z
n)
+equal (y
n-1+ Z
n), otherwise be 0, y
nlarger, show that attack is stronger, wherein y
ntest statistics, y
nmean X
naccumulation on the occasion of;
Decision function is expressed as:
(14)
Wherein, N represents worm detection threshold, d
n(y
n) be illustrated in the judgement of time n, inspection statistics y
nbe greater than N, d
n(y
n) be 1, mean to have to attack to occur, otherwise d
n(y
n) be 0, mean normally operation.
Further, in order to calculate simple mahalanobis distance, adopt incremental learning Pleistocene series evaluation to keep the correctness of statistics, establish E
ibe a characteristic value of i sample, set three variablees (E, ω, n),
n is historical sample length, and when observing new sample, ternary is updated suc as formula (7), (8) and (9):
n=n+1 (9)
Sample variance is calculated as suc as formula (10):
Further, described characteristic function is: characteristic function URL ():
The U here is the URL that the user sets transmission;
Characteristic function Filereq ():
Here A is the file size that the user sets transmission;
Characteristic function IPAder ():
IPAddr()=Number of distinct IP address (3)。
The present invention has following advantage and beneficial effect:
At first the present invention passes through characteristic function at learning phase, distinguishes the difference of behavior and the behavior of instant messaging worm of domestic consumer.Then, carry out the Sampling network worm by simple mahalanobis distance.In order to make the insensitivity of testing mechanism to the site access pattern, adopted non-parametric CUSUM, when the distance of new data surpassed permission that algorithm sets apart from the time generate alarm.The digital proof of collecting from the university instant communication server validity of this inventive method.
Adopt device of the present invention to be arranged in gateway, take the 1GHz Pentium III as basic machine.Every through 10 seconds in data centralization, the CPU time that record data process part is required.In 99% sample, within the CPU time less than 2 seconds, can process the packet of 10 seconds.In addition, the required maximum duration of any ten sample process in second is less than four second CPU time.All sample service rates have surpassed the arrival rate of flow.This shows that the real-time performance of the inventive method has surpassed 10 seconds running fire flows of a catenet.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in detail:
A kind of instant messaging Worm detection method, for communication server, the checkout gear of the method place main body is arranged on the gateway of communication server, and the data by gateway are detected, and comprises the following steps:
The step 1) learning phase is by worm behavioural characteristic on the data analysis network that infects worm on network,, deposit in database;
It is in order to work or amusement that typical user uses instant communicating system.He/her exchanges daily life with other people.As if it has nothing special, but it discloses important characteristics: the user may only exchange with several individuals over a period to come.On the contrary, the instant messaging worm is extensive widespread as far as possible, usually by the trustship worm code of transmission or the URL of file website.Therefore, can from normal behavior, distinguish the behavior of instant messaging worm.But, after loading the worm code, the IM worm will send the message language of a malice network address to different users.So can infer, this network address sends ratio will be increased.Defined function Count (x) is used identical x value and a user communication for the different user of quantity.For example, if a user sends www.google.com to four different friends in contact list, at this moment Count (www.google.com) just equals four.For portraying this feature, defined feature function URL () is suc as formula (1).
The U here is the URL that the user sets transmission.
Another kind of more common infection character is that victim's Transmit message size is all identical with content.In fact, these files are exactly the instant messaging worm.For describing this feature, the characteristic function of defined file Forward-reques, suc as formula (2).
Here A is the file size that the user sets transmission
A plurality of friends over a period to come with a user communication.When the user uses MSN, they can select that friend or those friends to be linked up in contact list.Yet worm can attempt to propagate as far as possible soon, so it may contact with a large amount of friend in contact list, has so just departed from the normal users usage behavior.In contact list, an IP address can represent a friend, and defined feature function IPAder () describes these characteristics suc as formula (3).
IPAddr()=Number of distinct IP address(3)
Step 2) detection module is accepted the new data by gateway and is adopted the similarity of the characteristic function in simple mahalanobis distance and step 1) to be contrasted, and then judges new data and whether be subject to invermination.
Simple mahalanobis distance computing formula is:
Wherein,
for simple mahalanobis distance, the number of the characteristic value that m is characteristic function, x
ifor i characteristic value of new data, y
ifor i characteristic value of training stage data,
for i mean eigenvalue of training stage, x is the new data characteristic vector, and y is the training stage averaged feature vector,
be the variance of i characteristic value, calculate the simple mahalanobis distance of new data
simple mahalanobis distance is larger, means that the probability of invermination is larger.With { X
n, n=1,2,3 ... mean simple mahalanobis distance sequence, now n means time span,
Mahalanobis distance is the most frequently used polynary anomaly statistics.Formula is described substantially is whether new sample is abnormal in the data of history learning.Calculate the data of New Observer and the distance that learning phase obtains data here.Distance is higher, is more likely just abnormal sign.
The definition of mahalanobis distance is suc as formula (4):
Here x and y are two characteristic vectors, and each vector element is variable.X is new observational characteristic vector, and y is the averaged feature vector calculated in learning phase.C
-1it is the inverse covariance Matrix C
ij=Cov (y
i, y
j), y
i, y
ji and j characteristic value in the learning phase characteristic vector.
Suppose that feature is to add up independently, mahalanobis distance provides a process useful, from baseline, weighs current deviation.Therefore, covariance matrix C becomes on diagonal matrix and diagonal element for each characteristic value variance.Therefore, simple mahalanobis distance is suc as formula (5):
Here m is set to 3 (because three optional feature values are arranged).
When contacting by instant communicating system and friend, because busy study or active user are not necessarily used it always.Therefore, the characteristic function value may be lower than relevant mean value, and still, this does not also mean that it is abnormal.Therefore, this deviation should not be set as mahalanobis distance.Therefore, use formula (6) is calculated simple mahalanobis distance.
Wherein as (y
n-1+ Z
n) 0 o'clock, (y
n-1+ Z
n)
+equal (y
n-1+ Z
n), otherwise be 0.
In order to calculate simple mahalanobis distance, adopt incremental learning Pleistocene series evaluation keep statistics correctness, establish E
ibe a characteristic value of i sample, set three variablees (E, ω, n),
n is historical sample length, and when observing new sample, ternary is updated suc as formula (7), (8) and (9):
n=n+1 (9)
Wherein, in (7), (8), (9), the equal sign left side is the value of new samples, and the equal sign right side is the value of previous historical sample length.
Sample variance is calculated as suc as formula (10):
In order to make the insensitivity of testing mechanism to the site access pattern, a kind of printenv Cumulative sum CUSUM method.
Adopt non-parametric CUSUM to make to detect insensitive to the site access pattern: at first not losing under any characteristic { X
n, n=1,2,3 ... be transformed into another random sequence { Z
n, n=1,2,3 ..., make all Z
nin negative value can not accumulate in time, the definition Z
nas follows:
Z
n=X
n-β (11)
Parameter beta is that for specific network condition, it contributes to produce { a Z of the random sequence with negative value to a constant
n, n=1,2,3 ..., the recurrence condition is as follows:
y
n=(y
n-1+Z
n)
+
y
0=0 (12)
(y wherein
n-1+ Z
n)+as (y
n-1+ Z
n) 0 equal (y
n-1+ Z
n), otherwise be 0, y
nlarger, show that attack is stronger, wherein y
ntest statistics, y
nmean X
naccumulation on the occasion of;
Decision function is expressed as:
Wherein, N represents worm detection threshold, d
n(y
n) be illustrated in the judgement of time n, inspection statistics y
nbe greater than N, d
n(y
n) be 1, mean to have to attack to occur, otherwise d
n(y
n) be 0, mean normally operation.
β is taken as 3 in the present invention.
Embodiment
Verified the inventive method by simulated environment.Collected 521 user data sets of certain university's communication server (the instant messaging service is only applicable in campus) and data have been divided into to two parts as study and classification and Detection.Wherein, 80% data are used as training data, and all the other are 20% for being mixed with IM worm attack data and being used for detecting the IM worm, and IM worm data are random mixing.In addition, the website information of the file of every 5 minutes simulation instant messaging worms in text message or transmission is to the friend in online contact list.
For normal discharge:
Owing to being busy with work or arduous research, the user can be not all the time all with contact list in friend contact, particularly at midnight.Therefore, when corresponding characteristic function value much larger than zero the time.Result is as shown in table 1:
Table 1
characteristic |
μ |
σ
2 |
URL() |
1.333312 |
0.420157 |
FileReq() |
1.271003 |
0.236540 |
IPAddr() |
2.600212 |
0.737141 |
When domestic consumer is used the IM service, several file transfer requests and network address are arranged in text message.In most of the cases, the user communicates with each other by text message.From result, also to see, URL () and FileReq () average are 1.333312 and 1.271003, corresponding variance is 0.420157 and 0.236540.This means, although the user sends the requirement of network address or file transfer in text message, they send identical URL or the file friend different to one or two usually.The mean value of IPAddr () and variance are 2.600212 and 0.73714.
After increasing instant messaging worm flow, worm detects:
As shown in Figure 1, emulation IM worm is propagated by send network address in text message.(a) shown in the characteristic function situation of change.The value that is shown to URL () when there is no instant messaging worm flow is not more than the excursion from 0 to 3 of 1, IPAddr () value.Yet, as (b) shows that after introducing the IM worm URL's () and IPAddr () value approaches 10 to the peak variation suddenly.Do not change the value of FileReq ().Therefore, the IM worm can detect in the unit interval after outburst.
Fig. 2 has shown that emulation IM worm propagates by Transmit message.(a) shown FileReq () value be not more than 1 and IPAddr () value excursion from 0 to 3 do not increase IM worm flow.Yet FileReq () value and IPAddr () value are different from normal value after introducing the IM worm.They change and to exceed 7 and reach their peak 15.FileReq () value is 0 always.Therefore, (b) show that this method, after introducing the IM worm, detects in the unit interval after outburst.
Carried out same test 100 times repeatedly.Result is similar, negative value do not occur.
To adopt device of the present invention to be arranged in gateway, take the 1GHz Pentium III as basic machine.Every through 10 seconds in data centralization, the CPU time that record data process part is required.In 99% sample, within the CPU time less than 2 seconds, can process the packet of 10 seconds.In addition, the required maximum duration of any ten sample process in second is less than four second CPU time.All sample service rates have surpassed the arrival rate of flow.This shows that the real-time performance of the inventive method has surpassed 10 seconds running fire flows of a catenet.