CN103490992A

CN103490992A - Instant messaging worm detection method

Info

Publication number: CN103490992A
Application number: CN201310470865.XA
Authority: CN
Inventors: 郭薇; 周翰逊; 张国栋; 贾大宇
Original assignee: Shenyang Aerospace University
Current assignee: Shanghai Taiyu Information Technology Co ltd; Shenzhen Pengbo Information Technology Co ltd
Priority date: 2013-10-10
Filing date: 2013-10-10
Publication date: 2014-01-01
Anticipated expiration: 2033-10-10
Also published as: CN103490992B

Abstract

The invention relates to the field of information security technology, in particular to an instant messaging worm detection method. The instant messaging worm detection method comprises two steps that firstly, the behaviors of an ordinary user and instant messaging worm behaviors are distinguished at a learning stage through a characteristic function, and secondly, the similarity between current network flow and learning data is calculated through the simple mahalanobis distance at a detection stage. In order to achieve the purpose that a detection mechanism is not sensitive to a web site access pattern, the similarity is calculated through a parameter-free CUSUM algorithm, and an alarm is generated when the distance of novel network flow exceeds the permitted distance set by the algorithm.

Description

The instant messaging Worm detection method

Technical field

The present invention relates to field of information security technology, is a kind of detection method for detection of the instant messaging worm specifically.

Background technology

Instant messaging (IM) service is very welcome, has the user who counts in necessarily in whole the Internet as a kind of instant exchange way.Many popular systems, as MSN Messenger (the Windows Messenger in Windows XP), the courier of Yahoo (YIM), AOL Instant Messenger (AIM), and Tencent QQ has changed the exchange way that we and friend, acquaintance and business are worked together.Yet the leak existed in instant communication client forms great security challenge.

The instant messaging worm is wide-scale distribution in instant communication network, by utilizing IM client and protocol bug, and a safety problem causing of instant message service.When instant communication worm operation, it is usually located at instant communication client, and attempts oneself to send to all friends and infected user.Some worm utilizes common engine to send information, inveigles the addressee to receive worm operation copy.Some IM worm even can exchange recipient's note and analyze their reply.Many IM worm examples are arranged at present as Chock, SoFunny, JS Menger.

The IM worm is different from periodic scanning virus and e-mail worm.Although the researcher has made great efforts to understand and contain the breeding of scanning worm and e-mail worm very much, because these researchs of different infection mechanisms are not to be well suited for the IM worm.The people such as M.Williamson apply the inhibition technology to slow down the propagation of worm to the instant messaging worm.But the method may postpone effective communication and limit too many IM user to allow to only have new contact person/sky etc.

Summary of the invention

For above shortcomings part in prior art, the technical problem to be solved in the present invention is to provide a kind of instant messaging Worm detection method.

The present invention adopts following technical scheme:

A kind of instant messaging Worm detection method, for communication server, comprises the following steps:

1) learning phase, by the behavioural characteristic of worm on the data analysis network that infects worm on network, is analyzed the behavioral data of normal users by characteristic function, deposits in database;

2) the detection-phase detection module is accepted the new data by gateway and is adopted the similarity of characteristic function in the database in simple mahalanobis distance and step 1) to be contrasted, and then judges new data and whether be subject to invermination.

Further, simple mahalanobis distance computing formula is:

d (x, \overset{&OverBar;}{y}) = Σ_{i = 0}^{m - 1} \frac{{({(x_{i} - \overset{&OverBar;}{y_{i}})}^{+})}^{2}}{σ_{i}^{2}} - - - (6)

Wherein,

for simple mahalanobis distance, the number that m is characteristic function, x _ifor i characteristic value of new data, y _ifor i characteristic value of learning phase data,

for i mean eigenvalue of learning phase, x is the new data characteristic vector, and y is the learning phase averaged feature vector, be the variance of i characteristic value, calculate the simple mahalanobis distance of new data

with { X _n, n=1,2,3 ... mean simple mahalanobis distance sequence, and n means the time interval here, simple mahalanobis distance is larger, means that the probability of invermination is larger.

Further, adopt non-parametric CUSUM to make detection algorithm insensitive to the site access pattern: at first not losing under any characteristic { X _n, n=1,2,3 ... be transformed into another random sequence { Z _n, n=1,2,3 ..., make all Z _nin negative value can not accumulate in time, the definition Z _nas follows:

Z _n=X _n-β (11)

Parameter beta is a constant, and for specific network condition, it contributes to produce { a Z of the random sequence with negative value _n, n=1,2,3 ..., the recurrence condition is as follows:

y _n=(y _n-1+Z _n) ⁺

y ₀=0 (12)

Wherein as (y _n-1+ Z _n) 0 o'clock, (y _n-1+ Z _n) ⁺equal (y _n-1+ Z _n), otherwise be 0, y _nlarger, show that attack is stronger, wherein y _ntest statistics, y _nmean X _naccumulation on the occasion of;

y_{n} = S_{n} - \min_{1 < κ < n} S_{k} - - - (13)

Wherein,

initial S ₀=0;

Decision function is expressed as:

d_{N} (y_{n}) = \{\begin{matrix} 0, y_{n} \leq N; \\ 1, y_{n} > N . \end{matrix}

(14)

Wherein, N represents worm detection threshold, d _n(y _n) be illustrated in the judgement of time n, inspection statistics y _nbe greater than N, d _n(y _n) be 1, mean to have to attack to occur, otherwise d _n(y _n) be 0, mean normally operation.

Further, in order to calculate simple mahalanobis distance, adopt incremental learning Pleistocene series evaluation to keep the correctness of statistics, establish E _ibe a characteristic value of i sample, set three variablees (E, ω, n),

n is historical sample length, and when observing new sample, ternary is updated suc as formula (7), (8) and (9):

E = E + \frac{e_{n + 1} - E}{n + 1} - - - (7)

ω = ω + e_{n + 1}^{2} - - - (8)

n=n+1 (9)

Sample variance is calculated as suc as formula (10):

σ^{2} = \frac{ω - n * E^{2}}{n - 1} - - - (10) .

Further, described characteristic function is: characteristic function URL ():

URL () = \{\begin{matrix} \underset{&ForAll; URL &Element; U}{Max} \{\begin{matrix} Count & (URL) \end{matrix}\}, U &NotEqual; 0 \\ 0, U = Φ \end{matrix} - - - (1)

The U here is the URL that the user sets transmission;

Characteristic function Filereq ():

Filereq () = \{\begin{matrix} \underset{&ForAll; a &Element; A}{Max} \{\begin{matrix} Count & (a) \end{matrix}\}, A &NotEqual; Φ \\ 0, A = Φ \end{matrix} - - - (2)

Here A is the file size that the user sets transmission;

Characteristic function IPAder ():

IPAddr()=Number of distinct IP address (3)。

The present invention has following advantage and beneficial effect:

At first the present invention passes through characteristic function at learning phase, distinguishes the difference of behavior and the behavior of instant messaging worm of domestic consumer.Then, carry out the Sampling network worm by simple mahalanobis distance.In order to make the insensitivity of testing mechanism to the site access pattern, adopted non-parametric CUSUM, when the distance of new data surpassed permission that algorithm sets apart from the time generate alarm.The digital proof of collecting from the university instant communication server validity of this inventive method.

Adopt device of the present invention to be arranged in gateway, take the 1GHz Pentium III as basic machine.Every through 10 seconds in data centralization, the CPU time that record data process part is required.In 99% sample, within the CPU time less than 2 seconds, can process the packet of 10 seconds.In addition, the required maximum duration of any ten sample process in second is less than four second CPU time.All sample service rates have surpassed the arrival rate of flow.This shows that the real-time performance of the inventive method has surpassed 10 seconds running fire flows of a catenet.

The accompanying drawing explanation

Fig. 1 is that emulation IM worm is propagated by send network address in text message, (a) has shown the situation of change of test statistics after characteristic function situation of change, (b) introduce the IM worm;

Fig. 2 propagates and has shown test statistics y after characteristic function situation of change, (b) introduce the IM worm by Transmit message for having shown emulation IM worm _nsituation of change.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in detail:

A kind of instant messaging Worm detection method, for communication server, the checkout gear of the method place main body is arranged on the gateway of communication server, and the data by gateway are detected, and comprises the following steps:

The step 1) learning phase is by worm behavioural characteristic on the data analysis network that infects worm on network,, deposit in database;

It is in order to work or amusement that typical user uses instant communicating system.He/her exchanges daily life with other people.As if it has nothing special, but it discloses important characteristics: the user may only exchange with several individuals over a period to come.On the contrary, the instant messaging worm is extensive widespread as far as possible, usually by the trustship worm code of transmission or the URL of file website.Therefore, can from normal behavior, distinguish the behavior of instant messaging worm.But, after loading the worm code, the IM worm will send the message language of a malice network address to different users.So can infer, this network address sends ratio will be increased.Defined function Count (x) is used identical x value and a user communication for the different user of quantity.For example, if a user sends www.google.com to four different friends in contact list, at this moment Count (www.google.com) just equals four.For portraying this feature, defined feature function URL () is suc as formula (1).

URL () = \{\begin{matrix} \underset{&ForAll; URL &Element; U}{Max} \{\begin{matrix} Count & (URL) \end{matrix}\}, U &NotEqual; 0 \\ 0, U = Φ \end{matrix} - - - (1)

The U here is the URL that the user sets transmission.

Another kind of more common infection character is that victim's Transmit message size is all identical with content.In fact, these files are exactly the instant messaging worm.For describing this feature, the characteristic function of defined file Forward-reques, suc as formula (2).

Filereq () = \{\begin{matrix} \underset{&ForAll; a &Element; A}{Max} \{\begin{matrix} Count & (a) \end{matrix}\}, A &NotEqual; Φ \\ 0, A = Φ \end{matrix} - - - (2)

Here A is the file size that the user sets transmission

A plurality of friends over a period to come with a user communication.When the user uses MSN, they can select that friend or those friends to be linked up in contact list.Yet worm can attempt to propagate as far as possible soon, so it may contact with a large amount of friend in contact list, has so just departed from the normal users usage behavior.In contact list, an IP address can represent a friend, and defined feature function IPAder () describes these characteristics suc as formula (3).

IPAddr()=Number of distinct IP address(3)

Step 2) detection module is accepted the new data by gateway and is adopted the similarity of the characteristic function in simple mahalanobis distance and step 1) to be contrasted, and then judges new data and whether be subject to invermination.

Simple mahalanobis distance computing formula is:

d (x, \overset{&OverBar;}{y}) = Σ_{i = 0}^{m - 1} \frac{{({(x_{i} - \overset{&OverBar;}{y_{i}})}^{+})}^{2}}{σ_{i}^{2}} - - - (6)

Wherein,

for simple mahalanobis distance, the number of the characteristic value that m is characteristic function, x _ifor i characteristic value of new data, y _ifor i characteristic value of training stage data,

for i mean eigenvalue of training stage, x is the new data characteristic vector, and y is the training stage averaged feature vector, be the variance of i characteristic value, calculate the simple mahalanobis distance of new data

simple mahalanobis distance is larger, means that the probability of invermination is larger.With { X _n, n=1,2,3 ... mean simple mahalanobis distance sequence, now n means time span,

Mahalanobis distance is the most frequently used polynary anomaly statistics.Formula is described substantially is whether new sample is abnormal in the data of history learning.Calculate the data of New Observer and the distance that learning phase obtains data here.Distance is higher, is more likely just abnormal sign.

The definition of mahalanobis distance is suc as formula (4):

D (x, \overset{&OverBar;}{y}) = {(x, \overset{&OverBar;}{y})}^{T} C^{- 1} (x, \overset{&OverBar;}{y}) - - - (4)

Here x and y are two characteristic vectors, and each vector element is variable.X is new observational characteristic vector, and y is the averaged feature vector calculated in learning phase.C ^-1it is the inverse covariance Matrix C _ij=Cov (y _i, y _j), y _i, y _ji and j characteristic value in the learning phase characteristic vector.

Suppose that feature is to add up independently, mahalanobis distance provides a process useful, from baseline, weighs current deviation.Therefore, covariance matrix C becomes on diagonal matrix and diagonal element for each characteristic value variance.Therefore, simple mahalanobis distance is suc as formula (5):

d (x, \overset{&OverBar;}{y}) = Σ_{i = 0}^{m - 1} \frac{{(x_{i} - {\overset{&OverBar;}{y}}_{i})}^{2}}{σ_{i}^{2}} - - - (5)

Here m is set to 3 (because three optional feature values are arranged).

When contacting by instant communicating system and friend, because busy study or active user are not necessarily used it always.Therefore, the characteristic function value may be lower than relevant mean value, and still, this does not also mean that it is abnormal.Therefore, this deviation should not be set as mahalanobis distance.Therefore, use formula (6) is calculated simple mahalanobis distance.

d (x, \overset{&OverBar;}{y}) = Σ_{i = 0}^{m - 1} \frac{{({(x_{i} - \overset{&OverBar;}{y_{i}})}^{+})}^{2}}{σ_{i}^{2}} - - - (6)

Wherein as (y _n-1+ Z _n) 0 o'clock, (y _n-1+ Z _n) ⁺equal (y _n-1+ Z _n), otherwise be 0.

In order to calculate simple mahalanobis distance, adopt incremental learning Pleistocene series evaluation keep statistics correctness, establish E _ibe a characteristic value of i sample, set three variablees (E, ω, n),

E = E + \frac{e_{n + 1} - E}{n + 1} - - - (7)

ω = ω + e_{n + 1}^{2} - - - (8)

n=n+1 (9)

Wherein, in (7), (8), (9), the equal sign left side is the value of new samples, and the equal sign right side is the value of previous historical sample length.

Sample variance is calculated as suc as formula (10):

σ^{2} = \frac{ω - n * E^{2}}{n - 1} - - - (10) .

In order to make the insensitivity of testing mechanism to the site access pattern, a kind of printenv Cumulative sum CUSUM method.

Adopt non-parametric CUSUM to make to detect insensitive to the site access pattern: at first not losing under any characteristic { X _n, n=1,2,3 ... be transformed into another random sequence { Z _n, n=1,2,3 ..., make all Z _nin negative value can not accumulate in time, the definition Z _nas follows:

Z _n=X _n-β (11)

Parameter beta is that for specific network condition, it contributes to produce { a Z of the random sequence with negative value to a constant _n, n=1,2,3 ..., the recurrence condition is as follows:

y _n=(y _n-1+Z _n) ⁺

y ₀=0 (12)

(y wherein _n-1+ Z _n)+as (y _n-1+ Z _n) 0 equal (y _n-1+ Z _n), otherwise be 0, y _nlarger, show that attack is stronger, wherein y _ntest statistics, y _nmean X _naccumulation on the occasion of;

y_{n} = S_{n} - \min_{1 < κ < n} S_{k} - - - (13)

Wherein,

initial S ₀=0;

Decision function is expressed as:

d_{N} (y_{n}) = \{\begin{matrix} 0, y_{n} \leq N; \\ 1, y_{n} > N . \end{matrix} - - - (14)

β is taken as 3 in the present invention.

Embodiment

Verified the inventive method by simulated environment.Collected 521 user data sets of certain university's communication server (the instant messaging service is only applicable in campus) and data have been divided into to two parts as study and classification and Detection.Wherein, 80% data are used as training data, and all the other are 20% for being mixed with IM worm attack data and being used for detecting the IM worm, and IM worm data are random mixing.In addition, the website information of the file of every 5 minutes simulation instant messaging worms in text message or transmission is to the friend in online contact list.

For normal discharge:

Owing to being busy with work or arduous research, the user can be not all the time all with contact list in friend contact, particularly at midnight.Therefore, when corresponding characteristic function value much larger than zero the time.Result is as shown in table 1:

Table 1

characteristic	μ	σ ²
			URL()	1.333312	0.420157
FileReq()	1.271003	0.236540
			IPAddr()	2.600212	0.737141

When domestic consumer is used the IM service, several file transfer requests and network address are arranged in text message.In most of the cases, the user communicates with each other by text message.From result, also to see, URL () and FileReq () average are 1.333312 and 1.271003, corresponding variance is 0.420157 and 0.236540.This means, although the user sends the requirement of network address or file transfer in text message, they send identical URL or the file friend different to one or two usually.The mean value of IPAddr () and variance are 2.600212 and 0.73714.

After increasing instant messaging worm flow, worm detects:

As shown in Figure 1, emulation IM worm is propagated by send network address in text message.(a) shown in the characteristic function situation of change.The value that is shown to URL () when there is no instant messaging worm flow is not more than the excursion from 0 to 3 of 1, IPAddr () value.Yet, as (b) shows that after introducing the IM worm URL's () and IPAddr () value approaches 10 to the peak variation suddenly.Do not change the value of FileReq ().Therefore, the IM worm can detect in the unit interval after outburst.

Fig. 2 has shown that emulation IM worm propagates by Transmit message.(a) shown FileReq () value be not more than 1 and IPAddr () value excursion from 0 to 3 do not increase IM worm flow.Yet FileReq () value and IPAddr () value are different from normal value after introducing the IM worm.They change and to exceed 7 and reach their peak 15.FileReq () value is 0 always.Therefore, (b) show that this method, after introducing the IM worm, detects in the unit interval after outburst.

Carried out same test 100 times repeatedly.Result is similar, negative value do not occur.

To adopt device of the present invention to be arranged in gateway, take the 1GHz Pentium III as basic machine.Every through 10 seconds in data centralization, the CPU time that record data process part is required.In 99% sample, within the CPU time less than 2 seconds, can process the packet of 10 seconds.In addition, the required maximum duration of any ten sample process in second is less than four second CPU time.All sample service rates have surpassed the arrival rate of flow.This shows that the real-time performance of the inventive method has surpassed 10 seconds running fire flows of a catenet.

Claims

1. an instant messaging Worm detection method, for communication server, is characterized in that, comprises the following steps:

2) configuration detection module in gateway, the detection-phase detection module is accepted by the new data of gateway and is adopted the similarity of value of the characteristic function of the database learning in simple mahalanobis distance and step 1) to be contrasted, and then judges new data and whether be subject to invermination.

2. according to instant messaging Worm detection method claimed in claim 1, it is characterized in that,

Simple mahalanobis distance computing formula is:

d (x, \overset{&OverBar;}{y}) = Σ_{i = 0}^{m - 1} \frac{{({(x_{i} - \overset{&OverBar;}{y_{i}})}^{+})}^{2}}{σ_{i}^{2}} - - - (6)

Wherein,

for i mean eigenvalue of learning phase, x is the new data characteristic vector, and y is the learning phase averaged feature vector,

be the variance of i characteristic value, calculate the simple mahalanobis distance of new data

3. according to instant messaging Worm detection method claimed in claim 2, it is characterized in that, adopt non-parametric CUSUM to make detection algorithm insensitive to the site access pattern: at first not losing under any characteristic { X _n, n=1,2,3 ... be transformed into another random sequence { Z _n, n=1,2,3 ..., make all Z _nin negative value can not accumulate in time, the definition Z _nas follows:

Z _n=X _n-β (11)

y _n=(y _n-1+Z _n) ⁺

y ₀=0 (12)

Wherein, as (y _n-1+ Z _n) 0 o'clock, (y _n-1+ Z _n) ⁺equal (y _n-1+ Z _n), otherwise be 0, y _nlarger, show that attack is stronger, wherein y _ntest statistics, y _nmean X _naccumulation on the occasion of;

y_{n} = S_{n} - \min_{1 < κ < n} S_{k} - - - (13)

Wherein,

initial S ₀=0;

Decision function is expressed as:

d_{N} (y_{n}) = \{\begin{matrix} 0, y_{n} \leq N; \\ 1, y_{n} > N . \end{matrix} - - - (14)

4. according to instant messaging Worm detection method claimed in claim 2, it is characterized in that, in order to calculate simple mahalanobis distance, adopt incremental learning Pleistocene series evaluation to keep the correctness of statistics, establish E _ibe a characteristic value of i sample, set three variablees (E, ω, n), n is historical sample length, and when observing new sample, ternary is updated suc as formula (7), (8) and (9):

E = E + \frac{e_{n + 1} - E}{n + 1} - - - (7)

ω = ω + e_{n + 1}^{2} - - - (8)

n=n+1 (9)

Sample variance is calculated as suc as formula (10):

σ^{2} = \frac{ω - n * E^{2}}{n - 1} - - - (10) .

5. according to instant messaging Worm detection method claimed in claim 1, it is characterized in that, described characteristic function is: characteristic function URL ():

URL () = \{\begin{matrix} \underset{&ForAll; URL &Element; U}{Max} \{\begin{matrix} Count & (URL) \end{matrix}\}, U &NotEqual; 0 \\ 0, U = Φ \end{matrix} - - - (1)

The U here is the URL that the user sets transmission;

Characteristic function Filereq ():

Filereq () = \{\begin{matrix} \underset{&ForAll; a &Element; A}{Max} \{\begin{matrix} Count & (a) \end{matrix}\}, A &NotEqual; Φ \\ 0, A = Φ \end{matrix} - - - (2)

Here A is the file size that the user sets transmission;

Characteristic function IPAder ():

IPAddr()=Number of distinct IP address (3)。