Specific embodiment
Here is with reference to the accompanying drawings with example to further description of the invention:
Fig. 1 is offline abnormality detection modeling module internal process schematic diagram of the present invention, comprising: 1. pre-processes and divides
Group;2. time-based segmentation;3. descriptive statistic;4. descriptive statistical analysis;5. possible reconfigure.Double circles indicate
Offline secure abnormality detection is output and input.Original input is the alarm from safety equipment (for example, firewall, invasion
The equipment such as detection device and router).Final output is the algorithm guide for selecting security exception detection.Grey box is by dividing safely
The parameter of teacher's input is analysed, different parameters can adapt to the purpose of different application scenarios and safety analysis.Application scenarios determine
Alarm quantity (for example, 1 year alarm) needed for safety analysis, network topology (for example, node, subnet), number of nodes (because
More for host and network equipment quantity, then alarm quantity is bigger).
It is described 1. to pre-process and be grouped, the topology of network and the mesh of safety analysis teacher are depended primarily on for how to be grouped
, such as, it is only necessary to some subnet or certain one kind alarm are monitored.If alarm is generated by different safety equipments
, then need to do security alarm attribute standardization and preliminary alarm correlation analysis.
Described 2. time-based segmentation is calculated alarm time sequence and is divided based on the time (for example, one day is divided into white
It and at night).
3. the descriptive statistic, extracts the distribution of each alarm time sequence and the descriptive statistic of Temporal dependency.
The distribution carrys out table by the dispersion (variance, quartile, coefficient of variation) of central tendency (mean value, median) and data
Show.Furthermore it can also assess the stability of alarm distribution statistics.If alarm time sequence forms trend, has periodical, season
Property can be predicted, then it shows Temporal dependency.Therefore, what Temporal dependency can be expressed as alarm time sequence can be pre-
The property surveyed and/or periodicity.
4. the descriptive statistical analysis analyzes extracted descriptive statistic to infer the applicability of Outlier Detection Algorithm
And validity.
It is described 5. possible to reconfigure, it is possible to safety analysis Shi Jianyi alarm time sequence to reconfigure, to build
Found significantly more efficient security exception detection algorithm.For example, different time can be extracted if alarm quantity depends on the working time
The descriptive statistic (for example, daytime, evening) of distribution.The thresholding of the abnormality detection of Temporal dependency can be determined at this time.
Further, 1. pretreatment and the grouping module, alarm received by the module can be any kind of
Alarm, for example, original alarm, super alarm or member alarm that safety equipment reports.For without loss of generality, the present invention mainly examines
Consider original alarm.
The pretreatment, i.e. warning information standardization, and eliminate and repeat alarm etc..Alarm packets pass through setting initial pool
ParameterAnd it realizes.Different grouping method depends on the target of safety analysis teacher.Such as:
(1) alarm source: the source address of alarm;
(2) alarm type: either usual alarm type, is also possible to super alarm type.
(1) the alarm source, alarm source are also possible to external alarm either internal alarm.Internal alarm main presentation
Behavior and user behavior at work, and external alarm mainly changes and noise.More fine-grained group of internal alarm
Conjunction can the purpose based on network topology and safety analysis.For example, safety analysis teacher can be based on different network and firewall
Strategy is grouped, such as different subnets, organization department and wired or wireless.
(2) the alarm type, different behaviors is disclosed based on different alarm types, otherwise, it is contemplated that the institute of a group
There is alarm security exception will likely be interfered to detect.Such as, it is generally the case that a kind of alarm type produces a large amount of alarm, then
Other types of alarm may be masked.
1. the output for pre-processing and being grouped, is exactly N number of alarm packets, i.e.,、、…、.Such as consider certain enterprise
The generated alarm over 5 months of industry IT network, can classify alarm according to the criterion of previous definition:
Alarm source: the alarm of wireline equipment, the alarm of wifi, external alarm;
Alarm type: wooden horse, etc..
Wired alarm and wifi alarm are why monitored respectively, be pc client and institute because of most of interior employees
There is server all wire communication mode to be used to connect, and most interior employees for using laptop and smart phone
(including guest) is all made of wireless communication mode connection.In addition, most of networks, the reason that wifi equipment is policy-limited makes
Obtain certain PC(or notebook) Web and mail applications can only be accessed.For these reasons, security alarm analysis system it is expected
Different historical behaviors can be obtained from alarm caused by wired alarm and wireless host.
The extraction of alarm type and the alarm quantity of each type are related.In Fig. 2, generated inhomogeneity is given
The percentage (alarm less than 1%, ignore) of type alarm.See from Fig. 2, the alarm for producing 80% is wooden horse alarm class
Type.This is the result is that believable, because the enterprise does not have the most of host equipments of direct monitoring.Fig. 1 is suitable for independently of institute
There are alarm packets, and unrelated with alarm quantity.However, being highly useful for automatically analyzing for being grouped comprising mass alarm.
Therefore, three most active alarm packets: wired wooden horse, wireless wooden horse, external wooden horse are mainly considered in next step.
Further, described 2. time-based segmentation, input are、、…、;Also, extract descriptive statistic
Three operating procedures: alarm time sequence calculates, the alarm sequence of invalidating label and time-based segmentation.
To each alarm packets, alarm time sequenceStatistics needs to input two parameters:
Time window w determines the alarm quantity that needs are analyzed;
(2) time granularity g assesses the minimum time unit of alarm (for example, alarm time daily, per hour, per minute
Sequence).
Above-mentioned parameter is inputted by safety analysis teacher, according to scene and analysis target.For example, if which analysis target will find
One day abnormal or alarm Situation Awareness, then time granularity be equally likely to one day (For daily alarm quantity), and the time
Window w is 6 months or more.On the other hand, if whether analysis target is to assess daytime and be distributed at night with different alarms,
Then time granularity is equally likely to one hour or less, and time window w is 1 month or more.In the scene of safety analysis, grain
What degree g meticulous (for example, second) should avoid.
So, described 2. time-based segmentation, assessmentIt is whether active in time window w.The purpose master of this step
If removing sluggish time series, this is because in order to further analyze.It is as inspection alarm time sequence
No active criterion, if producing 50% or more alarm quantity in the time interval, such alarm be it is active,
I.e. median ()> 0.Others such as filter the criterion and thresholding of inactive alarm sequence, depend on safety analysis target and
Depending on the case where Enterprise IT System.
Calculating alarm time sequenceLater, if it be it is active, in input time combination parameterOn the basis of
Further divide, whereinIt is defined as some time intervals (for example, daytime, evening), alarm time sequenceIt is divided
For M subsequence, j ∈ { 1,2 ..., M }.On the other hand, if safety analysis teacher is not special for the timing behavior of alarm
Expectation, then can be all alarm packetsDefine a kind of fine granularity time(for example, generally according to dividing per hour).
This be for a fact, it is described 5. possible to reconfigure, can suggest that the possible coarseness time reconfigures automatically, with
Analysis is in the extracted descriptive statistic of 3. descriptive statistic.
The output of described 2. time-based segmentation is exactly M subsequenceAnd sequence, i.e., for each alarm
Grouping, export M+1 alarm sequence.
Now, then the example above is investigated, is primarily upon 3 most active alarm packets: wired wooden horse, wifi wood
Horse, external wooden horse.The time window w of investigation is 5 months, and time granularity g is 1 hour.When this time granularity can investigate different
Between section time behavior.Fig. 5 is the time series alerted in relation to wired wooden horse hourly, wifi wooden horse, external wooden horse.X
Axis indicates time (hour), and Y-axis is the alarm quantity (0~800 alarm/hour) reported.Because these three are alerted
The median of sequence be greater than zero (median () > 0, i=1,2,3), so they are active.From figure 3, it can be seen that wifi
Wooden horse is the most active, and wired wooden horse takes second place, and external wooden horse alarm sequence is most weak.
Further, the 3. descriptive statistic, input areWith M subsequence.This module has extracted 3 groups of correlations
Descriptive statistic, be related to random distribution, Temporal dependency and stability.
The random distribution, distribution characteristics have 2 underlying attributes: central tendency and discrete.For highly dynamic
Application scenarios are investigated following statistics, and can intuitively be indicated by box figure.
(1) median m(, that is, median (m)), indicate the central tendency of data;
(2) quartile iqr indicates the dispersion for surrounding central tendency.
In order to indicate influence of the outlier to data dispersion, coefficient of variation is investigated, whereinWithIt is to accuse respectively
The mean value and variance being distributed belonging to alert sequence.When for higher value, then it represents that the alarm sequence is discrete, and
And/or there are exceptional values;However,When for smaller value, then it represents that be convergent distribution.
Furthermore for most active alarm sequence: wired wooden horse, WIFI wooden horse, external wooden horse consider ageing=
Working time (daytime), working time (evening), festivals or holidays (daytime), festivals or holidays (evening) }, Fig. 4 gives ageing
Box figure;Wherein, X-axis indicates time segmentation (daytime, evening), and Y-axis indicates the alarm quantity of every time quantum (for example, every
The quantity for the alarm that hour reports).Each box figure gives following statistical attribute: lower quartile (q1), median
(median), upper quartile (q3), interquartile-range IQR (iqr=q3-q1), lower antenna (=) and upper antenna=.It is allOn andUnder value, it is believed that be exceptional value.
Fig. 5 gives different ageingsCoefficient of variationThe value of coefficient.This statistics is for capturing data
Variability is useful.
As can be seen from Figure 4, daytime on weekdays, most alarm are that wifi wooden horse generates.On the other hand, in festivals or holidays
Daytime, wifi wooden horse alarm reduce, festivals or holidays night substantially without alarm.As can be seen from Figure 7, on weekdays white
It, the variation coefficient of wifi wooden horse alarm is low, and this is high for other groups of composition and division in a proportion, this just illustrates that the alarm sequence exists and makes an uproar
Sound, and/or there are certain exceptional values.
In all four ageings of Fig. 4 (a) and (d), the concentration that wired wooden horse alarm presents similar alarm becomes
Gesture (m) and dispersion (iqr), and daytime on weekdays is somewhat high.However, on weekdays, either daytime, or evening
On, there are higher exceptional values.These exceptional values are almost to be higher than an order of magnitude of central tendency;Also, it can from Fig. 5
Out, coefficient of variation is also higher value.
On the other hand, no matter external wooden horse is almost equal distribution on daytime or evening, and workaday daytime slightly has
Point is low, this may be related with the attack from different time zone.The dispersion of external wooden horse alarm is low, also, sometimes
Between in combination, coefficient of variation is close to 1.5.This suggests that external wooden horse alarm sequence independently of detection time, and can merge becomes
One ageing (working time/festivals or holidays, daytime/evening are without difference).
The Time Dependent, descriptive statistic relevant to Temporal dependency are for the abnormality detection based on recurrence
Useful.If an alarm sequence is there are if trend, periodicity and seasonality, it is shown as time dependence.Trend
It is a general systematic component, for sufficiently long time range, a time series may display cycle property or season
The mode of section property.
In order to extract the descriptive statistic of Temporal dependency, using the techniques of teime series analysis of filtering and auto-associating.Filter
Wave energy enough reduces the noise of time series.It is useful trend and time that this noise, which may be concealed for model abnormality detection,
Mode.In this case, using simple filtering technique;It is important to consider, because using more advanced filtering technique,
It can change the property of data.Based on such reason, the present invention is used based on radius as the SMA of r hours center windows filtering.
For clarity, it is assumed thatAs an alarm time sequence, andIt is the alarm quantity in t moment (for example, if the time
Granularity g was equal to 1 day, thenIndicate the t days alarm quantities).SMA filtering generates new sequence SMA(t), wherein alarm sequence
ColumnEach value quiltThe average value of 2r neighbours substituted, it may be assumed that
SMA(t)
Wherein,It is the alarm quantity in t moment, 2r+1 is the size of rolling average window.It is proposed that radius r
The smothing filtering or radius r value that value is 1 are 5 gradual filtering.
After the filtering, following Autocorrelation function (ACF) is calculated:
Wherein,It is the time interval of auto-associating,It is alarm time sequence, E is mathematic expectaion operator,WithIt isMean value and variance.When auto correlation is higher value and slow decay, it means that future value is related to history value;Otherwise
Very, i.e., when the auto-associating between two values goes to zero.If it, then a time series is considered
It is predictable, and there is enough precision of predictions in k-th of window.Therefore, above-mentioned condition meets, the exception inspection based on recurrence
Method of determining and calculating can be used effectively.
Different from the descriptive statistic of random distribution, Temporal dependency counts only from entire alarm time sequenceMiddle pumping
It takes, this is because since Autocorrelation function needs the continuity of alarm time, for identifying predictability, trend and period
Property.
In particular, the present invention is extracted following descriptive statistic about Temporal dependency:
(1) as predictable intervalValue;
Time seriesPrimary period(if any).
Wherein it is possible to have multiple periods (for example, 24 hours, 7 days), it can also be without the period (in this case,=
0).Furthermore, it is noted that in spite of to alarm sequenceFiltering is implemented, each statistics can be extracted.That is,
There are 3 kinds of configurations (no SMA filtering, weak SMA filtering, strong SMA filtering), correspondingly extract 3 pairs of values (,).
Fig. 6 give wired wooden horse, WIFI wooden horse, external wooden horse ACF value.X-axis indicates time interval(hour), Y
The value of axis ACF.Vertical dotted line expression is slided as unit of 24 hours, and horizontal dotted line indicates to alert with 0.3 for thresholding to determine
Whether sequence can be predicted, and give without filtering,=1 HeThe result of=5 three kinds of configurations.
Fig. 4 (a) illustrates wired wooden horse alarm in one week 24 hour period, is slightly enhanced by SMA filtering, but still tieed up
Hold (therefore, the period below 0.3 thresholding=0).Filtering slightly improves spacingPrediction, especially=5, still, the announcement
Alert sequence still keeps weak rigidity.On the other hand, WIFI wooden horse alarm display strong 24 hour period, this be it will be apparent that even if
Without using filtering.It means that the maximum probability for finding same each hour of value is every 24 hours.External wooden horse alarm
The ACF of sequence illustrates a trend components, is reinforced by filtering, is reachedPrediction is higher than24 hours of=5
Value.
The stability of the descriptive statistics, each alarm time sequence, in order to show the descriptive statistic of its distribution
Stability, consider median (median) and interquartile-range IQR.In invention, alarm time sequence will be analyzed by defining w conduct
Time window.What is verified is that how distribution statistics develop in time window w.For this purpose, consider two ginsengs
Number: the size (for example, 1 month) of sliding window S, time shift(for example, 1 week);Wherein,SW.By assigning these parameters difference
Value, information security analysis teacher can assess descriptive statistic in the stability of different cycles.How determine abnormality detection parameter
The information of the frequency reappraised is also useful.The present invention calculates median (median) and interquartile-range IQRValue, from when
Between be spaced=[0, s] starts, then=[, s+], then=[2, s+2] etc., until covering entire time window w.
This process is exactly descriptive statisticWith。
Fig. 7 gives the descriptive statistic in relation to alarm data collection.X-axis indicates time shift, and Y-axis indicatesWith's
It is worth (alarm quantity/hour).In this example, w=5 month, s=1 month,=1 week.For example, X=0, indicates 1st monthWith;X=1 indicates the 1st weekWith, etc..This make it possible to assess descriptive statistic on all bases how
It develops.
It can be seen from figure 7 that it is then stable that the statistics on wired wooden horse daytime, which is unstable, in the initial period
's;On the other hand, WIFI wooden horse alerts steep increase at night almost without alarm, but on daytime.External wooden horse is in whole cycle
It is always stable.
Here, the automatic verifying criterion whether stable about alarm distribution descriptive statistic is given.Assuming that d is descriptive
Count (for example, iqr), also, descriptive statistic d be time shift t value (for example,5Value).In order to assess the steady of d
It is qualitative, using a kind of dispersion measurement method of prevalence: middle position absolute deviation MAD.Particularly, for each descriptive statistic
D passes through following formula computational stability index:
Wherein,Expression MAD, denominator m (d)=median (), this is the different scale for requiring to compare
Descriptive statistic a normalization factor.It is lesser(almost nil) expression descriptive statistic d is stable, otherwise
Very.Particularly, work as time seriesCentral tendency and dispersion when meeting following relationship, then be stable:
0
Wherein,It is stability thresholding, it can be adjusted by safety analysis teacher according to IT network environment situation.At this
In invention application scenarios, for automatically identifying the stability and unstability of descriptive statistic, heuristically verify=0.2 is
One sufficient thresholding.In above-mentioned formula, considerThe maximum value of stability index, because one descriptive
The unstable of distribution can be considered enough by counting biggish difference.In fig. 8, wired wooden horse and wireless the wooden horse alarm on daytime
It is unstable, and the stability index of other four distributions is less than thresholding.
Method for detecting abnormality based on distribution, alarm sequence can pass through parameter or nonparametric distribution modeling (Gauss
(Gaussian) it is distributed,(Gamma) be distributed), also, anomalous event occur stochastic model low probability area or be distributed with
Vary widely region.These algorithms only have ability when sequence relies on and be not suitable for the method based on recurrence when it be not present useful.Alarm
It is that could be modeled by being distributed that sequence, which only keeps stabilization in central tendency and dispersion,.IfWithIt is all stable
, then it can use the method for detecting abnormality based on distribution.Algorithm based on distribution can be parameter or nonparametric.
Parametric technology just thinks that when having the evidence or knowledge of some alarm sequence distributions be useful.For example, median
It is stable, and concentrates on quartile region, then alarm sequence can be modeled by Gaussian Profile, although needing
Further analysis, for example, Chi-square Test.Other common parameter distributions are γ distribution and Longtail distribution.More complicated
Distribution can be approached by being distributed, for example, MoG(is approached by Gaussian Profile).
Imparametrization technology is just thought to be distributed alarm sequence just useful without priori knowledge.Common examples are based on histogram
Technology and based on kernel function technology (for example, Parzen window estimate).
In addition, ifIt is unstable, andIt is stable, then, using median as descriptive statistic
CUSUM-like method is effective to abnormality detection.
It is noted that external wooden horse (daytime, at night), wireless wooden horse (evening) and wired wooden horse (evening) alarm, Ke Yitong
Distribution is crossed to model, and wireless wooden horse (daytime) constantly increases in terms of mean value and variance, such alarm sequence is by dividing
Cloth to model increasingly complex.Wired wooden horse be in the initial period only it is unstable, it is then just stable always.That is,
It is effective to wired wooden horse based on the method for distribution after the initial period is unstable.
The decision flow diagram, as shown in figure 9, the case where first step assesses convergence exponent: if alarm sequence is not to receive
It holds back, still, does not also have Temporal dependency, then being also effective to abnormality detection based on the method for distribution.
Figure 10 is a kind of schematic diagram of the information security abnormality detection based on distribution of the present invention, including Real-time Alarm
Module, history alarm module, offline abnormality detection modeling module, online abnormality detection module, and knowledge base.
The Real-time Alarm module receives reported from various safety equipments by agreements such as SNMP, syslog in real time
Alarm, and be sent respectively to history alarm module and the abnormal on-line checking model module based on distribution.
The history alarm module, can be used as the backup of alarm time sequence, or offline secure attack is abnormal
Detection model module provides alarm data.
The offline abnormality detection modeling module to the alarm time Series Modeling, and provides the exception based on thresholding
The guide of detection method, the method for detecting abnormality based on recurrence and the method for detecting abnormality based on random distribution.It is described to be based on dividing
The method for detecting abnormality of cloth, by calculate in real time median m, interquartile-range IQR iqr, interval of events k, cycle T,WithSituation is come
Decide whether to select the information security method for detecting abnormality based on distribution.
The online abnormality detection module detects to real-time online Real-time Alarm module institute using the method based on distribution
The exception of report and alarm time series, also, will test result and be reported to related display module or safety analysis's teacher work further
Ground processing.
The knowledge base stores various statistical parameters, method for detecting abnormality and its application scenarios etc..
The foregoing is merely presently preferred embodiments of the present invention, practical range not for the purpose of limiting the invention;It is all according to this
Equivalence changes made by inventing and modification, are considered as the scope of the patents of the invention and are covered.