CN107294993A

CN107294993A - A kind of WEB abnormal flow monitoring methods based on integrated study

Info

Publication number: CN107294993A
Application number: CN201710543858.6A
Authority: CN
Inventors: 李智星; 沈柯; 于洪; 张冠群; 代南瑶; 胡聪; 胡峰; 王进; 雷大江; 欧阳卫华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2017-10-24
Anticipated expiration: 2037-07-05
Also published as: CN107294993B

Abstract

The present invention is claimed a kind of WEB abnormal flow monitoring methods based on integrated study, including data prediction, construction feature engineering, Data set reconstruction, and the foundation of model is with merging and five processes of model measurement.Data prediction is to carry out effective information extraction to url data.The structure of Feature Engineering is extraction and the structure that URL features are carried out using statistical methods such as comentropy, mutual informations.After the completion of Feature Engineering is built, for different access property, data set is adjusted, supervised learning is carried out in four kinds of machine learning algorithms such as input XGBoost, LightGBM.After learner construction complete, using Bagging framework integrated study devices.Choose data set again on raw data set and carry out classification prediction, label, testing model accuracy rate are decided in the way of majority ballot.Using in model process, by URL input models, five submodels in model can provide respective label probability respectively, and probability highest label is provided as final label.

Description

A kind of WEB abnormal flow monitoring methods based on integrated study

Technical field

The invention belongs to machine learning techniques field, and in particular to a variety of statistical algorithms and machine learning algorithm, this calculation Method employs new feature extraction mode, and carrying out novelty to statistics and machine learning algorithm merges, and realizes to WEB exception streams The monitoring of amount.

Background technology

1st, the network security problem of information age

In today of information huge explosion, the scale and internet number of users of computer network have all reached unprecedented scale, And come one after another, it is highlighting further for network security problem.It is used as the main means for resisting network attack, abnormal flow prison The research and development of survey are extremely urgent with upgrading.By the development of more than 20 years, the research of flow monitoring evolved multiple branches, but In actual applications, effect is but and not fully up to expectations, and its difficult point is concentrated mainly on following several aspects：

1) unlawful practice pattern is carried out into monitoring in real time with unalterable rules causes rate of false alarm too high；

2) when with characteristic matching, feature database needs manual update, it is impossible to detect unknown attack mode；

3) huge regular quantity causes system detectio performance to receive very big influence, and the maintenance of rule base becomes to be difficult to Safeguard；

4) the abnormal traffic detection system with block function is in flase drop proper communication behavior, and proper communication can be hindered It is disconnected；

5) when monitoring system data storage capacities have bottleneck, Denial of Service attack is subject to, communication will be blocked.

Problem above is had based on abnormal traffic detection system, currently three sides are concentrated mainly on the systematic research Upwards：Characteristic matching, rule-based reasoning and machine learning.

2nd, machine learning

In recent years, the method for machine learning is more and more applied to the algorithm design of abnormal traffic detection.It is not required to Want too many manual intervention to solve the manpower maintenance issues of the renewal of feature database and rule base in characteristic matching, substantially increase certainly Dynamicization degree；To the strong adaptability of different input datas, the high rate of false alarm deadlock of rule-based reasoning is broken, in face of unknown attack Higher accuracy rate can be obtained.

However, single machine learning can not perfectly solve problem.Statistical method therein thinks all events all Produced by statistical model, this method have ignored what the distributed model being previously set in parametric technique may not be inconsistent with True Data Risk, so as to produce very large deviation with expected results.The system that other statistical model is constituted works under off-line state mostly, nothing Method meets the requirement monitored in real time, thus to reach the very efficient performance of high-accuracy needs；And statistical method is for threshold value Determination it is extremely difficult, threshold value is too high, it is too low can all cause can cause the rising of rate of failing to report.

And machine learning algorithm is by priori aposterior knowledge seamless combination although can overcome framework not enough intuitively shortcoming, so And simple classification, clustering algorithm due to noise data interference, methods of sampling mistake, excessive modeling variable the problems such as can cause Fitting, can not reach good monitoring effect.And the accuracy of model need to rely on certain it is assumed that these hypothesis are to be embodied in In goal systems, the behavior pattern of network, the significantly decline of accuracy rate will be caused with assuming to run counter to.

The content of the invention

Present invention seek to address that above problem of the prior art.Propose one kind and effectively improve former machine learning method pair The WEB abnormal flow monitoring method methods based on integrated study of the accuracy rate of abnormal flow monitoring.Technical scheme It is as follows：

A kind of WEB abnormal flow monitoring methods based on integrated study, it comprises the following steps：

1) data prediction：Uniform resource position mark URL record is obtained, and progress is recorded to uniform resource position mark URL Cutting separation, extracts effective information；

2) construction feature engineering：With statistical method to common instruction attack, database attack, cross-site scripting attack Carry out the extraction of feature respectively comprising the uniform resource position mark URL that attack and proper network are accessed with local file；

3) Data set reconstruction：For five kinds of access properties, total data set is arranged according to respective feature respectively, will be marked Label be adjusted to the access property and other；

4) model is set up：To five kinds of data sets accessed corresponding to property, with XGBoost, (extreme gradient is carried respectively Rise), Light GBM (lightweight gradient elevator), RF (random forest), four kinds of machine learning algorithm logarithms of LR (logistic regression) According to supervised learning is carried out, with bagging framework integrated study devices, obtain for this five kinds access respective identification moulds of property Type；

5) model measurement：The partial data collection of advance reservation in step 4 is tested, testing model accuracy rate.

Further, the step 1) URL effective informations extraction include step：For a untreated URL:First Remove the invalid data after " # "；By rest segment by "" cut；Sub-argument goes out file path fragment, is drawn with "/" with "=" Point；Query portion is divided with " ＆ " with "="；Parameter obtained by division is respectively put into progress canonical in processing function with value and matched.

Further, the processing function can replace numeral with date and time, and disorderly symbol is replaced by that " $ 0 ", length is less than The character string of 10 lowercase composition is changed to " s ", and the character string that " Ox " that length is more than 2 starts is changed to " Ox1234 ", multiple Space is condensed to a space, and the fragment after being disposed is the URL information fragment that model needs.

Further, the step 2) construction feature engineering specifically includes：The length of URL parameter value, using in statistics Chebyshev inequality, and average and the variance of length calculate the exceptional value P of length：Character is distributed, and utilizes statistics In Chi-square Test calculating character distribution exceptional value α；Enumeration type, is enumerated in Exception Type belonging to the input of computation attribute value Situation；Keyword abstraction, finds the identical URL common traits for accessing property, after all url datas are scanned, to property Manage the adjacent character string in position and carry out frequency record, mutual information meter is done to remaining character string after screening out the too low character string of the frequency Calculate.

Further, the length exceptional value of the URL parameter value, utilizes the Chebyshev inequality and length in statistics The average of degree can calculate the exceptional value P of length with variance, and calculation formula includes：

Wherein X represents the length of URL parameter value, and μ is length average, σ²For length variance, k represents standard deviation number；

Further, the character distribution is specific using the exceptional value α of the Chi-square Test calculating character distribution in statistics Including：For character string { s₁,s₂,…,s_n},CD(s)_iRepresent i-th of probable value in CD (s), ICD_iRepresent i-th in ICD Individual probable value, thenI-th of probable value in wherein i=1,2 ..., n, i.e. ICD is institute in sample set There is the average of i-th of probable value of sample distribution；

Further, the enumeration type, the situation in Exception Type is enumerated belonging to the input of computation attribute value, described fixed Adopted function f and g, function f are linear increasing functions, and g (x) represents sample function, when sequentially inputting training sample, if running into Then g adds 1 to new samples, and otherwise g subtracts 1,

F (x)=x

The function f and g that are obtained after all samples all learn to terminate correlation coefficient ρ can be defined by following formula：

Wherein Var (f) and Var (g) are function f and g variance respectively, and Covar (f, g) is function f and g covariance.

Further, the keyword abstraction mutual information embodies whether character string internal combustion mode is close, and it is calculated Formula is as follows：

Wherein, P (s₁s₂s₃) represent character string s₁s₂s₃The probability of appearance, P (s₁s₂)、P(s₂s₃) implication is similar.

Further, in addition to the step of the adjacent word in left and right of the adjacent word of calculating character string enriches degree, the adjacent word in its left and right is rich Rich degree can be obtained with use information entropyWherein P (i) represents what the adjacent word i of the character string occurred Probability.

Further, the Bagging is that the son carried out from training set required for sub-sample constitutes each basic mode type is instructed Practice collection, the result to all base model predictions carries out integrating the final integrated study framework predicted the outcome of generation, in learner On the basis of, choose data set again from raw data set and carry out classification prediction, decide label in the way of majority ballot, together When, testing model accuracy rate.

Advantages of the present invention and have the beneficial effect that：

The present invention uses statistical method, URL is cut into slices, feature extraction, it is ensured that the integrality of feature extraction with Reliability.Integrated a variety of machine learning algorithms, including the high XGBoost of accuracy rate (extreme gradient lifting), RF are (at random simultaneously Forest) etc., it is ensured that model carries out high accuracy during Traffic Anomaly monitoring, and visiting URL is inputted into five moulds in monitoring process It is predicted to identify whether be known exception in type, while unknown exception can also be identified.

Brief description of the drawings

Fig. 1 is the method overall flow figure that the present invention provides preferred embodiment；

Fig. 2 is to URL cut and extract exemplary plot in this method；

Fig. 3 is this method bagging framework integrating process figures；

Fig. 4 is abnormal flow monitoring flow chart under this model.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only a part of embodiment of the present invention.

The present invention solve above-mentioned technical problem technical scheme be：

The present invention proposes a model for being used to solve abnormal flow monitoring.Fig. 1 show the flow chart of whole model. Data set is pre-processed, such as " ", "=" symbol split, the effective information in URL is extracted, to improve processing Efficiency.Fig. 2 is that URL cuts example.Data after processing carry out feature extraction by statistical methods such as mutual information, comentropies. After Feature Engineering construction is finished, according to the difference for accessing property, the data set of different characteristic is constructed respectively, and it is two to change label Class：Current accessed property and other.At the same time, extract partial data and be used as test set.To five data sets after reconstruct point Machine learning is not carried out.Introducing eXtreme Gradient Boosting,Light Gradient Boosting Machine、 Random Forest, tetra- kinds of machine learning algorithms of Logistic Regression carry out supervised learning to data set, and lead to Bagging framework integrated study devices are crossed, the separate identification model for different access property is obtained.Fig. 3 is bagging Framework integrating process.Reserved test set is brought into identification model respectively and tested, testing model accuracy.

The significant process of whole improved abnormal flow monitoring model includes：URL information extraction, the structure of Feature Engineering Make, the training of many Algorithm Learning devices, bagging frameworks are integrated.

First, URL information extraction

In order to improve the treatment effeciency of model, the effective information extraction to URL is most important.It is untreated for one URL:

1) need to remove the invalid data after " # " first；

2) by rest segment by "" cut

3) sub-argument goes out file path fragment, is divided with "/" with "="；

4) query portion is divided with " ＆ " with "="；

Parameter obtained by 3), 4) dividing is respectively put into progress canonical in processing function with value and matched.Handling function can be by Numeral replaced with date and time, disorderly symbol be replaced by " $ 0 ", length be less than 10 lowercase constitute character string be changed to " s ", The character string that " Ox " that length is more than 2 starts is changed to " Ox1234 ", and multiple spaces are condensed to a space.Fragment after being disposed The URL information fragment that as model needs.

2nd, the construction of Feature Engineering

It is well known that the construction of Feature Engineering drastically influence the validity and accuracy rate of model.

1) length of URL parameter value：Can using the Chebyshev inequality and the average of length in statistics and variance To calculate the exceptional value P of length,

Wherein μ is length average, σ²For length variance, k represents standard deviation number；

2) character is distributed：Utilize the exceptional value α of the Chi-square Test calculating character distribution in statistics.For character string { s₁, s₂,…,s_n},CD(s)_iRepresent i-th of probable value in CD (s), ICD_iI-th of probable value in ICD is represented, thenWherein i=1,2 ..., n.That is i-th of probable value in ICD is all sample distributions in sample set The average of i-th of probable value；

3) enumeration type：The situation that the legal input of some property value belongs to enumeration type is very universal, for example The legal parameters of " gender " attribute are " { male, female } ", and any input for being not belonging to both of these case should all belong to Abnormal conditions.Defined function f and g, function f is linear increasing function, when sequentially inputting training sample, if running into new samples Then g adds 1, and otherwise g subtracts 1.

F (x)=x

Wherein Var (f) and Var (g) are function f and g variance respectively, and Co var (f, g) are function f and g covariances；

4) keyword abstraction：In order to find the URL common traits of identical access property, the URL of same access type is closed Keyword is extracted and is particularly important.After all url datas are scanned, the character string adjacent to all physical locations carries out frequency note Record.Mutual information calculating is done to remaining character string after screening out the too low character string of the frequency.Mutual information embodies character string internal combustion Whether mode is close, and its calculation formula is as follows：

In addition it is also necessary to which the adjacent word in left and right of calculating character string neighbour's word enriches degree, left and right neighbour's word is abundanter, and the character string exists It is more flexible in data set, it is that the possibility of this kind of URL keyword is bigger.The abundant degree of the adjacent word in its left and right can use letter Entropy is ceased to obtainWherein P (i) represents the probability that the adjacent word i of the character string occurs.

3rd, the training of many Algorithm Learning devices

, it is necessary to which data are done with a little change before training data.URL features for every kind of access property are expanded to entirely In data set, five different data sets are formed.Change former label simultaneously, only retain the label of the access property, residue is accessed The label of the url data of property is all replaced with other.

XGBoost, LightGBM, RF, LR on selected by algorithm, by test, are that accuracy rate is higher, are pasted with problem Conjunction property most strong machine learning algorithm.

1)XGBoost：XGBoost is the algorithm being optimized on the basis of the boosting algorithms such as AdaBoost and GBDT, Available for linear classification, the linear regression algorithm with L1 and L2 regularizations can be regarded as；The regularization more than traditional GBDT Function thus lifted in terms of preventing over-fitting it is a lot, in terms of distributed algorithm, XGBoost can exist the feature of every dimension It is ranked up, and is stored in Block structures in one machine.Held so multiple feature calculations can be distributed in different machines OK, end product collects.XGBoost is so caused to be provided with the ability that distribution is calculated；Because characteristic value is finally simply used in Sequence, so characteristic value influences less to XGBoost model learnings；Simply the reduction of selection gradient is maximum for each calculating Feature is so feature correlation select permeability is also solved；

2)LightGBM：LightGBM is a framework for realizing GBDT algorithms, supports efficient parallel training, and Possess faster speed, lower memory consumption, preferably more preferable accuracy rate, distributed support, can quickly handle magnanimity Data.

3)Random Forest:Random Forest are particularly suitable to do many classification problems, and training and predetermined speed are fast, Showed on data set good；Fault-tolerant ability to training data is strong；It can handle very high-dimensional data, and it goes without doing feature Selection, i.e.,：The thousands of variable do not deleted can be handled, is played in the big measure feature that processing is gone out with key extracted Good effectiveness；The inside unbiased esti-mator of an extensive error can be generated during classification；It can train The importance degree of influencing each other between feature and feature is detected in journey；It is not in overfitting；

4)Logistic Regression:The thought of logistic regression is that data set is divided into two parts with a hyperplane, This two parts respectively be located at hyperplane both sides, and belong to two it is different classes of, just suiting will be every kind of in processing data collection Access the data that the URL data set of property labels again.Fig. 4 is the principle of classification schematic diagrames of Logistic Regression two. In addition, amount of calculation is very small during its classification, quickly, storage resource is extremely low for speed, and is easy to observation sample probability score.

4th, Bagging frameworks are integrated

Bagging is a kind of from training set from the sub- training set carried out required for sub-sample constitutes each basic mode type, to institute The result for having base model prediction, which integrate, produces the final integrated study framework predicted the outcome.On the basis of learner, Choose data set again from raw data set and carry out classification prediction, label is decided in the way of majority ballot, meanwhile, examine mould Type accuracy rate.Because the block mold of the framework is expected to be similar to the expectation of basic mode type, this also implies that the inclined of block mold Difference is approximate with the deviation of basic mode type, while the variance of block mold can increasing and reduce with base pattern number, it is therefore prevented that cross plan The enhancing of conjunction ability, model accuracy rate can be significantly improved.Table 1 is that each machine learning algorithm and the integrated rear experiments of Bagging are accurate The rate table of comparisons；

The model accuracy rate table of comparisons of table 1

The above embodiment is interpreted as being merely to illustrate the present invention rather than limited the scope of the invention. After the content for the record for having read the present invention, technical staff can make various changes or modifications to the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of WEB abnormal flow monitoring methods based on integrated study, it is characterised in that comprise the following steps：

1) data prediction：Uniform resource position mark URL record is obtained, and uniform resource position mark URL record is cut Separation, extracts effective information；

2) construction feature engineering：With statistical method to common instruction attack, database attack, cross-site scripting attack and sheet Ground file carries out the extraction of feature comprising the uniform resource position mark URL that attack and proper network are accessed respectively；

3) Data set reconstruction：For five kinds of access properties, total data set is arranged according to respective feature respectively, label is adjusted It is whole for the access property and other；

4) model is set up：To five kinds of data sets accessed corresponding to property, respectively with the extreme gradient liftings of XGBoost, Light GBM lightweight gradients elevator, RF random forests, four kinds of machine learning algorithms of LR logistic regressions carry out having supervision to learn to data Practise, with bagging framework integrated study devices, obtain for this five kinds access respective identification models of property；

5) model measurement：To step 4) in the partial data collection of advance reservation test, testing model accuracy rate.

2. the WEB abnormal flow monitoring methods according to claim 1 based on integrated study, it is characterised in that the step The extraction of rapid 1) URL effective informations includes step：For a untreated URL:The invalid data after " # " is removed first；Will Rest segment by "" cut；Sub-argument goes out file path fragment, is divided with "/" with "="；Query portion is with " ＆ " and "=" Divide；Parameter obtained by division is respectively put into progress canonical in processing function with value and matched.

3. the WEB abnormal flow monitoring methods according to claim 2 based on integrated study, it is characterised in that the place Reason function can replace numeral with date and time, and disorderly symbol is replaced by " $ 0 ", the character that lowercase of the length less than 10 is constituted Falsification is " s ", and the character string that " Ox " that length is more than 2 starts is changed to " Ox1234 ", and multiple spaces are condensed to a space, have handled Fragment after finishing is the URL information fragment that model needs.

4. the WEB abnormal flow monitoring methods according to claim 2 based on integrated study, it is characterised in that the step Rapid 2) construction feature engineering is specifically included：The length of URL parameter value, utilizes the Chebyshev inequality in statistics, Yi Jichang The average of degree calculates the exceptional value P of length with variance：Character is distributed, and is distributed using the Chi-square Test calculating character in statistics Exceptional value α；Enumeration type, the concrete condition that the input of computation attribute value belongs in enumerated type exception；Keyword is taken out Take, find the identical URL common traits for accessing property, after all url datas are scanned, the character adjacent to all physical locations String carries out frequency record, and mutual information calculating is done to remaining character string after screening out the too low character string of the frequency.

5. the exception of network traffic real-time monitoring system according to claim 4 based on big data, it is characterised in that described The length exceptional value of URL parameter value, can be counted using the Chebyshev inequality and the average of length in statistics with variance The exceptional value P of length is calculated, calculation formula includes：

Wherein X is the length of URL parameter value, and μ is length average, σ²For length variance, k represents standard deviation number.

6. the exception of network traffic real-time monitoring system according to claim 4 based on big data, it is characterised in that described Character distribution is specifically included using the exceptional value α of the Chi-square Test calculating character distribution in statistics：For character string { s₁, s₂,…,s_n},CD(s)_iRepresent i-th of probable value in CD (s), ICD_iI-th of probable value in ICD is represented, thenI-th of probable value in wherein i=1,2 ..., n, i.e. ICD is all sample distributions in sample set The average of i-th of probable value；

7. the exception of network traffic real-time monitoring system according to claim 4 based on big data, it is characterised in that described Enumeration type, the input of computation attribute value belongs to which kind of abnormal situation of enumeration type, the defined function f and g, and function f is Linear increasing function, g (x) represents sample function, and when sequentially inputting training sample, if running into new samples, then g plus 1, otherwise g Subtract 1,

F (x)=x

8. the exception of network traffic real-time monitoring system according to claim 4 based on big data, it is characterised in that described Keyword abstraction mutual information embodies whether character string internal combustion mode is close, and its calculation formula is as follows：

Wherein, P (s₁s₂s₃) represent character string s₁s₂s₃The probability of appearance, P (s₁s₂) represent character string s₁s₂The probability of appearance, P (s₂s₃) represent character string s₂s₃The probability of appearance.

9. the exception of network traffic real-time monitoring system according to claim 4 based on big data, it is characterised in that also wrap The step of adjacent word in left and right for including the adjacent word of calculating character string enriches degree, the abundant degree of the adjacent word in its left and right can be obtained with use information entropy Wherein P (i) represents the probability that the adjacent word i of the character string occurs.

10. the exception of network traffic real-time monitoring system based on big data according to one of claim 1-9, its feature exists In the Bagging is that the sub- training set required for each basic mode type of sub-sample composition is carried out from training set, to all basic modes The result of type prediction, which integrate, produces the final integrated study framework predicted the outcome, on the basis of learner, from original Again data set is chosen on data set and carries out classification prediction, label is decided in the way of majority ballot, meanwhile, testing model is accurate Rate.