CN107294993B

CN107294993B - WEB abnormal traffic monitoring method based on ensemble learning

Info

Publication number: CN107294993B
Application number: CN201710543858.6A
Authority: CN
Inventors: 李智星; 沈柯; 于洪; 张冠群; 代南瑶; 胡聪; 胡峰; 王进; 雷大江; 欧阳卫华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2021-02-09
Anticipated expiration: 2037-07-05
Also published as: CN107294993A

Abstract

The invention requests to protect a WEB abnormal traffic monitoring method based on ensemble learning, which comprises five processes of data preprocessing, characteristic engineering construction, data set reconstruction, model establishment and fusion and model testing. The data preprocessing is effective information extraction on the URL data. The construction of the feature engineering is to extract and construct the URL features by adopting statistical methods such as information entropy, mutual information and the like. After the feature engineering construction is completed, the data set is adjusted according to different access properties, and the adjusted data set is input into four machine learning algorithms of XGboost, LightGBM and the like for supervised learning. And after the learner is constructed, integrating the learner by adopting a Bagging framework. And (4) reselecting the data set from the original data set to perform classification prediction, typing the label in a majority voting mode, and checking the accuracy of the model. In the process of using the model, the URL is input into the model, five sub-models in the model respectively give out respective label probabilities, and the label with the highest probability is given out as a final label.

Description

WEB abnormal traffic monitoring method based on ensemble learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to various statistical algorithms and machine learning algorithms.

Background

1. Network security problem in information era

Today, the scale of computer networks and the number of people using the internet are reaching unprecedented scale, and the subsequent scale is the increasingly prominent problem of network security. As the most important means for resisting network attacks, the development and upgrading of abnormal traffic monitoring are imminent. After more than twenty years of development, the research on flow monitoring has evolved into a plurality of branches, however, in practical application, the effect is not satisfactory, and the difficulties are mainly focused on the following aspects:

1) carrying out real-time monitoring on the illegal behavior mode by using a fixed rule so as to cause overhigh false alarm rate;

2) when the feature matching is applied, the feature library needs to be updated manually, and an unknown attack mode cannot be detected;

3) the detection performance of the system is greatly influenced by the huge number of rules, and the maintenance of the rule base becomes difficult to maintain;

4) when the abnormal flow detection system with the blocking function falsely detects the normal communication behavior, the normal communication can be blocked;

5) when the data storage capacity of the monitoring system has a bottleneck, the monitoring system is vulnerable to denial of service, and communication is blocked.

Based on the above problems with the abnormal flow detection system, current research on the system is mainly focused on three directions: feature matching, rule reasoning and machine learning.

2. Machine learning

In recent years, a machine learning method is increasingly applied to algorithm design of abnormal flow detection. The problems of updating of the feature library and manual maintenance of the rule library in feature matching are solved without too much manual intervention, and the automation degree is greatly improved; the method has strong adaptability to different input data, breaks through the high false alarm rate impasse of rule reasoning, and can obtain higher accuracy rate in the face of unknown attacks.

However, single machine learning does not perfectly solve the problem. The statistical method considers that all events are generated by the statistical model, and ignores the risk that the distribution model set in advance in the parameter method may not accord with the real data, so that the expected result is greatly deviated. In addition, most systems formed by statistical models work in an off-line state and cannot meet the requirement of real-time monitoring, so that very high-efficiency performance is required for achieving high accuracy; and the statistical method is very difficult to determine the threshold, and the false alarm rate can be increased due to the fact that the threshold is too high or too low.

Although the machine learning algorithm can seamlessly combine the prior and posterior knowledge and overcome the defect that the frame is not intuitive enough, the problems of noise data interference, wrong sampling method, excessive modeling variables and the like in the simple classification and clustering algorithm can cause overfitting, and a good monitoring effect cannot be achieved. The accuracy of the model depends on certain assumptions, which are reflected in the behavior patterns of the target system and the network, and the violation of the assumptions can cause a great reduction in accuracy.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The WEB abnormal traffic monitoring method based on ensemble learning effectively improves the accuracy of the original machine learning method on abnormal traffic monitoring. The technical scheme of the invention is as follows:

a WEB abnormal traffic monitoring method based on ensemble learning comprises the following steps:

1) data preprocessing: acquiring a uniform resource locator URL record, cutting and separating the uniform resource locator URL record, and extracting effective information;

2) constructing a characteristic project: respectively extracting the characteristics of common instruction attack, database attack, cross-site scripting attack and local file inclusion attack and the uniform resource locator URL of normal network access by using a statistical method;

3) and (3) data set reconstruction: aiming at the five access properties, the total data set is sorted according to respective characteristics, and the tags are adjusted to the access properties and the other access properties;

4) establishing a model: for data sets corresponding to the five access properties, four machine learning algorithms of XGboost (extreme gradient boost), Light GBM (lightweight gradient hoist), RF (random forest) and LR (logistic regression) are respectively applied to supervised learning of the data, and a bagging framework integrated learner is applied to obtain respective identification models for the five access properties;

5) and (3) testing a model: and (4) testing the partial data set reserved in advance in the step (4) and checking the accuracy of the model.

Further, the step 1) of extracting the valid URL information includes the steps of: for an unprocessed URL, firstly removing invalid data after "#"; press the remaining segments by "? Cutting; dividing file path segments by '/' and '; the query part is divided by "&" and "&"; and respectively putting the parameters and the values obtained by division into a processing function for regular matching.

Further, the processing function replaces numbers with dates and times, the career is replaced by "$ 0", a character string composed of lower case letters with the length less than 10 is changed to "s", a character string at the beginning of "Ox" with the length greater than 2 is changed to "Ox 1234", a plurality of spaces are reduced to one space, and the processed fragments are the URL information fragments required by the model.

Further, the step 2) of constructing the feature engineering specifically includes: and (3) calculating the length abnormal value P by using the Chebyshev inequality in statistics and the mean value and the variance of the length according to the length of the URL parameter value: character distribution, namely calculating an abnormal value alpha of the character distribution by using Chi-square test in statistics; enumerating types, and calculating the conditions of the input attribute values in the enumeration exception types; extracting key words, searching URL common characteristics with the same access property, after scanning all URL data, performing frequency recording on all character strings adjacent to physical positions, and performing mutual information calculation on the rest character strings after screening out character strings with too low frequency.

Further, the length abnormal value of the URL parameter value may be calculated by using the chebyshev inequality in statistics and the mean and variance of the length, and the calculation formula includes:

where X represents the length of the URL parameter value, μ is the length mean, σ²Is length variance, k represents the number of standard deviations;

further, the calculating the abnormal value α of the character distribution by using the chi-square test in statistics includes: for character string s₁,s₂,…,s_n},CD(s)_iIndicating the i-th probability value in CD(s), ICD_iRepresenting the ith probability value in the ICD, then

Where i is 1,2, …, n, i.e. the ith probability value in the ICD is the mean of the ith probability values of all samples in the sample set;

further, the enumerating type is a condition in enumerating exception types to which the input of the calculation attribute value belongs, the functions f and g are defined, the function f is a linear increasing function, g (x) represents a sample function, when training samples are sequentially input, g is added with 1 if new samples are met, otherwise g is subtracted with 1,

f(x)＝x

the correlation coefficient ρ of the functions f and g obtained when learning of all samples is completed can be defined by the following formula:

where Var (f) and Var (g) are the variances of functions f and g, respectively, and Covar (f, g) is the covariance of functions f and g.

Further, the keyword extraction mutual information shows whether the internal combination mode of the character string is tight, and the calculation formula is as follows:

wherein, P(s)₁s₂s₃) Representing a character string s₁s₂s₃Probability of occurrence, P(s)₁s₂)、P(s₂s₃) The meanings are similar.

Further, the method also comprises a step of calculating the richness degree of the left and right adjacent characters of the character string adjacent characters, wherein the richness degree of the left and right adjacent characters can be obtained by using the information entropy

Where p (i) represents the probability of the occurrence of the neighbourhood i of the string.

Further, the Bagging is an integrated learning framework which performs sub-sampling from a training set to form a sub-training set required by each basic model, synthesizes the prediction results of all the basic models to generate a final prediction result, reselects a data set from an original data set on the basis of a learner to perform classified prediction, finalizes a label in a majority voting mode, and simultaneously, checks the accuracy of the model.

The invention has the following advantages and beneficial effects:

the invention uses statistical method to slice URL and extract features, ensuring the integrity and reliability of feature extraction. Meanwhile, various machine learning algorithms including extremely high accuracy XGboost (extreme gradient boost), RF (random forest) and the like are integrated, high accuracy of the models in flow anomaly monitoring is guaranteed, visiting URLs are input into five models in the monitoring process to be predicted to determine whether the models are known anomalies or not, and meanwhile unknown anomalies can be identified.

Drawings

FIG. 1 is a flowchart of the overall method of the present invention in providing a preferred embodiment;

FIG. 2 is a diagram illustrating an example of the URL segmentation and extraction in the present method;

FIG. 3 is a diagram of a bagging framework integration process in the present method;

fig. 4 is a flow chart of abnormal traffic monitoring in this model.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the invention provides a model for solving abnormal flow monitoring. Fig. 1 shows a flow chart of the entire model. Preprocessing the data set, e.g. on "&The processing efficiency is improved by dividing symbols such as "" and "", and extracting effective information in the URL. FIG. 2 is an example of URL cutting. And the processed data is subjected to feature extraction through statistical methods such as mutual information, information entropy and the like. After the feature engineering construction is finished, according to the difference of access properties, data sets with different features are respectively constructed, and the replacement labels are divided into two types: current access nature, and others. At the same time, part of the data is extracted as a test set. And respectively performing machine learning on the five reconstructed data sets. Introducing the eXtreme Gradient Boosting,Light Gradient Boosting MachineThe method comprises the steps of carrying out supervised learning on a data set by four machine learning algorithms of Random Forest and logic Regression, and obtaining mutually independent recognition models aiming at different access properties through a bagging framework integrated learner. Fig. 3 is a bagging framework integration process. And (4) respectively bringing the reserved test sets into the recognition models for testing, and checking the accuracy of the models.

Important processes of the whole improved abnormal flow monitoring model comprise: extracting URL information, constructing a feature project, training a multi-algorithm learner and integrating a bagging framework.

Information extraction of URL

In order to improve the processing efficiency of the model, effective information extraction of the URL is important. For an unprocessed URL:

1) firstly, invalid data after "#" needs to be removed;

2) press the remaining segments by "? "carry out cutting

3) Dividing file path segments by '/' and ';

4) the query part is divided by "&" and "&";

and respectively putting the parameters and values obtained by the division in 3) and 4) into a processing function for regular matching. The processing function replaces the number with the date and time, the career is changed to "$ 0", the character string consisting of lower case letters less than 10 is changed to "s", the character string beginning with "Ox" greater than 2 is changed to "Ox 1234", and a plurality of spaces are reduced to one space. And the processed segment is the URL information segment required by the model.

Second, structure of characteristic engineering

It is known that the construction of feature engineering seriously affects the effectiveness and accuracy of the model.

1) Length of URL parameter value: the length outlier P can be calculated using the chebyshev inequality in statistics and the mean and variance of the length,

where μ is the length mean, σ²Is length variance, k represents the number of standard deviations;

2) character distribution: the abnormal value α of the character distribution is calculated using the chi-square test in statistics. For character string s₁,s₂,…,s_n},CD(s)_iIndicating the i-th probability value in CD(s), ICD_iRepresenting the ith probability value in the ICD, then

Where i is 1,2, …, n. That is, the ith probability value in the ICD is the mean of the ith probability values of all samples in the sample set;

3) enumerated types: it is very common that the legal input of a certain attribute value belongs to an enumeration type, for example, the legal parameter of the "sender" attribute is "{ large, fe }", and any input that does not belong to both cases should belong to an abnormal case. Defining functions f and g, wherein the function f is a linear increasing function, and when training samples are input sequentially, g is added with 1 if new samples are met, and otherwise, g is subtracted with 1.

f(x)＝x

where Var (f) and Var (g) are the variances of functions f and g, respectively, and Co var (f, g) is the covariance of functions f and g;

4) extracting keywords: in order to find the common features of the URLs with the same access property, it is important to perform keyword extraction on the URLs with the same access type. After scanning all URL data, recording all character strings adjacent to the physical position frequently. And (4) screening character strings with too low frequency, and then performing mutual information calculation on the rest character strings. The mutual information shows whether the internal combination mode of the character string is compact or not, and the calculation formula is as follows:

In addition, it is also necessary to calculate the degree of richness of the left and right adjacent characters of the character string, and the richer the left and right adjacent characters are, the more flexible the character string is in the data set, and the higher the possibility of being the kind of URL keyword is. The rich degree of the left and right adjacent characters can use informationEntropy acquisition

Training of three, multiple algorithm learner

Before training the data, a small change in the data is required. The URL features for each access property extend into the entire data set, forming five different data sets. And meanwhile, the original tags are changed, only the tags of the access properties are reserved, and the tags of the URL data of the rest access properties are replaced by others.

The XGboost, LightGBM, RF and LR selected in the algorithm are tested to be machine learning algorithms with higher accuracy and strongest problem fitting property.

1) XGboost: the XGboost is an optimized algorithm based on lifting algorithms such as AdaBoost and GBDT, can be used for linear classification, and can be regarded as a linear regression algorithm with regularization of L1 and L2; compared with the traditional GBDT, the method has the advantages that regularization functions are added, so that the overfitting is prevented, in the aspect of a distributed algorithm, the XGboost sorts the features of each dimension in one machine and stores the features in a Block structure. So multiple feature calculations can be distributed across different machines and the final results are aggregated. Thus, the XGboost has the capability of distributed computing; because the characteristic value is only used for sorting, the abnormal characteristic value has less influence on the learning of the XGboost model; each calculation is only to select the features with the largest gradient reduction, so that the feature correlation selection problem is solved;

2) LightGBM: the LightGBM is a framework for realizing the GBDT algorithm, supports high-efficiency parallel training, has higher speed, lower memory consumption, better accuracy and better distributed support, and can quickly process mass data.

3) Random Forest is particularly suitable for multi-classification problems, has high training and predicting speed and good performance on a data set; the fault tolerance capability to the training data is strong; very high dimensional data can be processed without feature selection, namely: thousands of variables which are not deleted can be processed, and good effect is achieved when a large number of characteristics extracted by keys are processed; an internal unbiased estimate of the generalized error can be generated during the classification process; the interaction among the characteristics and the importance degree of the characteristics can be detected in the training process; overfitting cannot occur;

4) logistic Regression the idea of Logistic Regression is to divide a data set into two parts by using a hyperplane, wherein the two parts are respectively positioned at two sides of the hyperplane and belong to two different categories, and the data labeled on the URL data set with each access property is just matched when the data set is processed. FIG. 4 is a schematic diagram of two classification principles of Logistic Regression. In addition, the method has the advantages of small calculation amount in classification, high speed, extremely low storage resource and convenience in observing the probability scores of the samples.

Four, Bagging framework integration

Bagging is an integrated learning framework which forms a sub-training set required by each base model from a training set by sub-sampling and integrates the predicted results of all the base models to generate a final predicted result. On the basis of a learner, a data set is reselected from an original data set to carry out classification prediction, a label is finalized in a majority voting mode, and meanwhile, the model accuracy is checked. Since the integral model expectation of the framework is similar to the expectation of the base model, this means that the deviation of the integral model is similar to the deviation of the base model, and the variance of the integral model decreases with the increase of the number of the base models, the improvement of the overfitting capability is prevented, and the model accuracy is remarkably improved. Table 1 is a comparison table of experimental accuracy after integration of each machine learning algorithm and Bagging;

TABLE 1 model accuracy comparison Table

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A WEB abnormal traffic monitoring method based on ensemble learning is characterized by comprising the following steps:

4) establishing a model: for data sets corresponding to the five access properties, four machine learning algorithms of XGboost extreme gradient lifting, Light GBM lightweight gradient lifting, RF random forest and LR logistic regression are respectively used for supervised learning of the data, and a bagging framework integrated learner is used for obtaining respective identification models aiming at the five access properties;

5) and (3) testing a model: testing the part of the data set reserved in advance in the step 4) and checking the accuracy of the model.

2. The WEB abnormal traffic monitoring method based on ensemble learning according to claim 1, wherein the step 1) of extracting the URL valid information includes the steps of: for an unprocessed URL, firstly removing invalid data after "#"; press the remaining segments by "? Cutting; dividing file path segments by '/' and '; the query part is divided by "&" and "&"; and respectively putting the parameters and the values obtained by division into a processing function for regular matching.

3. The integrated learning-based WEB abnormal traffic monitoring method according to claim 2, wherein the processing function replaces numbers with dates and times, the career is replaced with "$ 0", a character string consisting of lower-case letters with a length less than 10 is changed to "s", a character string at the beginning of "Ox" with a length greater than 2 is changed to "Ox 1234", a plurality of spaces are reduced to one space, and the processed fragments are URL information fragments required by the model.

4. The integrated learning-based WEB abnormal traffic monitoring method according to claim 2, wherein the step 2) of constructing a feature project specifically comprises: and (3) calculating the length abnormal value P by using the Chebyshev inequality in statistics and the mean value and the variance of the length according to the length of the URL parameter value: character distribution, namely calculating an abnormal value alpha of the character distribution by using Chi-square test in statistics; enumerating types, wherein the input of the attribute values belongs to the specific conditions in the enumerated type exceptions; extracting key words, searching URL common characteristics with the same access property, after scanning all URL data, performing frequency recording on all character strings adjacent to physical positions, and performing mutual information calculation on the rest character strings after screening out character strings with too low frequency.

5. The integrated learning-based WEB abnormal traffic monitoring method according to claim 4, wherein the length abnormal value of the URL parameter value is calculated by using Chebyshev inequality in statistics and mean and variance of the length, and the calculation formula comprises:

wherein X is the length of the URL parameter value, mu is the length mean value, sigma is the length mean value, and k represents the number of standard deviations;

6. the integrated learning-based WEB abnormal traffic monitoring method according to claim 4, wherein the character distribution is calculated by using Chi-square test in statisticsThe abnormal value α of the character distribution specifically includes: for character string s₁,s₂,…,s_n},CD(s)_iIndicating the i-th probability value in CD(s), ICD_iRepresenting the ith probability value in the ICD, then

7. the integrated learning-based WEB anomaly traffic monitoring method according to claim 4, wherein the enumeration type is used for calculating which case the attribute value input belongs to the enumeration type anomaly, functions f and g are defined, the function f is a linear increasing function, g (x) represents a sample function, and when training samples are sequentially input, g is increased by 1 if new samples are encountered, otherwise g is decreased by 1;

f(x)＝x

8. The integrated learning-based WEB abnormal traffic monitoring method according to claim 4, wherein the keyword extraction mutual information shows whether the internal combination mode of the character string is tight, and the calculation formula is as follows:

wherein, P(s)₁s₂s₃) Representing a character string s₁s₂s₃Probability of occurrence, P(s)₁s₂) Representing a character string s₁s₂Probability of occurrence, P(s)₁) Representing a character string s₁Probability of occurrence, P(s)₃) Representing a character string s₃Probability of occurrence, P(s)₂s₃) Representing a character string s₂s₃The probability of occurrence.

9. The WEB abnormal traffic monitoring method based on ensemble learning according to claim 4, further comprising a step of calculating the richness degree of the left and right neighbourhoods of the character string neighbourhood, wherein the richness degree of the left and right neighbourhoods can be obtained by using the information entropy

10. The integrated learning-based WEB abnormal traffic monitoring method according to one of claims 1 to 9, wherein the Bagging is an integrated learning framework that sub-samples the training set to form a sub-training set required by each base model, integrates the prediction results of all the base models to generate a final prediction result, and on the basis of a learner, reselects a data set from an original data set to perform classification prediction, finalizes tags in a majority voting manner, and simultaneously, checks the model accuracy.