CN111901330A

CN111901330A - Ensemble learning model construction method, ensemble learning model identification device, server and medium

Info

Publication number: CN111901330A
Application number: CN202010725443.2A
Authority: CN
Inventors: 高甲; 王庆龙; 亓一航
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-06

Abstract

The embodiment of the invention relates to the technical field of network security, and discloses an ensemble learning model construction method, an ensemble learning model identification device, a server and a medium. The method comprises the following steps: selecting m deep neural network DNN models as weak learners of the integrated learning model; each DNN model is used for training network data to be detected and outputting m prediction results; splicing the m prediction results to be used as the input of the meta-model; and taking the prediction result of the meta-model as the final prediction result of the ensemble learning model. The embodiment of the invention can improve the detection effect of machine learning on the unbalanced intrusion behavior.

Description

Ensemble learning model construction method, ensemble learning model identification device, server and medium

Technical Field

The invention relates to the technical field of network security, in particular to an ensemble learning model construction method, an ensemble learning model identification device, a server and a medium.

Background

The intrusion detection system is one of the common modes for network security protection, and makes up the defects of the traditional security protection equipment, such as firewall, vulnerability scanning and access control, along with the increasing of network applications, the network environment is increasingly complex, security events occur frequently, and the intrusion detection system also becomes one of the main protection means in the network security field. With the continuous development and progress of machine learning technology and the effect of machine learning on data feature learning, more and more network security researchers try to design and adopt various machine learning models to solve various security problems in a network environment and obtain better detection effect than the traditional mode.

The machine learning technology is mainly used for learning the existing intrusion data and predicting and evaluating the unknown data characteristics by using the experience obtained by learning, so that the intrusion behavior is identified, and the intrusion behavior mainly comprises classification, regression and clustering. At present, machine learning techniques for identifying network intrusion behaviors can be specifically divided into traditional machine learning techniques, deep learning techniques and ensemble learning techniques. Most algorithms in machine learning can solve the classification problem, while intrusion detection mainly solves the problem of classifying normality and abnormality of network behavior or system state.

The intrusion detection model based on machine learning mainly depends on a classification algorithm, and the high-dimensional features and the imbalance of an intrusion data set have great influence on the training time and the detection effect of the traditional machine learning classification algorithm. Intrusion data existing in a real network environment are mostly unbalanced data, that is, the number of samples of one or more types in intrusion behavior data is far higher than the number of samples of other types, and the types of samples with smaller number of samples are more important than other types, so that the reference significance for identifying intrusion behavior is larger.

When unbalanced intrusion behavior data are processed by the current machine learning method, due to the difference of most classes and few classes in the number of samples, and the maximum overall classification accuracy of most classification algorithms is always the maximum target, the traditional classification model is inclined to the most classes and ignores the few classes, so that the overall classification accuracy is not high.

Disclosure of Invention

In view of this, embodiments of the present invention provide an ensemble learning model construction method, an ensemble learning model identification device, a server, and a medium, and aim to improve detection effects on unbalanced intrusion behaviors.

In order to solve the above technical problem, an embodiment of the present invention provides a method for constructing an ensemble learning model, where the ensemble learning model is used to identify an unbalanced intrusion behavior, and the method includes:

selecting m deep neural network DNN models as weak learners of the integrated learning model; each DNN model is respectively used for training network data to be detected and outputting m prediction results;

splicing the m prediction results to be used as the input of a meta-model;

and taking the prediction result of the meta-model as the final prediction result of the ensemble learning model.

The embodiment of the invention also provides an unbalanced intrusion behavior identification method, which comprises the following steps:

acquiring network data to be detected;

and inputting the network data to be detected into the integrated learning model for training to obtain a prediction result.

The embodiment of the present invention further provides an integrated learning model building apparatus, where the integrated learning model is used to identify unbalanced intrusion behavior, and includes:

the selection module is used for selecting m deep neural network DNN models as weak learners of the integrated learning model; each DNN model is respectively used for training network data to be detected and outputting m prediction results, wherein m is a positive integer;

the splicing module is used for splicing the m prediction results and then taking the spliced m prediction results as the input of the meta-model;

and the meta-calculation module is used for taking the prediction result of the meta-model as the final prediction result of the integrated learning model.

An embodiment of the present invention further provides a server, including: a memory storing a computer program and a processor running the computer program to implement the method as described above.

Embodiments of the present invention also provide a storage medium for storing a computer-readable program for causing a computer to perform the method as described above.

Compared with the prior art, the method and the device have the advantages that the step stacking method is adopted to take the plurality of DNN models as weak learners, the plurality of weak learners are combined into the strong learners through the meta-model, and the DNN has a better characteristic expression effect.

As an embodiment, the method further comprises optimizing the meta-parameters of the meta-model by the following steps:

acquiring a hyper-parameter combination to be optimized;

initializing a group of candidate values with preset intervals for each hyper-parameter in the hyper-parameter combination to obtain a plurality of groups of hyper-parameter candidate values of the hyper-parameter combination;

calculating and obtaining the maximum model accuracy corresponding to each group of hyper-parameter candidate values;

selecting a group of candidate values with reduced space in the field of each hyper-parameter candidate value in a group of hyper-parameter candidate values corresponding to the maximum model accuracy to obtain a plurality of groups of hyper-parameter candidate values with reduced space;

repeating the step of reducing the hyperparameter candidate value to search the maximum model accuracy until the searched model accuracy meets the preset condition;

and taking the candidate value corresponding to the maximum model accuracy rate when the preset stopping condition is met as the hyper-parameter value of each hyper-parameter in the hyper-parameter combination to be optimized of the meta-model. Therefore, the algorithm which is less in time consumption and easier to realize can be realized, and the model has higher accuracy.

As an embodiment, the selecting a set of candidate values with reduced spacing in the domain thereof includes:

and reducing the interval according to a preset multiple to obtain a group of candidate values after the interval is reduced.

As an embodiment, the preset stop condition is:

the currently obtained maximum model accuracy is less than or equal to the historical maximum model accuracy.

As one embodiment, the hyper-parameter comprises a resampling scale.

As an embodiment, the meta model is LightGBM. The LightGBM as the meta model can enable the whole ensemble learning model to have faster training speed, lower memory consumption and higher accuracy.

Drawings

FIG. 1 is a flowchart of a method for building an ensemble learning model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an ensemble learning model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating hyper-parameter optimization of an ensemble learning model according to an embodiment of the present invention;

FIG. 4 is a flow chart of an unbalanced intrusion behavior recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an ensemble learning model building apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an unbalanced intrusion behavior recognition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present invention in its various embodiments. However, the technical solution claimed in the present invention can be implemented without these technical details and various changes and modifications based on the following embodiments.

The embodiment of the invention relates to a construction method of an integrated learning model, and the integrated learning model is suitable for identifying unbalanced intrusion behaviors. Referring to fig. 1, the method includes the following steps:

step 101: and selecting m deep neural network DNN models as weak learners of the integrated learning model. Each DNN model is used for training network data to be detected and outputting m prediction results, and m is a positive integer.

Step 102: and splicing the m prediction results to be used as the input of the meta-model.

Step 103: and taking the prediction result of the meta-model as the final prediction result of the ensemble learning model.

In this embodiment, an ensemble learning model is constructed in a Stacking manner. Fig. 2 is a schematic structural diagram of the ensemble learning model constructed by the construction method according to the embodiment. The ensemble learning model constructed by the embodiment includes two training layers (i.e., level0 and level 1). The level0 (i.e., layer 0) data set is training data of an original data set, and is used for training each DNN submodel in the level0 layer, and a specific value of the number m of the submodels can be obtained through experiments. Each DNN submodel outputs a prediction result, and m submodels output m prediction results, where { p₁,p₂,…,p_m}; the training data for level1 layer is derived from the prediction result { p) of level0 (i.e. layer 1)₁,p₂,…,p_mIs used for meta-model training, the meta-model is to { p }₁,p₂,…,p_mAs input for training and produces an output value as the final predicted result P_f. Alternatively, in this embodiment, the meta model may adopt an open source model-LightGBM model of microsoft corporation. LightGBM is a system that implements GBDT (Grad)An entrnboongdecision tree) algorithm has the characteristics of higher training speed, lower memory consumption and higher accuracy.

The meta-model used in this embodiment is a LightGBM model, which has a large number of super-parameters, such as resampling ratios, which affect the training effect of the model. The super-parameters of the existing model are generally manually set, and a large amount of parameter adjustment training is carried out, so that the model can obtain a better prediction effect. The existing hyper-parameter adjusting algorithm comprises heuristic algorithms such as a genetic algorithm, a particle swarm algorithm and the like, is often complex to operate and is easy to fall into a local optimal situation. After the integrated learning model is constructed, the meta-model hyper-parameters can be optimized through the following steps.

Optionally, in this embodiment, referring to fig. 3, the optimizing the hyper-parameters of the meta-model includes the following steps:

step 301: and acquiring the hyper-parameter combination to be optimized.

Step 302: and initializing a group of candidate values with preset intervals for each hyper-parameter in the hyper-parameter combination to obtain a plurality of groups of hyper-parameter candidate values of the hyper-parameter combination.

Step 303: and calculating and obtaining the maximum model accuracy corresponding to each group of hyperparameter candidate values.

Step 304: and determining whether a preset stop condition is met, if not, executing the step 305, and if so, executing the step 306.

Step 305: and selecting a group of candidate values with reduced intervals in the field of each hyper-parameter candidate value in a group of hyper-parameter candidate values corresponding to the maximum model accuracy to obtain a plurality of groups of hyper-parameter candidate values with reduced intervals.

Step 306: and taking the candidate value corresponding to the maximum model accuracy rate when the preset stopping condition is met as the hyper-parameter value of each hyper-parameter in the hyper-parameter combination to be optimized of the meta-model.

The above-mentioned hyper-parametric optimization process is exemplified as follows:

let the hyperparametric combination P ═ { a, B, … }, which isWhere A and B respectively represent hyper-parameters to be searched, { a₁,a₂… } and b₁,b₂…, respectively, the super-parameter search steps are as follows:

(1) initializing a set of candidate values with a preset spacing for each hyper-parameter to be optimized (a)₁,a₂,…),(b₁,b₂…), …) to obtain sets of hyper-parameter candidates for the hyper-parameter combinations to be optimized. Wherein the value of the preset interval is larger. Assuming that the hyper-parameter combination to be optimized contains M hyper-parameters, and a set of hyper-parameter candidate values initialized by each hyper-parameter is N, M × N sets of hyper-parameter candidate values to be traversed can be obtained.

(2) Each set of hyperparameter candidate values { (a) in step (1)₁,b₁,…),(a₂,b₂…), …) into the meta-model to calculate the model accuracy, and obtain the model accuracy corresponding to each of the M x N sets of hyper-parameter candidate values, and obtain the maximum model accuracy therein, and simultaneously record the hyper-parameter candidate value combination corresponding to the maximum model accuracy in time.

(3) And (3) selecting a hyper-parameter candidate value combination corresponding to the maximum model accuracy obtained in the step (2), and re-dividing a smaller candidate interval in the neighborhood of each hyper-parameter candidate value to enable the current candidate interval to be smaller than the candidate interval of the previous layer, so as to obtain a new group of candidate values of each hyper-parameter after the candidate interval is reduced. Alternatively, the pitch may be reduced by a preset multiple to obtain a set of candidate values after the pitch reduction, the previous layer candidate pitch may be α times the candidate pitch after the pitch reduction, and a specific value of α may be determined by an experimental result.

(4) And (4) repeating the steps (2) and (3) on each group of reduced hyperparameter candidate values until the model accuracy obtained by calculation is less than or equal to the historical maximum model accuracy, namely the model accuracy is not improved any more. It is understood that the condition that the maximum model accuracy reaches the preset threshold may also be used as the condition for stopping searching for the maximum model accuracy, and the search stop condition for the maximum model accuracy (i.e. the preset stop condition) is not particularly limited herein.

According to the method, the hyper-parameter candidate value which enables the model accuracy to be higher is obtained through automatic search of a multi-layer grid search optimization strategy, the problem that manual parameter adjustment is long due to numerous meta-model hyper-parameters can be solved, compared with a heuristic search mode, the method is simple in algorithm implementation, and the detection effect of the integrated learning model can be effectively improved.

The method for constructing the ensemble learning model according to the embodiment may further include testing the constructed ensemble learning model. When the ensemble learning model constructed in this embodiment is tested and the ensemble learning model based on this embodiment is detected, the obtained test data sample or the intrusion data of the network environment to be detected may be processed as follows:

when testing the model, the data sample for testing is very important to the detection effect of the model. In this embodiment, public data or manually acquired network user behavior data may be used as the data sample for testing, and since these data sets may have the problem of data loss or irregular format, the raw data needs to be cleaned and processed to meet the training sample requirements of the model. The present embodiment processes the data set using normalization (i.e., normalization) and one-hot encoding (one-hot encoding). Normalizing the data can limit the features within a certain range, so that the adverse effect of singular sample data on model training is eliminated, and performing one-hot encoding on the samples in the data set can encode discontinuous types of features (namely discrete types of features) in the data set into types which can be processed by the model.

In this embodiment, in the integrated learning process, a resampling method is adopted to improve the detection effect on unbalanced data. Resampling is to perform undersampling (under-sampling) on a large number of sample categories, and discard a part of data to make the data equal to the data of a small number of categories; the less classes are over-sampled (over-sampled) and a portion of the data is reused, making it comparable to the more classes of data. In addition, the proportion of resampling is adjusted according to the hyper-parameters, and the learning effect of the ensemble learning can be effectively improved due to the fact that the proportion value has a remarkable influence on the learning effect of the ensemble learning.

The integrated learning model constructed by the embodiment of the invention comprises a DNN model and a LightGBM model, so that the detection effect of machine learning on unbalanced intrusion behaviors can be improved. In addition, the embodiment also automatically optimizes the meta-parameters of the meta-model and the like through a multi-layer grid optimization strategy, so that not only is time saved, but also the detection effect can be improved.

An embodiment of the present invention further provides an unbalanced intrusion behavior identification method, as shown in fig. 4, where the method includes the following steps:

step 401: and acquiring network data to be detected.

In step 401, the normalization, the one-hot encoding processing, and the resampling processing may be performed on the obtained detection network data with reference to the foregoing embodiment, which is not described herein again.

Step 402: and inputting the network data to be detected into the ensemble learning model according to the embodiment for training to obtain a prediction result.

In the embodiment of the invention, each DNN model in the integrated learning model is used for training network data to be detected to obtain a prediction result, and the meta-model is used for training the prediction results of the DNN models as input to obtain a final prediction result. Due to the fact that the DNN model can better express data characteristics, and the LightGBM model can obtain prediction results more quickly and accurately, and therefore network intrusion behaviors with imbalance can be identified more accurately.

An embodiment of the invention relates to an ensemble learning model construction device, wherein the ensemble learning model is used for identifying unbalanced intrusion behaviors. As shown in fig. 4, the construction apparatus 500 of the present embodiment includes:

a selecting module 501, configured to select m deep neural network DNN models as weak learners of the ensemble learning model. Each DNN model is used for training network data to be detected and outputting m prediction results, and m is a positive integer.

And a splicing module 502, configured to splice the m prediction results and use the spliced m prediction results as an input of the meta-model.

And the meta-calculation module 503 is configured to use the prediction result of the meta-model as a final prediction result of the ensemble learning model.

Optionally, the constructing apparatus 500 of the present embodiment further includes: a hyper-parameter optimization module (not shown).

The hyper-parameter optimization module comprises:

the obtaining sub-module is used for obtaining the hyper-parameter combination to be optimized;

the initialization sub-module is used for initializing a group of candidate values with preset intervals for each hyper-parameter in the hyper-parameter combination to obtain a plurality of groups of hyper-parameter candidate values of the hyper-parameter combination;

the calculation submodule is used for calculating and obtaining the maximum model accuracy corresponding to each group of hyperparameter candidate values;

a candidate value updating submodule for selecting a group of candidate values with reduced spacing in the field of each hyper-parameter candidate value in a group of hyper-parameter candidate values corresponding to the maximum model accuracy to obtain a plurality of groups of hyper-parameter candidate values with reduced spacing;

a cycle sub-module for repeating the above steps of reducing the hyper-parameter candidate value to search for the maximum model accuracy until the searched model accuracy satisfies a preset condition;

and the setting submodule is used for taking the candidate value corresponding to the maximum model accuracy when the preset stopping condition is met as the hyper-parameter value of each hyper-parameter in the hyper-parameter combination to be optimized of the meta-model.

The integrated learning model constructed by the construction device in the embodiment of the invention comprises a DNN (digital hierarchy network) model and a LightGBM (LightGBM) model, so that the detection effect of machine learning on unbalanced intrusion behaviors can be improved. In addition, the embodiment also automatically optimizes the meta-parameters of the meta-model and the like through a multi-layer grid optimization strategy, so that not only is time saved, but also the detection effect can be improved.

An embodiment of the present invention relates to an unbalanced intrusion behavior recognition apparatus, as shown in fig. 5, the recognition apparatus 600 of the embodiment includes:

the data obtaining module 601 is configured to obtain network data to be detected.

The prediction module 602 is configured to input the network data to be detected into the ensemble learning model according to the foregoing embodiment for training, and then obtain a prediction result.

One embodiment of the invention relates to a server. As shown in fig. 7, the server includes: a memory 702 and a processor 701;

the memory 702 stores instructions executable by the at least one processor 701, and the instructions are executed by the at least one processor 701 to implement the real-time cloud application hosting method in the foregoing embodiments.

The server includes one or more processors 701 and a memory 702, and one processor 701 is taken as an example in fig. 7. The processor 701 and the memory 702 may be connected by a bus or by other means, and fig. 7 illustrates an example of a bus connection. Memory 702, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 701 executes various functional applications and data processing of the device by executing nonvolatile software programs, instructions, and modules stored in the memory 702, that is, implements the ensemble learning model construction method or the unbalanced intrusion behavior recognition method described above.

The memory 702 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 702 may optionally include memory located remotely from the processor 701, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 702 and, when executed by the one or more processors 701, perform the ensemble learning model construction method or the unbalanced intrusion behavior recognition method of any of the method embodiments described above.

The above-mentioned device can execute the method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method, and reference may be made to the method provided by the embodiment of the present invention for technical details that are not described in detail in the embodiment.

An embodiment of the present invention also relates to a non-volatile storage medium for storing a computer-readable program for causing a computer to perform some or all of the above method embodiments.

That is, those skilled in the art can understand that all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method for constructing an ensemble learning model, wherein the ensemble learning model is used for identifying unbalanced intrusion behaviors, and the method comprises the following steps:

selecting m deep neural network DNN models as weak learners of the integrated learning model; each DNN model is respectively used for training network data to be detected and outputting m prediction results, wherein m is a positive integer;

splicing the m prediction results to be used as the input of a meta-model;

2. The construction method according to claim 1, further comprising optimizing hyper-parameters of the meta-model by:

acquiring a hyper-parameter combination to be optimized;

and taking the candidate value corresponding to the maximum model accuracy rate when the preset stopping condition is met as the hyper-parameter value of each hyper-parameter in the hyper-parameter combination to be optimized of the meta-model.

3. The method according to claim 2, wherein the selecting a set of candidate values with reduced spacing in the domain comprises:

4. The building method according to claim 2, wherein the preset stop condition is:

5. The build method of claim 2, wherein the hyper-parameter comprises a resampling scale.

6. The building method according to any one of claims 1 to 5, wherein the meta-model is LightGBM.

7. An unbalanced intrusion behavior recognition method, comprising:

acquiring network data to be detected;

inputting the network data to be detected into the ensemble learning model according to any one of claims 1 to 6, and training to obtain a prediction result.

8. An ensemble learning model building apparatus for identifying unbalanced intrusion behavior, comprising:

9. A server, comprising: a memory storing a computer program and a processor running the computer program to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer-readable program for causing a computer to perform the method of any one of claims 1 to 7.