CN112395168A

CN112395168A - Stacking-based edge side service behavior identification method

Info

Publication number: CN112395168A
Application number: CN202011373177.8A
Authority: CN
Inventors: 刘贤达; 王昆昆; 赵剑明; 陈春雨; 张厦千; 王天宇; 张博文
Original assignee: Shenyang Institute of Automation of CAS; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Shenyang Institute of Automation of CAS; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-23

Abstract

The invention provides a method for recognizing edge side service behaviors based on Stacking integration. The method comprises the following steps: acquiring edge side behavior characteristics, performing labeling definition on edge side behaviors according to the edge side behavior characteristics, acquiring an edge side behavior characteristic database, and constructing an edge side behavior identification model based on a PCA and Stacking integrated frame. The PCA algorithm can perform feature engineering processing on a high-dimensional edge side behavior feature database to obtain excellent data required by the model algorithm. The Stacking integration algorithm further processes the database through the base model to obtain a new data set for training of a secondary learner, so that the identification accuracy can be greatly improved, and the problem of overfitting is avoided. The method can model the data behavior on the edge side and efficiently identify the behavior action on the edge side.

Description

Stacking-based edge side service behavior identification method

Technical Field

The invention belongs to the field of industrial control network information security, and particularly relates to a Stacking integration-based edge side service behavior identification method.

Background

At present, research on the edge side behavior detection technology based on an industry characteristic knowledge base is blank at home and abroad, and the existing edge side equipment behavior detection and modeling methods are based on a machine learning algorithm. The existing method for detecting the behavior of the edge side equipment is detection based on a support vector machine, the cost of data space and time is high, and the recognition effect of a large amount of storage space on various action states is not good; the behavior state detection based on the naive Bayes method has lower equipment action recognition accuracy when the number of the features is large and the correlation among the features is large; the detection prediction result based on the decision tree algorithm is unstable, and the variance is large, so that the equipment action recognition error can be caused.

Disclosure of Invention

The invention aims to provide an edge-side-oriented service behavior identification method. The method comprises the steps of tracking and detecting output data of edge side equipment, judging whether the equipment in a cloud range breaks down or not, and judging a failure alarm mechanism.

The technical scheme adopted by the invention for realizing the purpose is as follows:

a method for identifying edge service behaviors based on Stacking comprises the following steps:

acquiring behavior data of edge side equipment, and preprocessing the behavior data of the edge side equipment;

performing feature selection on the preprocessed edge side equipment behavior data by using an integrated rule tree model;

and constructing an integrated learning edge side behavior recognition model by using a Stacking framework, taking the edge side equipment behavior data after feature selection as model input, and obtaining the action behavior state of the edge side equipment at the current moment according to a model prediction result through model training.

The edge side device behavior data includes: characteristic dimensions of time, device state, action state

The preprocessing of the behavior data of the edge side equipment is to use a PCA dimension reduction method to reduce the dimension of the behavior data of the edge side equipment, and specifically comprises the following steps:

normalizing the equipment behavior data on the edge side;

calculating a covariance matrix among features in the normalized edge side equipment behavior data;

calculating characteristic values and characteristic vectors of the characteristics in the normalized edge side equipment behavior data;

and (3) arranging the characteristic values from large to small, selecting k characteristic values from the maximum characteristic value to obtain k-dimensional edge side equipment behavior data, and calculating a set of k characteristic vectors according to the characteristic values to obtain the reduced-dimensional edge side equipment behavior data.

The method for constructing the integrated learning edge side equipment behavior recognition model by using the Stacking framework specifically comprises the following steps:

dividing the behavior data of the edge side equipment into a training set and a testing set, and training the training set to the base model in a K-fold cross validation mode;

training the base learner by using a training set, taking an output predicted value as one feature of a new sample, carrying out K-fold cross validation to obtain K features, and taking all the obtained features as the input of a secondary learner to continue training; the test set is trained through the base learner to generate a new test set for the prediction of the secondary learner;

selecting a random forest algorithm, an Adaboost algorithm and a K-nearest neighbor algorithm (KNN algorithm) as a base learner; using a linear regression algorithm as a secondary learner, and integrating the base learner model and the secondary learner model by a Stacking method; the training set is trained by three base learners, new data characteristics are output to serve as the input of a secondary learner, and the new training set obtains a final prediction result through secondary learning verification.

The random forest algorithm specifically comprises the following steps:

self-sampling is carried out from the training set or the testing set through a bootstrap method to obtain a new training set or testing set, and a decision tree is constructed according to the new training set or testing set;

establishing a characteristic random subset: when the decision tree is subjected to node splitting, randomly extracting a plurality of characteristics from all the characteristics to form a characteristic random subset, and searching the characteristics meeting the set requirement in the subset to establish the decision tree;

and performing majority voting on the prediction results of the plurality of constructed decision trees, and obtaining a final prediction result, namely the action state of the industrial control equipment at the moment, wherein minority obeys majority.

The Adaboost algorithm specifically comprises the following steps:

initializing sample weights in a training set, wherein if m samples are set, the weight of each sample is 1/m;

the cyclic training base learner reduces the weight of the sample when constructing the next training set if the sample meets the set classification condition; if a certain sample does not meet the set classification condition, the weight of the sample is increased, and the training set with the updated weight is used for training the next base learner;

judging whether the accuracy of the base learner reaches 50%, if the accuracy is lower than 50%, discarding the base learner, otherwise, keeping the base learner;

the training process is circulated until the number of base learning reaches a preset specified value;

and combining the trained base learners into a strong learner.

A Stacking-based edge business behavior identification system comprises:

the data acquisition module is used for acquiring behavior data of the edge side equipment;

the feature selection module is used for selecting features of the behavior data of the edge side equipment by using an integrated rule tree model;

and the model training module is used for constructing an integrated learning edge side behavior recognition model by using a Stacking framework, taking the edge side equipment behavior data after feature selection as model input, and obtaining the action behavior state of the edge side equipment at the current moment according to the model prediction result through model training.

The model training module comprises a base learner and a secondary learner, wherein a random forest algorithm, an Adaboost algorithm and a K-nearest neighbor (KNN) algorithm are used as the base learner, a linear regression algorithm is used as the secondary learner, and the base learner model and the secondary learner model are integrated by a Stacking method; the training set is trained by three base learners, new data characteristics are output to serve as the input of a secondary learner, and the new training set obtains a final prediction result through secondary learning verification.

The edge side device behavior data in the data acquisition module includes: time, device state, characteristic dimensions of action state.

The feature selection module is to:

normalizing the equipment behavior data on the edge side;

The model training module is configured to:

The invention has the following beneficial effects and advantages:

1. aiming at the behavior recognition problem of the edge side equipment service, the invention provides an edge side equipment service behavior recognition method based on an integrated learning method framework, which uses PCA and learning algorithms such as an integrated rule tree and random forests Adaboost, KNN, logistic regression and the like to model the normal behavior of the edge side equipment, collects the edge side equipment data monitored in real time to predict the degree of the deviation of the system state from the normal state as a system situation element, can correctly predict the safety situation of the system, can quickly recognize the edge side equipment behavior, and industrial control network security personnel can better implement a safety protection means according to the behavior state of the edge side equipment. Reliable decision information is provided for system management personnel, and judgment and prevention of dangerous events are made in time. .

Drawings

FIG. 1 is a schematic diagram illustrating a principle of using PCA to perform dimensionality reduction on edge-side device data in the present invention;

FIG. 2 is a schematic diagram of a process of using a random forest classification center according to the present invention;

FIG. 3 is a schematic diagram of a Stacking integration framework adopted in the present invention;

fig. 4 is a schematic flow diagram of an edge side device behavior identification method based on Stacking integration according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The invention aims to provide a behavior identification method facing edge side equipment. The method comprises the steps of tracking and detecting output data of edge side equipment, judging whether the edge side equipment breaks down or not and judging a failure alarm mechanism. The method comprises the steps of modeling normal behaviors of edge side equipment through a PCA and Stacking integrated framework by using the supervised learning algorithm, obtaining the deviation degree of the state of the edge side equipment and normal working conditions at each moment as a safety situation element by using the normal behaviors as a reference, carrying out fusion calculation on the safety situation element in a time dimension to obtain the current situation of a system, comparing detected data behavior characteristics with a behavior characteristic library, and identifying the behavior state of the current edge side equipment, wherein the action behavior state comprises the conditions of initial state, normal operation, fault occurrence and the like.

The research result of the invention integrates several algorithms through a stacking integrated learning framework, the newly generated learning algorithm has higher accuracy and better generalization performance, and the service behavior identification accuracy of the edge side equipment is higher. The method has good effect on identifying various action states of the equipment.

The invention mainly aims at the identification of the service behavior of the edge side equipment to develop the research of a detection method based on a behavior characteristic knowledge base, establishes a production control behavior model through principal component description and expert experience knowledge, and realizes an efficient behavior modeling and identification model of the edge side equipment by combining the established behavior characteristic knowledge base through the research based on an equipment behavior characteristic extraction algorithm. Performing feature engineering processing on the feature data of the edge side equipment, cleaning and reducing dimensions of the data by adopting a PCA (principal component analysis) method, performing feature selection by adopting a regular tree model, establishing an edge side equipment service behavior recognition model for the data subjected to the feature engineering selection by adopting a Stacking integrated framework, and gradually optimizing model parameters by adopting a grid search cross validation mode to establish the edge side equipment service behavior recognition method based on Stacking integration.

The technical scheme adopted by the invention for solving the technical problems is as follows:

1. the edge side equipment data has redundancy, complementarity and relevance, the algorithm operation time is increased due to high dimensionality, main characteristics are extracted from multi-dimensional sensing data by adopting a dimensionality reduction method to carry out data preprocessing on the edge side equipment data, missing values of the data are visualized, noise reduction processing is carried out on the data, the dimensionality reduction is carried out on the data by adopting a PCA (principal component analysis) method, similar characteristics are combined, the number of the characteristics is reduced, and the data dimensionality is reduced. The projection error is minimized, and the converted data dimensionality is new, so that the overfitting is prevented. As shown in FIG. 1, the PCA dimensionality reduction steps are as follows:

1a) and (4) carrying out mean value normalization on continuous original data to ensure that the data magnitude of each dimension is the same.

1b) And (3) solving a covariance matrix of the features:

wherein X is a characteristic value, Y is a predicted value, i is the ith dimension, and n is the total number of dimensions.

1c) And (3) solving the eigenvalue and the eigenvector according to the SVD:

(A^TA)ν_i＝λ_iν_i

wherein A is a characteristic value, ν is a characteristic vector, λ is a characteristic value, δ is a singular value, and μ is a left singular vector.

1d) Arranged from large to small according to the eigenvalue.

1e) K high variance features are selected.

2. And (3) performing feature selection on the compressed edge side equipment data by using an integrated rule tree model, wherein the rule tree model is used for searching for proper split points and dividing the target data into more, smaller and more-scale groups with stronger homogeneity. The selection of the split points comprises the selection of the whole features of the data and the division of the split points in the single features, the method for measuring the purity comprises Gini (Kernian coefficient) and cross entropy, the regular tree model generates various different tree models by continuously utilizing different features and randomly sampled unused samples, so that the generalization of the result is ensured, and then the feature importance index is generated by taking each feature as the average Kernian coefficient change quantity of the split points in different regular tree models and as the feature importance basis. The method based on the regular tree has strong robustness to noise in process data after more tree models are generated.

3. And constructing an integrated learning edge side equipment behavior recognition model by using a Stacking framework. The idea of the Stacking algorithm is to train the base learner with an initial training set, and a new sample set is generated by the base learner to train a combination of the secondary learners. The Stacking ensemble based learner selects different types of algorithms, and the correlation degree between the algorithms is low. The training samples are generally trained by a cross-validation method and a leave-one-out method. The data set is first divided into a training set and a test set. Typically 1/5 of the samples are selected for the test set and the remaining 4/5 are used for the training set. And (3) training and predicting the base models by adopting a cross validation mode for the training set, wherein the prediction result of each base model is used as a characteristic of the training data of the secondary learner, and the label of the new sample is still the label of the original sample. After all base models are trained, a new data set is generated for training by the secondary learner. And training each base model by the same test set, calculating an average value of a prediction result to serve as a characteristic of a new sample, and predicting the secondary learner. A multi-response linear regression algorithm (MLR) is typically chosen for use as the secondary learner.

As shown in fig. 3, the Stacking algorithm steps are as follows:

3a) the data set is divided into a training set and a testing set, wherein 1/5 of the samples are generally selected as the testing set, and the rest 4/5 is selected as the training set. And training the base model by adopting a K-fold cross validation mode for the training set. Generally, a 5-fold cross validation method is adopted. Four were used as training set and one as test set.

3b) The training set trains the base model, the predicted value is used as one feature of a new sample, and 5 features can be obtained through 5-fold cross validation. The original label of the sample is still regarded as the sample label. The training set is trained by the base model to generate a new training set. The test set is also trained by the base model to generate a new test set for prediction by the secondary learner.

3c) And selecting a random forest algorithm, an Adaboost algorithm and a KNN algorithm as a base learner. A linear regression algorithm is used as the secondary learner.

3d) An integrated learning framework predicts data.

AdaBoost algorithm step:

1) given a training set, data sample weights are initialized. Setting m samples, wherein the weight of each sample is 1/m;

2) a cyclic training basis learner whose weight is reduced in constructing the next training set if the sample has been accurately classified; if a sample point is misclassified, its weight is increased. The training set of weights updated with weights is used to train the next base learner;

3) judging whether the accuracy of the base learner reaches 50 percent or not, wherein the accuracy is lower than 50 percent, and discarding the base learner;

4) the training process is circulated until the number of base learning reaches a preset value T;

5) and combining the base learners obtained by training into a strong learner. After the training of the base learner is finished, the weight of the base learner with high classification accuracy is increased to play a determining role in the final classification function, and the weight of the base learner with low classification accuracy is reduced to play a smaller determining role in the final classification function.

Wherein the linear combination of the basis learners is:

h (x) is a basic mode type linear combination, alpha_tIs a weight, h_t(x) Is a basis classifier.

The end result of the iteration is to minimize the loss function:

l_exp(H|D)＝E_X～D[e^-f(x)H(x)] (2)

l_expfor mathematical expectations, D is the distribution and e is the Taylor expansion

H (x) minimizing the loss function, i.e. the loss function is biased to H (x) such that a bias value of 0 is obtained

P is the error rate.

Sign (H (x)) is the Bayesian optimal error rate, and P is the error rate.

At the moment, the loss function is minimum, the classification error rate is also minimum, and the integrated model effect is optimal.

As shown in FIG. 2, the random forest algorithm process has the following steps:

1) and self-sampling is carried out from the sample set D through a bootstrap method to obtain a new sample set D', and a decision tree is constructed according to the new sample set.

2) And establishing a characteristic random subset, randomly extracting K characteristics from all characteristics to form a characteristic random subset when the decision tree is subjected to node splitting, and searching for the optimal characteristic in the subset to establish the tree.

3) And performing majority voting on the constructed prediction results of the plurality of trees to obtain the prediction results.

The random forest algorithm is simple, has better effect on processing multi-classification tasks, is an integrated algorithm based on decision tree parallelism, and has higher timeliness. The variance is small, and the generalization performance and the expansibility are good. The random forest can sort the importance degree of each feature and is insensitive to the lack of partial feature values. The noise is relatively large, and the overfitting is easy to fall into.

4. And (3) carrying out the same data preprocessing on each piece of equipment data detected in real time, and continuously adjusting and optimizing by using a cross validation method through feature dimension reduction and feature selection. The main parameters are max _ features, max _ depth, min _ weight _ fraction _ leaf. The model effect is optimized by continuously adjusting parameters, the Y value is predicted by the model, the Y value is 0 and represents the normal operation of the equipment, the Y value is 1, the equipment has abnormal behavior, the fault can be caused, and the system is sent out an alarm prompt.

PCA dimensionality reduction

The edge side equipment data has redundancy, complementarity and relevance, and some characteristics contain a large number of missing values and are not suitable for direct data analysis. Therefore, PCA is used to reduce the dimension of the data. PCA is a data set reduction technique in statistics that removes a large number of redundant features from the data. And (3) compressing the behavior characteristic data characteristics of the equipment on the edge side, and compressing the data characteristics from the original n dimension to the m dimension, thereby improving the generalization capability of the model. The method mainly comprises the steps of calculating a covariance matrix among the features, solving eigenvalues and eigenvectors of the features, and sequencing the eigenvalues from large to small. The dimensions of the features are selected as needed to select each feature.

Firstly, mean normalization processing is carried out on the continuity data, a covariance matrix of the characteristic is obtained, the characteristic value and the characteristic vector are obtained according to SVD, the characteristic values are sorted from big to small, and M characteristic vectors with the largest characteristic values are selected as a sample mapping matrix.

2. Integrated tree rule feature selection

And (3) performing feature selection on the compressed data by using an integrated rule tree model, wherein the rule tree model is used for searching for proper split points and dividing the target data into more, smaller and more-scale groups with stronger homogeneity. The selection of the split points comprises the selection of the whole features of the data and the division of the split points in the single features, the method for measuring the purity comprises Gini (Kernian coefficient) and cross entropy, the regular tree model generates various different tree models by continuously utilizing different features and randomly sampled unused samples, so that the generalization of the result is ensured, and then the feature importance index is generated by taking each feature as the average Kernian coefficient change quantity of the split points in different regular tree models and as the feature importance basis. The method based on the rule tree has strong robustness to noise in data after more tree models are generated.

Stacking Integrated framework construction

The idea of the Stacking algorithm is to train the base learner with an initial training set, and a new sample set is generated by the base learner to train a combination of the secondary learners. The Stacking ensemble based learner selects different types of algorithms, and the correlation degree between the algorithms is low. The training samples are generally trained by a cross-validation method and a leave-one-out method. The data set is first divided into a training set and a test set. Typically 1/5 of the samples are selected for the test set and the remaining 4/5 are used for the training set. And (3) training and predicting the base models by adopting a cross validation mode for the training set, wherein the prediction result of each base model is used as a characteristic of the training data of the secondary learner, and the label of the new sample is still the label of the original sample. After all base models are trained, a new data set is generated for training by the secondary learner. And training each base model by the same test set, calculating an average value of a prediction result to serve as a characteristic of a new sample, and predicting the secondary learner. A multi-response linear regression algorithm (MLR) is typically chosen for use as the secondary learner.

The Stacking algorithm steps are as follows:

1) the data set is divided into a training set and a testing set, wherein 1/5 of the samples are generally selected as the testing set, and the rest 4/5 is selected as the training set. And training the base model by adopting a K-fold cross validation mode for the training set. Generally, a 5-fold cross validation method is adopted. Four were used as training set and one as test set.

2) The training set trains the base model, the predicted value is used as one feature of a new sample, and 5 features can be obtained through 5-fold cross validation. The original label of the sample is still regarded as the sample label. The training set is trained by the base model to generate a new training set. The test set is also trained by the base model to generate a new test set for prediction by the secondary learner.

3) And selecting a random forest algorithm, an Adaboost algorithm and a KNN algorithm as a base learner. A linear regression algorithm is used as the secondary learner.

4) An integrated learning framework predicts data.

4. The industrial control equipment behavior modeling and identification method is characterized in that as shown in figure 4,

the method comprises the following specific steps:

4a) taking out an industrial control equipment data set of the system in a normal operation period from a historical database, and performing labeling processing on the data according to equipment characteristic behavior knowledge;

4b) compressing and reducing the dimension of the data by using a PCA method, filtering redundant information and reserving the data of main information;

4c) and predicting the data after feature transformation by using a random forest method, wherein each rule tree represents a working condition, the proportion of each type of ticket number in the integrated model is a prediction probability vector, each iteration is performed, an algorithm finds an optimal classifier based on the weight of the current sample, the wrongly classified sample in the k-th iteration is assigned with higher weight in the k +1 iteration, the weight of the correctly classified sample in the next iteration is reduced, and the weight of each classifier is continuously adjusted through the iteration of one time until the optimal model is obtained.

4d) And performing dimension reduction processing on each real-time detected data, predicting by a Stacking integrated equipment behavior recognition model to obtain a predicted value, defining the behavior states of the equipment to have 0, 1, 2, 3 and 4 behavior states, and judging the behavior state of the industrial control equipment at the moment according to the predicted value.

The edge side equipment data set belongs to high-dimensional data, has the phenomena of information redundancy and characteristic value loss among a plurality of dimensions, cannot be directly used for classification analysis, performs constant dimension deletion on the data and performs numerical normalization processing on the data, then performs data dimension reduction by using a PCA (principal component analysis) method, and adjusts the parameters of a base learner model to optimize the ensemble learning framework.

According to the edge side behavior identification method of the ensemble learning framework algorithm, the state behavior of the equipment at the moment can be effectively judged according to the prediction result, and network security personnel can be helped to better implement a security protection means.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to method flow diagrams according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method for identifying edge service behaviors based on Stacking is characterized by comprising the following steps:

acquiring behavior data of edge side equipment;

performing feature selection on the edge side equipment behavior data by using an integrated rule tree model;

2. The method for identifying edge service behavior based on Stacking according to claim 1, wherein the edge side device behavior data includes: time, device state, characteristic dimensions of action state.

3. The method for identifying edge service behaviors based on Stacking according to claim 1, further comprising, after acquiring edge-side device behavior data:

normalizing the equipment behavior data on the edge side;

4. The method for recognizing the edge service behavior based on Stacking according to claim 1, wherein the building of the integrated learning edge-side device behavior recognition model by using the Stacking framework specifically comprises:

5. The method for identifying edge service behaviors based on Stacking according to claim 4, wherein the random forest algorithm specifically comprises:

6. The method for identifying edge service behaviors based on Stacking according to claim 4, wherein the Adaboost algorithm is specifically as follows:

and combining the trained base learners into a strong learner.

7. The system for identifying the edge service behavior based on the Stacking according to claim 1, comprising:

8. The system according to claim 7, wherein the edge device behavior data in the data obtaining module includes: time, device state, characteristic dimensions of action state.

9. The system according to claim 7, wherein the feature selection module is configured to:

normalizing the equipment behavior data on the edge side;

10. The system according to claim 7, wherein the model training module is configured to: