CN112491820B

CN112491820B - Abnormity detection method, device and equipment

Info

Publication number: CN112491820B
Application number: CN202011260938.9A
Authority: CN
Inventors: 周义
Original assignee: New H3C Technologies Co Ltd
Current assignee: New H3C Information Technologies Co Ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2022-07-29
Anticipated expiration: 2040-11-12
Also published as: CN112491820A

Abstract

The application provides an anomaly detection method, an anomaly detection device and anomaly detection equipment, wherein the method comprises the following steps: acquiring a target training data set, wherein the target training data set comprises a plurality of training data; inputting the target training data set to a first machine learning model, so that the first machine learning model determines a score value corresponding to each data feature based on a plurality of training data; sorting all the data features based on the score value corresponding to each data feature, selecting M data features with high score values based on sorting results, and determining a first target feature vector based on the M data features; inputting the first target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector to obtain a trained second machine learning model; and detecting whether the network data is abnormal data or not based on the trained second machine learning model. According to the technical scheme, the accuracy of detecting whether abnormal data exist is improved, and a good abnormal detection effect is obtained.

Description

Abnormity detection method, device and equipment

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method, an apparatus, and a device for detecting an anomaly.

Background

With the continuous development of network technology and the increasing of bearer services, networks have become an indispensable part of people's life and work. Meanwhile, the number of attack behaviors in the network is increasing, and various network attack activities become rampant day by day, such as port scanning attack, worm virus attack, DDoS (Distributed Denial of Service) attack, lemonavirus attack and the like, which cause network performance degradation, interference with normal network behavior, and even cause network interruption or paralysis. Therefore, it is necessary to detect abnormal traffic (i.e., traffic generated by an attack) in the network in time and control the abnormal traffic.

To detect abnormal traffic in the network, the traffic threshold may be configured empirically. At any time, if the traffic value in the network is greater than the traffic threshold, it is considered that there is abnormal traffic. And if the flow value in the network is not greater than the flow threshold value, the abnormal flow is not considered to exist. However, an excessively large flow value does not mean that an abnormal flow is always present, and there is a possibility that a simple network is busy, and therefore, it is determined whether an abnormal flow is present only according to whether the flow value is greater than the flow threshold value, and a relatively large error is present, and the detection result is inaccurate.

Disclosure of Invention

The application provides an anomaly detection method, which comprises the following steps:

acquiring a target training data set, wherein the target training data set comprises a plurality of training data, and each training data comprises a plurality of data characteristics;

inputting the target training data set to a first machine learning model to cause the first machine learning model to determine a score value corresponding to each data feature based on the plurality of training data;

sorting all data features based on a score value corresponding to each data feature, selecting M data features with high score values based on sorting results, and determining a first target feature vector based on the M data features, wherein M is a positive integer;

inputting the first target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector to obtain a trained second machine learning model;

detecting whether network data is abnormal data based on the trained second machine learning model.

The application provides an anomaly detection device, the device includes:

an acquisition module configured to acquire a target training data set, where the target training data set includes a plurality of training data, and each training data includes a plurality of data features;

A processing module to input the target training data set to a first machine learning model to cause the first machine learning model to determine a score value corresponding to each data feature based on the plurality of training data;

the determining module is used for sorting all the data features based on the score value corresponding to each data feature, selecting M data features with high score values based on a sorting result, and determining a first target feature vector based on the M data features, wherein M is a positive integer;

the processing module is further configured to input the first target feature vector to a second machine learning model, and train the second machine learning model through the first target feature vector to obtain a trained second machine learning model;

and the detection module is used for detecting whether the network data are abnormal data or not based on the trained second machine learning model.

The application provides an abnormality detection apparatus, including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor;

the processor is configured to execute machine executable instructions to perform the steps of:

Based on the technical scheme, in the embodiment of the application, the second machine learning model can be used for detecting whether the network data is abnormal data, namely, the machine learning method is used for detecting the network abnormality, so that the accuracy of detecting whether the abnormal data exists is improved, the wrong detection result is reduced, and the situations of false alarm, false negative alarm and the like are reduced. The M data features with high score values are selected from the multiple data features by using the first machine learning model (such as a random forest regression network model), namely the first machine learning model is used for screening the optimal data features, so that the feature calculation amount is greatly reduced, and the machine performance consumption caused by the operation of a large number of data features is avoided. Better anomaly detection is achieved by using a second machine learning model (such as an XGBoost network model).

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present application or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings of the embodiments of the present application.

FIG. 1 is a flow chart of an anomaly detection method in one embodiment of the present application;

FIG. 2 is a flow chart of an anomaly detection method in one embodiment of the present application;

FIG. 3 is a flow chart of an anomaly detection method in one embodiment of the present application;

FIGS. 4A-4E are schematic diagrams of the test effect in one embodiment of the present application;

fig. 5 is a configuration diagram of an abnormality detection device according to an embodiment of the present application;

fig. 6 is a block diagram of an abnormality detection apparatus according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

The embodiment of the application provides an anomaly detection method, which can realize network anomaly detection based on a machine learning method, namely, detect whether network data (such as network traffic) is anomalous data (such as abnormal traffic) based on the machine learning method. The method may be applied to an abnormality detection apparatus, which may be any type of apparatus, such as a router, a switch, a server, a notebook Computer, a mainframe, a PC (Personal Computer), and the like, and the type of the abnormality detection apparatus is not limited.

Referring to fig. 1, a flow chart of an anomaly detection method may include:

A target training data set is obtained, which may include a plurality of training data, each of which may include a plurality of data features, step 101. In addition, each training data may further include first label information of the training data, where the first label information is used to indicate that the training data is normal data or abnormal data. For example, if the first label information is the first value, it indicates that the training data is normal data, and if the first label information is the second value, it indicates that the training data is abnormal data.

For example, the first value and the second value may be configured empirically, for example, the first value is 0, the second value is 1, or the first value is 1, and the second value is 0, which is just an example.

In one possible embodiment, the target training data set may be obtained as follows:

the first method is to obtain an original training data set, where the original training data set may include training data of N attack types, where N may be a positive integer, and determine the original training data set as a target training data set.

For example, the original training dataset is a dataset used for training a network model (the content of the network model is referred to in the following embodiments), and for convenience of distinction, the dataset is referred to as an original training dataset, and an obtaining manner of the original training dataset is not limited, for example, the original training dataset may be an intrusion detection evaluation dataset, that is, a CICIDS dataset, such as CICIDS2018 or CICIDS 2017.

The original training data set may include training data of N attack types, for example, training data of a port scan attack type, training data of a worm virus attack type, training data of a DDoS attack type, training data of a leso virus attack type, and the like, and the attack type is not limited.

For the training data of each attack type, the training data may include a plurality of data features and first tag information, for example, the training data may be positive sample training data or negative sample training data, if the training data is positive sample training data, the first tag information is used to indicate that the training data is normal data (i.e., is not attack data of the attack type), and if the training data is negative sample training data, the first tag information is used to indicate that the training data is abnormal data (i.e., attack data of the attack type).

The data characteristics are characteristics capable of reflecting training data, and may be any type of characteristics, such as an uplink traffic characteristic, a downlink traffic characteristic, a source IP address characteristic, a destination IP address characteristic, a protocol type characteristic, a timestamp characteristic, an operating system type characteristic, and the like, and the type of the data characteristics is not limited.

Each training data may include a plurality of data features, and the data features in different training data may be completely the same, may also be partially the same, may also be completely different, and is not limited thereto.

Illustratively, after determining the original training data set as a target training data set, the target training data set is the same as the original training data set, the target training data set includes training data of N attack types, and the training data includes a plurality of data features and first label information for the training data of each attack type.

In the following embodiments, the target training data set is denoted as a target training data set a, and the target training data set a includes a plurality of training data of attack type 1 and a plurality of training data of attack type 2. The plurality of training data of attack type 1 includes positive sample training data and negative sample training data, and the number of the positive sample training data may be greater than the number of the negative sample training data, for example, the positive sample training data is 70% in proportion, and the negative sample training data is 30% in proportion. The number of positive sample training data may also be less than or equal to the number of negative sample training data. Similarly, the plurality of training data of attack type 2 also includes positive sample training data and negative sample training data.

And secondly, acquiring an original training data set, wherein the original training data set can comprise training data of N attack types, N can be a positive integer, and acquiring N target training data sets corresponding to the N attack types, wherein the attack types correspond to the target training data sets one to one, and each target training data set only comprises the training data of the attack type corresponding to the target training data set.

For an exemplary description of the original training data set, refer to method one, and will not be described herein again.

After the original training data set is obtained, assuming that the original training data set includes training data of attack type 1 and training data of attack type 2, a target training data set b1 corresponding to attack type 1 may be obtained, and a target training data set b2 corresponding to attack type 2 may be obtained. The target training data set b1 includes a plurality of training data (such as positive sample training data and negative sample training data) of attack type 1, but does not include training data of attack type 2. The target training data set b2 includes a plurality of training data (such as positive sample training data and negative sample training data) of attack type 2, but does not include training data of attack type 1.

For each of the training data for attack type 1 in the target training data set b1, the training data may include a plurality of data features and first tag information. For each of the attack type 2 training data in the target training data set b2, the training data may include a plurality of data features and first label information.

And acquiring N target training data sets corresponding to the N attack types, wherein the attack types correspond to the target training data sets one by one, and each target training data set only comprises training data of the attack types corresponding to the target training data set. That is, the original training data set and the N target training data sets are each determined as a final target training data set.

For example, after the original training data set is obtained, assuming that the original training data set includes training data of attack type 1 and training data of attack type 2, a target training data set a (the same as the original training data set), a target training data set b1 corresponding to attack type 1, and a target training data set b2 corresponding to attack type 2 are obtained. In the target training data set a, a plurality of training data of attack type 1 and a plurality of training data of attack type 2 are included, in the target training data set b1, a plurality of training data of attack type 1 are included, and in the target training data set b2, a plurality of training data of attack type 2 are included.

In summary, a target training data set can be obtained, i.e., the target training data set a is obtained in the first mode, the target training data set b1 and the target training data set b2 are obtained in the second mode, and the target training data set a, the target training data set b1 and the target training data set b2 are obtained in the third mode.

Step 102, inputting the target training data set into the first machine learning model, so that the first machine learning model determines a score value corresponding to each data feature based on the plurality of training data.

For example, the first machine learning model is a machine learning model capable of scoring the data features, and the type of the first machine learning model is not limited as long as the data features can be scored.

For example, after obtaining a target training data set, the target training data set may be input to the first machine learning model, the target training data set including a plurality of training data, each training data including a plurality of data features and first label information. After obtaining the target training data set, the first machine learning model may determine a score value corresponding to each data feature based on a plurality of training data in the target training data set.

For example, assuming that the target training data set includes 1000 pieces of training data, which includes data feature T1-data feature T100 for each piece of training data in the target training data set, 1000 pieces of training data of 101 dimensions (100-dimensional data feature + first label information of 1 dimension) are input to the first machine learning model. The first machine learning model is processed based on 1000 pieces of 101-dimensional training data, and then a score value corresponding to each data feature in the data features T1-T100 can be obtained, and the processing process of the first machine learning model is not limited as long as the score value corresponding to each data feature can be obtained.

In one possible implementation, the first machine learning model may include, but is not limited to, a random forest regression network model, i.e., a network model obtained by a random forest regression algorithm. Of course, the random forest regression network model is only an example, and the type of the first machine learning model is not limited.

For example, the present embodiment determines the score value corresponding to each data feature by using a random forest regression network model, that is, the input data of the random forest regression network model is a plurality of training data in the target training data set, and each training data includes a plurality of data features and first label information. The output data of the random forest regression network model is a score value corresponding to each data characteristic. For example, after 1000 pieces of 101-dimensional training data are input to the random forest regression network model, the random forest regression network model may be processed based on the 1000 pieces of 101-dimensional training data to obtain a score value corresponding to each data feature in the data features T1-T100, and the processing procedure of the random forest regression network model is not limited.

Step 103, sorting all the data features based on the score value corresponding to each data feature, selecting M data features with high score values based on the sorting result (i.e., selecting M data features with high score values from all the data features), and determining a first target feature vector based on the M data features, where M is a positive integer.

For example, the first target feature vector may include M data features and second tag information of the first target feature vector, where the second tag information is used to indicate that the first target feature vector is a feature vector of normal data or a feature vector of abnormal data. For example, if the second tag information is the first value, it indicates that the first target feature vector is a feature vector of normal data, and if the second tag information is the second value, it indicates that the first target feature vector is a feature vector of abnormal data. The first value and the second value are configured empirically, such that the first value is 0 and the second value is 1, or the first value is 1 and the second value is 0.

For example, since the first machine learning model can output a score value corresponding to each data feature, the first target feature vector may be determined based on M data features having high score values. For example, all data features are sorted in the order of the score values from high to low, based on the sorting result, M data features ranked in the top (i.e., M data features with high score values) are selected, and based on the selected M data features, the first target feature vector is determined. Or sorting all the data features according to the sequence of the score values from low to high, selecting M data features (namely M data features with high score values) sorted later based on the sorting result, and determining a first target feature vector based on the selected M data features.

For example, taking the value of M as 4 as an example, the first machine learning model can output a score value corresponding to each of the data features T1-T100, and the first target feature vector may include data features T50, T88, T91, and T92 with high score values.

For example, after obtaining the first target feature vector, based on M data features included in the first target feature vector, second tag information of the first target feature vector may also be determined. For example, if the data feature included in the first target feature vector is the data feature of the normal data, the second tag information is used to indicate that the first target feature vector is the feature vector of the normal data, and if the second tag information is the first value. If the data features included in the first target feature vector are the data features of the abnormal data, the second tag information is used for indicating that the first target feature vector is the feature vector of the abnormal data, and if the second tag information is the second value.

In a possible implementation manner, for the first manner, the target training data set a is input to the first machine learning model, and based on step 102 and step 103, a target feature vector K1 corresponding to the target training data set a may be obtained, and the target feature vector K1 may include M1 data features. For the second mode, the target training data set b1 is input to the first machine learning model, so that the target feature vector K2 corresponding to the target training data set b1 can be obtained, and the target feature vector K2 may include M2 data features. Inputting the target training data set b2 into the first machine learning model, a target feature vector K3 corresponding to the target training data set b2 may be obtained, and the target feature vector K3 may include M3 data features. For the third mode, the target training data set a is input to the first machine learning model to obtain the target feature vector K1, the target training data set b1 is input to the first machine learning model to obtain the target feature vector K2, and the target training data set b2 is input to the first machine learning model to obtain the target feature vector K3.

Illustratively, the values of M1, M2 and M3 can be empirically configured, and M1, M2 and M3 are not limited. For example, M2 and M3 may be the same or different, and M2 and M3 are the same for example. M1 and M2 may be the same or different, for example M1 and M2 are different, e.g., M1 is greater than M2.

And 104, inputting the first target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector to obtain a trained second machine learning model. For example, the trained second machine learning model may be used to detect whether the network data is anomalous data.

For example, the second machine learning model is a machine learning model capable of detecting whether the network data is abnormal data, and this is not limited as long as it is capable of detecting whether the network data is abnormal data.

After obtaining the first target feature vector, the first target feature vector may be input to the second machine learning model, and the first target feature vector may include the M data features and the second label information. After the second machine learning model obtains the first target feature vector, the network parameters of the second machine learning model can be trained based on the M data features and the second label information, and the training process is not limited.

For example, for the first mode described above, the target feature vector K1 may be input to the second machine learning model, and the target feature vector K1 includes M1 data features and 1 second tag information. The second machine learning model is trained based on the target feature vector K1 to obtain a trained second machine learning model, and the training process of the second machine learning model is not limited. For the second mode, the target feature vector K2 and the target feature vector K3 may be input to the second machine learning model, and the second machine learning model is trained based on the target feature vector K2 and the target feature vector K3 to obtain a trained second machine learning model. For the third mode, the target feature vector K1, the target feature vector K2, and the target feature vector K3 may be input to the second machine learning model, and the second machine learning model is trained based on the target feature vector K1, the target feature vector K2, and the target feature vector K3 to obtain a trained second machine learning model.

For example, based on the trained second machine learning model, a mapping relationship between the feature vector and a label value can be fitted, where the label value is used to indicate that there is an abnormality or there is no abnormality, and if the label value is a first value, it may indicate that there is an abnormality, and if the label value is a second value, it may indicate that there is no abnormality.

On this basis, if the label value corresponding to the feature vector corresponding to the network data is the first value, it is detected that the network data is normal data, that is, is not abnormal data. And if the label value corresponding to the feature vector corresponding to the network data is a second value, detecting that the network data is abnormal data. In summary, based on the trained second machine learning model, it can be detected whether the network data is abnormal data.

In one possible implementation, the second machine learning model may include, but is not limited to, an XGBoost (eXtreme Gradient Boosting) network model, i.e., a network model obtained by an XGBoost algorithm. Of course, the XGBoost network model is only an example, and the type of the second machine learning model is not limited, such as GBDT (Gradient Boosting decision Tree).

For example, the XGBoost network model is used in the embodiment to detect whether the network data is abnormal data, that is, the input data of the XGBoost network model is a first target feature vector, the first target feature vector includes M data features and second tag information, the XGBoost network model is trained by using the first target feature vector, the training process is not limited, the trained XGBoost network model can fit a mapping relationship between the feature vector and a tag value, and the tag value is used to indicate that there is an abnormality or no abnormality, so that the XGBoost network model can detect whether the network data is abnormal data.

In one possible implementation, there may be M data features for each of the N target training data sets. On the basis, a second target feature vector can be determined based on all data features corresponding to the N target training data sets, and the second target feature vector is input to the second machine learning model, so as to train the second machine learning model. That is, the first target feature vector and the second target feature vector are input to the second machine learning model together, and the second machine learning model is trained by the first target feature vector and the second target feature vector.

For example, after performing a deduplication operation (i.e., removing duplicate data features) on all data features (i.e., N × M data features) corresponding to the N target training data sets, assuming that P data features remain after the deduplication operation, a second target feature vector may be determined based on the P data features, that is, the second target feature vector may include the P data features and second label information of the second target feature vector.

For example, for the second and third modes, a target training data set b1 and a target training data set b2 may be obtained, where the target training data set b1 corresponds to M2 data features, and the target training data set b2 corresponds to M3 data features. And performing deduplication operation on the M2 data features and the M3 data features to obtain P data features, determining a target feature vector K4 based on the P data features, inputting the target feature vector K4 to the second machine learning model, and training the second machine learning model based on the target feature vector K4 to obtain a trained second machine learning model. In summary, the target feature vector K1, the target feature vector K2, the target feature vector K3, and the target feature vector K4 may be input to the second machine learning model, and the second machine learning model is trained based on these target feature vectors to obtain a trained second machine learning model.

And 105, detecting whether the network data is abnormal data or not based on the trained second machine learning model.

In one possible embodiment, steps 101-104 are schematic diagrams of a training process in which a trained second machine learning model is obtained. Since the trained second machine learning model is used to detect whether the network data is abnormal data, the trained second machine learning model may be used to detect whether the network data is abnormal data, and the following description is provided with reference to a specific embodiment to describe a detection process of the network data, which is shown in fig. 2 and is a flowchart of an abnormality detection method, where the method may include:

step 201, network data is obtained, wherein the network data comprises a plurality of network characteristics.

Illustratively, in the device operation process, network data, such as abnormal data (i.e. attack data) or normal data, may be acquired, where the network data includes a plurality of network features, and the essence of the network features is data features, and here, for convenience of distinction, the data features of the network data are referred to as network features, such as an uplink traffic feature, a downlink traffic feature, a source IP address feature, a destination IP address feature, a protocol type feature, a timestamp feature, an operating system type feature, and the like, and the type of the network features is not limited.

At step 202, at least one target network characteristic is selected from a plurality of network characteristics.

For example, a target network feature matching the feature type of the M data features may be selected from the plurality of network features, and the feature type of the target network feature is the same as any one of the feature types of the M data features.

Referring to the above embodiment, in the first mode, M1 data features can be obtained, and the feature types of M1 data features, such as X, are recorded ₁₁ 、X ₁₂ …、X _1M1 Based on the above, in step 202, based on the network data including a plurality of network features, it is determined whether the feature type X exists or not ₁₁ The corresponding network feature, if any, is taken as the target network feature, say, feature type X ₁₁ And if the uplink traffic characteristics are represented, selecting the uplink traffic characteristics from the plurality of network characteristics as target network characteristics. Then, whether the feature type X exists or not is judged ₁₂ Corresponding network characteristics, and so on. In summary, the sum X can be selected from a plurality of network characteristics of the network data ₁₁ 、X ₁₂ …、X _1M1 The corresponding network characteristic serves as a target network characteristic.

For the second mode, M2 data features and M3 data features can be obtained, and the feature types of M2 data features and M3 data features, such as X, are recorded ₂₁ 、X ₂₂ …、X _2M2 And X ₃₁ 、X ₃₂ …、X _3M3 . Based on this, in step 202, a sum X can be selected from a plurality of network characteristics of the network data ₂₁ 、X ₂₂ …、X _2M2 ，X ₃₁ 、X ₃₂ …、X _3M3 The corresponding network characteristic serves as a target network characteristic.

For the third mode, M1 data features, M2 data features and M3 data features can be obtained, and feature types of the M1 data features, M2 data features and M3 data features, such as X ₁₁ 、X ₁₂ …、X _1M1 ，X ₂₁ 、X ₂₂ …、X _2M2 And X ₃₁ 、X ₃₂ …、X _3M3 . Based on this, in step 202, a sum X can be selected from a plurality of network characteristics of the network data ₁₁ 、X ₁₂ …、X _1M1 ，X ₂₁ 、X ₂₂ …、X _2M2 ，X ₃₁ 、X ₃₂ …、X _3M3 The corresponding network characteristic serves as a target network characteristic. Or selecting X from multiple network characteristics of network data ₁₁ 、X ₁₂ …、X _1M1 The corresponding network characteristic serves as a target network characteristic.

The feature type of the above record is a feature type after deduplication, i.e., a feature type where there is no duplication.

Step 203, determining a detection feature vector based on at least one target network feature, wherein the detection feature vector may include the at least one target network feature, that is, include all target network features.

Step 204, inputting the detected feature vector to the trained second machine learning model, and determining a label value corresponding to the detected feature vector by the second machine learning model, where the label value is used to indicate that the detected feature vector is a feature vector of normal data or a feature vector of abnormal data.

Step 205, detecting whether the network data is abnormal data based on the tag value.

For example, the trained second machine learning model can fit a mapping relationship between the feature vector and a label value, where the label value is used to indicate that there is an abnormality or there is no abnormality, if the label value is a first value, it indicates that there is an abnormality, and if the label value is a second value, it indicates that there is no abnormality. Based on this, after the detected feature vector is input to the trained second machine learning model, the second machine learning model may process the detected feature vector to obtain a label value corresponding to the detected feature vector.

Further, if the tag value is the first value, it indicates that the detected feature vector is a feature vector of normal data, and therefore, it is detected that the network data is normal data based on the tag value, that is, the network data is not abnormal data. If the tag value is the second value, it indicates that the detected feature vector is a feature vector of abnormal data, and therefore, the network data is detected to be abnormal data based on the tag value.

For example, the execution sequence is only an example given for convenience of description, and in practical applications, the execution sequence between the steps may also be changed, and the execution sequence is not limited. Moreover, in other embodiments, the steps of the respective methods do not have to be performed in the order shown and described herein, and the methods may include more or less steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

The above technical solution of the embodiment of the present application is described below with reference to specific application scenarios.

Referring to fig. 3, a flow chart of an anomaly detection method may include:

step 301, an original training data set (e.g., CICIDS2018) is obtained, where the original training data set includes training data of N attack types, such as training data of attack type 1 and training data of attack type 2.

Step 302, the original training data set is determined as a target training data set a.

Step 303, inputting the target training data set a to a random forest regression network model, determining score values corresponding to the data features based on the random forest regression network model, selecting M1 data features with high score values from the multiple data features, and selecting 12 data features with high score values if the value of M1 is 12.

Step 304, determining a target feature vector K1 based on the M1 data features, wherein the target feature vector K1 comprises M1 data features, and inputting the target feature vector K1 to the XGBoost network model.

Step 305, obtaining a target training data set b1 corresponding to the attack type 1 based on the original training data set, and obtaining a target training data set b2 corresponding to the attack type 2, wherein the target training data set b1 comprises training data of the attack type 1, and the target training data set b2 comprises a plurality of training data of the attack type 2.

For the original training data set, the training data of each attack type can be separated, such as DDOS attack, Dos attack, zombie attack and the like, and the training data of each attack type corresponds to a target training data set. The target training data set corresponding to each attack type may include positive sample training data and negative sample training data, such as 70% of the positive sample training data and 30% of the negative sample training data.

Step 306, inputting the target training data set b1 to a random forest regression network model, determining score values corresponding to the data features based on the random forest regression network model, selecting M2 data features with high score values from the multiple data features, and selecting 4 data features with high score values if the value of M2 is 4. And inputting the target training data set b2 to a random forest regression network model, determining score values corresponding to the data features based on the random forest regression network model, selecting M3 data features with high score values from the multiple data features, and selecting 4 data features with high score values if the value of M3 is 4.

And 307, determining a target feature vector K2 based on the M2 data features, and inputting the target feature vector K2 to the XGboost network model. And determining a target feature vector K3 based on the M3 data features, and inputting the target feature vector K3 to the XGboost network model.

Illustratively, as the training data comprises too many types of data features, not all the data features are effective for different attack types, when all the data features are input to the XGBoost network model, the time consumption of the algorithm is greatly increased, and strong calculation power is also required, so that the random forest regression network model can be used for scoring different data features, 4 data features with high score values are screened out for each attack type, and only 4 data features are input to the XGBoost network model.

And 308, performing deduplication operation on the M2 data features and the M3 data features based on the obtained M2 data features and the M3 data features to obtain P data features, determining a target feature vector K4 based on the P data features, and inputting the target feature vector K4 to the XGboost network model.

Step 309, training the XGBoost network model based on the target feature vector K1, the target feature vector K2, the target feature vector K3, and the target feature vector K4 to obtain a trained XGBoost network model, where the trained XGBoost network model can implement anomaly detection of the network device.

Based on the technical scheme, in the embodiment of the application, the random forest regression network model is used for selecting the data features with high score values from the multiple data features, the optimal data features are screened, the use of all the data features is avoided, the feature calculation amount is greatly reduced, and the machine performance consumption caused by the operation of a large number of data features is avoided. The XGboost network model is used to obtain a good anomaly detection effect. For example, after the trained XGBoost network model is obtained, the effect of the XGBoost network model is tested, and the test result is as follows:

in the detection mode 1, for each attack type, the XGboost network model can be used for processing 10 times, and the average accuracy is calculated. In the detection mode 2, for an original training data set, a random forest regression network model is used for scoring all data features, 4 data features with high score values are selected for each attack type, the data features are put together and then are subjected to de-duplication to obtain final 16 data features, then the XGboost network model is used for processing 10 times, and the average accuracy is calculated to be 98%, and the method is shown in FIG. 4A. In the detection mode 3, for the original training data set, a random forest regression network model is used to score all data features to obtain 12 data features with high score values, and then the XGBoost network model is used to process 10 times to calculate the average accuracy rate to be 97%, which is shown in fig. 4B.

As a result of the three detection methods, whether applied to each attack type or the original training data set, the XGBoost network model is superior to other network models (for example, a random forest network model is used as the second machine learning model), as shown in fig. 4C, 4D and 4E. Fig. 4C is a comparison result of the XGBoost network model and the random forest network model when the detection mode 1 is adopted, fig. 4D is a comparison result of the XGBoost network model and the random forest network model when the detection mode 2 is adopted, and fig. 4E is a comparison result of the XGBoost network model and the random forest network model when the detection mode 3 is adopted.

Based on the same application concept as the method, an abnormality detection device is provided in the embodiment of the present application, and as shown in fig. 5, the abnormality detection device is a schematic structural diagram, and the abnormality detection device may include:

an obtaining module 51, configured to obtain a target training data set, where the target training data set includes a plurality of training data, and each training data includes a plurality of data features; a processing module 52 for inputting the target training data set to a first machine learning model, such that the first machine learning model determines a score value corresponding to each data feature based on the plurality of training data; a determining module 53, configured to sort all data features based on a score value corresponding to each data feature, select M data features with a high score value based on a sorting result, and determine a first target feature vector based on the M data features, where M is a positive integer; the processing module 52 is further configured to input the first target feature vector to a second machine learning model, and train the second machine learning model through the first target feature vector to obtain a trained second machine learning model; a detecting module 54, configured to detect whether the network data is abnormal data based on the trained second machine learning model.

For example, the obtaining module 51 is specifically configured to, when obtaining the target training data set:

acquiring an original training data set, wherein the original training data set comprises training data of N attack types;

acquiring N target training data sets corresponding to the N attack types, wherein N is a positive integer; wherein each target training data set only comprises training data of an attack type corresponding to the target training data set; or, determining the original training data set as a target training data set; or, determining the original training data set and the N target training data sets as final target training data sets.

Illustratively, each of the N target training data sets corresponds to M data features; the determining module 53 is further configured to determine a second target feature vector based on all data features corresponding to the N target training data sets; the processing module 52 inputs the first target feature vector to a second machine learning model, and when the second machine learning model is trained through the first target feature vector, the processing module is specifically configured to: and inputting the first target feature vector and the second target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector and the second target feature vector.

Illustratively, the detection module 54 is specifically configured to:

obtaining network data, the network data comprising a plurality of network characteristics;

selecting at least one target network feature from the plurality of network features;

determining a detection feature vector based on the at least one target network feature;

inputting the detection characteristic vector to a trained second machine learning model, and determining a label value corresponding to the detection characteristic vector by the second machine learning model, wherein the label value is used for indicating that the detection characteristic vector is a characteristic vector of normal data or a characteristic vector of abnormal data;

detecting whether the network data is abnormal data based on the label value.

Illustratively, the detecting module 54 is specifically configured to, when selecting at least one target network feature from the plurality of network features: and selecting a target network feature matched with the feature types of the M data features from the plurality of network features.

Based on the same application concept as the method, the embodiment of the present application provides an abnormality detection apparatus, as shown in fig. 6, the abnormality detection apparatus includes: a processor 61 and a machine-readable storage medium 62, the machine-readable storage medium 62 storing machine-executable instructions executable by the processor 61; the processor 61 is configured to execute machine executable instructions to perform the following steps:

Based on the same application concept as the method, embodiments of the present application further provide a machine-readable storage medium, where several computer instructions are stored, and when the computer instructions are executed by a processor, the anomaly detection method disclosed in the above example of the present application can be implemented.

The machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An anomaly detection method, characterized in that it comprises:

Sorting all data features based on the score value corresponding to each data feature, selecting M data features with high score values based on sorting results, and determining a first target feature vector based on the M data features, wherein M is a positive integer;

detecting whether network data is abnormal data or not based on the trained second machine learning model;

wherein obtaining a target training data set comprises:

acquiring N target training data sets corresponding to the N attack types, wherein N is a positive integer; wherein each target training data set only comprises training data of an attack type corresponding to the target training data set; or, determining the original training data set as a target training data set; or, determining the original training data set and the N target training data sets as final target training data sets;

Each target training data set in the N target training data sets corresponds to M data features; before the inputting the first target feature vector to the second machine learning model, the method further comprises: determining a second target feature vector based on all data features corresponding to the N target training data sets;

the inputting the first target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector includes: and inputting the first target feature vector and the second target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector and the second target feature vector.

2. The method of claim 1, wherein the detecting whether network data is anomalous data based on the trained second machine learning model comprises:

detecting whether the network data is abnormal data based on the label value.

3. The method of claim 2,

the selecting at least one target network feature from the plurality of network features comprises: and selecting a target network feature matched with the feature types of the M data features from the plurality of network features.

4. The method according to any one of claims 1 to 3,

the first machine learning model comprises a random forest regression network model;

the second machine learning model comprises an XGBoost network model.

5. An abnormality detection apparatus, characterized in that the apparatus comprises:

the detection module is used for detecting whether the network data are abnormal data or not based on the trained second machine learning model;

the acquisition module is specifically configured to, when acquiring a target training data set:

Each target training data set in the N target training data sets corresponds to M data features; the determining module is further configured to determine a second target feature vector based on all data features corresponding to the N target training data sets; the processing module inputs the first target feature vector to a second machine learning model, and when the second machine learning model is trained through the first target feature vector, the processing module is specifically configured to: and inputting the first target feature vector and the second target feature vector to a second machine learning model, and training the second machine learning model through the first target feature vector and the second target feature vector.

6. The apparatus of claim 5, wherein the detection module is specifically configured to:

And detecting whether the network data is abnormal data or not based on the label value.

7. The apparatus of claim 6, wherein the detection module is specifically configured to, when selecting the at least one target network feature from the plurality of network features: and selecting a target network feature matched with the feature types of the M data features from the plurality of network features.

8. An abnormality detection apparatus characterized by comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor;

detecting whether the network data is abnormal data or not based on the trained second machine learning model;

wherein obtaining a target training data set comprises:

each target training data set in the N target training data sets corresponds to M data features; before the inputting the first target feature vector to the second machine learning model, the method further includes: determining a second target feature vector based on all data features corresponding to the N target training data sets;