CN116260642A

CN116260642A - Knowledge distillation space-time neural network-based lightweight Internet of things malicious traffic identification method

Info

Publication number: CN116260642A
Application number: CN202310168774.4A
Authority: CN
Inventors: 徐小龙; 朱士洲; 夏飞; 赵娟; 李定成
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-13

Abstract

The invention discloses a knowledge distillation space-time neural network-based lightweight internet of things malicious flow identification method, which comprises the following steps: acquiring malicious traffic of the Internet of things to be identified; acquiring a pre-trained teacher network model; obtaining a pre-trained student network model based on knowledge distillation; and inputting the malicious traffic of the Internet of things to be identified into a pre-trained student network model to obtain a category prediction vector and a corresponding prediction label. The teacher network model is trained in advance, and the training method is realized through the following steps: acquiring a training set, wherein the training set comprises different categories of malicious flow samples of the Internet of things, a formatted sample vector set and real labels of the malicious flow samples of the Internet of things; inputting the formatted sample vector set into a constructed teacher network model to obtain a multi-classification prediction sequence; and obtaining and storing the soft labels according to the prediction vector values in the multi-classification prediction sequence. In the testing of a plurality of real Internet of things malicious flow data sets, the performance is higher than that of a machine learning-based identification method.

Description

Knowledge distillation space-time neural network-based lightweight Internet of things malicious traffic identification method

Technical Field

The invention relates to a knowledge distillation space-time neural network-based lightweight internet of things malicious flow identification method, and belongs to the technical field of internet of things information and communication.

Background

With the rapid development of information and communication technology of the internet of things, the application of the internet of things mainly comprises a plurality of fields of power, industry, traffic and the like. Because the internet of things connects the devices through the network, a large amount of flow data is generated when the devices exchange with external resources. The increase of traffic data also leads to an increase and diversification of network attacks, which, once invaded by malicious traffic, can cause not only paralysis of the equipment, but also economic loss of the clients. Network malicious traffic identification is an important way of network intrusion detection, and is essentially a classification problem, aiming at timely and accurately identifying hidden malicious attacks in network traffic. At present, a series of researches are carried out on the identification of the malicious traffic of the Internet of things, and a large number of researchers apply the neural network to the detection and identification of the malicious traffic of the Internet of things so as to improve the performance of a model. Traffic classification techniques under traditional networks are largely divided into payload inspection, statistics-based, machine learning-based and deep learning-based methods. The payload-based approach cannot identify encrypted traffic. Statistical-based methods, while capable of rapid traffic classification, do not provide high accuracy identification. Machine learning-based methods and deep learning-based methods have become important methods in the field of network traffic classification. However, for the models which have been proposed at present, most of the problems of complex neural networks, huge model parameters, high complexity, long calculation time and the like exist. The characteristics of weak network and narrow bandwidth of the Internet of things environment, weak operation of edge nodes in the wireless sensor network and low memory enable common network node equipment to not bear the memory and operation cost required by model operation, and the deployment of a model method based on deep learning in an actual environment is greatly hindered. Therefore, for malicious traffic detection and identification at the edge node of the wireless sensor network of the Internet of things, a model with few parameters, low complexity and high performance is very important to design.

In recent years, deep learning technology is widely studied by a plurality of researchers in the field of malicious traffic classification of the Internet of things, and a better classification effect is achieved compared with a traffic classification method comprising a traditional machine learning method; however, on one hand, the complexity of the internet of things makes malicious traffic in the network more complex and diversified, so that traffic data contains a large amount of invalid redundant information, such as header information of a protocol, and the malicious traffic is more difficult to identify due to the popularization and application of the current traffic encryption technology, so that the accuracy of multi-classification identification of the traffic of the internet of things based on a deep learning method is limited; on the other hand, most of the existing deep learning-based methods have the problems of complex neural network, high model complexity, huge parameter quantity and the like, so that the model cannot be deployed by the internet of things equipment with limited resources in the internet of things edge network, and the operation of the model is difficult to support. In addition, for the flow classification model based on deep learning, most of the flow classification model is composed of a complex neural network, so that the model has high complexity, huge parameter and large model volume, and a large amount of calculation resources, storage resources and time are required to be consumed, so that a lot of internet of things equipment cannot deploy the classification recognition model, and in order to reduce the complexity and the size of the model, a lot of researchers process data set samples, including PCA (principal component analysis) methods, channel cleaning and the like, however, the model parameter and the model volume are limited in reduced degree, and more resources are still required to be consumed for processing.

In summary, the research on classification and identification of malicious traffic of the internet of things still has the following defects and disadvantages in the current work:

1) At present, most malicious traffic is combined with encryption technology, so that the characteristics of the malicious traffic are more difficult by a method based on effective load, and the classification and identification performance is low.

2) Based on machine learning and statistical methods, these methods mostly rely on feature extraction, and usually require manual feature design, but identification of abnormal and encrypted traffic is still difficult, and more time is required to collect traffic data, so that the real-time performance is low.

3) The flow identification method based on deep learning is characterized by high model complexity, large parameter amount, complex neural network, low internal memory of edge nodes in the wireless sensing network of the Internet of things, weak operation and the like, so that the Internet of things equipment with limited resources cannot be deployed.

Disclosure of Invention

Aiming at the defects or shortcomings of the prior art, the invention provides a light-weight malicious flow identification method of the internet of things based on a knowledge distillation space-time neural network, which combines a preprocessing method and model characteristics, can effectively extract spatial characteristics in the whole data stream and time sequence characteristics among data packets, and is higher than the identification method based on machine learning in the test of a plurality of real-object networking malicious flow data sets.

The technical scheme adopted for solving the technical problems is as follows: the invention provides a knowledge distillation space-time neural network-based lightweight internet of things malicious flow identification method, which comprises the following steps:

acquiring malicious traffic of the Internet of things to be identified;

acquiring a pre-trained teacher network model;

obtaining a pre-trained student network model based on knowledge distillation;

and inputting the malicious traffic of the Internet of things to be identified into a pre-trained student network model to obtain a category prediction vector and a corresponding prediction label.

Further, training the teacher network model in advance is achieved through the following steps:

acquiring a training set, wherein the training set comprises different categories of malicious flow samples of the Internet of things, a formatted sample vector set and real labels of the malicious flow samples of the Internet of things;

inputting the formatted sample vector set into a constructed teacher network model to obtain a multi-classification prediction sequence;

and obtaining and storing the soft labels according to the prediction vector values in the multi-classification prediction sequence.

Further, the pre-training of the target student network model is realized by the following steps:

Inputting the formatted sample vector set into a constructed student network model for training to obtain a multi-classification prediction sequence generated by the student model;

extracting a category prediction vector in the multi-category prediction sequence, wherein the category prediction vector represents a final prediction value corresponding to each category to which the malicious traffic sample of the Internet of things belongs;

and obtaining the corresponding category and the predictive label of the corresponding category according to the maximum value in the final predictive value.

Further, pre-training the target student network model includes:

acquiring a soft label generated by a pre-trained teacher model;

generating a soft label and a hard label in the training process of the student model;

performing attention loss calculation on a soft label generated by the student model through the self-adaptive temperature function and a soft label generated by the teacher model;

performing cross entropy loss calculation on the hard tag and the real tag generated by the student model;

the attention loss and the cross entropy loss together form the loss of the student network model, and the weight is updated;

and if the loss of the loss function is converged to a certain value, stopping training to obtain a final target student network model.

Further, the training set is obtained by the following steps:

carrying out stream splitting on the malicious traffic samples of the Internet of things according to the same five-tuple to obtain a plurality of stream sample files;

Clearing an invalid stream sample file with the size of 0KB to obtain a residual stream sample file;

selecting a plurality of continuous data packets from the residual stream sample file;

filtering and shielding IP addresses of the selected data packets;

and carrying out vectorization and standardization processing on the selected data packet to obtain a formatted sample vector.

Further, the target student network model is a lightweight neural network model based on a knowledge distillation space-time neural network;

the student network model consists of two layers of one-dimensional depth separable convolutional neural networks, one layer of bi-directional stacked BiLSTM neural network and a fully connected neural network, and are connected in sequence;

the student network model utilizes knowledge distillation to realize knowledge migration from the teacher model to the student model through the self-adaptive temperature function T and the combined loss function kd_loss;

the target student network model extracts spatial features and temporal features.

Further, the selected data packet is filtered, which is realized by the following steps:

reading and filtering the data packets with empty data fields according to the specification sequence of (n, m) from the starting point of the data packets until the data packets with non-empty data fields are read, and if the number of the data packets is less than n, supplementing all the data packets with zero;

Reserving the n data packets with m length byte numbers, cutting off the data packets if the length byte numbers exceed m, and supplementing the full data packets with zero if the length byte numbers are less than m;

converting each data packet into a two-dimensional vector of (n, m);

vector normalization processing is carried out on the two-dimensional vector converted into (n, m);

further, obtaining the training set includes:

converting the formatted sample vector into a picture form for storage to obtain a formatted sample vector set;

dividing the formatted sample vector set into a training set and a testing set;

each formatted sample vector is a two-dimensional vector of single precision floating point numbers in the form of n x m, and each data packet is a one-dimensional vector in the form of 1 x m.

Further, pre-training the target student network model includes:

considering the formatted sample vector as an n-step sequence with a time sequence relationship, inputting a target network model with each step being m, and inputting the formatted sample vector with each step being 1×m;

the input formatted sample vector is accessed into one-dimensional depth separable convolution, and the spatial characteristics of the data packet are captured;

taking the space features generated in the data packet as the input of time sequence feature extraction, and extracting by using a bi-directional stacking BiLSTM method;

Outputting a category prediction vector by the target student network model;

wherein the probability value of the student model to the ith sample at the temperature T

The expression is:

z _i logits of the ith sample output by the student model;

the expression of the adaptive temperature function T is:

T(accuracy)＝α-(β*accuracy) ^θ

wherein, alpha value is used as a super parameter to determine the variation range of T. The value of θ is used as an index parameter to determine the change speed of T, and the value of β is used as a proportion parameter to determine the change amplitude of T. accuracy is the accuracy, and the T value gradually decreases with the increase of the accuracy.

Further, pre-training the target student network model includes:

calculating the probability value of each category and the loss of a real label of a malicious flow sample of the Internet of things by using a combined loss function kd_loss, wherein the kd_loss consists of two parts of attention loss and cross entropy loss, and the expression is as follows:

kd_loss＝0.5*attention_loss+0.5*ce_loss

wherein, the attention_loss is attention loss, and the ce_loss is cross entropy loss;

the loss of the attribute_loss is used for calculating the loss between the soft label generated by the student model and the soft label generated by the teacher model, and the loss of the hard label generated by the student model and the real label;

the attribute_loss expression is:

Wherein the method comprises the steps of

Soft tag generated for teacher model, +.>

A, generating a soft label for a student model, wherein a is a proportion parameter, gamma is an index parameter, and a log function base number is e;

the ce_loss expression is:

wherein t is _i Is a real label, and the log function base number is e;

and obtaining the optimal network parameters according to the loss calculated by kd_loss and the weight of the target student network model.

The beneficial effects are that:

1. the training mode provided by the invention, combined with the preprocessing method and the model feature, can effectively extract the space feature in the whole data stream and the time sequence feature among the data packets, and is higher than the recognition method based on machine learning in the test of a plurality of real-object networking malicious flow data sets.

2. According to the invention, under the condition that the model is light by using the depth separable convolution and the BiLSTM neural network, the accuracy of the model is effectively improved by taking the accuracy of the model in classifying and identifying malicious traffic into consideration and by a knowledge distillation method. By introducing a teacher model, the model proposed by the invention is used as a student model, the information carried by the soft tag can be dynamically changed through the self-adaptive temperature function proposed by the invention in the knowledge distillation process, and the knowledge of the teacher model can be migrated to the target student network model proposed by the invention by combining the proposed combined loss function, so that the student model is light in weight, and meanwhile, the accuracy rate also reaches the teacher model level;

3. The training strategy, the classification recognition mode and the malicious flow multi-classification model in the light-weight internet of things malicious flow recognition method based on the knowledge distillation space-time neural network provided by the invention are used, are shown in tests of a plurality of internet of things real network malicious flow data sets, can be effectively deployed in the face of updated malicious flow types and for internet of things equipment with limited resources, and can have high recognition accuracy.

Drawings

Fig. 1 is a schematic flow chart of a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a target network model structure according to a second embodiment of the present invention.

FIG. 3 is a flow chart of knowledge migration based on knowledge distillation according to a second embodiment of the invention.

Fig. 4 is a flowchart illustrating a process of filtering and masking an IP address according to a second embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating vectorization and formatting of two sample data according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings.

Example 1

The traffic classification method based on the payload extracts unencrypted data features from the data packet, while the traffic classification method based on statistics extracts statistical features such as the data packet length, arrival time, stream length and the like from the data packet to complete classification, and the traditional method of using machine learning to realize traffic classification generally requires manual design feature extraction, which has very high time consumption.

The flow classification methods based on deep learning often have different models and corresponding model inputs, so often the preprocessing modes are different, for example, some data in all layers are selected, some data in only an application layer is selected, and in the preprocessing steps of the methods, the data are often required to be subjected to steps of clipping and zero filling, so that the data of the input model are all samples with fixed length, the difference is that some data are directly selected for the first n bytes in the flow, more than clipping, less than filling by 0, then converted into a square picture form for storage, some data are selected according to data packets in the flow, the first n data packets in the data flow are taken, each data packet is clipped for a fixed m-byte length as a sample, and the classification output by the final model is also the classification predicted value or the classification probability of the whole sample, which can be more easily and efficiently trained compared with the samples with the fixed length.

As shown in fig. 1, the invention provides a method for identifying malicious traffic of a lightweight internet of things based on a knowledge distillation space-time neural network, which comprises the following steps:

acquiring malicious traffic of the Internet of things to be identified;

acquiring a pre-trained teacher network model;

Obtaining a pre-trained student network model based on knowledge distillation;

Further, in this embodiment, the teacher network model is trained in advance, which is implemented by the following steps:

Further, in this embodiment, the target student network model is trained in advance, which is implemented by the following steps:

Further, in this embodiment, the training target student network model in advance further includes:

acquiring a soft label generated by a pre-trained teacher model;

Further, in this embodiment, the training set is obtained by the following steps:

filtering and shielding IP addresses of the selected data packets;

Further, in this embodiment, the target student network model is a lightweight neural network model based on a knowledge distillation space-time neural network;

Further, in this embodiment, filtering the selected data packet is implemented by the following steps:

Converting each data packet into a two-dimensional vector of (n, m);

further, in this embodiment, acquiring the training set further includes:

dividing the formatted sample vector set into a training set and a testing set;

outputting a category prediction vector by the target student network model;

The expression is:

z _i logits of the ith sample output by the student model;

the expression of the adaptive temperature function T is:

T(accuracy)＝α-(β*accuracy) ^θ

kd_loss＝0.5*attention_loss+0.5*ce_loss

the attribute_loss expression is:

wherein the method comprises the steps of

Soft tag generated for teacher model, +.>

The ce_loss expression is:

wherein t is _i Is a real label, and the log function base number is e;

Testing whether the target student network model is qualified by using the test set comprises the following steps:

inputting the test set into a pre-trained target network model;

outputting a prediction vector of the test set, a corresponding category and a prediction label of the corresponding category through the target network model;

calculating the prediction accuracy of the test set;

if the prediction accuracy is higher than the set threshold, the target student network model can be judged to be qualified.

The performance of the flow classification method of the internet of things based on deep learning is greatly dependent on the extraction and recognition capability of the classification model on the spatial features and the time sequence features of samples in the flow. If only short sample information is selected, the model may be difficult to train because the effective information of the sample is too little, in order to make the training effect of the model better, the sample needs to be processed into a format suitable for model training in a preprocessing stage, and in order to better extract the features from the information, the missing of the sample information or the insufficient effective information is prevented, and the method of selecting data with a certain length is used to enable the model to acquire the features from the information. In addition, for the problems of complex model neural network, large model parameter quantity, large volume, high complexity and the like, methods such as principal component analysis, channel cleaning and the like are generally performed on a sample to reduce a part of model parameter quantity and volume, but a certain cost is still required to be consumed.

In this embodiment, a way of preprocessing, sample training classification and knowledge distillation-based lightweight processing is designed to implement a model of a specific condition, and reduce the problem influence existing above as much as possible, so that the model can be deployed on a resource-limited device. Firstly, processing a sample into a format required by training in a preprocessing mode, then extracting features in a model self feature and training mode, namely extracting spatial features by depth separable convolution, extracting time sequence features by using BiLSTM, migrating teacher model feature knowledge by a knowledge distillation method, calculating loss by using a combined loss function, and finally obtaining a prediction vector of the sample, wherein element values of the prediction vector correspond to a prediction value of each category respectively, and the category with the largest prediction value is a predicted final label. This greatly reduces the complexity, volume and parameters of the model.

As shown in FIG. 1, the invention shows the main steps of a preprocessing method, a training method, a testing method and a light-weight method of the malicious flow identification method of the light-weight Internet of things based on the knowledge distillation space-time neural network, wherein the preprocessing processes of samples in the training and testing methods based on the knowledge distillation space-time neural network are the same, namely, an original flow file is subjected to flow segmentation according to five-tuple, an invalid flow file is filtered, then the data flow file is formatted into a preset dimension to finally form a sample vector, and the samples are divided into a training set and a testing set as samples used by the training method and the testing method; the classification method inputs the serialized samples into a neural network model, extracts spatial characteristics and time sequence characteristics, and outputs predicted vectors of the samples through a full connection layer; in the training process, the student model generates a soft tag and a hard tag, the soft tag generated by the student model and the soft tag generated by the teacher model calculate loss by using a concentration loss function, so that the characteristic knowledge of the teacher model is migrated to the student model, the hard tag generated by the student model and the real tag calculate loss by using a cross entropy loss function, and finally, the soft tag generated by the student model and the soft tag generated by the teacher model are combined together to form the loss of the student model; the category corresponding to the largest element value in the predictive vector is the predictive label; and finally updating the model weight parameters by using a back propagation algorithm.

The light model based on the knowledge distillation space-time neural network in the embodiment is a malicious flow classification model with specific characteristics, and the characteristics are as follows: for a serialization model input containing a certain dimension, extracting through spatial features and time sequence features, so as to obtain multidimensional features; respectively generating a soft tag and a hard tag in the model training process, calculating the loss of the soft tag and the teacher model, and calculating the loss of the hard tag and the real tag, wherein the two parts together form the loss of the model; a typical model meeting the requirements is shown in fig. 2, and the target student network model is formed by a one-dimensional depth separable convolutional neural network, a BiLSTM neural network, a full connection layer and knowledge migration based on knowledge distillation, and the structure and the function of the target student network model comprise:

the two layers of one-dimensional depth separable convolutional neural network layers can capture the spatial characteristics of samples, one convolutional layer often comprises a plurality of convolutional kernels, and when one-dimensional convolutional operation is carried out on the input samples, channels for generating new characteristic graphs are different due to different parameters of the convolutional kernels; for example, for a serialized input x, the convolution kernel t:

x＝[x _1：k ，x _k+1：2k ，...，x _n-k+1：n ]

The one-dimensional convolution operates as follows:

a _i ^t ＝f(w*h _i：i+k-1 +b)

where f is a nonlinear activation function, w is a sliding window over x, b is an offset value, a _i ^t Then the feature generated by the convolution kernel t on the corresponding sliding window; for the whole input, the height setting of the one-dimensional convolution kernel is not 1, so that the number of sequences for generating the feature map is smaller than that of the input sequences, and parameter operation is reduced.

The LSTM neural network layer can capture time characteristics among data packets in the sample, and a plurality of channels of the input sample are regarded as a plurality of time steps, so that candidate values, namely the calculation formula of the current unit state, are as follows:

c _t ^～＝tanh(W _cx x _t +W _ch h _t-1 +b _c )

wherein x is _t ∈R ^m Representing vectors on any channel t of the input sample, the dimension being the same as the dimension m of the input sample; h is a _t-1 ∈R ^s For the output of the hidden layer of the last time step, the dimension is determined by the dimension parameter s of the hidden layer of the stacked bidirectional LSTM neural network unit; c _t ^～ Intermediate output for the current layer; w (W) _cx And b _c Respectively weight and bias.

The fully connected neural network layer is used for converting the output characteristic diagram into a predicted value for each classification in a light model based on knowledge distillation space-time neural network, the characteristic diagram generated by one-dimensional depth separable convolution and BiLSTM neural network is used as the input of the fully connected layer, the characteristic extraction prediction of convolution kernel operation is ensured by the method, and finally the prediction vector of the whole sample is integrated.

Knowledge migration based on knowledge distillation, namely generating a hard tag by a target student model and a soft tag by a self-adaptive temperature function in a training process, introducing a pre-trained teacher model, and calculating the loss of the soft tag of the student model and the soft tag of the teacher model by a attention loss function to achieve knowledge migration from the teacher model to the student model; calculating the loss of the hard tag and the real tag of the student model through a cross entropy loss function; the attention loss and the cross entropy loss jointly form the loss of the student model, so that the weight is updated, the purpose of improving the model accuracy is achieved, and the light model accuracy realized by using the one-dimensional depth separable convolution and the BiLSTM neural network is improved.

According to the invention, by utilizing the spatial feature and time sequence feature extraction capability of the neural network and the characteristic that the model complexity, the parameter and the volume can be greatly reduced by the one-dimensional depth separable convolution, a light model is realized, and the model can be deployed on the Internet of things equipment with limited resources; in addition, the thought and knowledge of knowledge distillation are introduced, and the knowledge features of a teacher model with high accuracy performance are transferred to the target network model, namely the lightweight model, by a knowledge distillation method, so that the model identification accuracy is further and effectively improved.

Example two

Fig. 2 to fig. 5 are schematic diagrams showing a second embodiment of the present invention, which provides a detailed verification description of a lightweight internet of things malicious traffic recognition method based on a knowledge distillation spatiotemporal neural network, specifically including:

the preprocessing of the embodiment includes five stages of stream segmentation, filtering, vectorization, standardization and IDX compression storage, firstly, carrying out stream segmentation on an original flow file, then deleting a filtering invalid file, filtering an invalid data packet in the vectorization process, shielding invalid data, then carrying out standardization processing to enable samples to be formatted into model input patterns, storing sample sets, finally converting the samples stored in the sets into pictures and carrying out IDX compression storage, and the specific implementation steps are as follows:

step 1: according to the distribution of the original flow file, the original file is segmented according to the flow, as shown in fig. 5, and the specific process is as follows:

step 1-1: the flow splitting is performed according to the same five-tuple (source IP address, source port, destination IP address, destination port and transport layer protocol).

Step 1-2: and detecting the cut stream sample file, judging whether the stream sample file is an effective stream file, and if not, clearing.

Step 1-3: the data stream files are divided and marked according to categories.

Step 2: the data packets are sequentially read from a given data stream file, and detection and judgment are performed on the data packets, as shown in fig. 5, and the specific process is as follows:

step 2-1: and detecting whether the current read data packet is a valid IP data packet, and if not, clearing the data packet.

Step 2-2: and further detecting whether the data field in the data packet is empty, and clearing if the data field is empty.

Step 2-3: the source and destination IP addresses in the data packet are changed to 0 so as to prevent the model from being classified according to the IP addresses and influencing the real performance of the model.

Step 3: repeating the step 2 until the nth effective data packet is obtained, cutting off if the number of the effective data packets obtained in the data stream sample file exceeds n, and complementing with 0 if the number of the effective data packets is less than n. n represents the number, i.e. the nth valid packet is obtained.

Step 4: each sample is converted into a two-dimensional vector of n x m, the stream samples are converted into sample vectors of the same dimension so as to facilitate the input and operation of the model, in the method, n=10, m=500, and the reason for selecting the number of n and m is to obtain proper characteristic information to ensure the performance of the model, and to reduce the training time of the model and contribute to the light weight. Where n and m represent sample dimensions, i.e., a two-dimensional array, nxm represents n rows and m columns, and m also represents packet length.

The specific implementation process comprises the following steps:

for each packet, every 8 bits are converted into an integer between 0 and 255, each packet holds the first m integers, truncates if the length is more than m, and complements with 0 if the length is less than m, as shown in fig. 5.

The formatted samples are converted into an n m two-dimensional vector, each element value being a decimal integer between 0 and 255.

In order to prevent gradient explosion in the model training process, normalization processing is performed on the two-dimensional sample vector, and element values in the two-dimensional vector are all integer values of 0 to 255, so that the element values are directly divided by 255, the element values are between 0 and 1, and the normalization processing is completed.

And 5, converting the formatted and normalized sample set into a picture, converting the picture into an IDX file, and compressing and storing the IDX file.

The multi-classification method comprises a preprocessing stage, a classification stage and a knowledge migration stage based on knowledge distillation, wherein the original flow is segmented according to the flow, a data flow file is formatted into a dimension input by a model, a cost matrix is created, the classification model finishes prediction classification of a current sample according to the input, the knowledge distillation thought is utilized in the training process, the soft label and the hard label generated by the student model are used for calculating loss by using a concentration loss function with the soft label of the teacher model, the real label is used for calculating by using a cross entropy loss function, and finally the concentration loss and the cross entropy loss jointly form the calculation loss of the whole student model.

The specific operation steps comprise the following steps:

step 1) carrying out stream segmentation on the original flow file according to the same five-tuple, clearing invalid data stream files, sequentially reading data packets in the data stream files, and filtering the invalid data packets until the number of the valid data packets reaches n.

Step 2) for each data packet, after converting the data packet into decimal integers, selecting the length m, multiple sections and multiple supplements so that the samples are n multiplied by m two-dimensional vectors, and then performing standardization processing to facilitate training of the model.

And 3) inputting the sample vector into a target model in an n-step m-long form for feature extraction, namely, extracting multi-dimensional features of space features and time sequence features, outputting a prediction vector of the current input sample, wherein the type with the largest prediction value in the prediction vector is the prediction type.

And 4) generating a soft label and a hard label by the student model in the training process, wherein the soft label is generated by a self-adaptive temperature function, the soft label utilizes the idea of knowledge distillation, and performs attention loss calculation with the soft label of the teacher model, so that the migration of the characteristics of the teacher model to the student model is realized, and the loss is calculated by using a cross entropy loss function between the hard label and a real label.

And 5) finally, using a combined loss formed by the attention loss and the cross entropy loss as the integral loss of the student model, and then using a back propagation method to update the weight in the model.

The principle of the light-weight internet of things malicious flow identification method based on the knowledge distillation space-time neural network is as follows:

the complexity of the malicious traffic recognition model of most of the Internet of things at present is high, the neural network is mixed, the quantity and the volume of model parameters are large, so that the model cannot be deployed for the Internet of things equipment with limited storage resources, calculation resources and the like at the edge network nodes in the Internet of things, the edge equipment is extremely easy to attack, and the loss is caused; in addition, the problem of insufficient feature extraction in the training process makes the classifier trained by the model poor in performance, and the model is difficult to extract stable features from a sample.

Let a formatted sample x e R be a two-dimensional vector n x m:

x＝[x ₁ ，x ₂ ，...，x _n ]

x _i ＝[k ₁ ，k ₂ ，...，k _m ]，i∈[1，n]

wherein x is _i For the ith data in a sample, i.e. the ith data packet in the stream, there are n total data packets, each being m in length, i.e. each being a one-dimensional vector of length m, the whole x is a two-dimensional vector of n x m, where x=x _1：n ，x _i ＝k _1：m 。

Referring to fig. 2, 3 and 4, as a core idea of a lightweight internet of things malicious flow identification method based on a knowledge distillation space-time neural network, a two-dimensional vector sample with dimension of n×m is input into a model for training, and multi-dimensional feature extraction is performed by using a one-dimensional depth separable convolution and a BiLSTM neural network so as to ensure the robustness of the extracted features, wherein in the training process, the method comprises the spatial features of the whole sample and the time sequence features between each step; for spatial feature extraction, the samples are input into a convolution operation, using a _i ^t ＝f(w*h _i：i+k-1 +b) performing convolution operation, wherein for the whole input, the number of sequences for generating the feature map is smaller than that of the input because the height of the one-dimensional convolution kernel is not set to 1, and each step of convolution is 1×1 because of the one-dimensional depth separable convolution, so that parameter operation is greatly reduced; for time sequence feature extraction, the feature map after convolution is input into a BiLSTM neural network, and is calculated in the forward direction and the backward direction by using c _t ^～＝tanh(W _cx x _t +W _ch h _t-1 +b _c ) Calculating the current cell state, for each step of input, generates a new eigenvector s= [ y ] ₁ ，y ₂ ，...，y _m ]Combining the feature vectors generated in each step into a new feature map, and finally generating a prediction vector p of the whole sample through a full connection layer, wherein,

p＝[p ₁ ，p ₂ ，...，p _t ]

For the finally generated prediction vector, each element value is the class confidence of the model for the current sample identification, and each class has a numerical value, wherein the class corresponding to the maximum numerical value is the prediction classification of the model for the current sample.

The invention uses the depth separable convolution neural network and the BiLSTM neural network to extract the space characteristics and the time sequence characteristics, and the number of the neural units is only 32, thereby forming an extremely light student network model, but the recognition accuracy of the model is inevitably lost.

Therefore, the knowledge distillation method is introduced, based on knowledge distillation, a teacher model with extremely high accuracy is firstly trained and stored, then a student model dynamically generates soft labels in the training process by using the proposed self-adaptive temperature function, the soft labels generated by the student model and the soft labels generated by the teacher model calculate loss through the attention loss function, and the characteristic knowledge of the teacher model can be transferred to the student model; generating a hard tag by using the student model, and calculating the loss between the hard tag and the real tag by using a cross entropy loss function; finally, the attention loss and the cross entropy loss are generated in the training process of the student model by constituting loss kd_loss, and the weight is updated. The adaptive temperature function is as follows:

T(accuracy)＝α-(β*accuracy) ^θ

Wherein, alpha value is used as a super parameter to determine the variation range of T. The value of θ is used as an index parameter to determine the change speed of T, and the value of β is used as a proportion parameter to determine the change amplitude of T. accuracy is the rate of accuracy, and the T value can be along with the increase of accuracy and reduce gradually, realizes at model training initial stage, needs bigger T value to make soft label carry more information to along with the rate of accuracy improves, need reduce the T value and obtain more outstanding characteristic information. Assuming that the current sample belongs to the class A, in the initial stage of training, as the accuracy is not improved yet, the sample soft label generated by the student model carries more characteristic information, so that the model can quickly acquire the characteristic knowledge of the class A sample, and the temperature T is required to be in a larger value; with the improvement of the accuracy, when the accuracy reaches a certain level, and the current sample belongs to the class A, and when the model can not acquire more sample knowledge of the class A, the soft label corresponding to the sample is required to highlight sample information more, so that the temperature T is required to be in a smaller value.

Then, calculating the integral loss of the student model by using a combined loss function kd_loss, wherein the integral loss consists of two parts of attention loss and cross entropy loss, and the formula is as follows;

kd_loss＝0.5*attention_loss+0.5*ce_loss

the attribute_loss expression is:

wherein the method comprises the steps of

Soft tag generated for teacher model, +.>

the ce_loss expression is:

wherein t is _i Is a real label, and the log function base number is e;

and updating model weights by using a back propagation algorithm according to the loss calculated by kd_loss to obtain optimal network parameters, wherein the identification performance of few samples is higher.

The method for identifying the malicious traffic of the lightweight internet of things based on the knowledge distillation space-time neural network comprises a pretreatment stage, a training stage and a knowledge migration stage based on the knowledge distillation, wherein the pretreatment stage is to process, filter and process the original traffic into a formatted sample with a trainable model, and convert the formatted sample into an IDX file for storage; in the training stage, a sample is input into a model, and a prediction vector of the current sample is output through a full-connection layer after multi-dimensional feature extraction of space features and time sequence features; in a knowledge migration stage based on knowledge distillation, namely using a proposed self-adaptive temperature function, a student model generates a soft label through the temperature function, additionally generates a hard label, calculates the loss between the soft label and the soft label generated by a teacher model through attention loss, realizes knowledge migration from the teacher model to the student model, and calculates the loss between the hard label and a real label through a cross entropy loss function; the attention loss and the cross entropy loss jointly form an overall loss kd_loss of the student model; and then updating the weight in the model by using a back propagation method according to the loss, so that the model is more sensitive to the identification and classification of the sample. The specific operation steps are as follows:

Step 1: and obtaining a plurality of original abnormal and encrypted flow files, carrying out flow segmentation on the original flow files, marking corresponding classification labels, and then removing invalid flow files.

Step 2: and sequentially reading data packets from each data stream file, reserving effective data packets, filtering invalid data packets until the number of the effective data packets reaches n, and supplementing 0 if the number of the effective data packets is less than n.

Step 3: for a valid packet in the processed stream, every 8 bits are converted into a decimal integer between 0 and 255, wherein the packet length is kept to 500 bytes, if exceeded, truncated, and if insufficient, complemented with 0.

Step 4: and (3) carrying out normalization processing on the samples of the processed two-dimensional vector n multiplied by m so that the element values of the sample vector are between 0 and 1.

Step 5: and inputting the samples into model training, extracting multi-dimensional characteristics of spatial characteristics and time sequence characteristics, and finally outputting predicted vectors of the samples through a full connection layer.

Step 6: in the training process, the student model generates a soft tag and a hard tag, wherein the soft tag generated by the student model and the soft tag of the teacher model perform attention loss calculation, and the characteristic knowledge of the teacher model is migrated to the student model; and performing cross entropy loss calculation on the hard tag and the real tag generated by the student model.

Step 7: and calculating the calculated attention loss and the cross entropy loss to obtain the integral loss kd_loss of the student model, and updating the weight in the model by using a back propagation method to ensure that the model recognition rate is higher.

Step 8: and (5) repeating the steps (5), 6 and 7 until the model converges, namely completing multi-classification recognition training based on space-time cost.

The present invention is described in terms of flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A knowledge distillation space-time neural network-based lightweight internet of things malicious traffic identification method is characterized by comprising the following steps:

acquiring malicious traffic of the Internet of things to be identified;

acquiring a pre-trained teacher network model;

obtaining a pre-trained student network model based on knowledge distillation;

inputting malicious traffic of the Internet of things to be identified into a pre-trained student network model to obtain a category prediction vector and a corresponding prediction label;

Pre-training the student network model includes:

outputting a category prediction vector by the target student network model;

wherein, student modelProbability value for the ith sample at temperature T

The expression is:

z _i logits of the ith sample output by the student model;

the expression of the adaptive temperature function T is:

T(accuracy)＝α-(β*accuracy) ^θ

the alpha value is used as an over parameter to determine the variation range of T, the theta value is used as an index parameter to determine the variation speed of T, the beta value is used as a proportion parameter to determine the variation range of T, the accuracy is the accuracy, and the T value gradually decreases along with the increase of the accuracy.

2. The knowledge distillation space-time neural network-based malicious traffic identification method for the lightweight Internet of things of the invention as claimed in claim 1, wherein,

The pre-training teacher network model comprises:

3. The knowledge distillation space-time neural network-based malicious traffic identification method for the lightweight Internet of things of the invention as claimed in claim 1, wherein,

the pre-training student network model comprises:

4. The knowledge distillation space-time neural network-based light-weight internet of things malicious traffic identification method according to claim 1 or 3, wherein the method is characterized by comprising the following steps of:

the pre-training student network model comprises:

acquiring a soft label generated by a pre-trained teacher model;

the attention loss and the cross entropy loss together form the loss of the student network model, and the weight is updated to obtain the final student network model.

5. The knowledge distillation space-time neural network-based light-weight internet of things malicious traffic identification method according to claim 2 or 3, wherein the method is characterized by comprising the following steps of: the training set is obtained by the following steps:

Filtering and shielding IP addresses of the selected data packets;

6. The knowledge distillation space-time neural network-based malicious traffic identification method for the lightweight internet of things, which is disclosed in claim 1, is characterized in that:

the target student network model is a lightweight neural network model based on a knowledge distillation space-time neural network;

7. The knowledge distillation space-time neural network-based malicious traffic recognition method for the lightweight internet of things of the present invention as set forth in claim 5, wherein filtering the selected data packets comprises:

converting each data packet into a two-dimensional vector of (n, m);

vector normalization processing is performed on the two-dimensional vector converted into (n, m).

8. The knowledge distillation space-time neural network-based lightweight internet of things malicious traffic recognition method according to claim 2 or 3, wherein obtaining the training set comprises:

dividing the formatted sample vector set into a training set and a testing set;

9. The knowledge distillation space-time neural network-based lightweight internet of things malicious traffic recognition method according to claim 1, wherein the training of the student network model in advance further comprises:

kd_loss＝0.5*attention_loss+0.5*ce_loss

the attribute_loss expression is:

wherein the method comprises the steps of

Soft tag generated for teacher model, +.>

Generating a soft label for a student model, wherein alpha is a proportion parameter, gamma is an index parameter, and a log function base number is e;

the ce_loss expression is:

wherein t is _i Is a real label, and the log function base number is e;