CN117336071A

CN117336071A - Internet of things equipment safety protection method and device based on distributed AI

Info

Publication number: CN117336071A
Application number: CN202311343739.8A
Authority: CN
Inventors: 金城
Original assignee: Jiangsu Xinchao Tiancheng Intelligent Technology Co ltd
Current assignee: Jiangsu Xinchao Tiancheng Intelligent Technology Co ltd
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2024-01-02

Abstract

The invention relates to the technical field of the Internet of things, in particular to a safety protection method and a safety protection device for equipment of the Internet of things based on a distributed AI, comprising the following specific steps: s1: training a model on a global data set by a training method of the model through a federal learning distributed framework; s2, model reasoning is carried out through a trained model of the central server side so as to form risk intrusion classification of the Internet of things equipment; s3, identifying malicious attacks through a risk intrusion classification scheme, and adjusting a defense strategy in real time. The invention can carry out model training under the condition of not directly sharing data by using the federal learning technology, effectively protects the privacy of user data, is difficult to realize in the traditional scheme, adopts an ESAAE algorithm to carry out data expansion, effectively utilizes limited data resources and improves the data utilization rate.

Description

Internet of things equipment safety protection method and device based on distributed AI

Technical Field

The invention relates to the technical field of the Internet of things, in particular to a safety protection method and a safety protection device for Internet of things equipment based on a distributed AI.

Background

Along with the rapid development of the internet of things, more and more intelligent devices are connected to the network, so that great convenience is brought to the life of people. However, there are many potential safety hazards of the internet of things device, such as device loopholes, data leakage, malicious attacks, and the like, so that the internet of things device becomes a main target of an attacker. Therefore, ensuring the safety of the internet of things equipment is important.

In the traditional scheme of the internet of things safety protection, a centralized model training method is generally adopted, and all data are collected on a central server for training. This method, while simple, has the following problems:

1. data privacy problem in the traditional internet of things safety protection scheme, a centralized model training method is generally adopted, and all data are collected to a central server for training. This method easily causes data privacy disclosure, because all data is uploaded to the central server, and the privacy and security of the data cannot be effectively protected.

2. The data transmission overhead is large, which results in a large overhead for data transmission due to the need to upload all data to the central server. Particularly, under the condition that the number of the devices of the internet of things is large, a large amount of data uploading can cause network congestion and influence network performance.

3. The model effect is limited, the traditional safety protection scheme of the Internet of things equipment often adopts a simple model training method, and the model effect is possibly limited. Because of the huge amount of data and complex sources, centralized model training may not fully mine the potential information of the data, resulting in poor model results.

4. The dynamic defense capability is lacking, the traditional security protection scheme of the Internet of things equipment generally adopts a static defense strategy, and the real-time response and the dynamic defense capability to risk invasion are lacking. When the Internet of things equipment faces risk invasion, effective response cannot be quickly made, and the safety of the Internet of things equipment is reduced.

5. The data expansion and feature extraction capabilities are limited-traditional schemes are generally limited in terms of data expansion and feature extraction. Especially under the condition that the data volume of the internet of things equipment is insufficient or the data distribution is uneven, the traditional scheme can not fully utilize the existing data, so that the model training effect is not ideal.

6. The model training and optimizing method is single, and the traditional safety protection scheme of the Internet of things equipment often adopts the single model training and optimizing method. This approach may not fully mine the potential information of the data, affecting the performance of the model.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a safety protection method and device for equipment of the Internet of things based on distributed AI, and the specific technical scheme is as follows:

the Internet of things equipment safety protection method based on the distributed AI comprises the following specific steps:

s1: training a model on a global data set by a training method of the model through a federal learning distributed framework;

s2, model reasoning is carried out through a trained model of the central server side so as to form risk intrusion classification of the Internet of things equipment;

s3, identifying malicious attacks through a risk intrusion classification scheme, and adjusting a defense strategy in real time.

As a further technical scheme of the invention, the federal learning distributed framework comprises a central server and a plurality of users;

the method for training the model on the global data set by the federal learning distributed framework comprises the following steps:

s11: data acquisition is carried out through the Internet of things, and each user terminal marks the acquired data;

s12: carrying out vectorization processing on the data, and converting each data attribute into a digital value;

s13: carrying out data preprocessing on the vectorized data;

s14: expanding the preprocessed data through an ESAAE algorithm;

S15: extracting features of the expanded training data through a differential motion optimization algorithm;

s16: training a classifier by utilizing the training data after feature extraction through an RSNN algorithm;

s17: uploading parameters of the model after the user terminals are trained by using the local data to a central server terminal, and judging whether the model is iterated; if yes, stopping training, if not, entering S18;

s18: and aggregating the received parameters to form a global parameter model, updating the global parameter to the client, judging whether the global model is converged, if so, stopping training, and if not, returning to the step S17.

As a further technical scheme of the present invention, the preprocessing method of the data in S13 is as follows:

firstly, carrying out normalization processing on numerical data, wherein the normalization formula is as follows:

wherein the j-th numerical attribute of the i-th data point in the vectorized data set D is x _ij Wherein, j is more than or equal to 1 and less than or equal to m, n is the size of the data set, m is the number of numerical attributes, and min _j Sum max _j The minimum and maximum values of the j-th numerical attribute,is a normalized value;

next, the missing values in the data are processed using the following formula:

Wherein mu _j Is the average or median of the j-th attribute.

As a further technical solution of the present invention, the ESAAE algorithm in S14 includes:

self-encoder: the self-encoder is a core component of the ESAAE and is used for compressing and reconstructing data; the structure comprises an encoder and a decoder, wherein the encoder maps input data x to a hidden layer h, and the decoder maps the hidden layer h back to reconstructed input data x';

a discriminator: the discriminator is used for distinguishing real data from generated data and comprises a plurality of hidden layers and an output layer, wherein the output of the discriminator is a probability value which represents the probability that the input data is the real data;

a generator: the generator is for generating new data samples that take as input the hidden layer activation state from the encoder and generate new data through a series of deconvolution operations;

the ESAAE algorithm flow is as follows:

s141: let W be ⁱⁿ For inputting weight matrix W ^res The size of the input matrix is N multiplied by M for the internal weight matrix of the network, wherein N is the node number of the network, M is the input dimension, the size of the residual matrix is N multiplied by N, and the parameters of the ESAAE, including the weights and the bias of the self-encoder and the discriminator, are initialized;

S142: by encoding the input data x into the hidden layer h from the encoder section, the activation state of the hidden layer is updated by the following formula:

h _t+1 ＝(1-α)h _t +tanh(W ⁱⁿ x _t +W ^res h _t )

where α is the update rate, W ⁱⁿ And W is ^res Respectively an input weight matrix and a residual weight matrix. h is a _t+1 Is the activation state of the residual layer at the t+1th iteration, h _t Is the activation state of the residual layer at the t-th iteration;

then again, x, the hidden layer h is decoded back to the original data x by the decoding part of the self-encoder, which can be expressed as:

x＝σ(W ^dec h+b ^dec )

wherein σ () is a Sigmoid activation function, W ^dec Is a decoding weight matrix, b ^dec Is the decoding bias;

calculation ofLoss function of self-encoderDefine the loss function of the custom encoder>The goal of the self-encoder is to minimize reconstruction errors, i.e. the difference between the original data x and the decoded data x'; expressed in terms of Mean Square Error (MSE):

where n is the number of samples;

using the generator portion, taking as input the hidden layer activation state of the self-encoder and generating new data by a series of deconvolution operations; the goal of the generator is to generate dummy data that is similar to the real data distribution;

the loss function of the generator consists of two parts: reconstruction loss and fight loss; wherein the reconstruction loss is The countering loss is defined as:

wherein h is the hidden layer output of the self-encoder, G (h) is the generator generates a new sample using the hidden layer output of the self-encoder, D () is a discriminant function, and h-p (h) represent that h obeys the distribution p (h);

classifying the real data and the generated data using a discriminator section; the arbiter accepts the real data and the generated data as inputs and outputs a probability value, the output of the arbiter is calculated using the following formula:

wherein pa is the probability value, W, output by the discriminator ^disc Is a discriminant weight matrix, b ^disc Is the arbiter bias;

calculating a loss function of a discriminantDefine the loss function of the arbiter>The aim of the discriminator is to distinguish between real data and generated data; expressed in terms of a cross entropy loss function (CE):

wherein y is a real label, and p is the prediction probability of the discriminator;

using gradient descent methods, while optimizing the loss function of ESAAELoss function->Can be expressed as:

wherein, alpha, beta and lambda are all preset super parameters;

in each iteration, the self-encoder and the arbiter counter-propagate the loss function and update the parameters at the same time;

s143: and (S142) iterating the step until the iteration times reach the maximum iteration times preset by people, stopping iterating, and adding the sample generated by the generator into the preprocessed data set of the current user side to serve as an expanded training data set.

As a further technical solution of the present application, the differential motion optimization algorithm in S15 includes:

for each weight w [ i ] in the neural network]Is optimized to have an initial velocity v [ i ]]Acceleration a _cs [i]。a _cs [i]The error can be calculated by the error of the current network output and the error difference of the last time, and can be expressed as:

a _cs [i]＝ka*(error _current -error _previous )

where ka is a super parameter representing the gain of the acceleration. error (error) _current Error, being the error of the current network _previous Is the error of the network at the last iteration;

error for current network _current Can be expressed as:

where N is the number of samples,is the network output of the kth sample, +.>Is the true label of the kth sample, error _current Is an error of the network.

Further, a kinetic energy correction factor is calculated. For each weight w [ i ] and bias b [ i ], its kinetic energy K [ i ] is calculated, which can be expressed as:

K[i]＝0.5*m[i]*v[i] ²

where m [ i ] is the scale of the weight or bias and v [ i ] is the update rate. Defining a function f (K i) to calculate the kinetic energy correction factor, which can be expressed as:

where e is the base of the natural logarithm.

Further, the acceleration is adjusted. Using the kinetic energy correction factor to adjust acceleration can be expressed as:

a[i]＝a _cs [i]*f(K[i])

further, updates of speed and weight are made. For each weight w [ i ], the velocity v [ i ] is first updated using the adjusted acceleration a [ i ], which can be expressed as:

v[i] _new ＝[v[i] _old *h(adpt[i])*f(IG)+a[i]*dt]*μ

Where dt is a small time interval that is continuously adjusted by the dynamic learning rate. h (adpt [ i ]) is a parameter adjustment factor that functions to optimize a parameter adjustment process using historical update information of parameters, thereby improving performance and convergence speed of the algorithm. f (IG) is an information gain adjustment factor that acts as a historical update to account for parameters, making the algorithm more flexible and efficient. μ is an adaptive inertia factor.

The adaptive inertia factor mu is adaptively adjusted in each iteration, and the adaptive adjustment mode can be expressed as follows:

wherein mu _t Adaptive inertia factor, μ for the t-th iteration _t-1 Is the self-adaptive inertia factor of the t-1 step, beta ₃ Is a super parameter, |g _t And I is the absolute value of the gradient of the t-th step.

Further, the calculation manner of the parameter adjustment factor can be expressed as:

parameter adaptability is calculated. Parameter adaptability is defined as the mean of the update rate of the parameter over a time window, with a sliding window used to calculate the adaptability of each weight and bias. Let v [ i ]] _t-1 ,v[i] _t-2 ,...,v[i] _t-n Is the weight w [ i ]]Or bias b [ i ]]The speed in the last n updates is calculated as its adaptive adpt [ i ]]The following are provided:

where n is the size of the sliding window.

And (5) self-adaptive parameter adjustment. The update speed of the weights and the bias is adjusted according to the parameter adaptability, and a function h (adpt [ i ]) is defined to calculate a parameter adjustment factor:

where β is a superparameter representing the adaptive weight.

Further, the information gain adjustment factor f (IG) may be calculated as:

the information gain is calculated. Defining the information gain as a measure of the difference between the current weight or bias value and its historical value is accomplished by calculating the change in entropy. The weights or offsets are set at times t and t-1, respectively, to be w _t And w _t-1 Their information gain IG can be calculated as follows:

IG＝-p(w _t )log(p(w _t ))+p(w _t-1 )log(p(w _t-1 ))

where p (w) is the probability distribution of the weight or bias value w.

And (5) adjusting the gain of the adaptive information. The update speed of the weights and offsets is adjusted according to the information gain. Defining a function f (IG) to calculate the information gain adjustment factor:

wherein, gamma is a super parameter, which represents the weight of the information gain;

further, the dt adjustment mode can be expressed as:

error curvature is defined as the second derivative of error with respect to time. The error curvature is estimated by calculating the errors at three consecutive time points. Setting error _previous ，error _current And error _next The last, current and next errors, respectively, the estimated error curvature curvatus is as follows:

Where dt is the time interval.

Self-adaptive learning rate adjustment; the learning rate is adjusted according to the error curvature, and a function g (cure) is defined to calculate a learning rate adjustment factor:

where α is a hyper-parameter representing the weight of the curvature.

Using the learning rate adjustment factor to update the learning rate dt can be expressed as:

dt _new ＝dt _old *g(curvature)

further, a new velocity v [ i ] is used] _new To update the weights w i]Can be expressed as:

w[i] _new ＝w[i] _old +v[i] _new *dt

where dt is a hyper-parameter that can be considered as a learning rate; k is the gain of acceleration, and the speed of parameter updating is controlled;

further, for updating the bias, similar to the update process of the weights, using acceleration and velocity to update the bias b [ i ], can be expressed as:

for each bias, the acceleration is first calculated, which can be expressed as:

a _b [i]＝a _cs [i]*f(K[i])

further, the update speed may be expressed as:

v _b [i] _new ＝v _b [i] _old +a _b [i]*dt

further, the update bias may be expressed as:

b[i] _new ＝b[i] _old +v _b [i] _new *dt

in each training iteration, the weight and the bias are updated according to the kinetic energy correction factors, and the acceleration is dynamically adjusted, so that the motion of an object can be better simulated, and the algorithm is more flexible and efficient.

As a further technical solution of the present application, the RSNN algorithm in S16 includes:

First, a multi-layered impulse neural network is constructed, consisting of an input layer, a hidden layer, and an output layer. Neurons of the hidden layer all employ a pulsed activation function.

Using impulse neurons as the basic unit in a network, the input and output relationships of one impulse neuron can be expressed as:

where u (t) is the input of the neuron, w _i Is the weight of the neuron, x _i (t) is the output of the previous layer neuron, b is the bias, and N is the dimension of the input. A is an attention weight matrix, which consists of attention weights;

specifically, attention weight A _ij Calculated from the function fc (), can be expressed as:

wherein s (Fx _i ,Fx _j ) Is a calculation of two features Fx _i And Fx _j Function of similarity. Definition s (Fx) _i ,Fx _j ) The following form:

s(Fx _i ,Fx _j )＝θ ^T [Fx _i ；Fx _j ]

wherein θ is a parameter vector to be learned, [ Fx ] _i ；Fx _j ]Is a splice of two features;

further, the output y (t) of each neuron is a pulse function, which may be in the form of:

where θ is the threshold of the neuron;

to evaluate the performance of the model, a first loss function L is defined _origin Can be expressed as:

wherein t is _i Is a real tag, p _i Is the predictive probability of the model, M is the number of categories;

further, the loss function of the present invention defines an entropy function H () to calculate the uneven distribution of the attention weights, which can be expressed as:

Adding the entropy function term into the loss function, and updating the parameters:

L＝L _origin +α _d ×H(A)

wherein L is _origin As a first loss function, alpha _d Is a manually preset super parameter;

further, for optimization on the Riemann manifold, it is necessary to calculate the Riemann gradient; at point w on manifold M, riemann gradient grad _w L is the only vector that satisfies the following condition:

wherein g _w Is a measure at point w on manifold M, T _w M is the tangent space of manifold M at point w, DL (w) [ v ]]Is the derivative of the function L along the direction v at the point w;

further, an inner product is defined on the Riemann manifold to calculate the Riemann gradient and the Riemann Hessian matrix, and gradient descent is performed on the Riemann manifold. The weight update formula is:

wherein,is from point w _t Index map of departure->Is at point w _t Riemann gradient at, η is learning rate;

based on this, the following steps are repeatedly performed until the stop condition is satisfied:

forward propagation: sending the input data into a neural network, and calculating the output of each layer;

calculating loss: calculating a loss using the defined loss function;

back propagation: calculating the gradient of each layer of weight according to the loss function;

updating weights: the weights are updated based on the gradient descent of the Riemann manifold.

As a further technical solution of the present application, the method for determining whether the global model converges in S18 is:

before training begins, a training round number can be set, and once the round number is reached, model training is stopped.

The method for judging whether the global model converges in the S18 is as follows:

when the training and verification loss function value of the model stops or changes obviously decrease, the model can be considered to be converged, and the model training is stopped;

let the classifier model be fe (& theta) _e ) Wherein θ _e Is a parameter of the model. Given a sample x to be classified _e Using model fe (. Theta.; theta) _e ) Classifying the materials to obtain a classification result y _pred Can be expressed as:

y _pred ＝fe(x _e ；θ _e )

further, using the F1-score to evaluate the classification result, F1-score is a harmonic mean of Precision (Precision) and Recall (Recall), which can be expressed as:

wherein:

where TP is the true number of cases, FP is the false positive number, and FN is the false negative number.

The internet of things equipment safety protection device based on the distributed AI, which is applied to any one of the technical schemes, comprises the following steps:

and a data acquisition module: the method is used for collecting data in the Internet of things;

The execution module: training a model on the global dataset;

and a data processing module: the method is used for processing the entered data, identifying malicious attacks and adjusting the defending strategy in real time.

The beneficial effects of the invention are as follows:

1. data privacy protection: by using the federal learning technology, the invention can perform model training under the condition of not directly sharing data, effectively protects the privacy of user data, which is difficult to realize in the traditional scheme.

2. The data utilization rate is improved: the invention adopts ESAAE algorithm to expand data, effectively utilizes limited data resources and improves the data utilization rate.

3. Model accuracy and robustness are improved: through feature extraction and classifier training optimization, the method improves accuracy and robustness of the model, and risk invasion is more effectively identified and defended.

4. Dynamic defense: the invention can automatically trigger the dynamic defense mechanism, respond to risk invasion in real time, and is more flexible and effective compared with the traditional static defense strategy.

5. Pertinence protection: the method can classify and analyze the intrusion behavior, adopts targeted protective measures, and defends malicious attacks more accurately and effectively.

6. Real-time response: the intrusion behavior is analyzed through the risk intrusion classification method, and the response can be automatically or manually performed, such as adjusting security policies, tracking attack sources and the like, so that the real-time response capability to risk intrusion is improved.

Drawings

FIG. 1 illustrates a schematic structural diagram of a federal learning distributed framework;

FIG. 2 shows a flow diagram of a model trained on a global dataset;

fig. 3 shows a schematic flow diagram of a security protection apparatus for an internet of things device based on a distributed AI.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments.

The application provides a distributed AI-based (advanced technology attachment) -based safety protection method for Internet of things equipment, which adopts a training method for a model by adopting a federal learning distributed framework, and consists of a central server and a plurality of users, wherein the training method can realize that each user participating in training in the safety protection task for the Internet of things equipment trains the model on a global data set under the condition of not sharing data, and comprises the following specific steps of:

S11: data acquisition is carried out through the Internet of things, and each user terminal marks the acquired data.

The data formats of the user terminals are stored in a CSV format, each row of records corresponds to one data point, and each column represents one data attribute; the data attributes of each user side include: device type, device manufacturer, operating system version, software version, CPU model, memory size, network traffic, port status, login failure times, and device log; each data attribute is represented by an appropriate value or string.

Each user side marks the acquired data and mainly comprises marks whether risk invasion exists or not. There are two possible values for the labels: 0 represents no risk intrusion and 1 represents a risk intrusion.

For example: the data attribute values of the data points of the Internet of things equipment are as follows:

device type: a camera; equipment manufacturers: vendor A; operating system version: OS_v1; software version: sw_v2; CPU model: CPU_1; memory size: 4GB; network traffic: 1000KBps; port state: opening; number of login failures: 3, a step of; device log: "error occurred during login"; then, the vector of this data point is expressed as:

X= [ "camera", "vendor a", "OS _v 1","SW _v 2","CPU ₁ ","4GB","1000KBps",

"on", "3", "error CCureddingLogin")

And, the data is marked as: y=1 (indicating that there is a risk intrusion)

Where X represents a feature vector of the data point and Y represents a marker of the data point.

And S12, carrying out vectorization processing on the data, and converting each data attribute into a digital value.

Wherein, the non-numerical attributes such as the device type, the device manufacturer, the operating system version, the software version, the CPU model and the like are vectorized through One-hot encoding (One-hot encoding).

One-hot encoding is a method of converting a classification variable into a binary vector. For example, for a device type attribute, three device types are set: "camera" is denoted as [1, 0]; "temperature sensor" is denoted as [0,1,0]; the "router" is denoted as [0, 1].

For the numerical attributes such as memory size, network traffic and the like, the numerical attributes are directly used as elements of vectors.

For port state attributes such as "on" and "off," binary encoding may be used, i.e., with "on" being denoted as 1 and "off being denoted as 0.

For the login failure time attribute, the login failure time attribute is directly used as an element of the vector.

For the device log attribute, word Embedding (Word Embedding) technology is used for converting words in the log into vectors; for example, words such as "error" and "focused" may be converted into vectors using a pre-trained word embedding model; specifically, let the vector of word w be denoted as e _w Wherein, w belongs to the word set V in the device log, the formula for converting the word w into a vector is:

e _w ＝Embedding(w)

wherein Embedding is a pre-trained Word Embedding model, and the invention adopts a Word2Vec model.

Merging all attribute vectors into one large vector;

specifically, M attributes are provided, and the vector of the ith attribute is a _i Wherein i is more than or equal to 1 and less than or equal to M. The final vector representation x of the data points is given by the following formula:

x _vec ＝Cont[a ₁ ,a ₂ ,…,a _M ]

where Cont [ ] represents the vector merge operation and x is the vectorized data.

In this way, all attributes of the data points are converted into one large vector, which can be directly used for subsequent machine learning model training and prediction.

S13, carrying out data preprocessing on the quantized data.

First, normalization processing is performed on numerical data. Normalization can convert data to a standardIs helpful to improve the training efficiency and convergence speed of the model. Let the j-th numerical attribute of the i-th data point in the vectorized data set D be x _ij Where 1.ltoreq.j.ltoreq.m, n is the size of the dataset and m is the number of numerical attributes. The normalized formula is:

wherein min is _j Sum max _j The minimum and maximum values of the j-th numerical attribute, Is a normalized value.

Further, missing values in the data are processed. Missing values may have an impact on model training and prediction. Let the j-th attribute of the i-th data point in data set D be x _ij The missing values are processed using the following formula:

wherein mu _j Is the average or median of the j-th attribute.

S14, expanding the preprocessed data through an ESAAE algorithm.

The data volume of each user end is often insufficient, and an artificial intelligent algorithm model with high robustness is difficult to train. Therefore, the invention proposes a data expansion based on the echo state's antagonistic self-encoder (Echo State Adversarial Autoencoder, ESAAE) algorithm.

The ESAAE algorithm provided by the invention comprises a self-encoder and a discriminator.

Self-encoder: the self-encoder is a core component of the ESAAE for compression and reconstruction of data. The structure comprises an encoder and a decoder. The encoder maps the input data x to the hidden layer h. The decoder maps the hidden layer h back to the reconstructed input data x'.

A discriminator: the discriminator is used for distinguishing real data from generated data. Which comprises a plurality of hidden layers and an output layer. The output of the arbiter is a probability value representing the probability that the input data is real data.

A generator: the generator is for generating new data samples. It takes as input the hidden layer activation state of the self-encoder and generates new data through a series of deconvolution operations.

The algorithm flow of ESAAE is as follows:

s141: let W be ⁱⁿ For inputting weight matrix W ^res Is the internal (or residual) weight matrix of the network. The size of the input matrix is n×m, where N is the number of nodes of the network and M is the dimension of the input. The size of the residual matrix is n×n. Parameters of ESAAE are initialized, including weights and biases from encoders and discriminators.

h _t+1 ＝(1-α)h _t +tanh(W ⁱⁿ x _t +W ^res h _t )

where α is the update rate, W ⁱⁿ And W is ^res Respectively an input weight matrix and a residual weight matrix. h is a _t+1 Is the activation state of the residual layer at the t+1th iteration, h _t Is the activation state of the residual layer at the t-th iteration.

x＝σ(W ^dec h+b ^dec )

wherein σ () is a Sigmoid activation function, W ^dec Is a decoding weight matrix, b ^dec Is the decoding bias.

Calculating a loss function from an encoderDefine the loss function of the custom encoder >The goal of the self-encoder is to minimize the reconstruction error, i.e. the difference between the original data x and the decoded data x'. Expressed in terms of Mean Square Error (MSE):

where n is the number of samples.

Using the generator part, the hidden layer activation state of the self-encoder is taken as input and new data is generated by a series of deconvolution operations. The goal of the generator is to generate dummy data that is similar to the real data distribution.

The loss function of the generator consists of two parts: reconstruction loss and fight loss. Wherein the reconstruction loss isThe countering loss is defined as:

where h is the hidden layer output from the encoder, G (h) is the generator generating new samples using the hidden layer output from the encoder, D () is the discriminant function, and h-p (h) represent that h obeys the distribution p (h).

The real data and the generated data are classified using a discriminator section. The arbiter accepts the real data and the generated data as inputs and outputs a probability value, the output of the arbiter is calculated using the following formula:

wherein pa is the probability value, W, output by the discriminator ^disc Is a discriminant weight matrix, b ^disc Is the arbiter bias.

Calculating the loss of the discriminatorLoss functionDefine the loss function of the arbiter >The goal of the arbiter is to distinguish between the real data and the generated data. Expressed in terms of a cross entropy loss function (CE):

where y is the true label and p is the predicted probability of the arbiter.

wherein, alpha, beta and lambda are all preset super parameters.

In each iteration, both the self-encoder and the arbiter counter-propagate the loss function while updating the parameters. The present invention uses a random gradient descent (SGD) method to optimize the loss function of ESAAE.

S143: data expansion was performed using trained ESAAE. Namely, the iteration step S142 is looped until the iteration number reaches the maximum preset iteration number, the iteration is stopped, and the sample generated by the generator is added into the preprocessed data set of the current user side to be used as the training data set after expansion.

And S15, extracting the characteristics of the expanded training data.

The invention provides a neural network based on differential motion optimization, which is inspired by the principle of object motion in nature, wherein the differential motion optimization algorithm optimizes parameters of the neural network based on the principle that the object uniformly accelerates linear motion on a plane, the weights w and the biases b of the neural network are regarded as objects, and the parameters are optimized according to the dynamic changes of the weights w and the biases b, and the optimization algorithm has two key parameters, namely speed v and acceleration a.

Specifically, for each weight w [ i ] in the neural network]Is optimized to have an initial velocity v [ i ]]Acceleration a _cs [i]。a _cs [i]The error can be calculated by the error of the current network output and the error difference of the last time, and can be expressed as:

a _cs [i]＝ka*(error _current -error _previous )

where ka is a super parameter representing the gain of the acceleration. error (error) _current Error, being the error of the current network _previous Is the error of the network at the last iteration.

Error for current network _current Can be expressed as:

K[i]＝0.5*m[i]*v[i] ²

where e is the base of the natural logarithm.

a[i]＝a _cs [i]*f(K[i])

v[i] _new ＝[v[i] _old *h(adpt[i])*f(IG)+a[i]*dt]*μ

1. parameter adaptability is calculated. Parameter adaptability is defined as the mean value of the update rate of the parameter over a certain time window,the adaptability of each weight and bias is calculated using a sliding window. Let v [ i ]] _t-1 ,v[i] _t-2 ,...,v[i] _t-n Is the weight w [ i ]]Or bias b [ i ]]The speed in the last n updates is calculated as its adaptive adpt [ i ]]The following are provided:

where n is the size of the sliding window.

2. And (5) self-adaptive parameter adjustment. The update speed of the weights and the bias is adjusted according to the parameter adaptability, and a function h (adpt [ i ]) is defined to calculate a parameter adjustment factor:

where β is a superparameter representing the adaptive weight.

Further, the information gain adjustment factor f (IG) may be calculated as:

1. the information gain is calculated. Defining the information gain as a measure of the difference between the current weight or bias value and its historical value is accomplished by calculating the change in entropy. The weights or offsets are set at times t and t-1, respectively, to be w _t And w _t-1 Their information gain IG can be calculated as follows:

IG＝-p(w _t )log(p(w _t ))+p(w _t-1 )log(p(w _t-1 ))

where p (w) is the probability distribution of the weight or bias value w.

2. And (5) adjusting the gain of the adaptive information. The update speed of the weights and offsets is adjusted according to the information gain. Defining a function f (IG) to calculate the information gain adjustment factor:

where γ is a superparameter representing the weight of the information gain.

Further, the dt adjustment mode can be expressed as:

1. error curvature is defined as the second derivative of error with respect to time. The error curvature is estimated by calculating the errors at three consecutive time points. Setting error _previous ，error _current And error _next The last, current and next errors, respectively, the estimated error curvature curvatus is as follows:

Where dt is the time interval.

2. And (5) adjusting the self-adaptive learning rate. The learning rate is adjusted according to the error curvature, and a function g (cure) is defined to calculate a learning rate adjustment factor:

where α is a hyper-parameter representing the weight of the curvature.

3. Using the learning rate adjustment factor to update the learning rate dt can be expressed as:

dt _new ＝dt _old *g(curvature)

w[i] _new ＝w[i] _old +v[i] _new *dt

where dt is a hyper-parameter that can be considered as a learning rate. k is the gain of the acceleration, the speed at which the control parameter is updated.

for each bias, the acceleration is first calculated, which can be expressed as:

a _b [i]＝a _cs [i]*f(K[i])

further, the update speed may be expressed as:

v _b [i] _new ＝v _b [i] _old +a _b [i]*dt

further, the update bias may be expressed as:

b[i] _new ＝b[i] _old +v _b [i] _new *dt

S16: and training the classifier by using the training data after the feature extraction.

The invention adopts a pulse neural network (RSNN) algorithm based on Riemann manifold to train the classifier.

The impulse neural network of the Riemann manifold utilizes an optimization method on the Riemann manifold to update the weights.

Specifically, first, a multi-layer impulse neural network is constructed, which is composed of an input layer, a hidden layer and an output layer. Neurons of the hidden layer all employ a pulsed activation function.

where u (t) is the input of the neuron, w _i Is the weight of the neuron, x _i (t) is the output of the previous layer neuron, b is the bias, and N is the dimension of the input. A is an attention weight matrix, which consists of attention weights.

s(Fx _i ,Fx _j )＝θ ^T [Fx _i ；Fx _j ]

wherein θ is a parameter vector to be learned, [ Fx ] _i ；Fx _j ]Is a splice of two features.

where θ is the threshold of the neuron.

Wherein t is _i Is a real tag, p _i Is the predictive probability of the model, and M is the number of categories.

L＝L _origin +α _d ×H(A)

wherein L is _origin As a first loss function, alpha _d Is a manually preset hyper-parameter.

Further, for optimization on the Riemann manifold, it is necessary to calculate the Riemann gradient. At point w on manifold M, riemann gradient grad _w L is the only vector that satisfies the following condition:

wherein g _w Is a measure at point w on manifold M, T _w M is the tangent space of manifold M at point w, DL (w) [ v ]]Is the derivative of the function L in the direction v at point w.

wherein,is from point w _t Index map of departure->Is at point w _t Riemann gradient at, η is the learning rate.

1. forward propagation: the input data is fed into the neural network and the output of each layer is calculated.

2. Calculating loss: the loss is calculated using a defined loss function.

3. Back propagation: the gradient of each layer weight is calculated from the loss function.

4. Updating weights: the weights are updated based on the gradient descent of the Riemann manifold.

S17: after the model is trained by each user side through the local data, the model is uploaded to a central server side, and federation fusion of the models is carried out.

In the federal learning distributed framework provided by the invention, a federal average algorithm is adopted to carry out cooperative training of the model. The federal averaging algorithm is an aggregation algorithm in federal learning that implements co-training of local models through global iterations.

Specifically, in each global iteration, it is assumed that the number of clients is n, the total number of samples owned is D, and the number of samples of client k is D _k ，R ^d The d-dimensional real space, the objective function to be optimized can be expressed as:

wherein:

f _j (ω)＝l(u _j ,v _j ；ω)

wherein f (ω) is a loss function of the federal learning model, f _j (omega) is the model parameter omega versus the jth sample data (u) _j ,v _j ) Prediction function of loss, l () is the loss function.

For the client k, define its loss function F _k (ω) is:

wherein p is _k And indexing the data of the kth user side.

Thus, the loss function f (ω) of the federal learning model can be expressed as:

gradient when client kWhen the learning rate is R, updating after the t-th iteration to obtain the latest global model parameter omega _t+1 The method comprises the following steps:

wherein omega _t 、ω _t+1 And respectively updating the obtained global model parameters after the t-1 th iteration and the t-th iteration.

The local model parameter updating manner of the client k can be expressed as follows:

wherein omega _t,k 、ω _l+1,k And updating the obtained local model parameters of the user terminal k after the t-1 th iteration and the t-th iteration respectively.

Therefore, when federal learning is performed, the objective to be optimized of the user side is determined first, and then the loss function of each user side and the loss function of the federal learning model are determined. Further, a gradient of the user end k is obtainedAfter that, the local model parameter omega in the t-th iteration update is obtained _l,k The global model parameter update may be expressed as:

after the t-th iteration, setting the local model parameter of the user side k to omega _i+1,k ＝ω _t The iterative computation is then continued until the training is completed.

In the federal learning framework of the present invention, the criteria for model cessation training include reaching a preset number of training rounds or model convergence.

Wherein, if the preset training wheel number is reached, that is, before training starts, one training wheel number can be set, and once the training wheel number is reached, model training is stopped.

In addition, when the training and validation loss function value of the model stops or changes significantly decrease, or the F1 score evaluation score of the model reaches a preset value, the model may be considered to have converged and the model training stops.

Specifically, a classifier model is set as fe (& theta) _e ) Wherein θ _e Is a parameter of the model. Given a sample x to be classified _e Using model fe (. Theta.; theta) _e ) Classifying the materials to obtain a classification result y _pred Can be expressed as:

y _pred ＝fe(x _e ；θ _e )

wherein:

Specifically, when a risk intrusion is identified, dynamic defense mechanisms are automatically triggered, such as restricting traffic from malicious sources, rejecting requests from malicious sources, and so forth.

Further, the intrusion type is identified according to the risk intrusion classification method, and targeted protection measures are adopted. For example, for the identified DDoS attack, measures such as traffic cleaning, firewall configuration, etc. are taken to protect.

Further, the intrusion behavior is analyzed by the risk intrusion classification method, for example, information such as attack characteristics, attack sources and the like is extracted. Based on the analysis results, the response is performed automatically or manually, such as adjusting security policies, tracking attack sources, etc.

Example two

FIG. 3 shows a flow diagram of a distributed AI-based Internet of things device security guard; the invention also provides a safety protection device of the internet of things equipment based on the distributed AI, which comprises:

the execution module: training a model on the global dataset;

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting.

Claims

1. The internet of things equipment safety protection method based on the distributed AI is characterized by comprising the following specific steps:

2. The distributed AI-based security protection method of an internet of things device of claim 1, wherein the federal learning distributed framework comprises a central server and a plurality of users;

s13: carrying out data preprocessing on the vectorized data;

s14: expanding the preprocessed data through an ESAAE algorithm;

3. The internet of things device security protection method based on the distributed AI of claim 2, wherein: the preprocessing method of the data in the S13 comprises the following steps:

wherein the j-th numerical attribute of the i-th data point in the vectorized data set D is x _ij Wherein, j is more than or equal to 1 and less than or equal to m, n is the size of the data set, and m is the numerical valueQuantity of sexes, min _j Sum max _j The minimum and maximum values of the j-th numerical attribute,is a normalized value;

next, the missing values in the data are processed using the following formula:

wherein mu _j Is the average or median of the j-th attribute.

4. The internet of things device security protection method based on the distributed AI of claim 2, wherein: the ESAAE algorithm in S14 includes:

self-encoder: the self-encoder is a core component of the ESAAE and is used for compressing and reconstructing data; the structure comprises an encoder and a decoder, wherein the encoder maps input data x to a hidden layer h, and the decoder maps the hidden layer h back to the reconstructed input data x ^′ ；

the ESAAE algorithm flow is as follows:

h _t+1 ＝(1-α)h _t +tanh(W ⁱⁿ x _t +W ^res h _t )

where α is the update rate, W ⁱⁿ And W is ^res Respectively an input weight matrix and a residual weight matrix; h is a _t+1 Is the activation state of the residual layer at the t+1th iteration, h _t Is the activation state of the residual layer at the t-th iteration;

x＝σ(W ^dec h+b ^dec )

calculating a loss function from an encoderDefine the loss function of the custom encoder >The goal of the self-encoder is to minimize reconstruction errors, i.e. the original data x and the decoded data x ^′ Differences between; expressed in terms of Mean Square Error (MSE):

where n is the number of samples;

the loss function of the generator consists of two parts: reconstruction loss and fight loss; which is a kind ofIn (2), the reconstruction loss isThe countering loss is defined as:

calculating a loss function of a discriminantDefine the loss function of the arbiter >The aim of the discriminator is to distinguish between real data and generated data; expressed in terms of a cross entropy loss function (CE):

using gradient descentMethod for simultaneously optimizing ESAAE loss functionLoss function->Can be expressed as:

wherein, alpha, beta and lambda are all preset super parameters;

5. The internet of things device security protection method based on the distributed AI of claim 2, wherein: the differential motion optimization algorithm in S15 includes:

for each weight w [ i ] in the neural network]Is optimized to have an initial velocity v [ i ]]Acceleration a _cs [i]；a _cs [i]The error can be calculated by the error of the current network output and the error difference of the last time, and can be expressed as:

a _cs [i]＝ka*(error _current -error _previous )

where ka is a hyper-parameter representing the gain of acceleration; error (error) _current Error, being the error of the current network _previous Is the error of the network at the last iteration;

error for current network _current Can be expressed as:

where N is the number of samples,is the network output of the kth sample, +.>Is the true label of the kth sample, error _current Is an error of the network;

further, calculating a kinetic energy correction factor; for each weight w [ i ] and bias b [ i ], its kinetic energy K [ i ] is calculated, which can be expressed as:

K[i]＝0.5*m[i]*v[i] ²

where m [ i ] is the scale of the weight or bias and v [ i ] is the update rate; defining a function f (K i) to calculate the kinetic energy correction factor, which can be expressed as:

wherein e is the base of natural logarithms;

further, adjusting the acceleration; using the kinetic energy correction factor to adjust acceleration can be expressed as:

a[i]＝a _cs [i]*f(K[i])

further, updating the speed and the weight; for each weight w [ i ], the velocity v [ i ] is first updated using the adjusted acceleration a [ i ], which can be expressed as:

v[i] _new ＝[v[i] _old *h(adpt[i])*f(IG)+a[i]*dt]*μ

where dt is a small time interval that is continuously adjusted by the dynamic learning rate; h (adpt i) is a parameter adjustment factor, which is used to optimize the parameter adjustment process by using the historical update information of the parameter, thereby improving the performance and convergence rate of the algorithm; f (IG) is an information gain adjustment factor which is used for taking historical update information of parameters into consideration, so that the algorithm is more flexible and efficient; μ is an adaptive inertia factor;

wherein mu _t Adaptive inertia factor, μ for the t-th iteration _t-1 Is the self-adaptive inertia factor of the t-1 step, beta ₃ Is a super parameter, |g _t I is the absolute value of the gradient of step t;

calculating parameter adaptability; defining parameter adaptability as the average value of the update rate of the parameter in a certain time window, and calculating the adaptability of each weight and bias by using a sliding window; let v [ i ]] _t-1 ,v[i] _t-2 ,...,v[i] _t-n Is the weight w [ i ]]Or bias b [ i ]]The speed in the last n updates is calculated as its adaptive adpt [ i ]]The following are provided:

wherein n is the size of the sliding window;

self-adaptive parameter adjustment; the update speed of the weights and the bias is adjusted according to the parameter adaptability, and a function h (adpt [ i ]) is defined to calculate a parameter adjustment factor:

wherein β is a superparameter representing the adaptive weight;

further, the information gain adjustment factor f (IG) may be calculated as:

calculating information gain; defining the information gain as a measure of the difference between the current weight or bias value and its historical value by calculating the change in entropy; the weights or offsets are set at times t and t-1, respectively, to be w _t And w _t-1 Their information gain IG can be calculated as follows:

IG＝-p(w _t )log(p(w _t ))+p(w _t-1 )log(p(w _t-1 ))

where p (w) is the probability distribution of the weight or bias value w;

adjusting the gain of the self-adaptive information; adjusting the update speed of the weights and the biases according to the information gain; defining a function f (IG) to calculate the information gain adjustment factor:

further, the dt adjustment mode can be expressed as:

defining an error curvature as a second derivative of the error with respect to time; estimating an error curvature by calculating errors at three consecutive time points; setting error _previous ，error _current And error _next The last, current and next errors, respectively, the estimated error curvature curvatus is as follows:

where dt is the time interval;

wherein α is a hyper-parameter representing the weight of the curvature;

dt _new ＝dt _old *g(curvature)

w[i] _new ＝w[i] _old +v[i] _new *dt

for each bias, the acceleration is first calculated, which can be expressed as:

a _b [i]＝a _cs [i]*f(K[i])

further, the update speed may be expressed as:

v _b [i] _new ＝v _b [i] _old +a _b [i]*dt

further, the update bias may be expressed as:

b[i] _new ＝b[i] _old +v _b [i] _new *dt

6. The internet of things device security protection method based on the distributed AI of claim 2, wherein: the RSNN algorithm in S16 includes:

firstly, constructing a multi-layer pulse neural network, which consists of an input layer, a hidden layer and an output layer; neurons of the hidden layer adopt pulse type activating functions;

where u (t) is the input of the neuron, w _i Is the weight of the neuron, x _i (t) is the output of the previous layer neuron, b is the bias, and N is the dimension of the input; a is an attention weight matrix, which consists of attention weights;

wherein s (Fx _i ,Fx _j ) Is a calculation of two features Fx _i And Fx _j A function of similarity; definition s (Fx) _i ,Fx _j ) The following form:

s(Fx _i ,Fx _j )＝θ ^T [Fx _i ；Fx _j ]

where θ is the threshold of the neuron;

further, the loss function defines an entropy function H () to calculate the uneven distribution of attention weights, which can be expressed as:

L＝L _origin +α _d ×H(A)

wherein g _w Is a measure at point w on manifold M, T _w M is the tangent space of manifold M at point w, DL (w) [ v ] ]Is the derivative of the function L along the direction v at the point w;

further, defining an inner product on the Riemann manifold to calculate a Riemann gradient and a Riemann Hessian matrix, and performing gradient descent on the Riemann manifold; the weight update formula is:

wherein,is the slave pointw _t Index map of departure->Is at point w _t Riemann gradient at, η is learning rate;

calculating loss: calculating a loss using the defined loss function;

7. The internet of things device security protection method based on the distributed AI of claim 2, wherein the method for determining whether the global model converges in S18 is:

8. The internet of things device security protection method based on the distributed AI of claim 2, wherein the method for determining whether the global model converges in S18 is:

let the classifier model be fe (& theta) _e ) Wherein θ _e Is a parameter of the model; given a sample x to be classified _e Using model fe (. Theta.; theta) _e ) Classifying the materials to obtain a classification result y _pred Can be expressed as:

y _pred ＝fe(x _e ；θ _e )

wherein:

9. The internet of things equipment safety protection device based on the distributed AI is applied to the internet of things equipment safety protection method based on the distributed AI as set forth in any one of claims 1 to 8, and is characterized by comprising the following steps:

the execution module: training a model on the global dataset;