CN112115469A

CN112115469A - Edge intelligent moving target defense method based on Bayes-Stackelberg game

Info

Publication number: CN112115469A
Application number: CN202010966915.3A
Authority: CN
Inventors: 钱亚冠; 关晓惠; 王滨; 陶祥兴; 周武杰; 云本胜; 陈晓霞; 李蔚; 楼琼; 吴淑慧
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Water Resources and Electric Power
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Water Resources and Electric Power
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-22
Anticipated expiration: 2040-09-15
Also published as: CN112115469B

Abstract

The invention discloses a Bayes-Stackelberg game-based intelligent edge moving target defense method, and provides a dynamic defense mechanism called as edge intelligent moving target defense (EI-MTD). Firstly, a member model which is small in scale and suitable for being deployed at an edge node is obtained from a complex teacher model of a cloud data center through difference knowledge distillation. And then, a Bayes-Stackelberg game strategy is utilized to dynamically schedule the member models, so that an attacker cannot judge a target model for executing the classification task. The defense mechanism can effectively prevent an attacker from selecting an optimal agent model to make a countersample, thereby blocking black box attack. Experiments on an ILSVRC2012 image data set show that the EI-MTD provided by the invention can effectively protect edge intelligence from being attacked by a malicious black box.

Description

Edge intelligent moving target defense method based on Bayes-Stackelberg game

Technical Field

The invention relates to a security technology of edge intelligent computation, and provides an edge intelligent moving target defense method based on a Bayes-Stackelberg game.

Background

Artificial intelligence based on deep learning has been successfully applied in various fields, from facial recognition, natural language processing, to computer vision. With the vigorous development of intelligent technology, people's life has changed greatly, and people increasingly rely on convenient services provided by intelligent life and hope to enjoy intelligent services anytime and anywhere. Over the past few years, the theory of edge computing has moved towards applications, and various applications have been developed to improve our lives. The maturation of deep learning techniques and edge computing systems and the growing demand for intelligent life has facilitated the development and implementation of Edge Intelligence (EI). Current EI implementations are based on deep learning models, i.e., Deep Neural Networks (DNNs), which are deployed to devices at the edge of the network (e.g., smart cameras of a surveillance system) to achieve real-time performance for applications such as object recognition and anomaly detection.

Currently, security of edge intelligence is a concern. Past work has mostly focused on data privacy with marginal intelligence, but has not focused enough on combating sample attacks. Existing work has shown that DNNs are extremely vulnerable to attack against samples. The challenge sample is the input image with a carefully designed small perturbation added in order to spoof the deep neural network. The challenge samples have a special property that for a particular model generated challenge sample, another model can also be successfully spoofed, called transferability. In models with similar architecture, low model capacity and high test accuracy, the challenge samples have higher transferability. In theory, with this property, an attacker can make a target model of the attack against a sample on the local proxy model without knowing anything about the target model, called a black box attack. In fact, an attacker can find an agent model which is closer to the target model by repeatedly inquiring the target model, so that a higher attack success rate is obtained, and the effect of white-box attack is achieved.

Model compression is considered an effective way to reduce the size of the model due to the limitations of computing, storage, etc. resources on the edge nodes, including edge devices and edge servers. However, the robustness of the model is positively correlated to the size of the model. Therefore, the compression model on the edge node is more vulnerable to the countersample. In addition, most of the methods for defending the countermeasure samples proposed at present need to work under the condition of rich GPU computing resources, and are not suitable for edge nodes. Thus, the limited resources limit the application of edge intelligence to sensitive areas.

We summarize the security challenges facing edge intelligence computation as follows: (1) how to prevent attackers from finding the optimal proxy model, (2) how to reduce the transferability of countermeasure samples without compromising the normal sample accuracy, (3) how to defend against resource-limited countermeasure samples on edge nodes.

Disclosure of Invention

The invention provides a defense method for an edge intelligent moving target to solve the problems. For the first challenge, we change the static object model to a dynamic object model, randomly scheduling the classification service. Since the attacker does not know the model that really serves them, they cannot estimate which candidate agent model is close to the target model. For the second challenge, we try to increase the difference between the models deployed on the edge nodes. We use the gradient of the loss function as the basis for the difference metric, since current attacks mainly use the gradient to make examples of countermeasures. For the third challenge, we use transfer learning to extract knowledge, from a large capacity powerful teacher model to a small capacity student model in a cloud data center. The benefit of this approach is that the classification knowledge and robustness is transferred and the size of the model is compressed.

The present invention integrates these solutions into a middle defense framework, referred to as edge intelligence with moving target defense (EI-MTD). To this end, we constructed the EI-MTD by the following steps: (1) a robust teacher model is obtained through countermeasure training by utilizing powerful GPUs of a cloud data center; (2) transferring the robust knowledge of the teacher model to the student models through differential knowledge distillation to obtain diversity; and (3) switching the student model by using a Bayes-Stackelberg game strategy to balance accuracy and safety.

The invention realizes the purpose through the following technical scheme:

the invention provides an EI-MTD system comprising three key technologies, namely confrontation training, differential knowledge distillation and dynamic scheduling of a service model. We use countermeasure training to obtain a powerful teacher model of the cloud data center. And then using transfer learning to extract robust knowledge from the teacher model and to use resources in a small-scale student model with limited resources. Meanwhile, a differential regularization term is added to obtain the diversity of an extraction model, and the transferability of a confrontation sample is effectively inhibited. These student models, also referred to as membership models in a mobile object environment, are further used in a dynamic scheduling scheme for services to schedule users. Thanks to the diversity derived from the distillation of the differential knowledge, dynamic scheduling can perfectly confuse attackers to find the best surrogate model, as shown on the right side of fig. 1.

The invention comprises the following steps:

s1: and (4) performing confrontational training for the teacher model. Suppose that the cloud data center has been trainedExercise data set

And teacher model F_t(θ_t). A neural network with a layer 101 of ResNet-101 is used as a teacher model, FGSM (functional short message System) confrontation samples are used for confrontation training in a cloud data center, and a combined 'FAST' confrontation training method is used for accelerating the process. Work has shown that resistance training allows for greater network robustness.

S2: differential knowledge distillation of student models. First from the teacher model F_t(θ_t) To obtain a sample x at a suitable distillation temperature T_iSoft label of

Creating a new training data set

To obtain the diversity of the student model, we define a new model with regularization term CS_coherenceIs equal to sigma T²J/K+λ·CS_coherenceTraining all student models simultaneously

To minimize the common loss function L. Note that in the present invention, the student model, the member model, and the object model refer to the same object, and are referred to as the student model in knowledge distillation and the member model in dynamic scheduling.

S3: dynamic service scheduling of member models. After differential knowledge distillation, the student models are deployed to edge nodes, one model for each node. Here, the edge node includes an edge device and an edge server. And (3) a certain edge server is designated as a service scheduling controller, and all member models and nodes where the member models are located are registered in the scheduling controller. When a user (including an attacker) inputs an image request classification service through an edge device (e.g., a smartphone), the edge device first uploads the service request to the dispatch controller rather than processing it directly on the local model. And then the scheduling controller selects one edge node through a Bayes-Stackelberg game to execute a classification task. The whole process is transparent to the attacker, and the attacker cannot know which edge node provides service finally.

Further, as described in step S3, the diversity of the model plays a key role in the effectiveness of dynamic scheduling. The counterattack exploits the fact that the gradient with respect to the input is taken as the perturbation direction, taking gradient alignment as a measure of diversity.

Suppose there are two member models

And

e omega and agent model F selected by attacker_aIs e.g. U, with

Respectively represent

And

is applied to the gradient of the sample x. If it is not

And

the included angle between them is small enough, which means that it is possible to make

Misclassified x_advCan also make

Is misclassified, therefore

And

the difference between

And

the included angle therebetween. We use Cosine Similarity (CS) to denote +_xJ₁And +_xJ₂Degree of alignment of (a):

wherein < +_xJ₁,▽_xJ₂Is > is

And

the inner product of (d). If CS (+_xJ₁,▽_xJ₂) Not equal to-1, then +_xJ₁And +_xJ₂The gradient of (2) is opposite in direction, meaning that the gradient can be made

Misclassified x_advCan not make

And (4) carrying out error classification.

Further, the cosine similarity is further applied to the training process of the student models in step S2 to obtain a member model set with greater diversity. Since cosine similarity is calculated using two gradients, to further generalize to K models, the maximum value on pairwise cosine similarity is defined as the EI-MTD diversity measure:

wherein, J_aAnd J_bRespectively representing student models

And

a loss function of theta^(a)And theta^(b)Respectively representing student models

And

is determined by the parameters of (a) and (b),

is x gets the soft label from the teacher model. Due to CS_coherenceIs a non-smooth function, cannot use a gradient descent optimization method, and further uses a LogSumExp function to approximate CS_coherence：

The student models are distilled from a teacher model of a cloud data center, and the diversity among the student models needs to be ensured during distillation, so that a regularization term CS is added in the knowledge distillation process_coherenceRedefining a new distillation loss function:

wherein, lambda is a regularization coefficient, and CS is controlled in the training process_coherenceThe importance of (c). In order to make the student model fully learn the antagonistic knowledge of the teacher model, set β ═ 1, i.e. only use the soft label example to train learningAnd (4) generating a model. Differential knowledge distillation algorithm 1 is shown below.

Further, after the student models are obtained by distilling the difference knowledge in the step S3, the member models are deployed to the edge nodes. When the edge device receives the image, it does not perform classification on its own model, but instead forwards the image to the dispatch controller. The scheduling controller will select the registered service model by the scheduling policy, specifically:

in a confrontational environment, both the defender and the attacker would like to maximize their "yield" through some strategy, which is typical of gaming problems. In the present invention, a Bayes-Stackelberg game is used to model the scheduling strategy. The defender's strategy is to select a suitable service classification model, and the adversary's strategy is to select an optimal agent model to generate the confrontation sample. The present invention represents Bayes-Stackelberg game as seven-element group

Wherein L is defensive, S_LIs a group of student models obtained after differential distillation

The type of follower F includes two types, legal user F⁽¹⁾And attacker F⁽²⁾(ii) a Legitimate user F⁽¹⁾Movement space of

Only one action, namely requesting services using a legitimate sample; attacker F⁽²⁾Movement space of

Is to select different agent models

Collection of defender LBenefit to

And a legitimate user F⁽¹⁾Gain of (2)

Defining the classification accuracy of the member model to the natural image; income of defender L

Is the classification accuracy of the member model on the antibody sample; illegal user F⁽²⁾Gain of (2)

Defined as the success rate against sample attacks; p⁽¹⁾Indicating a legitimate user F⁽¹⁾Probability of occurrence, P⁽²⁾Represents an attacker F⁽²⁾The probability of occurrence; converting a model scheduling strategy problem based on a Bayes-Stackelberg game into a mixed integer quadratic programming problem (MIQP) as follows:

0≤s_n≤1

v^(c)∈R

wherein P is⁽¹⁾＝1-α，P⁽²⁾＝α，s＝(p₁,p₂,...,p_K) Solving to obtain the scheduling strategy of the member model, p_iMember model F_s(θ⁽ⁱ⁾) The probability of being selected. q. q.s^(c)Is a user F^(c)Optimal strategy of response, user's profit is v^(c). The above problem can be solved using DOBSS algorithm.

The invention relates to a defense method for resisting sample attack, which is researched for the first time aiming at an edge intelligent system. The first of saiik et al proposed defending confrontation samples (MTDeep) with moving targets, and our approach differs from them in two ways, the first of which is that they do not consider the application scenario of edge intelligence, only for applications on cloud platforms, and the second of which is that they do not consider the differences of member models, so that the final defense effect is not obvious. The hrs (hierarchical Random switching) network proposed by Wang et al sets several parallel network modules in the network, and can switch randomly in the forward propagation process, but our method switches the whole network randomly, and the switching strategies are also different. Abhishek et al analyzed the impact of the limited rationality of the attacker on the MTD performance achieved based on the Stackelberg game, and the results showed that the MTD game framework designed for rational attackers was sufficient to defend the limited rational attackers, and thus the method of the present invention also assumed that the attackers were rational. Song et al propose fMTD detection and defense countermeasure samples based on the phenomenon that countermeasure samples of different models are distributed differently, fMTD is mainly retrained by using different countermeasure samples on a basic model to obtain a group of bifurcation (fork) models, then the countermeasure samples are detected by using the output consistency of the samples on the group of fork models, and meanwhile, the countermeasure samples are correctly classified by using a voting principle, and MTD in the method is mainly embodied in that after the models are deployed, the countermeasure samples can still be dynamically generated to perform countermeasure training, so that the fork models of a system at a certain stage are dynamically changed. Unlike the method of the present invention, the method still needs forward reasoning on multiple models and cannot be deployed on edge devices with limited resources.

The invention has the beneficial effects that:

the invention relates to a Bayes-Stackelberg game-based intelligent moving target defense method, which has the following advantages compared with the prior art:

(1) the invention firstly proposes the defense against adversarial attacks on edge intelligent systems. The EI-MTD provided by the invention is well combined with the inference system architecture of the edge device and the edge server, namely a deep learning model carries out inference independently on an edge node, thereby realizing dynamic execution. The dynamic scheduling mechanism is completely transparent to the user and does not reduce the classification accuracy.

(2) To prevent transferability, the present invention proposes differential knowledge distillation to increase the diversity of membership models on edge nodes. Unlike the knowledge distillation of a single model, the present invention employs multiple student models that distill simultaneously with a common loss function. In addition, the method simultaneously compresses the scale of the model, and can overcome the limitation of edge node resources.

(3) An EI simulation platform was built using GPU servers, PCs and Raspberry pi to test our EI-mtd. The experiment used a real image dataset ILSVRC 2012. Experimental results indicate that EI-MTD can defend against 80% of challenge samples generated by M-DI 2-FGSM.

Drawings

FIG. 1 is a static object model and a dynamic object model.

In the figure: on the left is a typical device-based static service attack architecture. The attacker attacks node K with a challenge sample of "cat," knowing that it is the model on node K that performs the classification. On the right is a dynamic scheduling object model scheme. Although the attacker tries to attack node K, it does not know the model that is actually executed in detail.

FIG. 2 is a frame of an EI-MTD;

in the figure: the black line represents the process of membership model deployment to edge nodes and the red line represents the process of EI-MTD classification countermeasure samples.

FIG. 3 is the accuracy of top-1 and top-5 against the teacher model in training. After each training epoch, the teacher model is tested with two data sets, one containing clean samples and the other containing PGD samples

FIG. 4 accuracy of the training model and the distillation model of the differential knowledge

Fig. 5 shows that the EI-MTD has higher accuracy than the single membership model under different aggressor occurrence probabilities. Note that these member models are somewhat robust because they distill out of the teacher model.

FIG. 6 shows the EI-MTD accuracy at different distillation temperatures T.

Fig. 7 shows the values of differential immunity γ for different distillation temperatures tdifferent.

FIG. 8 is the effect of differential immunization γ on EI-MTD.

FIG. 9 is the EI-MTD accuracy at different regularization coefficients λ.

Fig. 10 is the values of the different regularization coefficients λ differential immunity γ.

Fig. 11 shows the effect of differential immunity γ on EI-MTD at a temperature T of 10.

Fig. 12 is a thermodynamic diagram of EI-TMD differential immunity gamma and accuracy for different temperature T in combination with regularization coefficients lambda. The left column represents differential immunity γ, and the right column represents classification accuracy. (a) The (b), (c) and (d) respectively correspond to different methods for generating confrontation samples, including FGSM, PGD, MI-FGSM and M-DI 2-FGSM.

Detailed Description

The invention is further illustrated by the following specific examples:

1. preparing knowledge:

1.1 deep neural network and challenge samples

Deep learning models (DNNs) can often be mapped using a mapping function F (X, θ): R^d→R^LIs represented by, wherein X ∈ R^dIs an input sample variable; θ represents a parameter of DNNs; l represents the DNN prediction class number. As used herein, a DNNs with a Softmax output layer, where the Softmax function is defined as:

DNN may be denoted as f (x) softmax (z), where z denotes the output vector of the last hidden layer of DNN. Given an input sample X ∈ X, the prediction label for DNNs can be expressed as: y ═ argmax_i∈{1,..,L}F(x)_iWherein the probability value F (x)_y′Referred to as the confidence score of the prediction. The goal of training DNNs is to make the difference between their predicted y' and true labels y smaller and smaller. The loss of an input-label pair (x, y) is represented by J (x, y, θ), and the objective function for training DNNs herein is the cross-entropy loss function, defined as: j (x, y, θ) — 1_yLog (Softmax (z (x, θ))), where-1_yIs the one-hot encoding of the real tag, and the logarithm of the vector is defined as the logarithm of each element.

The countermeasures are to add a disturbance r which cannot be detected by human eyes to the input sample x, so that the model with certain generalization capability is misclassified. Specifically, x is_adv＝x+r,s.t||r||_p≦ prediction of DNN argmax_iF(x_adv)≠-1_yOr argmax_iF(x_adv) T, where t is an attacker-specified category. As used herein, /)_∞Norm measures the size of disturbance r, i.e. R_∞≤。

1.2 gradient-based attacks

When the model information of DNNs is known, the white-box attack method can be under the constraint of | | x_adv-x||_pAt most, passing through an optimization function

To construct a challenge sample. This section mainly introduces an attack method for generating a countermeasure sample based on gradient optimization.

FGSM (fast gradient notation): the first method proposed by GoodFellow for generating a challenge sample based on model gradient information obtains a challenge sample x by maximizing a loss function J (x, y, θ)_advFor this purpose, the perturbation r, i.e. x, is sought in the direction in which the gradient of the loss function with respect to x changes maximally_adv＝x+r·sign(▽_xJ(x,y；θ) Sign () represents a sign function +_xJ (x, y; θ) is the gradient of the loss function to the input x, | | r | | luminance_∞≤。

PGD (projection gradient descent): alekscan et al extend FGSM to an iterative approach to finding countermeasures against disturbances, i.e.

Where T is the number of iterations, the iteration step α ═ T, T is the total number of whole iterations, clip(. -) represents clipping the perturbation within the constraint.

MI-FGSM: dong et al replace the iterative part of the PGD with momentum iterations to stabilize the gradient direction from entering local maxima. The gradient descent-based momentum iteration method is represented as:

wherein

u is the momentum term decay factor.

MDI 2-FGSM: in order to improve the black box attack rate of the multi-step iteration method, Xie et al propose to perform input transformation on a sample after each step of iteration is completed, and specifically:

where p denotes the probability of transformation, the random transformation function

1.3 confrontational training

The countermeasure training is a method for learning DNNs, and can improve the robustness of DNNs. It was first proposed by Goodfellow et al that the reason why challenge samples can confuse DNNs is the lack of training data, and therefore to defend against challenge samples, it was proposed to generate a large number of challenge samples with FGSMs and then retrain DNNs as part of the training data with their correct labels. Mardy et al describe the resistance training learning problem as a robust optimization problem as follows:

they propose to solve the internal maximization problem with the PGD approach. The generated challenge samples are then used for training to solve the external minimization problem. However, the method of the antagonistic training has a gradient computation complexity of o (mn) in a single batch, where M is the data volume and N is the number of iterations of PGD, which is N times greater than the standard training o (M).

1.4 knowledge distillation

Hinton first proposed knowledge distillation, who thought the prediction vector of the model contained structured information between classes, and it could be used to remove part of redundancy of the neural network, achieving the goal of compressing the network structure. Specifically, for a trained teacher model F_t(θ) its logits layer output is Z ═ Z₁(x),...,Z_L(x) Redefines the softmax function:

wherein the parameter T is a temperature parameter,

called soft tag, the original tag y of sample x is called hard tag. Soft and hard labels may better train student models than training using only hard labels. Training of the student model is to minimize knowledge distillation loss:

wherein the content of the first and second substances,

the soft label is generated by the teacher model, and the beta is the weight for adjusting the calculation loss of the hard label and the soft label in the training process of the student model.

1.5Bayes-Stackelberg Game

Stackelberg game is nonA cooperative, sequenced decision game whose participants (players) include a leader L that takes action first, and a follower F that starts later. We represent the Stackelberg game G ═ (L, F, S) by a six-tuple_L,S_F,R_L,R_F) Here S_LIs the action space of the leader, S_FIs the motion space of the follower, R_LIs a revenue function of the leader, R_FIs the follower's revenue function. The revenue function being a function R defined over a combination of actions_i:[S_L]×[S_F]→ R, where i ═ L, F, [ S ═ L_i]An index set representing an action space. A pure policy is one that can only select one action, while a hybrid policy is one in which each action can be selected with a probability 0 ≦ p < 1. In the Stackelberg game, a leader adopts a mixed strategy s to take action first, and a follower F optimizes the income of the leader under the strategy of the leader and responds to a pure strategy q. Finally solving a mixed integer quadratic programming problem (MIQP) of one leader:

here, N is a large positive number, the solved objective function value is the optimal benefit of the leader, the optimal mixed strategy of the leader is s, q is the optimal strategy responded by the follower, and the benefit of the follower is v.

In the field of information security, it is common to assume that a leader is defendingThe party and the follower is the attacker. The attacker may contain multiple attack types, and thus extends the Stackelberg game to situations where there are multiple type followers, referred to as the Bayes-Stackelberg game, denoted as the Bayes-Stackelberg game

C e 1, C, i.e. the followers, contains C types, each type of follower F^(c)All have their own policy sets

And a revenue function

p^(c)Indicating follower F^(c)The probability of occurrence. In this game, the leader is unaware of the follower F^(c)But knows the probability distribution p of his type^(c)And is therefore a Stackelberg game of incomplete information. And finally solving the MIQP problem of the leader of the Bayes-Stackelberg game:

and solving to obtain the optimal income of the leader, wherein the objective function value is the optimal income of the leader, and the optimal mixing strategy of the leader is s and q^(c)Is the follower F^(c)Optimal strategy of response, when the benefit of the follower is v^(c)。

2. Defense method

The invention provides an edge intelligent moving target defense framework comprising three key technologies: resistance training, differential knowledge distillation, and model dynamic scheduling. We used countermeasure training in the cloud data center to obtain a powerful teacher model. Secondly, robust knowledge is extracted from a teacher model by using transfer learning and applied to a small-scale student model with limited resources, and different from Hinton knowledge distillation, difference regular terms are added to improve the diversity among the student models and effectively reduce the transferability of a confrontation sample. These student models, also called membership models, are further dynamically scheduled. Thanks to the diversity obtained, our dynamic scheduling can increase the difficulty for an attacker to find the optimal proxy model, as shown on the right side of fig. 1.

Antagonistic training of the teacher model: suppose we have a training data set

And a teacher model. Work has shown that a larger capacity network can be made more robust against training. Therefore, we choose a network with 101 layers like ResNet-101 as the teacher model. Countermeasure training is then performed at the cloud data center, in conjunction with a "FAST" countermeasure training approach to speed up the process.

Differential knowledge distillation of student models. The soft labels of the training set at the appropriate distillation temperature are first obtained from the teacher model and then a new training data set is created. The essence of knowledge distillation is to train student models with teacher model soft labels. To obtain the diversity of the student models, we define a new loss function with regularization term, and train all student models simultaneously to minimize the common loss function. Note that in the present invention, the student model, the member model, and the object model refer to the same object, which have specific names in specific contexts.

Dynamic service scheduling of member models. After differential knowledge distillation, student models are deployed to edge nodes. Note that there is only one student model per edge node, including edge devices and edge servers. Where the edge server is designated as the dispatch controller. All student models, i.e., membership models, are registered in the dispatch controller. When a user (including an attacker) inputs an image through an edge device (e.g., a smartphone) requesting a classification service. The edge device first uploads the service request to the dispatch controller rather than processing it directly on the local model. The scheduling controller selects an edge node, and more precisely the model on it, to perform the classification. Thus, the attacker cannot know which edge node will ultimately provide service. The edge server provides a best target model selection through the Bayes-Stackelberg game.

3.1 measure of dissimilarity

As described above, the diversity of the models plays an important role in the effectiveness of dynamic scheduling. For this reason, how to properly balance the diversity of the quantities is an important issue. Inspired by the fact that the counterattack exploits the gradient relative to the input as the perturbation direction, we use gradient alignment as a diversity measure.

Suppose there are two member models

And

e omega and agent model F selected by attacker_aIs e.g. U, with

Respectively represent

And

is applied to the gradient of the sample x. If it is not

And

Misclassified x_advCan also make

Is misclassified, therefore

And

the difference between

And

wherein < +_xJ₁,▽_xJ₂Is > is

And

Misclassified x_advCan not make

And (4) carrying out error classification.

3.2 differential knowledge distillation

This section further applies cosine similarity to the training process of the membership models to obtain membership models with greater diversity. Since cosine similarity is calculated with two gradients and our EI-MTD includes K models, the maximum on pairwise cosine similarity is defined as the EI-MTD diversity measure:

wherein J_aAnd J_bRespectively representing member models

And

a loss function of theta^(a)And theta^(b)Respectively representing member models

And

is determined by the parameters of (a) and (b),

is x gets the soft label from the teacher model. Due to CS_coherenceNon-smooth function, not using gradient descent equal first order optimization method, LogSumExp function is used herein to smoothly approximate CS_coherence：

Smaller CS_coherenceSmall means greater variability between member models. Attention is paid toThe member models are distilled from a teacher model of the cloud data center, and meanwhile, the diversity among the member models needs to be ensured, so that a regularization term is added in the knowledge distillation process, and a new distillation loss function is newly defined as follows:

where λ is the regularization system, controlling CS during the training process_coherenceThe importance of (c). In order to allow the student model to sufficiently learn the antagonistic knowledge of the teacher model, β is set to 1. That is, we train the student model using only the soft label example. Differential knowledge distillation algorithm 1 is shown below.

3.3 model scheduling policy

3.2 after the student models are obtained by distilling the difference knowledge in subsection, the member models are deployed to the edge nodes, as shown in FIG. 2. When the edge device receives the image, it does not perform classification on its own model, but instead forwards the image to the dispatch controller. The scheduling controller will select the registered service model by the scheduling policy. In this section, we will describe the scheduling policy in detail.

Is to select different agent models

Income of defender L

And a legitimate user F⁽¹⁾Gain of (2)

0≤s≤1

v^(c)∈R

wherein P is⁽¹⁾＝1-α，P⁽²⁾＝α，s＝(p₁,p₂,...,p_K) Solving to obtain the scheduling strategy of the member model, p_iMember model F_s(θ⁽ⁱ⁾) The probability of being selected. q. q.s^(c)Is an end user F^(c)Optimal strategy of response, end user's profit is v^(c)。

MIQP is an NP difficult problem, and the method solves the problem by using a resolving optimal Bayes-Stackelberg game resolving method. In addition, the DOBSS algorithm has three key advantages over other solving methods. First, the method allows Bayes-Stackelberg to be expressed compactly, without the need for gaming to be converted to normal form by hasani (Harsanyi); secondly, the method only needs to solve a mixed integer linear programming problem, but not calculate a set of linear programming problems, so that the solving speed is further improved; finally, it directly finds the optimal leader strategy, instead of Nash balancing, thus enabling it to find a highly profitable Stackelberg balancing strategy (exploiting the pre-emptive dominance of the leader). And for the solved optimal strategy s of the leader, the edge server serves the users according to the model which is arranged on the edge equipment in a dispatching way under the optimal strategy according to the server affinity of the users.

4. Experiment of

4.1 Experimental setup

In the experimental verification of the invention, a GPU cluster, a PC and a raspberry dispatching machine are used for respectively simulating a cloud data center, an edge server and edge equipment.

The cloud computing center: in the embodiment of the invention, an X745-G30 server of Ubuntu eosin is used for simulating a cloud computing center, the operating system of the server is Ubuntu16.04.6LTS, the GPU model is NVIDIA Geforce RTX 2080Ti 4, and extension packages such as Python3.7.3 and Pytroch 1.2 are used. On the server, antagonistic training of teacher models and differential knowledge distillation of student models are performed.

An edge server: the HUAWEI MateBook 142020 notebook was used to emulate an edge server, 64-bit Windows 10 operating system, and the CPU processor was Intel Core i5-10210U 2.11GHz, 16GB RAM. The solution of DOBSS algorithm is implemented using python3.6, pulp 2.1.

Edge equipment: we select a set of 6 Raspberry Pi 3Model B + as edge devices, the processor of each Raspberry Pi 3Model B + is Broadcom BCM2837B0, the operating system is 64-bit quad-core ARM Cortex-A53, and the memory is 1GB LPDDR2 SDRAM. In addition to the student models on these edge devices, we have also developed test programs, i.e., sending images to the edge server at any point in time, simulating image classification requests.

A teacher model: the teacher model uses the ResNet-101 model. The model has 101 layers and 33 residual blocks. The teacher model was trained at the GPU with 120 million clean images and their corresponding confrontational samples, which were generated by the FGSM method.

Student/member model: several light-weight model structures which are mainstream at present, namely MobileNet V2, ShuffleNet V2 and SqueezeNet are adopted as student/member models. On the three model structures, six models are obtained through different hyper-parameters: MobileNet V2-1.0, MobileNet V2-0.75, ShuffleNet V2-0.5, ShuffleNet V2-1.0, SqueezeNet-1.0, SqueezeNet-1.1.

The proxy model comprises the following steps: to simulate an attacker's strategy, five surrogate models were selected: MobileNet V2-1.0, Shufflentv 2-1.0, SuqeezeNe-1.0, ResNet-18, and VGG-13. The first three models are structurally very similar to the member models that simulate the white-box attack. Given a pre-trained surrogate model, we generated countermeasure samples by FGSM, PGD, MI-FGSM and M-DI 2-FGSM.

Data set: an example of the invention was an experiment on an ILSVRC2012 dataset containing 1000 classes, consisting of 120 ten thousand images as a training set and 150,000 images as a test set. Each image is 224 x 224 in size, with three color channels. It is currently a reference dataset in the field of image classification.

4.2 antagonistic training and differential knowledge distillation

Accuracy of the teacher model: to ensure a better transfer of knowledge from the teacher model to the student model, the teacher model itself must be of sufficient accuracy and robustness. We used 120 million clean pictures and their corresponding confrontational samples to confront and train teacher model F_t. During the training process, we selected 10,000 confrontational sample strategy teacher models F from the test set using PGD generation_tEach training phase of (a). For the PGD method, the perturbation size is 5, the iteration step is/5, and the number of iterations is 20. FIG. 3 shows a teacher model F_tThe effect of the resistance training. Teacher model F with the deepening of training rounds_tThe accuracy of (2) is gradually improved. First, for the clean example, the teacher model has a top-1 accuracy of 11.83% and a top-5 accuracy of 15.31%. Through 15 rounds of confrontation training, the accuracy of top-1 is improved to 64.03%, and the accuracy of top-5 is improved to 82.8%. Similarly, teacher model F_tThe accuracy of top-1 increased from 3.37% to 52.35% and the accuracy of top-5 increased from 13.55% to 73.71% for the PGD challenge samples. In conclusion, the teacher model obtains higher accuracy and robustness through the confrontation training, and the student model is guaranteed to obtain good performance.

Accuracy of student/member models. We evaluated using two sets of models, corresponding to normal training and differential distillation, respectively, each set containing 6 models, as shown in fig. 4. FIG. 4 shows the accuracy of top-1 and top-5 for two group models tested by clean examples and hostile examples. For example, the normal trained shufflentv 2-1.0 model has a top-1 accuracy of 6.12% and a top-5 accuracy of 20.49% against challenge. In contrast, the same model of differential distillation can achieve 39.15% top-1 accuracy and 67.43% top-5 accuracy, with a slight decrease in accuracy for the clean examples. These results indicate that the student model, distilled from the robust teacher model, has better ability to defend against challenge samples, and has lower model capacity. This means that the student model obtained by differential distillation can be applied to the edge intelligence computing environment.

4.3 revenue matrix for moving target defense

The revenue matrix in the game represents the revenue of the participants under different strategies. The elements of the revenue matrix are the doublets (a, b), where a is the classification accuracy when attacked by the challenge sample and b is the attack success rate. We obtain the value of a by the test set testing membership model of ILSVRC2012 and generate the challenge sample testing membership model of the test set on the surrogate model to obtain b. For legitimate users, their revenue is the accuracy of the classifier. Table 2 shows the results of the game matrix between defenders and legitimate users in the MTD-EI framework. Tables 3, 4, 5 and 6 give the revenue matrices between defenders and attackers (PGD, FGSM, MI-FGSM and M-DI 2-FGSM). For example, (56.73, 43.27) in table 3 shows that when the attacker generates the confrontation sample by PGD on the agent model ResNet-18, and attacks the classification model MobileNetV2-1.0, the defender's profit is 56.73% of the classification accuracy of the confrontation sample, and the attacker's profit is 43.27% of the success rate of the confrontation sample attack.

Table 3: game gains of defenders and PGD attackers, wherein the gains of attackers are attack success rates (%), and the gains of image classification systems are classification accuracy rates when attacked (1-attack success rate)

Table 4: game profit of defender and FGSM attacker, wherein the profit of attacker is attack success rate (%), and the profit of defender is classification accuracy rate when attacked (1-attack success rate)

Table 5: the game profit of defender and MI-FGSM attacker, wherein the profit of attacker is attack success rate (%), and the profit of defender is classification accuracy when attacked (1-attack success rate)

Table 6: defense and M-DI²-gambling yield of FGSM attacker, wherein the yield of attacker is attack success rate (%), and the yield of defender is classification accuracy when attacked (1-attack success rate)

4.4 EI-MTD effectiveness

Given the revenue matrix in section 5.2, we can select the probability vector of the appropriate membership model by solving. Since the defender's optimal strategy depends on the probability a of the attacker's occurrence, we verified the effectiveness of the EI-MTD under different conditions compared to a single membership model without dynamic scheduling. These member models are somewhat robust in that they are distilled from a robust teacher model. The results are shown in FIG. 5, where (a), (b), (c) and (d) correspond to FGSM, PGD, MI-FGSM and M-DI2-FGSM, respectively. Next, we discuss the effectiveness of the EI-MTD defense system according to the probability of occurrence of an adversary, α, using PGD (fig. 4 (confrontation sample)) as an example:

(1) assuming that the user type is only legal users, that is, all requests are clean samples, at this time, setting α to 0, the EI-MTD selects the membership model MobileNetV2-1.0 with the highest classification accuracy of the clean samples, which is equivalent to that the EI-MTD uses the pure policy MobileNetV2-1.0, and does not enable model switching.

(2) Assuming that the user type is only an attacker, that is, all requests are countermeasure samples, when α is set to 1, the optimal scheduling policy solved by the EI-MTD is s (0.13, 0.15, 0.16, 0.12, 0.14, 0.3), and the EI-MTD randomly selects a membership model according to the probability vector s. Under this strategy, the expected classification accuracy of EI-MTD for normal samples was 64.57%, and the expected accuracy of challenge samples was 40.86%, but the challenge sample accuracy was less than 32% for a single DNN. It can be seen that EI-MTD has better defense effectiveness.

(3) The actual situation is that the legal users and the attackers are distributed with a certain prior probability, and the probability of the appearance of the attackers is assumed to be alpha (0 < alpha < 1), so that the probability of the appearance of the legal users is 1-alpha. In the experiment, alpha is 0.1,0.2, … and 0.9 respectively, and various possible situations are simulated. From fig. 5, it can be observed that as α increases, i.e. the proportion of challenge samples in the request increases, the accuracy of all models tends to decrease, since the destructive effect of the challenge samples increases. But we can find that the classification accuracy of the EI-MTD is still higher than that of the single member model.

We analyzed FGSM, MI-FGSM and M-DI2-FGSM against attacks simultaneously, and EI-MTD classification accuracy is also higher than that of single member models. Especially for M-DI2-FGSM with the strongest black box attack capability, EI-MTD improves the accuracy of SqueezeNet-1.0 of a single member model with the worst defense capability from 15.77% to 41.09%. It can be seen that the EI-MTD method proposed herein can improve the robustness of the entire image classification system.

4.5 transferability of EI-TMD

The transferability of the challenge samples can be measured by the transfer rate, i.e. the ratio of the number of transferred challenge samples to the total number of challenge samples constructed by the original model. Essentially, the transfer rate is equal to 100 minus the classification accuracy of the target model. We can observe from fig. 5 that the transferable ratio of challenge samples on EI-MTD is lower than other membership models. For example, in FIG. 5d, the transfer rate on EI-MTD is (100% -41.09%), while the transferable rate on the membership model of MobileNet V2-1.0 is (100% -28.74%). Similarly, the transfer rates on other membership models were found to be higher than the EI-MTD, indicating that the EI-MTD can reduce transferability against the samples.

4.6 Effect of T and λ on EI-MTD

To analyze in depth the effect of differential knowledge distillation on EI-MTD, we further analyzed two important parameters T and λ, which represent distillation temperature and regularization coefficient. Sailik et al propose differential immunity as a measure of MTD effectiveness, which considers that for an ideal MTD, a particular attack exhibits differences over different model configurations. Therefore, they defined differential immunity using challenge success rate:

wherein F_aE.U denotes the agent model chosen by the attacker to generate the countermeasure sample, F_sE.omega represents a target model selected by a defender for classification services, and ASR (F)_a,F_s) Representing the attack success rate of the countermeasure sample generated by the agent model on the target model. A larger gamma value indicates good MTD performance. In this section, we used differential immunization γ to investigate the effect of T and λ on EI-MTD.

Influence of T: for ease of analysis, λ is fixed to 0.3, while assuming that all requests are challenge samples. The relationship between the accuracy of EI-MTD and the distillation temperature T is shown in FIG. 6. As the distillation temperature T increased, we observed a corresponding increase in EI-MTD classification accuracy for challenge samples generated from FGSM, PGD, MI-FGSM and M-DI 2-FGSM.

Differential immunity gamma can be easily calculated due to the classification accuracy of all member models. Figure 7 shows the corresponding differential immunity gamma to distillation temperature T. Differential immunity γ can be observed to increase with increasing distillation temperature T, which means that higher distillation temperature T can expand the diversity of membership models. The reason is that higher temperatures will be of the membership modelThe decision boundary approaches to a robust teacher model, and max is reduced_FsASR(F_a,F_s). However, the increase in the differential immunity γ became gradual after the distillation temperature T was increased to 12, indicating that the distillation temperature T may no longer be the major factor affecting the membership model differences at this time. This result is a good demonstration of EI-MTD effectiveness.

Based on the above observations, we further analyzed the correlation between the accuracy of EI-MTD and differential immunity γ. In fig. 8, we experimentally show the classification accuracy of EI-MTD at different differential immunizations γ. The results show that increasing differential immunity γ can improve the performance of EI-MTD, which confirms again the idea described in section 3.1 that the diversity of the membership models determines the effectiveness of EI-MTD. For example, when γ is 0.15, the accuracy of EI-MTD is only 27.34%, whereas when we increase γ to 0.38 by increasing the temperature T to 20, the accuracy of EI-MTD reaches 47.86%. We can clearly explain how distillation temperature T works (1) increasing temperature T increases differential immunity γ; (2) the accuracy of EI-MTD can be further improved by increasing the differential immunity gamma; therefore (3) increasing the distillation temperature T can increase the effectiveness of EI-MTD.

Influence of λ: λ is a regularization coefficient, controlling CS during training_coherenceThe importance of (c). To analyze the effect of λ on EI-MTD performance, we fixed the distillation temperature T-10. As shown in fig. 9, increasing λ may improve the accuracy of EI-MTD well. The results do not show the essential relationship between them. Therefore, we first show in fig. 10 how the regularization coefficient λ affects the differential immunity γ. In particular, if we reduce λ to 0, this means that all membership models are the same, i.e., EI-MTD is not dynamically scheduled, and the accuracy of the challenge samples generated for PGD is only 27.34%. In contrast, if the differential immunity is increased to 1, the EI-MTD reaches an accuracy of 55.68%. In fact, increasing λ represents the importance of increasing membership model variability in the differential distillation process. This correspondingly increases the differential immunity. In this way, it can be seen that a larger γ can improve the accuracy of EI-MTD. Fig. 11 shows that increasing γ increases the level of EI-MTD at a temperature T of 10And (5) determining. Therefore, we briefly summarize the above analysis as follows (1) a larger λ can increase the diversity of member models, thus further enhancing differential immunity γ; (2) higher differential immunity gamma can ensure higher precision; therefore, (3) a larger lambda is beneficial to improving EI-MTD accuracy.

Optimal combination of T and λ: although we analyzed the effects of T and λ separately, the effect of their combination on EI-MTD accuracy is not clear. We show by a thermodynamic diagram in fig. 12 the accuracy of differential immune γ and EI-MTD under the insufficient combination. It can be seen that T and λ do not cancel out the effect of each other. This is because increasing both T and λ increases the differential immunity γ. In the present example experiment, T18 and λ 0.9 may achieve optimum performance, but too large a value does not appear to have further significant impact.

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A border intelligent moving target defense method based on a Bayes-Stackelberg game is characterized by comprising the following steps:

s1: antagonistic training for teacher models: existing training data set of cloud data center

And teacher model F_t(θ_t) (ii) a Adopting a ResNet-101 neural network with 101 layers as a teacher model, carrying out countermeasure training in a cloud data center by using FGSM countermeasure samples, and accelerating the process by using a combined 'FAST' countermeasure training method;

s2: student model's difference knowledge is evaporatedDistilling: first from the teacher model F_t(θ_t) To obtain a sample x at a suitable distillation temperature T_iSoft label of

Creating a new training data set

Defining a new CS with regularization term_coherenceIs equal to sigma T²J/K+λ·CS_coherenceTraining all student models simultaneously

To minimize the common loss function L;

s3: dynamic service scheduling of member models: after differential knowledge distillation, the student models are deployed to edge nodes, and each node is deployed with one model; here, the edge node includes an edge device and an edge server; appointing a certain edge server as a service scheduling controller, and registering all member models and nodes where the member models are located in the scheduling controller; when a user inputs an image request classification service through the edge device, the edge device firstly uploads the service request to the scheduling controller, and then the scheduling controller selects one edge node through a Bayes-Stackelberg game to execute a classification task.

2. The Bayes-Stackelberg game-based edge intelligent mobile target defense method according to claim 1, characterized in that: using the gradient alignment as a diversity measure according to the description in step S3;

is provided with two member models F_s ⁽¹⁾And F_s ⁽²⁾E omega and agent model F selected by attacker_aIs e.g. U, with

Respectively represent F_s ⁽¹⁾And F_s ⁽²⁾Step of the loss function of (2) on sample xDegree; if it is not

And

the angle between them is sufficiently small, which means that F can be made_s ⁽¹⁾Misclassified x_advAlso enable F_s ⁽²⁾Misclassification, therefore F_s ⁽¹⁾And F_s ⁽²⁾The difference between

And

the included angle therebetween is related; using Cosine Similarity (CS) representation

And

degree of alignment of (a):

wherein

Is that

And

inner product of (d); if it is

Then

And

the gradient of (2) is opposite in direction, meaning that F can be made_s ⁽¹⁾Misclassified x_advFail to make F_s ⁽²⁾And (4) carrying out error classification.

3. The Bayes-Stackelberg game-based edge intelligent mobile target defense method according to claim 1, characterized in that: in the step S2, cosine similarity is further applied to the training process of the student models to obtain a member model set with greater diversity; since cosine similarity is calculated using two gradients, to further generalize to K models, the maximum value on pairwise cosine similarity is defined as the EI-MTD diversity measure:

wherein, J_aAnd J_bRespectively represent student models F_s ^(a)And F_s ^(b)A loss function of theta^(a)And theta^(b)Respectively represent student models F_s ^(a)And F_s ^(b)Is determined by the parameters of (a) and (b),

x obtains the soft label from the teacher model; due to CS_coherenceIs a non-smooth function, cannot use a gradient descent optimization method, and further uses a LogSumExp function to approximate CS_coherence：

Student model is from teacher model of cloud data centerDistillation is carried out, and the diversity among the student models needs to be ensured at the same time of distillation, so that the regularization term CS is added in the knowledge distillation process_coherenceRedefining a new distillation loss function:

wherein, lambda is a regularization coefficient, and CS is controlled in the training process_coherenceThe importance of (c); in order to enable the student model to fully learn the confrontational knowledge of the teacher model, setting beta to 1, namely training the student model by using only the soft label example; differential knowledge distillation algorithm 1 is as follows:

4. the Bayes-Stackelberg game-based edge intelligent mobile target defense method according to claim 1, characterized in that: after the student models are obtained through differential knowledge distillation in the step S3, the student models are member models and are deployed to edge nodes; when the edge device receives the image, the image is not classified on the model of the edge device, but the image is forwarded to the scheduling controller; the scheduling controller will select the registered service model by the scheduling policy, specifically:

representing Bayes-Stackelberg games as seven-tuple

Is to select different agent models

Income of defender L

And a legitimate user F⁽¹⁾Gain of (2)

0≤s_n≤1

v^(c)∈R

wherein P is⁽¹⁾＝1-α，P⁽²⁾＝α，s＝(p₁,p₂,...,p_K) Solving to obtain the scheduling strategy of the member model, p_iMember model F_s(θ⁽ⁱ⁾) A probability of being selected; q. q.s^(c)Is a user F^(c)Optimal strategy of response, user's profit is v^(c)(ii) a And (5) solving by using a DOBSS algorithm.