CN112766496B

CN112766496B - Deep learning model safety guarantee compression method and device based on reinforcement learning

Info

Publication number: CN112766496B
Application number: CN202110119514.9A
Authority: CN
Inventors: 陈晋音; 王珏
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2024-02-13
Anticipated expiration: 2041-01-28
Also published as: CN112766496A

Abstract

The invention discloses a deep learning model safety guarantee compression method and device based on reinforcement learning, comprising the following steps: (1) Modeling the deep learning model into a graph network by using a graph network mode; (2) extracting an embedded vector of the graph network by adopting GCN; (3) Taking the current embedded vector of each node of the graph network as the environment state of reinforcement learning, predicting an action value based on the environment state by reinforcement learning, and trimming the embedded vector of each node according to the action value until the embedded vectors of all nodes are trimmed, so that one round of compression of a deep learning model is realized; (4) Calculating error rate and safety according to the prediction result of the model after one round of compression on the sample data; (5) The return value compressed by a round of deep learning model is calculated by reinforcement learning according to the error rate and the safety; (6) And (3) repeating the steps (3) to (5) until the iteration is ended based on the return value, so as to realize compression of the deep learning model.

Description

Deep learning model safety guarantee compression method and device based on reinforcement learning

Technical Field

The invention relates to the field of deep learning, in particular to a deep learning model safety guarantee compression method and device based on reinforcement learning.

Background

The rapid development of deep learning has enabled it to achieve performance even beyond the human level on tasks such as image classification, speech recognition and text classification. These advances have led to a wide range of applications and deployments in the medical field for deep learning. Image recognition, wearable equipment, auxiliary diagnosis and treatment and rehabilitation auxiliary appliances and the like are important application fields of medical health at present. Despite the abundance of depth models and their excellent performance, the large number of weights can consume considerable storage and memory bandwidth, which can present difficulties for deployment of depth models in medical devices. Therefore, how to perform model compression reduces resource consumption is critical to the deployment of the model at the edge.

Depth model compression methods can be broadly divided into network pruning, network quantization, low rank approximation, knowledge distillation, and compact network design. The main idea of network pruning is to reject relatively unimportant weights in a weight matrix, and then fine-tune the network to restore network accuracy. Network quantization reduces the volume of the model by reducing the number of bits used for each parameter storage. The low rank approximation is to consider the weight matrix of the original network as a full rank matrix, and then a plurality of low rank matrices can be used to approximate the original matrix to achieve the purpose of simplification. Knowledge distillation employs transfer learning by employing the output of a complex model trained in advance as a supervisory signal to train another simple network. Compact network design is the stage of model construction in which small, compact networks are selected.

Network pruning is a relatively common method among these compression methods. The core of the network pruning technique is to determine the compression policy for each layer because of the different redundancy that they exist. This typically requires manually set heuristics and domain expertise to explore the space of the overall model design, making trade-offs between model size, speed and accuracy. AMC uses reinforcement learning to automatically sample the design space, and improves the compression quality of the model. The DMCP models channel pruning as a Markov process, proposes a differentiable channel pruning method, and can be directly optimized through loss of standard tasks and gradient descent of budget regularization. Both of these methods can achieve a greatly compressed model with no or slight loss in model accuracy. However, none of them consider the security of the compressed model. The depth model is vulnerable to challenge samples. Furthermore, complex connected modes exist in modern depth model structures, and layer-by-layer pruning in the above method does not take into account layer-to-layer dependencies.

In conclusion, how to automatically learn the compression strategy, the compressed model is reduced in parameter quantity, high accuracy is maintained, and meanwhile, the method has strong safety, and has important theoretical and practical significance for deployment of the depth model on medical equipment.

Disclosure of Invention

In view of the above, the invention provides a deep learning model safety guarantee compression method and device based on reinforcement learning, which can realize deep learning model compression and simultaneously maintain the accuracy and safety of the deep learning model.

In order to achieve the above object, the present invention provides the following technical solutions:

a deep learning model safety guarantee compression method based on reinforcement learning comprises the following steps:

(1) Each layer of the deep learning model for identifying tasks or classifying tasks is regarded as a node, the connection relation between layers is regarded as a continuous edge, and the deep learning model is modeled into a graph network;

(2) Extracting an embedded vector of the graph network by adopting a graph convolution neural network;

(3) Taking the current embedded vector of each node of the graph network as the environment state of reinforcement learning, predicting an action value based on the environment state by reinforcement learning, and trimming the embedded vector of each node according to the action value until the embedded vectors of all nodes are trimmed, so that one round of compression of a deep learning model is realized;

(4) Calculating the error rate and the safety of the compressed deep learning model according to the prediction result of the compressed deep learning model on the sample data for identifying tasks or classifying tasks;

(5) Calculating a return value of a round of deep learning model compression by reinforcement learning according to the error rate and the safety of the compressed deep learning model;

(6) And (3) repeating the steps (3) to (5) until the iteration is ended based on the return value, so as to realize compression of the deep learning model.

Preferably, in step (1), when modeling the deep learning model into a graph network, for a current layer of the deep learning model, an input sample dimension, a convolution kernel sliding step length and a calculated amount of the current layer are reduced in total calculated amount of all the previous layers, and residual calculated amount of all the later layers, and a pruning strategy adopted by the previous layer forms a feature vector of the current layer;

when a connection relation exists between the two layers, the connection edge between the two layers is 1, otherwise, the connection edge between the two layers is 0, and therefore an adjacency matrix of the graph network is built.

Preferably, in the step (2), a feature matrix formed by feature vectors of the nodes and an adjacent matrix of the graph network are used as inputs of the graph convolution neural network, and at least 2 layers of the graph convolution neural network are used for extracting an embedded vector of each node, wherein the embedded vector of each node is identical to the dimension of the feature vector.

Preferably, in step (3), implementing pruning of the embedded vector of each node according to the action value includes:

first, a modification strategy is determined according to an action value, when the action value is inThen a maximum corresponding selection pruning strategy is adopted; when the action value is +.>Then a greedy search pruning strategy is employed; when the action value is +.>Then a clipping strategy is employed that selects clipping channels based on the loss of the feature map;

then, determining the sparse rate of pruning according to a pruning strategy, so that the sparse rate corresponding to the pruning strategy is [0,1];

and finally, pruning the embedded vector of the current node according to the adopted pruning strategy and the corresponding sparsity ratio to realize the compression of the deep learning model layer structure corresponding to the node.

Preferably, determining the pruning sparsity according to a pruning strategy includes: and taking 3 times of the difference between the minimum value of the action value range corresponding to the pruning strategy and the calculated action value as a sparse rate.

Preferably, in the deep learning process, after predicting a current action value based on a current environment state by adopting an action network, calculating a decision value by adopting a decision network based on the current environment state, the current action and a next environment state corresponding to a next adjacent node, and constructing a loss function according to the decision value and a baseline rewarding value to update network parameters of the decision network, thereby updating the network parameters of the action network, and predicting a next action value based on the next environment state;

wherein, the Loss function Loss is:

y _i ＝r _i -b+γQ(s _i+1 ,u(s _i+1 )|θ ^Q )

wherein Q(s) _i ,a _i |θ ^Q ) Representing the parameter theta ^Q The environmental state s of the decision network for the ith sample _i And an action value a _i N represents a transition sample selected from a pool of experience for updating decision network parameters, b represents a baseline prize value, is an exponential moving average of previous rewards values, r _i Represents the reported value recorded in the ith sample, which is a curtain of reported value R divided by a curtain of steps T, gamma represents the discount factor, Q(s) _i+1 ,u(s _i+1 )|θ ^Q ) Is a parameter of theta ^Q Based on the environmental state s of the (i+1) th sample _i+1 And the action depends on the environmental state s _i+1 The generated motion value u (s _i+1 ) Is a target of the evaluation value of (a).

Preferably, in step (4), calculating the error rate and the security of the compressed deep learning model includes:

sample data for identifying tasks or classifying tasks are input into a compressed deep learning model to obtain a prediction result, the number of correctly classified samples is determined according to the prediction result, the prediction accuracy of the compressed deep learning model is calculated according to the ratio of the number of correctly classified samples to the total number of samples, and then the error rate of the compressed deep learning model is calculated according to the prediction accuracy;

sample data for identifying tasks or classifying tasks and a CLEVER score calculated by a compressed deep learning model are employed as security.

Preferably, in the step (5), the calculation formula of the return value R is:

R＝-Error·log(FLOPs)+λu

wherein Error represents Error rate, flow represents the calculated total amount of the deep learning model after compression, lambda represents super-parameters, and u represents safety index.

The deep learning model safety guarantee compression device based on reinforcement learning comprises a memory, a processor and a computer program stored in the memory and executable on the computer processor, and is characterized in that the deep learning model safety guarantee compression method based on reinforcement learning is realized when the processor executes the computer program.

Compared with the prior art, the invention has the following beneficial effects:

a deep learning model safety guarantee compression method and device based on reinforcement learning uses a graph network mode to model a model structure, so that the relation between layers can be fully considered, rather than only a structure with only sequential connection between layers is considered, and the method and device are more universal. The compression strategy and the sparsity of each layer can be automatically selected by using the reinforcement learning framework, manual setting is not needed, and in addition, the parameters of the compressed model can be reduced and the safety can be maintained as much as possible while the accuracy of the compressed model is maintained by setting the return value R.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a reinforcement learning-based deep learning model security assurance compression method provided by an embodiment of the invention;

FIG. 2 is a diagram of a reinforcement learning process provided by an embodiment of the present invention;

FIG. 3 is a process diagram of reinforcement learning model compression provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

In order to achieve model compression and maintain accuracy and safety, embodiments provide a deep learning model safety guarantee compression method and device based on reinforcement learning, which uses a graph convolutional neural network (GCN) to model relationships among layers of the deep learning model, extracts embedded vectors of the layers as a state in reinforcement learning, and then prunes the layers through appropriate pruning strategies of reinforcement learning.

Fig. 1 is a flowchart of a deep learning model security guarantee compression method based on reinforcement learning according to an embodiment of the present invention. As shown in fig. 1, the deep learning model security guarantee compression method based on reinforcement learning provided by the embodiment includes the following steps:

and step 1, modeling the compressed deep learning model structure by using a graph network mode.

In an embodiment, each layer of the deep learning model for the recognition task or the classification task is regarded as a node, the connection relationship between layers is regarded as a continuous edge, and the deep learning model is modeled as a graph network.

Key in the graph network are nodes, node attributes and node connection relationships. In the modeling process, each layer of the deep learning model is taken as a node, 11 features are used for representing feature vectors of the node, wherein t represents an index of the layer, and then the feature vectors of each node are represented as follows: (t, n, c, h, w, stride, k, FLOPs [ t ]],reduced,rest,a _t-1 ) Where n×c×k×k is the dimension of the convolution kernel, c×h×w is the dimension of the input samples, stride is the step size of the convolution kernel sliding, FLOPs [ t ]]For the calculated FLPs of the t-th layer, reduced is the calculated total amount reduced for the first few layers, rest is the calculated total amount remaining in the subsequent layers, scaling these eigenvalues to [0,1]]. Such a feature can distinguish the individual convolutional layers well. The node connection relationship is represented using an adjacency matrix a, where element a _i,j Determined by the relationship between nodes i and j. If there is a connection between layers, the corresponding element in the adjacency matrix is set to 1, and the general deep learning model is a layer-by-layer connection mode, and it should be noted that the residual block has a connection across multiple layers. In addition, denseNet also has many cross-layer connections, and attention must be paid to setting 1 for the corresponding element when constructing the adjacency matrix.

And 2, extracting an embedded vector of the graph network by adopting the graph convolution neural network.

In an embodiment, a 2-layer graph convolutional neural network is used to extract the embedded vector of the graph network, and the propagation rule of each layer is as follows:

wherein,i.e. adding an identity matrix I to the adjacency matrix A _N ，/>Is->Degree matrix of (2), i.eH ^(l) Is the active cell matrix of the first layer, H ⁽⁰⁾ Feature matrix X, W composed of feature vectors X of nodes ^(l) For the parameter matrix of each layer, σ is a sigmoid activation function, the input values can be mapped to [0,1]]. The class labels are not used for training the parameter W, and a good network information aggregation effect can be obtained only by randomly initializing the parameter W. The output embedded vector is expressed as: (y) ₁ ,y ₂ ,…,y _l )＝G(x ₁ ,x ₂ ,…,x _l ) Where G represents the entire GCN model and l represents the number of features, where the dimension of the output embedded vector is chosen to be the same as the dimension of the input feature vector.

And 3, pruning the deep learning model layer by adopting a reinforcement learning algorithm DDPG until the last layer.

In the embodiment, the current embedded vector of each node of the graph network is used as the environment state of reinforcement learning, the action value based on the environment state is predicted by reinforcement learning, and the embedded vector of each node is trimmed according to the action value until the embedded vectors of all the nodes are trimmed, so that one round of compression of the deep learning model is realized. The specific process is as follows:

(a) The current embedded vector of each node of the graph network is used as the environment state of reinforcement learning.

In an embodiment, each layer L of the deep learning model _t I.e. the current embedded vector of each node of the graph network as the reinforcement-learned environmental state s _t . It should be noted that the embedded vectors of all layers cannot be used as states at one time, since the subsequent pruning operation affects the feature vectors of the subsequent layers, and the changed feature vectors need to be input to the GCN to generate the embedded vectors. That is, since the embodiment prunes the layer structure of the deep learning model layer by layer, after the pruning of the current layer is completed, the feature vector corresponding to the next layer is input to the GCN to generate the embedded vector.

(b) And predicting an action value based on the environment state by adopting reinforcement learning, and trimming the embedded vector of each node according to the action value.

DDPG agent receives current environmental state s from the environment _t Then a decision network is adopted to output a [0,1]]As the action value a _t Action value a _t Representative are selected pruning strategies and their sparsity ratios, which can determine the specific pruning pattern after determining a certain pruning strategy. The method for pruning three channels is selected, wherein the first method is the maximum corresponding selection, namely, the pruned channels are determined according to the size of the weight, the second method is a greedy search mode, the optimal channel combination is selected so as to optimize the effect of the channel combination on the layer, and the third method is to select the pruned channels according to the loss of the feature map, wherein the pruned channels are used for minimizing the loss of the feature map. Since each pruning strategy corresponds to a sparsity of [0,1]Therefore, the action value of the DDPG output needs to be halved into three parts,and->And respectively establishing mapping relations. After the current layer uses the corresponding strategy and the sparsity to compress, the agent is transferred to the next layer L _t+1 Receiving state s _t+1 。

(c) According to the sequence of the structural layers in the deep learning model, the layers are quasi-laminated according to the strategy of DDPG until the last layer L is completed _T As shown in fig. 3.

The structure of the DDPG is shown in fig. 2, the DDPG has two networks, an Actor network and a Critic network, the Actor network generates an action, the state and the action value are input into the Critic network to obtain a corresponding Q value, the objective function of the Actor is to maximize the Q value, and the objective function of the Critic network is to minimize the error of Q (s, a). Here, the Actor network and the Critic network are both set to two hidden layers, each layer having 300 neurons, soft-updated with τ=0.01, and the training network uses a batch size of 64, 2000 as the size of the buffer. Noise handling by agent exploration, where using truncated normal distributions allows agents to explore as much of the unknown space as possible:

sigma is initialized to 0.5 in the exploration process, and after searching for 100 scenes, sigma exploration 300 scenes are reduced in an exponential manner.

As with Block-QNN, a variant of the Belman equation is applied, in a curtain, each state-to-state transition can be performed with a four-tuple (s _t ,a _t ,R,s _t+1 ) And (3) representing that R is a return value obtained after the whole deep learning model is compressed, and calculating from the accuracy and the safety. Because it is at the end of a curtainThe candidate may obtain a return value, and at the time of the process update, the baseline prize value b is used to reduce the variance of the gradient estimate, which is the moving average of the indices of the previous prizes:

y _i ＝r _i -b+γQ(s _i+1 ,u(s _i+1 )|θ ^Q )

wherein the discount factor gamma is set to 1 to avoid short-term rewards priority too high.

And 4, calculating the error rate and the safety of the compressed deep learning model and obtaining a return value.

In an embodiment, accuracy and security are assessed on a validation set for the identification task or classification task, and then a return value is calculated back to the agent. The evaluation of accuracy accuracies uses an accuracy evaluation method in image classification, i.e., the number of correctly classified samples is a percentage of the total number of samples. The Error rate Error is calculated as:

wherein N is _correct Representing the number of correctly classified samples, N _total Indicating the total number of samples.

The method for evaluating the safety uses the CLEVER score, is irrelevant to attack, has reasonable calculation amount, can be used for large data sets such as ImageNet, uses a local Lipuxis constant to evaluate the lower bound of the minimum countermeasure distance, takes the local Lipuxis constant as the measure of model safety, utilizes a limit value theory to evaluate the Lipuxis constant required, and is more efficient. The CLEVER score of a single sample is calculated in the following way, the calculation method of the target attack CLEVER score is named as CLEVER-t, the input is a classifier function f (x), and the sample x is the same as the CLEVER score of the target attack CLEVER ₀ And the corresponding class c, the target class is j, the disturbance norm p, and the target class is a fixed ball B _P (x ₀ N is uniformly and independently taken from M) _b Batches, each batch having N _s With M being the maximum perturbation for each sample. First, initializing:

g(x)←f _c (x)-f _j (x)

at N _b Selecting one batch i from the batches, and selecting N in the batch in turn _s Samples, sample denoted as x ⁽ⁱ ^,k) Where k represents the kth selected sample of the batch, x ^(i.k) ∈B _p (x ₀ M), b can be calculated by back propagation _ik :

Update S for each batch:

S←S∪{max _k {b _ik }}

taking the maximum likelihood estimate of the position parameter of the inverse Wilson distribution on S, and recording asThe CLEVER score for this sample is u:

for the non-target attack, the CLEVER score of the target attack is calculated for each target class, and the minimum value is taken. The above-described CLEVER score may be used to randomly select a portion of a data set for a single sample and then average as a security indicator for the classifier over a certain data set.

The report value R is calculated by:

R＝-Error·log(FLOPs)+λu

this return function is relatively sensitive to variations in Error rate Error, which is reduced as much as possible because it maximizes R. Because the number of floating point operands is large, the logarithm is multiplied by the error rate as a small incentive. The CLEVER score is not suitable to take the logarithm as an excitation, since this value is small and therefore multiplied by a super-parameter lambda to add.

And 5, feeding back the return value R to the intelligent agent, and repeating the step 3 and the step 4 until iteration.

In the embodiment, the number of experimental rounds, that is, 400 total screens in the DDPG setting are used as requirements, and the reinforcement learning strategy is continuously updated, so as to achieve a better compression effect.

The embodiment also provides a deep learning model safety guarantee compression device based on reinforcement learning, which comprises a memory, a processor and a computer program stored in the memory and executable on the computer processor, and is characterized in that the deep learning model safety guarantee compression method based on reinforcement learning is realized when the processor executes the computer program.

The deep learning model safety guarantee compression method and device based on reinforcement learning can be used for pruning compression of the deep learning model of an image classification task, a voice recognition task and a text classification task, so that the weight of the model is reduced on the basis of guaranteeing the safety and accuracy of the model, the calculated amount of the compressed deep learning model in the image classification task, the voice recognition task and the text classification task can be reduced, the resource consumption is reduced, and the deep learning model is convenient to deploy at an edge end for use.

When the deep learning model is used for an image classification task, the adopted sample data is an image, and the error rate and the safety of the compressed model are calculated according to the image. When the deep learning model is used for a voice recognition task, the adopted sample data are voice data, and the error rate and the safety of the compressed model are calculated according to the voice data. When the deep learning model is used for text classification tasks, the adopted sample data are text data, and the error rate and the safety of the compressed model are calculated according to the text data.

The deep learning model safety guarantee compression method and device based on reinforcement learning provided by the embodiment model the model structure by using the graph network mode, can fully consider the connection between layers, but not only consider the structure with only sequential connection between layers, and is more universal. The framework of reinforcement learning is used, the compression strategy and the sparsity of each layer can be automatically selected, no manual setting is needed, and the network structure is trimmed layer by layer according to the compression strategy and the sparsity until the last layer. In addition, by setting the return value R, the parameters of the compressed model can be reduced as much as possible and the safety can be maintained while the accuracy of the compressed model is maintained.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. The deep learning model safety guarantee compression method based on reinforcement learning is characterized by comprising the following steps of:

(1) Each layer of the deep learning model for the recognition task or the classification task is regarded as a node, the connection relation between the layers is regarded as a continuous edge, and the deep learning model is modeled into a graph network according to the connection relation, wherein the classification task comprises an image classification task, a voice recognition task and a text classification task, when the deep learning model is used for the image classification task, the adopted sample data is an image, when the deep learning model is used for the voice recognition task, the adopted sample data is voice data, and when the deep learning model is used for the text classification task, the adopted sample data is text data;

(3) Taking the current embedded vector of each node of the graph network as the environment state of reinforcement learning, predicting an action value adopted based on the environment state by reinforcement learning, and realizing pruning of a current layer of the model according to the action value until each layer of the model is pruned, so as to realize one-round compression of the deep learning model;

2. The reinforcement learning-based deep learning model security guarantee compression method as claimed in claim 1, wherein in the step (1), when modeling the deep learning model into a graph network, for a current layer of the deep learning model, an input sample dimension, a convolution kernel sliding step length and a calculated amount of the current layer, a calculated total amount of reduction of all the previous layers, a calculated total amount of remaining all the later layers, and a pruning strategy adopted by the previous layer form a feature vector of the current layer;

3. The reinforcement learning-based deep learning model security guarantee compression method of claim 1, wherein in the step (2), a feature matrix formed by feature vectors of nodes and an adjacent matrix of a graph network are used as inputs of the graph convolution neural network, and at least 2 layers of the graph convolution neural network are used for extracting an embedded vector of each node, wherein the embedded vector of each node is identical to the feature vector in dimension.

4. The reinforcement learning-based deep learning model security assurance compression method of claim 1, wherein in the step (3), the pruning of the embedded vector of each node according to the action value comprises:

5. The reinforcement learning-based deep learning model security assurance compression method of claim 1, wherein determining a pruning sparsity according to a pruning strategy comprises: and taking 3 times of the difference between the minimum value of the action value range corresponding to the pruning strategy and the calculated action value as a sparse rate.

6. The reinforcement learning-based deep learning model security guarantee compression method of claim 1, wherein in the deep learning process, after predicting a current action value based on a current environment state by adopting an action network, calculating a decision value based on the current environment state, the current action and a next environment state corresponding to a next adjacent node by adopting a decision network, and constructing a loss function according to the decision value and a baseline rewarding value to update network parameters of the decision network, thereby updating network parameters of the action network for predicting the next action value based on the next environment state;

wherein, the Loss function Loss is:

y _i ＝r _i -b+γQ(s _i+1 ,u(s _i+1 )|θ ^Q )

wherein Q(s) _i ,a _i |θ ^Q ) Representing the parameter theta ^Q The decision network of (a) versus the environmental state s of the ith sample of the N samples extracted _i And an action value a _i N represents successive transition samples selected from the experience buffer for updating the decision network parameters, b represents the baseline prize value, is an exponentially moving average of the previous return values, r _i Represents the reported value recorded in the ith sample, which is a curtain of reported value R divided by a curtain of steps T, gamma represents the discount factor, Q(s) _i+1 ,u(s _i+1 )|θ ^Q ) Is a parameter of theta ^Q Based on the environmental state s of the (i+1) th sample _i+1 And the action depends on the environmental state s _i+1 The generated motion value u (s _i+1 ) Is a target of the evaluation value of (a).

7. The reinforcement learning-based deep learning model security assurance compression method of claim 1, wherein in step (4), calculating the error rate and security of the compressed deep learning model comprises:

8. The reinforcement learning-based deep learning model security assurance compression method of claim 1, wherein in the step (5), the calculation formula of the return value R is:

R＝-Error·log(FLOPs)+λu

9. A reinforcement learning-based deep learning model security assurance compression device comprising a memory, a processor and a computer program stored in the memory and executable on the computer processor, wherein the reinforcement learning-based deep learning model security assurance compression method of any one of claims 1 to 8 is implemented when the processor executes the computer program.