CN115828188A - Method for defending substitute model attack and capable of verifying DNN model copyright - Google Patents

Method for defending substitute model attack and capable of verifying DNN model copyright Download PDF

Info

Publication number
CN115828188A
CN115828188A CN202211661085.9A CN202211661085A CN115828188A CN 115828188 A CN115828188 A CN 115828188A CN 202211661085 A CN202211661085 A CN 202211661085A CN 115828188 A CN115828188 A CN 115828188A
Authority
CN
China
Prior art keywords
model
watermark
network
data set
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211661085.9A
Other languages
Chinese (zh)
Inventor
刘红
吴希昊
刘传雨
肖云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211661085.9A priority Critical patent/CN115828188A/en
Publication of CN115828188A publication Critical patent/CN115828188A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Editing Of Facsimile Originals (AREA)

Abstract

The invention relates to the technical field of deep learning model property right protection, in particular to a method for defending alternative model attack and verifying DNN model copyright, which comprises the steps of constructing a joint deployment model, wherein the model comprises an extraction network and a classification network; in the joint deployment model, a data set comprises an original data set and a trigger set, and the trigger set is generated by embedding a watermark into the original data; in the process of training the joint deployment model, extracting the watermarks of the data set by using an extraction network, wherein the watermarks extracted from the original data set are used as original data, and the watermarks extracted from the triggering set are embedded watermarks; adding disturbance into the extracted data, and carrying out countermeasure training on the classification model; if an attacker obtains the model and the data set in the server through attack and trains to obtain a substitution model, respectively inputting the original data set trigger set into the substitution model, and if the classification results of the two data sets are different, taking the model as the substitution model; the invention can verify the copyright of the model under the condition of the black box.

Description

Method for defending substitute model attack and capable of verifying DNN model copyright
Technical Field
The invention relates to the technical field of deep learning model property right protection, in particular to a method for defending alternative model attack and verifying DNN model copyright.
Background
At present, deep learning is rapidly developed in various fields, and great success is achieved in the aspects of computer vision, natural language processing and the like, which far exceeds the traditional algorithm. A good deep learning model often requires many specialized talents, a large amount of computing resources, and large-scale data, which is often unique to companies, meaning that the deep learning model has great commercial value. However, as a digital product, while neural network models can condense the designer's intelligence, it requires a significant amount of training data and computational resources. For example, in order to accurately recognize a human face, a neural network generally requires tens of millions of human face images to learn and popularize. In addition, the neural network is affected by its network structure, data size and computational resources, and the calculation result usually takes several weeks, so it is necessary to protect the copyright of the neural network model from being violated.
The trained model is deployed in a cloud server to provide services, however, the server can be attacked maliciously to cause model leakage, an attacker can obtain illegal benefits through plagiarism model providing services, or a substitute model is trained through API access. Methods of verifying model copyrights are roughly classified into white box verification and black box verification.
In 2017, uchida et al propose a method for adding a watermark to a model, and add a regular term to an objective function for training a normal network to embed copyright information in network weights, and at the same time, ensure that the large-scale reduction of model performance is not affected. But the weights and structure of the model must be fully accessed for authentication, and in a real scenario an attacker does not open the model. In order to verify the copyright without knowing the internal structure and the total weight of the model, merrer et al propose a method for verifying the model copyright under the black box condition, and they adopt an antagonistic defense technology to fine tune the decision boundary of the model, so that the fine-tuned network can still normally classify part of samples of the decision boundary, and selected several antagonistic samples can be correctly classified, but they do not consider the mobility problem of the antagonistic samples; zhang et al designed a black box model watermark based on author signature, and they designed 3 watermark styles: the method comprises the steps of marking target labels appointed by an author on the watermarks by picture marks, random noise and irrelevant pictures respectively, then mixing the target labels into a training set for training, wherein the network obtained through training shows all normal conditions on normal picture input, but when the pictures marked with the watermarks are met, the appointed target labels are output, and therefore existence of the watermarks is proved. Adi et al propose a black-box model watermarking algorithm based on back-door attacks, which randomly select some abstract pictures, apply target labels, mix in a training centralized training network, the trained network appears normal on normal input, and when the selected abstract pictures are encountered, the model outputs the designated target labels, thereby proving the existence of the watermarks. However, the black box model watermarks at this stage are all 0-1 watermark algorithms, i.e. the embedded watermark can only verify the existence or non-existence of the watermark. Guo et al designed a multi-bit black-box model watermark algorithm, they first converted author information into n-bit binary sequences, then sent into random number generator and random sequencer respectively to specify the image after adding watermark label and embedded watermark position and watermark content, when extracting the watermark, only used as information to calculate the embedded watermark position can correctly extract the watermark. Chen et al also implement a multi-bit black-box model watermarking algorithm, when embedding a watermark, first send all pictures in a training set to the network, take a mean value of the output logits and cluster into two types, then correspondingly select pictures and target labels from the two types of pictures according to the copyright identification of an author, generate a countermeasure sample, and then finely tune the model to enhance the attack effect of the countermeasure sample.
However, the existing black box model only considers verification, and how to deal with verification failure caused by the attack of the substitution model is a problem to be considered. The training data set based on the target model can maximally imitate the behavior of the target model, and a polling method is used for learning the target model. There are two main methods for preventing the surrogate model from being attacked, namely, the method prevents the successful training of the surrogate model, and the method activates the backdoor in the trained surrogate model. The training of the deep learning model is based on a gradient descent algorithm, and model parameters are optimized through the back propagation of gradient descent, so that the convergence state is achieved. For the single classification problem, one picture is required to correspond to a single category, gradient reduction of the substitution model can be disabled by corresponding the consistent picture to different classification labels, and the substitution model cannot be converged. There is therefore a need for a method to disable the gradient descent algorithm when training a surrogate model without affecting the copyright owner training model.
Disclosure of Invention
In order to verify the copyright of the model under the condition of a black box and destroy the training of the substitution model, the invention provides a method for defending the attack of the substitution model and verifying the copyright of a DNN model, which specifically comprises the following steps:
s1, constructing a joint deployment model, wherein the model comprises an extraction network and a classification network;
s2, in the combined deployment model, a data set comprises an original data set and a trigger set, and the trigger set is generated by utilizing a spatial invisible watermark system based on the original data set;
s3, in the process of training the joint deployment model, extracting the watermarks of the data set by using an extraction network, wherein the watermarks extracted from the original data set are consistent with the original data, and the watermarks extracted from the trigger set are the watermarks generated by the spatial invisible watermark mechanism in the step S2;
s4, adding disturbance into the data after the watermark is extracted, and performing countermeasure training on the classification model;
and S5, if an attacker obtains the model and the data set in the server through attack and trains to obtain a substitution model, respectively inputting the original data set trigger set into the substitution model, and if the classification results of the two data sets are different, the model is the substitution model.
Further, the loss function of the extraction network in the training process is as follows:
L R =λ 4 *l wm5 *l self
wherein l wm The watermark loss is realized by extracting the data in the trigger set through an extraction networkWatermark, λ 4 Is the weight lost by the watermark; l. the self The self-loss is realized by extracting the data in the original data set through an extraction network to obtain an image consistent with the original image, and the lambda is 5 Is the weight lost by itself.
Further, watermark loss l wm Expressed as:
Figure BDA0004013945570000031
wherein x is i 'As the ith image in the trigger set X', N c Is the number of pixels; r (x) i ') denotes the extraction network R from the trigger set image x i ' extracting the obtained data, wherein l is a watermark; n is a radical of f Extracting the number of neurons in the network; VGG k (R(x i ')) represents the ith image x in the trigger set i The features of the watermark extracted by the extraction network R on the kth layer of the VGG network; VGG k (l) Features of the watermark at the k-th layer of the VGG network.
Further, self-loss l self Expressed as:
Figure BDA0004013945570000041
wherein x is i Representing the ith image, N, in the original data set X c Is the number of pixels, R (x) i ) Representing the ith image X of an extraction network from a raw data set X i Extracting the obtained watermark, x i 'As the ith image in the trigger set X', N f For extracting the number of neurons in the network, VGG k (R(x i ) Representing the features of the watermark extracted from the ith image in the original data set X through the extraction network R on the kth layer of the VGG network; VGG k (x i ) For the ith image X in the original data set X i Features at the kth layer of the VGG network.
Further, in the process of performing countermeasure training on the classification model, if the label corresponding to one original data x is y, the data after the disturbance is added to the original data is represented as x + Δ x, the label corresponding to x + Δ x is made to be y, and the label obtained by classifying the original data x by the classifier M after the training is completed is y', and the label obtained by classifying x + Δ x is y.
Further, the loss function of the classification model training process is expressed as:
Figure BDA0004013945570000042
wherein x represents a datum in the training set, y represents a real label corresponding to the datum x, and y' represents an error label corresponding to the datum x; l (x + Deltax, y; theta) represents a loss function of the classification model, namely the cross entropy loss of the classification model; theta represents a model parameter of the classification model; Δ x is a disturbance belonging to the set of disturbances Ω;
Figure BDA0004013945570000043
represents the cross entropy loss, f θ (x) The label is expressed by the predicted system parameter when the input is the system parameter.
Compared with the prior art, the invention has the following beneficial effects:
the use scene is wider: the model is subjected to copyright verification based on a black box mode, the model structure and parameters are not required to be known during the copyright verification of the model, and the use scene is wide;
higher safety: according to the method, the training data set is poisoned through an invisible watermark mechanism, only a model copyright owner can train a model in the poisoned data set, training of the substitution model is invalid, and the situation that an attacker trains the substitution model to disable black box verification is avoided.
Drawings
FIG. 1 is a schematic diagram of a scenario adopted by a method for defending against substitution model attacks and capable of verifying the copyright of a DNN model;
FIG. 2 is a general flow diagram of one embodiment of a method of the present invention for verifying the copyright of a DNN model for protection against substitution model attacks;
FIG. 3 is a schematic illustration of multiple gradients in an embodiment of the invention;
FIG. 4 is a diagram of a model framework employed in an embodiment of a method of verifiable DNN model copyrights for protection against substitution model attacks of the present invention;
FIG. 5 is a structure of an arbiter employed in the present invention;
FIG. 6 is an extraction network for extracting a watermark in the present invention;
fig. 7 shows a watermark embedding network for acquiring a trigger set according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for defending alternative model attack and capable of verifying DNN model copyright, as shown in figure 2, which specifically comprises the following steps:
s1, constructing a joint deployment model, wherein the model comprises an extraction network and a classification network;
s2, in the combined deployment model, a data set comprises an original data set and a trigger set, and the trigger set is generated by utilizing a spatial invisible watermark system based on the original data set;
s3, in the process of training the joint deployment model, extracting the watermarks of the data set by using an extraction network, wherein the watermarks extracted from the original data set are consistent with the original data, and the watermarks extracted from the trigger set are the watermarks generated by the spatial invisible watermark mechanism in the step S2;
s4, adding disturbance into the data after the watermark is extracted, and performing countermeasure training on the classification model;
and S5, if an attacker obtains the model and the data set in the server through attack and trains to obtain a substitution model, respectively inputting the original data set trigger set into the substitution model, and if the classification results of the two data sets are different, the model is the substitution model.
The invention mainly aims at the scene that an attacker intends to avoid a backdoor by training a surrogate model, as shown in FIG. 1. The method mainly comprises five processes:
the first stage is as follows: constructing a mixed data set based on the invisible watermark system;
in the second stage, a watermark extraction network and a classification network are jointly trained based on a mixed data set;
the third stage, reverse the training stage of confrontation, make the model only have the correct result to the output of the extracting network of the watermark;
a fourth stage: model deployment and verification, namely, jointly deploying an extraction network and a classification network at the stage, and judging the copyright ownership of the model according to the difference and the sameness of the output results of a trigger set and a common data set when the copyright is verified;
the fifth stage: an attacker trains a substitution model, the attacker performs internal attack to obtain a mixed data set and a source model and trains the substitution model based on the mixed data set and the source model, and the gradient descent algorithm of the substitution model is destroyed by multiple labels corresponding to visual consistency images caused by an invisible watermark mechanism of the mixed data set, so that the substitution model is not converged.
In this embodiment, the raw data set is first collected, the trigger set is generated using a spatially invisible watermarking scheme, and the raw data set and the trigger set are treated as a mixed data set. In this process, the raw dataset may be directly obtained from an existing image-based classification dataset, such as CIFAR-10, MINIST, etc. The invention can carry out copyright protection on the classification model trained by any data set.
Defense is based on the idea that a source model and a data set substitution model thereof have two ideas, one is to fail gradient descent of the substitution model, and the other is to enable a backdoor to still be activated in the substitution model. It is a feasible practice to still activate backdoors in the substitution model, but the generated substitution model can still be used, which still infringes the interests of copyright holders to some extent. The method for directly preventing the training of the surrogate model is the most effective method, the deep learning optimizes internal parameters based on the gradient descent algorithm, and the training of the model can be prevented by destroying the gradient descent algorithm. The invisible watermark is also called a blind watermark, so that the image bearing the watermark and the original image have completely consistent appearances, the invisible watermark mechanism is combined with the back door mechanism, the images with the invisible watermark correspond to different categories and serve as a trigger set to trigger the back door, and meanwhile, the images with consistent appearances have different labels because the trigger set and the original data set serve as a mixed training set, so that the training substitution model cannot be converged.
The method comprises the steps of performing watermark embedding on images in an original data set through a watermark embedding network to obtain a trigger set image, namely adding a spatial invisible watermark on the basis of the original image, selecting any method for adding the spatial invisible watermark on the images from the prior art by a person skilled in the art, not otherwise limiting the method, training the visual consistency and the structural consistency before and after watermark embedding, and taking the trigger set (D2) and the original data set (D1) as a mixed training set (D), namely D = D1+ D2, x = D1 i The ith picture in D has a corresponding label of y i Then there is data x of the visually original data set m = data x of trigger set n If the mth data x in the original data set m Is given by the label y m N-th data x in the trigger set n Is given by the label y n ,x m 、x n With visual consistency, the surrogate model trained by the attacker would consider x m =x n But its corresponding tag has y m ≠y n I.e. the same picture is considered to correspond to different classification labels.
Further, embedding the watermark in the original image requires that the image containing the watermark (trigger set) remain visually consistent with the original carrier image in order not to sacrifice the quality of the original carrier image. Since the generation countermeasure network has a good expression in terms of the difference of desired different images, the present embodiment adds a discriminator D after the watermark embedding network H until the discriminator D does not discriminate whether or not the image data output from the watermark embedding network H is a watermark image, to further improve the image quality of the watermark image. In the embodiment of the present invention, as shown in fig. 4, UNet can be used as the watermark embedding network H, and the UNet network structure as shown in fig. 7 is widely used by image processing tasks, especially tasks where the input image and the output image have the same attribute, because the UNet network has weight connection sharing, the loss function of the watermark embedding network can be expressed as:
L H =λ 1 *l bs2 *l vgg3 *l adv
wherein λ is 1 、λ 2 、λ 3 For a hyper-parameter, all three coefficients may be set to 1;
Figure BDA0004013945570000071
Figure BDA0004013945570000072
N c representing the total number of pixel values of the image, X 'representing the image in the trigger set X', X representing one image in the original image set X; loss of perception vgg Is the difference between the VGG characteristics of x' and x, and can be expressed as:
Figure BDA0004013945570000073
wherein, VGG k () Refers to the feature extracted from the k-th layer of the VGG network, N f For the number of neurons of the VGG network, the VGG network in this embodiment adopts a VGG16 network; the countermeasure loss is used to constrain the discrimination of a discriminator, the discriminator is used to judge whether the image is a trigger set image or an original image after the watermark is embedded, and the loss function is expressed as:
Figure BDA0004013945570000081
wherein, D (x) i ) Representing an image x i Inputting a discriminator; l adv The meaning of (1) is that for an ideal discriminator, the output is 1 when the input image is an original image, and the output is 0 when the input image is a trigger set image. The above loss function can be used to train out the requiredAnd (3) a watermark embedding network, wherein the watermark embedding network is used for embedding the copyright watermark into the original data set to obtain a trigger set, and the trigger set and the original data set are mixed.
In this embodiment, an original image and a watermarked image are input into a discriminator, and the probability that the input image is the original image or the watermarked image is determined until the discriminator cannot determine a difference between the watermark output by an encoder and the original image, as shown in fig. 5, the discriminator is formed by cascading three convolution modules and a convolution layer, where one convolution module sequentially includes one convolution layer (Conv), one Batch Normalization layer (BN), and one ReLU active layer.
And (3) taking the mixed data set as a training set, and using a watermark extraction network to jointly train the protected classification model, so as to solve the problem that the gradient cannot be reduced. The mixed data set can prevent training of the substitution model, but can also interfere with the training of the model by the copyright owner, so that a network for purifying the poisoned data set is required to be trained, pictures with consistent appearances are extracted, and the extracted results are used as the trained images instead of the mixed data set images to train the classification network model.
And (3) carrying out watermark extraction on the mixed data set, wherein in order to ensure the accuracy of the classification model of the joint training, the watermark-free part in the mixed data set is required to be kept consistent with the original image in the extracted result. And for the part with the watermark, extracting the watermark and requiring the extraction result to be consistent with the watermark. And then, training the model owner based on the result of extracting the network, and solving the problem that the gradient cannot be reduced because the visual consistency of the pictures with the visual consistency is removed after the network is extracted.
In this embodiment, for the extraction network R, a CEILNet is used, which follows the network structure of an automatic encoder, fig. 7, consisting of three convolutional layers, fig. 6, and a decoder consisting of one anti-convolutional layer and two convolutional layers. For the watermark extraction loss function, it is necessary to constrain the extraction network R to extract the watermark from the trigger set, and output the image itself from the original image, and add a layer of gaussian noise after extraction, and the loss function proposed in this embodiment is expressed as:
L R =λ 4 *l wm5 *l self
wherein λ is 4 、λ 5 Respectively, a hyperparametric of watermark loss, self-loss, lambda 4 、λ 5 Is set to 1; l wm For watermark loss, it is expressed as:
Figure BDA0004013945570000091
wherein, l is a watermark, and the watermark is used for synthesizing with the image of the original data set to obtain a trigger set; the watermark loss constraint watermark extraction network R extracts images which keep consistency with the watermark image structure and vision for the trigger set image, so that the visual perception loss l is added vgg (ii) a Self-loss l self Expressed as:
Figure BDA0004013945570000092
and the self-loss constraint watermark extraction network R extracts an image which keeps consistent with the original image structure and vision for the original image.
The classification network is a back door embedded in the joint training process with the extraction network, so that the classification network can not be normally used when being separated from the extraction network. Countering an attack, i.e. deliberately adding some imperceptible subtle perturbations to the input samples, results in the model giving a false output with high confidence. The adversity training is a defense method adopted to improve the robustness of the model to the adversity sample, in the embodiment, some adversity samples are constructed and added into the original data set, the robustness of the model to the adversity sample is hopefully enhanced, and the adversity training formula is expressed as follows:
Figure BDA0004013945570000093
the above formula can also be expressed as:
Figure BDA0004013945570000094
wherein x represents a datum in the training set, y represents a real label corresponding to the datum x, and y' represents an error label corresponding to the datum x; l (x + Deltax, y; theta) represents a loss function of the classification model, namely the cross entropy loss of the classification model; theta represents a model parameter of the classification model; Δ x is a disturbance belonging to the disturbance set Ω;
Figure BDA0004013945570000101
represents the cross entropy loss, f θ (x) The label is expressed by the predicted system parameter when the input is the system parameter.
The antagonism training process is as follows:
1. adding disturbance to the input x can make the model unable to obtain a correct prediction result y, i.e. f (x + Δ x) ≠ y, so as to generate a corresponding loss value according to the loss function, wherein the value of the disturbance Δ x can be limited to a disturbance space Ω, | | | Δ x | | | | ≦ e, ensuring that the input sample is indistinguishable to a human, and e is a minimum threshold value, which is generally set to 0.01.
2. The input of the model after adding disturbance is x + delta x, the model is trained by using (x + delta x, y), and the model parameter theta is updated to minimize the average loss of training data.
According to the idea of antagonism training, a reverse antagonism training algorithm is provided to train a classification network, in the traditional antagonism training, the generalization capability of a model to an image added with disturbance is optimized, and the robustness is improved. The method specifically comprises the following steps:
assuming that x corresponds to a tag y, setting the tag of x + Δ x to be y, i.e., M (x + Δ x) = y; and for undisturbed x, M (x) = y';
the noise addition is completed by a watermark extraction network, in the embodiment, a noise layer is added after the last layer of the extraction network is output, and the noise layer randomly adds a Gaussian noise on the basis of the extraction network output result; specifically, in order to make the data to which noise is added correspond to a correct label and the data to which noise is not added correspond to a dislocated label in the classification task, that is, if the label corresponding to one original data x is y, the data after disturbance is added to the original data is represented as x + Δ x, the label corresponding to x + Δ x is y, and the label obtained by classifying the original data x by the classifier M after training is y', and the label obtained by classifying x + Δ x is y; the loss function in the extraction network S2 is then:
l task =l g +l ng
l g =-[ylogy+(1-y)log(1-y)]
l ng =[ylogy+(1-y)log(1-y)]
wherein l g Is a loss function that adds gaussian noise floor images.
The classification model trained by the loss function can only correctly output a classification result for the graph which is output by the extraction network and added with specific disturbance, and can output an error result for a clean image without noise.
Jointly deploying an extraction network and a protected model, namely jointly deploying the watermark extraction network and a classification network into an API, receiving input by the watermark extraction network and outputting a final result by the classification network, and assuming that the input is x, the joint network is M ', the classification network is M and the extraction network is E, then M' (x) = M (E (x)) = y.
Attackers acquire a joint deployment model and a training set through internal attack and train the joint deployment model and the training set, namely the training set of a substitution model is usually based on the existing training set of a target model, some disturbance is added, and the target model is polled in a self-adaptive mode to expand a limited seed data set. Suppose there is a synthetic dataset { (x) i ,y i ) And (4) a synthesis set generated based on the mixed data set obtained in S1, wherein x is contained in the synthesis set i =x j Corresponds to y i ≠y j . As shown in fig. 3: then for sample x i Having a gradient
Figure BDA0004013945570000111
For sample x j Is provided with
Figure BDA0004013945570000112
And because of x i =x j ,y i ≠y j Therefore u is i ≠u j For the model, because a same picture has different gradients in the training process, the gradient can not be reduced normally, and the model can not reach the convergence state.
After the attacker fails to attempt to train the surrogate model to achieve the goal of the theft model, it is more likely that the target model is directly deployed for profit-making, which is also the lowest cost form of theft. Based on the above steps, the obtained combined model has two requirements: 1. the function is maintained. 2. Sensitive to the trigger set. Assuming that there are a combined model M ', a single classification model M, a trigger set X', and an original data set X, the combined model that satisfies the above requirements is described as follows:
1. when a picture in Z is input into the model M ', M' (X) = M (E (X)) = M (X), that is, when X is classified, M (X) → M (xi) = { y1, y2.., yn }.
2. When the picture in Z is input into the model M, M (X) → M (X) = { y1, y2.., yn }, which is consistent with the picture input into M', it is indicated that the joint deployment model does not affect the original function of the single classification model.
3. When a picture in X 'is input to the model M', M '(X') = M (E (X ')) = M (l'), i.e., i 'is classified, and if the l correspondence type is different from X in S2, M (l') → M (l) = { yl }, which indicates that the joint model outputs a label corresponding to l instead of a label corresponding to X for the trigger set.
4. When a picture in X ' is input into the model M, M (X ') = M (xi ') → M (xi) = { w1, w 2., wk }, it is proved that only the protected model can trigger the backdoor and the normal model is not sensitive to the trigger set.
Based on the above description, for a suspected model, if the suspected model outputs { yl } for the trigger set and { y1, y2.
An attacker acquires the joint deployment model, but because the joint deployment can carry a backdoor embedded by a copyright owner, an attempt is made to deploy only the classification model instead of deploying the watermark extraction network and the classification network together. The classification model trained based on the inverse antagonism algorithm provided by the embodiment directly outputs an erroneous result for normal undisturbed input. Therefore, the model cannot be normally used without jointly deploying the models. The backdoor is triggered whenever the models are co-deployed. Therefore, the above method can protect the model copyright of the copyright holder.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A method for defending alternative model attack and capable of verifying DNN model copyright is characterized by specifically comprising the following steps:
s1, constructing a joint deployment model, wherein the model comprises an extraction network and a classification network;
s2, in the combined deployment model, a data set comprises an original data set and a trigger set, and the trigger set is generated by utilizing a spatial invisible watermark system based on the original data set;
s3, in the process of training the joint deployment model, extracting the watermarks of the data set by using an extraction network, wherein the watermarks extracted from the original data set are consistent with the original data, and the watermarks extracted from the trigger set are the watermarks generated by the spatial invisible watermark mechanism in the step S2;
s4, adding disturbance into the data after the watermark is extracted, and performing countermeasure training on the classification model;
and S5, if an attacker obtains the model and the data set in the server through attack and trains to obtain a substitution model, respectively inputting the original data set trigger set into the substitution model, and if the classification results of the two data sets are different, the model is the substitution model.
2. The method for protecting against alternative model attack capable of verifying DNN model copyright as claimed in claim 1, wherein the loss function of the extraction network during training is:
L R =λ 4 *l wm5 *l self
wherein l wm For watermark loss, the purpose is to extract the data in the trigger set by an extraction network to obtain a watermark, lambda 4 Is the weight lost by the watermark; l. the self The self-loss is realized by extracting the data in the original data set through an extraction network to obtain an image consistent with the original image, and the lambda is 5 Is the weight lost by itself.
3. Method for verifiable DNN model copyrights against substitution model attacks as in claim 2, wherein the watermark is lost by l wm Expressed as:
Figure FDA0004013945560000011
wherein x is i For trigger set X Middle ith image, N c Is the number of pixels; r (x) i ) Representation extraction network R from trigger set image x i Extracting the obtained data, wherein l is a watermark; n is a radical of f Extracting the number of neurons in the network;
VGG k (R(x i ) Represents the ith image x in the trigger set i The features of the watermark extracted by the extraction network R on the kth layer of the VGG network; VGG k (l) Features of the watermark at the k-th layer of the VGG network; i | · | purple wind 2 Representing the L2 norm.
4. The verifiable DNN module of claim 2 for protection against surrogate model attacksMethod of copyright typing characterized by a loss of itself l self Expressed as:
Figure FDA0004013945560000021
wherein x is i Representing the ith image, N, in the original data set X c Is the number of pixels, R (x) i ) Representing the ith image X of an extraction network from a raw data set X i Extracting the obtained watermark, x i For trigger set X Middle ith image, N f For extracting the number of neurons in the network, VGG k (R(x i ) Representing the features of the watermark extracted from the ith image in the original data set X through the extraction network R on the kth layer of the VGG network; VGG k (x i ) For the ith image X in the original data set X i Features at the kth layer of the VGG network.
5. The method for verifying the copyright of the DNN model for defending against the attack of the surrogate model as claimed in claim 1, wherein in the process of performing the countermeasure training on the classification model, if the label corresponding to an original data x is y, the data after the original data is disturbed is represented as x + Δ x, the label corresponding to x + Δ x is y, the label obtained by classifying the original data x by the classifier M after the training is y ', the label obtained by classifying x + Δ x is y, and y' is the error label of the original data x.
6. The method for protecting against alternative model attacks that can verify the copyright of the DNN model in claim 5, wherein the classification model training process is expressed as:
Figure FDA0004013945560000022
wherein x represents a datum in the training set, y represents a real label corresponding to the datum x, and y' represents an error label corresponding to the datum x; l (x + Δ x)Y; theta) represents a loss function of the classification model, i.e. the cross entropy loss of the classification model; theta represents a model parameter of the classification model; Δ x is a disturbance belonging to the disturbance set Ω;
Figure FDA0004013945560000023
represents the cross entropy loss, f θ (x) The label is expressed by the predicted system parameter when the input is the system parameter.
CN202211661085.9A 2022-12-23 2022-12-23 Method for defending substitute model attack and capable of verifying DNN model copyright Pending CN115828188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211661085.9A CN115828188A (en) 2022-12-23 2022-12-23 Method for defending substitute model attack and capable of verifying DNN model copyright

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211661085.9A CN115828188A (en) 2022-12-23 2022-12-23 Method for defending substitute model attack and capable of verifying DNN model copyright

Publications (1)

Publication Number Publication Date
CN115828188A true CN115828188A (en) 2023-03-21

Family

ID=85517898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211661085.9A Pending CN115828188A (en) 2022-12-23 2022-12-23 Method for defending substitute model attack and capable of verifying DNN model copyright

Country Status (1)

Country Link
CN (1) CN115828188A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118134740A (en) * 2024-05-07 2024-06-04 浙江大学 Model watermarking method and device based on non-decision domain method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118134740A (en) * 2024-05-07 2024-06-04 浙江大学 Model watermarking method and device based on non-decision domain method

Similar Documents

Publication Publication Date Title
Li et al. How to prove your model belongs to you: A blind-watermark based framework to protect intellectual property of DNN
Namba et al. Robust watermarking of neural network with exponential weighting
Zhang et al. Protecting intellectual property of deep neural networks with watermarking
Li et al. A survey of deep neural network watermarking techniques
Akhtar et al. Advances in adversarial attacks and defenses in computer vision: A survey
Li et al. Piracy resistant watermarks for deep neural networks
WO2021042665A1 (en) Dnn-based method for protecting passport against fuzzy attack
Guo et al. An overview of backdoor attacks against deep neural networks and possible defences
Botta et al. NeuNAC: A novel fragile watermarking algorithm for integrity protection of neural networks
Hitaj et al. Evasion attacks against watermarking techniques found in MLaaS systems
CN109919303B (en) Intellectual property protection method, system and terminal for deep neural network
Quiring et al. Adversarial machine learning against digital watermarking
Zhu et al. Fragile neural network watermarking with trigger image set
CN115828188A (en) Method for defending substitute model attack and capable of verifying DNN model copyright
CN114862650B (en) Neural network watermark embedding method and verification method
Mosafi et al. Stealing knowledge from protected deep neural networks using composite unlabeled data
Sun et al. Deep intellectual property protection: A survey
Wu et al. Watermarking pre-trained encoders in contrastive learning
Liu et al. Data protection in palmprint recognition via dynamic random invisible watermark embedding
CN113034332B (en) Invisible watermark image and back door attack model construction and classification method and system
CN113362217A (en) Deep learning model poisoning defense method based on model watermark
CN109544438A (en) A kind of digital watermark method based on neural network and dct transform
CN113435264A (en) Face recognition attack resisting method and device based on black box substitution model searching
CN115546003A (en) Back door watermark image data set generation method based on confrontation training network
Li et al. A novel robustness image watermarking scheme based on fuzzy support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination