CN111539916A - Image significance detection method and system for resisting robustness - Google Patents

Image significance detection method and system for resisting robustness Download PDF

Info

Publication number
CN111539916A
CN111539916A CN202010270423.0A CN202010270423A CN111539916A CN 111539916 A CN111539916 A CN 111539916A CN 202010270423 A CN202010270423 A CN 202010270423A CN 111539916 A CN111539916 A CN 111539916A
Authority
CN
China
Prior art keywords
image
saliency
energy function
input
robust
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010270423.0A
Other languages
Chinese (zh)
Other versions
CN111539916B (en
Inventor
曾怡瑞
马争鸣
李冠彬
林倞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010270423.0A priority Critical patent/CN111539916B/en
Publication of CN111539916A publication Critical patent/CN111539916A/en
Application granted granted Critical
Publication of CN111539916B publication Critical patent/CN111539916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image significance detection method and system for resisting robustness, wherein the method comprises the following steps: step S1, generating a counterattack sample for the significance detection as an input image of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original image; step S2, using the confrontation sample obtained in step S1 as input, reconstructing an input image using an energy-based generation model, performing likelihood modeling using a neural network approximation energy function, and generating a reconstructed image from which confrontation noise is removed; and step S3, the reconstructed image obtained in step S2 is used as the input of the backbone network and generates a saliency map of the dense marker.

Description

Image significance detection method and system for resisting robustness
Technical Field
The invention relates to the technical field of computer vision based on deep learning, in particular to an image saliency detection method and system for robust resistance based on an energy model.
Background
The purpose of saliency detection is to locate and segment objects in an image or video frame that are most visually characteristic to the human eye. Designing a saliency detection model to simulate humans not only helps to understand the intrinsic mechanisms of human vision and psychology, but also helps many applications of computer vision and computer graphics. For example, saliency detection is applied to context-aware image editing, image thumbnails, object segmentation, and person re-recognition. Saliency detection is a fundamental task in computer vision, has been widely studied for a long time, and there is currently a great deal of relevant work.
In recent years, the application of deep neural networks greatly improves the significance detection effect, and gradually becomes a mainstream method. Significance detection methods driven by deep convolutional neural networks can be divided into two groups, sparse labels and dense labels. Among them, sparse label methods occur in early years, since these methods take a region as a computational unit and involve two separate steps of feature extraction and significant value inference, are generally inefficient and require a large amount of space for feature storage; inspired by the successful application of the full convolution network in pixel-level semantic segmentation, a new advanced technology has been established in significance detection by the recent dense label method, such as work Detect globallely, refine locality: A novel approach likelihood detection (ICCV) of Wang, T et al.
In the existing detection method with the best performance, a full convolution neural network is basically adopted as a model architecture method. Hou, Q et al, 2018, Deeply super seen content object detection with short connections (PAMI) adapted to the overall nested edge detector structure by introducing short connections of the jumper layer structure.
In recent years, although full convolution neural networks have been used with great success in the problem of detection of salient objects, these methods have some weaknesses that may degrade their performance. First the end-to-end trainable properties allow gradients to be easily propagated from surveillance targets to the input image, which exposes the salient object detection model to the risk of countering attacks; second, dense label models do not explicitly model the contrast between different image portions, but rather implicitly estimate the saliency in a single FCN. Once the input image is contaminated against noise, both low-level and high-level features are affected. Third, the current saliency detection training set is very small compared to the image classification task with millions of samples, while the salient object classes involved are also very limited. Thus to some extent the existing model fits bias within the data, e.g. detection targets often appear in the training set rather than locating the most prominent objects, while these methods may rely on capturing too much high level semantics and may be sensitive to low level perturbations, such as against noise.
Efficiency and robustness are important since significant object detection techniques are typically employed as initialization or pre-processing at an early stage of the system. Given that the performance of the preprocessing stage is severely affected by some well-designed input noise, the following stages may produce erroneous results, which may be catastrophic for the entire system. Therefore, there is a real need to pay attention to the robustness of significance monitoring, and provide an accurate, fast and stable significance detection model for the significant object detection task.
Disclosure of Invention
To overcome the above-mentioned deficiencies of the prior art, the present invention provides a robust image saliency detection method and system to improve the robustness of the existing dense labeling method and maintain the efficiency.
To achieve the above object, the present invention provides an image saliency detection method against robustness, comprising the following steps:
step S1, generating a counterattack sample for the significance detection as an input image of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original image;
step S2, using the confrontation sample obtained in step S1 as input, reconstructing an input image using an energy-based generation model, performing likelihood modeling using a neural network approximation energy function, and generating a reconstructed image from which confrontation noise is removed;
step S3, using the reconstructed image obtained in step S2 as the input of the backbone network, and generating a saliency map of dense markers.
Preferably, in step S1, an iterative gradient-based white-box attack is used.
Preferably, in step S1, the maximum number of iterations T is limited, the total run-time cost, once T iterations are completed or L is reachedNorm bound, the iteration stops and returns the challenge sample obtained at the current time step.
Preferably, the step S2 further includes:
step S201, using a neural network to approximate an energy function, and generating a sample from probability distribution defined by the energy function;
step S202, a noise model is further introduced when the image is reconstructed;
step S203, training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, gradually enabling the reconstructed image sampled from the gradient of the energy function to be close to the original input image.
Preferably, in step S201, an iterative refining process of langevin dynamics is used, and the gradient of the energy function is used for sampling to reconstruct the input image.
Preferably, in step S202, langevin dynamics are used to increase the gradient-decreasing perturbation:
Figure BDA0002442955820000031
wherein, IRFor the current reconstructed image, IR+1In order to update the image for the next step,
Figure BDA0002442955820000032
is the gradient of the energy function with respect to the image I,
Figure BDA0002442955820000033
in correspondence with the learning rate α,
Figure BDA0002442955820000034
is IRThe inertia factor of (c).
Preferably, in step S3, the backbone network selects any visual saliency model based on a full convolution network.
Preferably, the method further comprises:
step S4, smoothing the input image, namely the confrontation sample, by adopting a filtering method for contrast modeling of context perception restoration;
s5, the saliency score provided by the backbone network is improved using low-level feature similarity and image context information between pixels of the smoothed confrontation samples, and the saliency map is adjusted by minimizing a partial energy function.
Preferably, in step S5, the similarity between the pixels in the low-level color space and the spatial location is measured and the restoration component adjusts the saliency map by minimizing some energy function.
To achieve the above object, the present invention further provides an image saliency detection system that is robust against, comprising:
the counterattack sample generation unit is used for generating counterattack samples for significance detection as input images of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original images;
the input image reconstruction unit is used for reconstructing an input image by using an energy-based generation model, performing likelihood modeling by using a neural network approximate energy function and generating an image with anti-noise removed;
and the saliency detection unit is used for taking the reconstructed image obtained by the input image reconstruction unit as the input of the backbone network and generating a densely marked saliency map so as to reduce the high sensitivity of the backbone network on confrontation samples.
Compared with the prior art, the method and the system for detecting the image significance of the robustness countermeasure reconstruct the input image added with the countermeasure noise by utilizing the universality and the simplicity of the energy-based model in the likelihood modeling, thereby effectively eliminating the attack effect, and after a backbone network, a significance map is refined by utilizing the similarity between contexts, so that the detection effect can be obviously improved.
Drawings
FIG. 1 is a flow chart illustrating the steps of a robust image saliency detection method of the present invention;
FIG. 2 is a process diagram of robust image saliency detection according to an embodiment of the present invention;
fig. 3 is a system architecture diagram of a robust image saliency detection system of the present invention.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
The current counterattack mainly adopts the following method: one-step gradient-based methods (fast gradient notation FGSM) and iterative-based methods. Wherein, under the constraint of infinite norm threshold value epsilon, the fast gradient notation (FGSM) calculates one-step gradient to maximize the loss function of the output and the true value, and the formula of FGSM generating the countermeasure sample is as follows:
Figure BDA0002442955820000053
wherein x is*X and y respectively represent the challenge sample, the original image and the true value, and f (·;. theta.) represents the neural network model with a parameter theta.
The iterative method performs FGSM multiple times with step size α, and the formula is as follows:
Figure BDA0002442955820000051
wherein the content of the first and second substances,
Figure BDA0002442955820000052
representing the challenge sample generated at the t-th time step, clip (x, ∈) will be used to select each element x of xiIs maintained at [ x ]i-∈,xi+∈]Within the range. Existing attacks are mostly focused on image classificationOn the task, some researches are carried out on semantic segmentation, human body posture estimation and the like, but blank exists in the counterattack aiming at the detection of the remarkable object, and the method initiates the counterattack of a white box and a black box to the detection model of the remarkable object.
Fig. 1 is a flowchart illustrating steps of a robust image saliency detection method according to the present invention, and fig. 2 is a process diagram illustrating a robust image saliency detection according to an embodiment of the present invention. As shown in fig. 1 and fig. 2, the present invention provides a robust image saliency detection method, which includes the following steps:
and step S1, generating a counterattack sample for the significance detection as an input image of the system on the original image by using an iterative gradient-based method aiming at the counterattack for the significance detection. The iterative gradient-based method is a combination of gradient-based and iteration as described above, i.e. the iteration is iterated on a gradient-based basis.
Specifically, in a white-box attack, a visual saliency model is selected as the neural network to be attacked, let f (·, θ) be a pre-trained model with parameter θ, x*And y respectively represent the original image, the corresponding confrontation sample and the true value, and the data of the confrontation sample needs to be preprocessed before the confrontation sample is synthesized. After the challenge sample is generated, the challenge sample x is*Up to [0,255 ]]And rounded to an RGB image. Each element y of yiBelonging to {0,1}, 0 representing insignificant and 1 representing significant. To ensure that the counterdisturbance is not noticeable, the parameter is set to LSo that | x-x |*And | | | is less than or equal to. The maximum number of iterations T limits the overall runtime cost. Once T iterations are completed or L is reachedNorm bound, the iteration stops and returns the challenge sample obtained at the current time step.
In each iteration t, the countermeasure sample is taken according to the initialization or the previous time step
Figure BDA0002442955820000061
Update as follows:
Figure BDA0002442955820000062
wherein p istIs the perturbation calculated for the t-th iteration. The goal of the iteration is to make the prediction of all pixels in x wrong, i.e.
Figure BDA0002442955820000063
Where i denotes the pixel in x, c denotes two classes: significant and non-significant. To determine ptThe gradient descent algorithm is used for the following equation:
Figure BDA0002442955820000064
wherein S istRepresenting the set of pixels that f can still be correctly classified. Obtaining p by normalizationt=α·p′t/||p′t||In a specific embodiment of the present invention, only the method of generating white-box challenge samples is described, but it should be noted that it can be directly transferred to black-box attacks because existing visual saliency models have a similar full convolution-based architecture and are typically initialized by the same pre-trained image classification model.
In step S2, the input image is reconstructed using the energy-based generation model, and likelihood modeling is performed using a neural network approximation energy function to generate an image from which the countermeasure noise is removed.
In step S2, with the challenge sample obtained in step S1 as the input of S2, since the challenge noise is accurately calculated by back propagation, it is itself vulnerable, and the challenge noise forms some subtle curve-like patterns, which may play an important role, and the elimination of these patterns may reduce the effect of the attack.
Specifically, step S2 further includes:
step S201, approximating an energy function using a neural network, sampling with a gradient of the energy function, and generating a sample from a probability distribution defined by the energy function. Namely, the challenge samples obtained in step S1 are input to the energy function and sampled by the gradient of the energy function, and the samples are generated from the probability distribution defined by the energy function.
Specifically, given a data point x, let
Figure BDA0002442955820000071
Is an energy function represented by a neural network with a parameter theta, the energy function being distributed through a Boltzmann distribution
Figure BDA0002442955820000072
A defined probability distribution, wherein I represents an image, and Z (θ) ═ exp (-E)θ(x) Dx) represents a partition function and E represents an energy function. The neural network in the present invention is represented by F (which may be expressed as a combination of convolutional layers and nonlinear activation functions, random initialization parameters), i.e., F (I, θ) — E (I; θ). To generate samples from such distributions, the probability density of generating an image is maximized. The reconstructed image is denoted as IRTo generate samples from the energy model, an iterative refinement process is employed. When reconstructing an image, in a first iteration, an iterative refining process is carried out using a composite image initialized with 0, i.e. in the first iteration, IRThe iterative refining process is as follows, 0:
Figure BDA0002442955820000073
wherein IRFor the current reconstructed image, IR+1For the next updated image, α is the learning rate,
Figure BDA0002442955820000074
is the gradient of the energy function with respect to the image I, and can be calculated by back propagation,
Figure BDA0002442955820000075
represents the gradient of the neural network F, because
Figure BDA0002442955820000076
Is the gradient of the energy function with respect to the image I, and F (I, θ) — E (I; θ).
Briefly, the reconstructed image is obtained by sampling this probability distribution, and is embodied as a part of the updated formula (i.e., the part for graduating the neural network F)
Figure BDA0002442955820000077
Portion). Sampling is an iterative process, and a reconstructed image with higher quality can be obtained after repeated iterative sampling, namely an iterative refining process. The neural network F is used for approximating the distribution, and the iterative refining process utilizes the gradient of the neural network to sample when the image is updated.
Step S202, a noise model is further introduced during image reconstruction, the difficulty of recovering fine details is increased, the chance of fitting antagonistic noise is reduced, and the antagonistic noise is effectively eliminated, wherein other types of noise can be selected for replacing the added Gaussian noise
In order to remove the challenge noise in the challenge sample. A noise model is further introduced in the process of reconstructing the image:
Figure BDA0002442955820000081
where Z represents some noise distribution, e.g., gaussian noise, and e is the noise strength.
Since the countering noise is accurately calculated by back-propagation, it is inherently fragile, and forms a number of subtle curvilinear patterns that may play an important role. Eliminating these patterns may reduce the effect of the attack. Adding noise during image synthesis increases the difficulty of restoring fine details, thereby reducing the chance of fitting antagonistic noise to achieve the goal of removing the antagonistic noise.
Preferably, to increase the mixing time during sampling, langevin dynamics are used to increase the perturbation of the gradient descent:
Figure BDA0002442955820000082
wherein the content of the first and second substances,
Figure BDA0002442955820000083
corresponding to the learning rate α.
Figure BDA0002442955820000084
Is IRN (0,1) represents the gaussian noise of the standard positive distribution.
Step S203, training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, gradually enabling the reconstructed image sampled from the gradient of the energy function to be close to the original input image.
In particular, the neural network F is updated so that the reconstructed image gradually approaches the input image II. Let likelihood function L (θ) log (p (I)Iθ)), θ is trained in a direction that maximizes the log-likelihood L (θ):
Figure BDA0002442955820000085
Figure BDA0002442955820000086
wherein theta istNetwork parameter, theta, representing the current time stept+1A parameter representing the update of the next time step,
Figure BDA0002442955820000091
represents the gradient of the maximum likelihood function with respect to θ, β represents the step size.
Figure BDA0002442955820000092
Wherein Ep(I;θ)[·]Is the expectation of I under the distribution p (I; theta). The expectation is not explicitly calculated but is approximated by a sampling. The reconstructed image is obtained by sampling image I from the distribution p (I; theta) and, therefore, in a particular embodiment of the invention, the selection is made
Figure BDA0002442955820000094
To approximate the expected Ep(I;θ)[·]. The parameters of the neural network F are updated during the training as follows:
Figure BDA0002442955820000093
the training process of steps S201-S203 is performed iteratively in alternation, thereby generating a high quality reconstructed image.
Preferably, in step S2, a synthesized image having the same size as the input image is initialized with zeros. The composite image is an image reconstructed with the counternoise removed.
And step S3, using the reconstructed image obtained in step S2 as the input of the backbone network and generating a saliency map of dense marks, so as to reduce the high sensitivity of the network on confrontation samples.
The backbone network can be chosen as any fully convolutional network-based visual saliency model that takes the entire image as input and produces densely labeled saliency maps. The full convolution backbone network has high efficiency and high accuracy. The framework backbone, which is initially based on a full convolution network, is initialized to some pre-trained visual saliency model. The reconstructed image generated in the step of S2 is used as an input of the backbone network, so as to reduce the high sensitivity of the network on the confrontation sample.
Preferably, the present invention provides an image saliency detection method against robustness, further comprising the following steps:
step S4, smoothing the input image, i.e. the confrontation sample, by using a filtering method for contrast modeling of context-aware restoration, thereby refining the final result. In the embodiment of the present invention, a bilateral filtering method is selected to smooth the confrontation samples, but the present invention is not limited thereto, and any filtering method may be selected instead.
S5, the saliency score provided by the backbone network is improved using low-level feature similarity and image context information between pixels of the smoothed confrontation samples, and the saliency map is adjusted by minimizing a partial energy function. The low-level features refer to pixel-level features, such as shape, texture features, which, while countering noise may destroy the high-level semantic understanding of the neural network, do not affect the pixel-level features. The image context information refers to image context information reflecting the correlation between the objects in the image.
To improve the impact of the newly introduced noise on the results in step S2, this step exploits the low-level feature similarity between pixels and image context information to improve the saliency score provided by the backbone network. Since the purpose of combating the perturbations is to parameterize the convolution filter, the recovery component fully adopts a graph model rather than a CNN architecture. Since the previous high-level convolution features have been contaminated, the present invention measures the similarity between pixels in the low-level color space and spatial locations, and the recovery component adjusts the saliency map by minimizing some energy function, using the following formula:
Figure BDA0002442955820000101
wherein y is a rough saliency map, y*Is a refined saliency map, wherein the first term embodies global context information and the first term is used for measuring a unary energy function
Figure BDA0002442955820000102
Assigning a cost of i, the cost being an inverse likelihood, a second term representing pixel-level features and local context, and a second term pairwise energy function measure used simultaneously
Figure BDA0002442955820000103
And
Figure BDA0002442955820000104
the cost of i and j is specified, including in particular spatial location and color space, which encourages similar nearby pixels to be labeled as the same saliency value, i.e., the cost measured by the second energy function includes spatial location and color space
Figure BDA0002442955820000105
And
Figure BDA0002442955820000106
cost is specified for i and j, where P represents the spatial location of the pixel, x'iRepresenting the color of the pixel. The pairwise energy function is defined as follows:
Figure BDA0002442955820000107
wherein p represents the position of the pixel, x'iRepresents the color of the pixel, x '(i.e., x'iAnd x'j) Fighting sample x for input*The result of bilateral filtering. Omega2And thetaγSelect as 1, select ω by verification1、θαAnd thetaβ. μ is a learnable tag compatibility function that penalizes assigning i and j with different tags, encouraging similar neighboring pixels to have the same label.
In some previous works, such as Conditional random fields as recurrent neural networks (ICCV), the above formula was interpreted as a dense connected Conditional random field with a recurrent neural network. Neural networks are implemented with 1 × 1 convolutional layers. Since the recovery component refines the results using the global context, it is more difficult to change the prediction by countering the noise within a limited perturbation strength. To affect the results for certain pixel locations, the remote feature vector may need to be changed, resulting in greater perturbation. The parameters of the step S5 are established according to the efficiency information in the fully connected crfs with a gaussian edge potential (NeuroIPS), and then the parameters of the backbone network and the recovery component are finely adjusted together.
Fig. 3 is a system architecture diagram of a robust image saliency detection system of the present invention. As shown in fig. 3, the present invention provides an image saliency detection system that is robust against, comprising:
a confrontation sample generating unit 301, configured to generate a confrontation attack sample for saliency detection as an input reference image of the system on the original image by using an iterative gradient-based method for the confrontation attack for saliency detection. The iterative gradient-based approach is a combination of the gradient-based and iterative approaches described above, the iteration being iterative on a gradient-based basis
Fight attacks against significance detection. Inspired by the adaptive algorithms for the correlation and object detection (ICCV), the present invention synthesizes challenge samples by an iterative gradient-based method.
Specifically, in a white-box attack, the visual saliency model is selected as the neural network to be attacked. Let f (·, θ) be the pre-trained model with parameter θ. x, x*And y represent the original image, its corresponding challenge sample and true value, respectively. The data of the countermeasure sample is preprocessed before synthesizing the countermeasure sample, and in the embodiment of the present invention, the preprocessing is performed by subtracting an image mean value from an original image x, where the image mean value is generally a pixel mean value of a training set, and the averaging is performed to normalize the image. After the challenge sample is generated, the challenge sample x is*Up to [0,255 ]]And rounded to an RGB image. Each element y of yiBelonging to {0,1}, 0 representing insignificant and 1 representing significant. To ensure that the counterdisturbance is not noticeable, the parameter is set to LSo that | x-x |*And | | | is less than or equal to. The maximum number of iterations T limits the overall runtime cost. Once T iterations are completed or L is reachedNorm bound, the iteration stops and returns the challenge sample obtained at the current time step.
In each iteration t, the countermeasure sample is taken according to the initialization or the previous time step
Figure BDA0002442955820000121
Update as follows:
Figure BDA0002442955820000122
Figure BDA0002442955820000123
wherein p istIs the perturbation calculated for the t-th iteration. The goal of the iteration is to make x equal toPrediction errors with pixels, i.e.
Figure BDA0002442955820000124
Where i denotes the pixel in x, c denotes two classes: significant and non-significant. To determine ptThe gradient descent algorithm is used for the following equation:
Figure BDA0002442955820000125
wherein S istRepresenting the set of pixels that f can still be correctly classified. Obtaining p by normalizationt=α·p′t/||p′t||Where α is a fixed step size, the present invention introduces only the method of generating white-box challenge samples, which can be directly transferred to black-box attacks because existing visual saliency models have a similar full convolution-based architecture and are typically initialized by the same pre-trained image classification model.
And an input image reconstruction unit 302, configured to reconstruct an input image using the energy-based generation model, perform likelihood modeling using a neural network approximation energy function, and generate an image from which the anti-noise is removed.
The input image reconstruction unit 302 takes as input the antagonistic sample obtained by the antagonistic sample generation unit 301, which is itself fragile since the antagonistic noise is accurately calculated by back propagation, forms a number of subtle curve-like patterns, which may play an important role, and the elimination of these patterns can reduce the effect of the attack.
Specifically, the input image reconstruction unit 302 further includes:
a sampling module for approximating an energy function using a neural network, sampling with a gradient of the energy function, and generating samples from a probability distribution defined by the energy function. That is, the challenge samples obtained in the challenge sample generation unit 301 are input to the energy function and sampled by the gradient of the energy function, and the samples are generated from the probability distribution defined by the energy function
Specifically, given a data point x, let
Figure BDA0002442955820000126
Is an energy function represented by a neural network with a parameter theta, the energy function being distributed through a Boltzmann distribution
Figure BDA0002442955820000131
A defined probability distribution, wherein I represents an image,
Figure BDA0002442955820000132
representing the partition function and E the energy function. The neural network is represented by F in the invention, namely F (I, theta) ═ E (I; theta). To generate samples from such distributions, the probability density of generating an image is maximized. The reconstructed image is denoted as IRIn order to generate samples from the energy model, an iterative refinement process is used, in which, when reconstructing the image, the iterative refinement process is carried out in the first iteration using a composite image initialized with 0, i.e. in the first iteration, IRThe iterative refining process is as follows, 0: :
Figure BDA0002442955820000133
wherein IRIs the current reconstructed image, IR+1Is the next updated image, α is the learning rate,
Figure BDA0002442955820000134
is the gradient of the energy function with respect to the image I, and can be calculated by back propagation,
Figure BDA0002442955820000135
represents the gradient of the neural network F, because
Figure BDA0002442955820000136
Is the gradient of the energy function with respect to the image I, and F (I, θ) — E (I; θ).
The noise introducing module is used for further introducing a noise model when the image is reconstructed, increasing the difficulty of recovering fine details, thereby reducing the chance of fitting antagonistic noise so as to effectively eliminate the antagonistic noise, wherein the added Gaussian noise can also be replaced by other types of noise
In order to remove the challenge noise in the challenge sample. A noise model is further introduced in the process of reconstructing the image:
Figure BDA0002442955820000137
where Z represents some noise distribution, e.g., gaussian noise, and e is the noise strength.
Since the countering noise is accurately calculated by back-propagation, it is inherently fragile, and forms a number of subtle curvilinear patterns that may play an important role. Eliminating these patterns may reduce the effect of the attack. Adding noise during image synthesis increases the difficulty of restoring fine details, thereby reducing the chance of fitting antagonistic noise to achieve the goal of removing the antagonistic noise.
Preferably, to increase the mixing time during sampling, langevin dynamics are used to increase the perturbation of the gradient descent:
Figure BDA0002442955820000141
wherein the content of the first and second substances,
Figure BDA0002442955820000142
corresponding to the learning rate α.
Figure BDA0002442955820000143
Is IRN (0,1) represents the gaussian noise of the standard positive distribution.
And the training module is used for training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, the reconstructed image sampled from the gradient of the energy function is gradually close to the original input image.
In particular, the neural network F is updated so that the reconstructed image gradually approaches the input image II. Let likelihood function L (θ) log (p (I)Iθ)), θ is trained in a direction that maximizes the log-likelihood L (θ):
Figure BDA0002442955820000144
Figure BDA0002442955820000145
wherein theta istNetwork parameter, theta, representing the current time stept+1A parameter representing the update of the next time step,
Figure BDA0002442955820000146
represents the gradient of the maximum likelihood function with respect to θ, β represents the step size.
Figure BDA0002442955820000147
Wherein Ep(I;θ)[·]Is the expectation of I under the distribution p (I; theta). The expectation is not explicitly calculated but is approximated by a sampling. The reconstructed image is obtained by sampling image I from the distribution p (I; theta) and, therefore, in a particular embodiment of the invention, the selection is made
Figure BDA0002442955820000148
To approximate the expected Ep(I;θ)[·]. The parameters of the neural network F are updated during the training as follows:
Figure BDA0002442955820000149
and the saliency detection unit 303 is used for taking the reconstructed image obtained by the input image reconstruction unit 302 as the input of the backbone network and generating a densely marked saliency map, so that the high sensitivity of the network on the confrontation sample is reduced.
In a specific embodiment of the present invention, the backbone network can be selected as any visual saliency model based on a fully convolutional network that takes the entire image as input and produces a densely labeled saliency map. The full convolution backbone network has high efficiency and high accuracy. The framework backbone, which is initially based on a full convolution network, is initialized to some pre-trained visual saliency model. The reconstructed image generated by the input image reconstruction unit 302 is used as the input of the backbone network, so that the high sensitivity of the network on the confrontation sample is reduced.
Preferably, the present invention is an image saliency detection system that is robust against, further comprising:
the smoothing unit 304 is configured to smooth the input image, i.e., the confrontation sample, by using a filtering method, and use the smoothed image for contrast modeling of context-aware restoration of the saliency map refining unit, so as to refine a final result. In the embodiment of the present invention, a bilateral filtering method is selected to smooth the confrontation samples, but the present invention is not limited thereto, and any filtering method may be selected instead.
A saliency map refinement unit 305 for refining the saliency score provided by the backbone network by minimizing the partial energy function using low-level feature similarities between pixels of the smoothed confrontation samples and image context information. The low-level features refer to pixel-level features, such as shape, texture features, which, while countering noise may destroy the high-level semantic understanding of the neural network, do not affect the pixel-level features. The image context information refers to image context information reflecting the correlation between the objects in the image.
In summary, the image saliency detection method and system for robust confrontation according to the present invention reconstruct an input image without noise pair by using an energy-based generation model, and through context-aware recovery, can effectively solve the problems that the existing saliency detection model is sensitive to noise pair and the like, enhance the robustness of the model, and enable confrontation samples to be used in the saliency detection model, and still obtain a good detection effect. By combining the invention to each application of significance detection, various well-designed antagonistic noise inputs can be defended, and the reliability of the detection result can be ensured
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims (10)

1. An image saliency detection method that is robust against, comprising the steps of:
step S1, generating a counterattack sample for the significance detection as an input image of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original image;
step S2, using the confrontation sample obtained in step S1 as input, reconstructing an input image using an energy-based generation model, performing likelihood modeling using a neural network approximation energy function, and generating a reconstructed image from which confrontation noise is removed;
step S3, using the reconstructed image obtained in step S2 as the input of the backbone network, and generating a saliency map of dense markers.
2. A robust-resistant image saliency detection method as claimed in claim 1, characterized by: in step S1, an iterative gradient-based white-box attack is used.
3. A robust-resistant image saliency detection method as claimed in claim 2, characterized by: in step S1, the maximum number of iterations T is limited, the total run-time cost, once T iterations are completed or L is reachedNorm bound, the iteration stops and returns the challenge sample obtained at the current time step.
4. The robust-robust image saliency detection method of claim 1, characterized in that step S2 further comprises:
step S201, using a neural network to approximate an energy function, and generating a sample from probability distribution defined by the energy function;
step S202, a noise model is further introduced when the image is reconstructed;
step S203, training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, gradually enabling the reconstructed image sampled from the gradient of the energy function to be close to the original input image.
5. A robust-resistant image saliency detection method as claimed in claim 4, characterized by: in step S201, an iterative refining process of langevin dynamics is employed, and a gradient of an energy function is used to perform sampling so as to reconstruct an input image.
6. The robust image saliency detection method of claim 5 characterized in that in step S202, Langew dynamics is used to increase the gradient descent perturbation:
Figure FDA0002442955810000021
wherein, IRFor the current reconstructed image, IR+1In order to update the image for the next step,
Figure FDA0002442955810000022
is the gradient of the energy function with respect to the image I,
Figure FDA0002442955810000023
in correspondence with the learning rate α,
Figure FDA0002442955810000024
is IRThe inertia factor of (c).
7. A robust-resistant image saliency detection method as claimed in claim 1, characterized by: in step S3, the backbone network selects any full-convolution network-based visual saliency model.
8. A robust-resistant image saliency detection method as claimed in claim 1, characterized in that said method further comprises:
step S4, smoothing the input image, namely the confrontation sample, by adopting a filtering method for contrast modeling of image context perception restoration;
s5, the saliency score provided by the backbone network is improved using low-level feature similarity and image context information between pixels of the smoothed confrontation samples, and the saliency map is adjusted by minimizing a partial energy function.
9. A robust-resistant image saliency detection method as claimed in claim 8, characterized by: in step S5, the similarity between the pixels in the low-level color space and the spatial location is measured and the restoration component adjusts the saliency map by minimizing a partial energy function.
10. An image saliency detection system that is robust against comprising:
the counterattack sample generation unit is used for generating counterattack samples for significance detection as input images of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original images;
the input image reconstruction unit is used for reconstructing an input image by using an energy-based generation model, performing likelihood modeling by using a neural network approximate energy function and generating an image with anti-noise removed;
and the saliency detection unit is used for taking the reconstructed image obtained by the input image reconstruction unit as the input of the backbone network and generating a densely marked saliency map so as to reduce the high sensitivity of the backbone network on confrontation samples.
CN202010270423.0A 2020-04-08 2020-04-08 Robust-resistant image saliency detection method and system Active CN111539916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010270423.0A CN111539916B (en) 2020-04-08 2020-04-08 Robust-resistant image saliency detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010270423.0A CN111539916B (en) 2020-04-08 2020-04-08 Robust-resistant image saliency detection method and system

Publications (2)

Publication Number Publication Date
CN111539916A true CN111539916A (en) 2020-08-14
CN111539916B CN111539916B (en) 2023-05-26

Family

ID=71978514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010270423.0A Active CN111539916B (en) 2020-04-08 2020-04-08 Robust-resistant image saliency detection method and system

Country Status (1)

Country Link
CN (1) CN111539916B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070784A (en) * 2020-09-15 2020-12-11 桂林电子科技大学 Perception edge detection method based on context enhancement network
CN113283545A (en) * 2021-07-14 2021-08-20 中国工程物理研究院计算机应用研究所 Physical interference method and system for video identification scene
CN113380255A (en) * 2021-05-19 2021-09-10 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113450271A (en) * 2021-06-10 2021-09-28 南京信息工程大学 Robust adaptive countermeasure sample generation method based on human visual model
CN114998707A (en) * 2022-08-05 2022-09-02 深圳中集智能科技有限公司 Attack method and device for evaluating robustness of target detection model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100086221A1 (en) * 2008-10-03 2010-04-08 3M Innovative Properties Company Systems and methods for evaluating robustness
US20130230253A1 (en) * 2008-10-03 2013-09-05 3M Innovative Properties Company Systems and methods for evaluating robustness of saliency predictions of regions in a scene
CN109583455A (en) * 2018-11-20 2019-04-05 黄山学院 A kind of image significance detection method merging progressive figure sequence
CN109992931A (en) * 2019-02-27 2019-07-09 天津大学 A kind of transportable non-black box attack countercheck based on noise compression

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100086221A1 (en) * 2008-10-03 2010-04-08 3M Innovative Properties Company Systems and methods for evaluating robustness
US20130230253A1 (en) * 2008-10-03 2013-09-05 3M Innovative Properties Company Systems and methods for evaluating robustness of saliency predictions of regions in a scene
CN109583455A (en) * 2018-11-20 2019-04-05 黄山学院 A kind of image significance detection method merging progressive figure sequence
CN109992931A (en) * 2019-02-27 2019-07-09 天津大学 A kind of transportable non-black box attack countercheck based on noise compression

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070784A (en) * 2020-09-15 2020-12-11 桂林电子科技大学 Perception edge detection method based on context enhancement network
CN113380255A (en) * 2021-05-19 2021-09-10 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113380255B (en) * 2021-05-19 2022-12-20 浙江工业大学 Voiceprint recognition poisoning sample generation method based on transfer training
CN113450271A (en) * 2021-06-10 2021-09-28 南京信息工程大学 Robust adaptive countermeasure sample generation method based on human visual model
CN113450271B (en) * 2021-06-10 2024-02-27 南京信息工程大学 Robust self-adaptive countermeasure sample generation method based on human visual model
CN113283545A (en) * 2021-07-14 2021-08-20 中国工程物理研究院计算机应用研究所 Physical interference method and system for video identification scene
CN114998707A (en) * 2022-08-05 2022-09-02 深圳中集智能科技有限公司 Attack method and device for evaluating robustness of target detection model
CN114998707B (en) * 2022-08-05 2022-11-04 深圳中集智能科技有限公司 Attack method and device for evaluating robustness of target detection model

Also Published As

Publication number Publication date
CN111539916B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Ding et al. Semantic segmentation with context encoding and multi-path decoding
CN111539916A (en) Image significance detection method and system for resisting robustness
Zhou et al. Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network
Li et al. End-to-end united video dehazing and detection
Mei et al. Robust visual tracking and vehicle classification via sparse representation
Liu et al. Learning converged propagations with deep prior ensemble for image enhancement
Li et al. Grayscale-thermal object tracking via multitask laplacian sparse representation
Zhu et al. A fast single image haze removal algorithm using color attenuation prior
CN109961444B (en) Image processing method and device and electronic equipment
CN112750140B (en) Information mining-based disguised target image segmentation method
Vishnu et al. EV AA-Exchange Vanishing Adversarial Attack on LiDAR Point Clouds in Autonomous Vehicles
Elons et al. A proposed PCNN features quality optimization technique for pose-invariant 3D Arabic sign language recognition
Teng et al. Underwater target recognition methods based on the framework of deep learning: A survey
CN113222960B (en) Deep neural network confrontation defense method, system, storage medium and equipment based on feature denoising
Cheng et al. Adversarial exposure attack on diabetic retinopathy imagery
Song et al. Multistage curvature-guided network for progressive single image reflection removal
Li et al. Generative dynamic patch attack
Cheng et al. Sonar image garbage detection via global despeckling and dynamic attention graph optimization
Talib et al. YOLOv8-CAB: Improved YOLOv8 for Real-time object detection
Chen et al. Deep trident decomposition network for single license plate image glare removal
He et al. Transferable attack for semantic segmentation
Sun et al. Polynomial approximation based spectral dual graph convolution for scene parsing and segmentation
Anirudh et al. MimicGAN: Corruption-mimicking for blind image recovery & adversarial defense
Kemmou et al. Automatic facial expression recognition under partial occlusion based on motion reconstruction using a denoising autoencoder
Ji et al. Blind motion deblurring using improved DeblurGAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant