CN111539916A

CN111539916A - Image significance detection method and system for resisting robustness

Info

Publication number: CN111539916A
Application number: CN202010270423.0A
Authority: CN
Inventors: 曾怡瑞; 马争鸣; 李冠彬; 林倞
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-08-14
Anticipated expiration: 2040-04-08
Also published as: CN111539916B

Abstract

The invention discloses an image significance detection method and system for resisting robustness, wherein the method comprises the following steps: step S1, generating a counterattack sample for the significance detection as an input image of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original image; step S2, using the confrontation sample obtained in step S1 as input, reconstructing an input image using an energy-based generation model, performing likelihood modeling using a neural network approximation energy function, and generating a reconstructed image from which confrontation noise is removed; and step S3, the reconstructed image obtained in step S2 is used as the input of the backbone network and generates a saliency map of the dense marker.

Description

Image significance detection method and system for resisting robustness

Technical Field

The invention relates to the technical field of computer vision based on deep learning, in particular to an image saliency detection method and system for robust resistance based on an energy model.

Background

The purpose of saliency detection is to locate and segment objects in an image or video frame that are most visually characteristic to the human eye. Designing a saliency detection model to simulate humans not only helps to understand the intrinsic mechanisms of human vision and psychology, but also helps many applications of computer vision and computer graphics. For example, saliency detection is applied to context-aware image editing, image thumbnails, object segmentation, and person re-recognition. Saliency detection is a fundamental task in computer vision, has been widely studied for a long time, and there is currently a great deal of relevant work.

In recent years, the application of deep neural networks greatly improves the significance detection effect, and gradually becomes a mainstream method. Significance detection methods driven by deep convolutional neural networks can be divided into two groups, sparse labels and dense labels. Among them, sparse label methods occur in early years, since these methods take a region as a computational unit and involve two separate steps of feature extraction and significant value inference, are generally inefficient and require a large amount of space for feature storage; inspired by the successful application of the full convolution network in pixel-level semantic segmentation, a new advanced technology has been established in significance detection by the recent dense label method, such as work Detect globallely, refine locality: A novel approach likelihood detection (ICCV) of Wang, T et al.

In the existing detection method with the best performance, a full convolution neural network is basically adopted as a model architecture method. Hou, Q et al, 2018, Deeply super seen content object detection with short connections (PAMI) adapted to the overall nested edge detector structure by introducing short connections of the jumper layer structure.

In recent years, although full convolution neural networks have been used with great success in the problem of detection of salient objects, these methods have some weaknesses that may degrade their performance. First the end-to-end trainable properties allow gradients to be easily propagated from surveillance targets to the input image, which exposes the salient object detection model to the risk of countering attacks; second, dense label models do not explicitly model the contrast between different image portions, but rather implicitly estimate the saliency in a single FCN. Once the input image is contaminated against noise, both low-level and high-level features are affected. Third, the current saliency detection training set is very small compared to the image classification task with millions of samples, while the salient object classes involved are also very limited. Thus to some extent the existing model fits bias within the data, e.g. detection targets often appear in the training set rather than locating the most prominent objects, while these methods may rely on capturing too much high level semantics and may be sensitive to low level perturbations, such as against noise.

Efficiency and robustness are important since significant object detection techniques are typically employed as initialization or pre-processing at an early stage of the system. Given that the performance of the preprocessing stage is severely affected by some well-designed input noise, the following stages may produce erroneous results, which may be catastrophic for the entire system. Therefore, there is a real need to pay attention to the robustness of significance monitoring, and provide an accurate, fast and stable significance detection model for the significant object detection task.

Disclosure of Invention

To overcome the above-mentioned deficiencies of the prior art, the present invention provides a robust image saliency detection method and system to improve the robustness of the existing dense labeling method and maintain the efficiency.

To achieve the above object, the present invention provides an image saliency detection method against robustness, comprising the following steps:

step S1, generating a counterattack sample for the significance detection as an input image of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original image;

step S2, using the confrontation sample obtained in step S1 as input, reconstructing an input image using an energy-based generation model, performing likelihood modeling using a neural network approximation energy function, and generating a reconstructed image from which confrontation noise is removed;

step S3, using the reconstructed image obtained in step S2 as the input of the backbone network, and generating a saliency map of dense markers.

Preferably, in step S1, an iterative gradient-based white-box attack is used.

Preferably, in step S1, the maximum number of iterations T is limited, the total run-time cost, once T iterations are completed or L is reached_∞Norm bound, the iteration stops and returns the challenge sample obtained at the current time step.

Preferably, the step S2 further includes:

step S201, using a neural network to approximate an energy function, and generating a sample from probability distribution defined by the energy function;

step S202, a noise model is further introduced when the image is reconstructed;

step S203, training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, gradually enabling the reconstructed image sampled from the gradient of the energy function to be close to the original input image.

Preferably, in step S201, an iterative refining process of langevin dynamics is used, and the gradient of the energy function is used for sampling to reconstruct the input image.

Preferably, in step S202, langevin dynamics are used to increase the gradient-decreasing perturbation:

wherein, I_RFor the current reconstructed image, I_R+1In order to update the image for the next step,

is the gradient of the energy function with respect to the image I,

in correspondence with the learning rate α,

is I_RThe inertia factor of (c).

Preferably, in step S3, the backbone network selects any visual saliency model based on a full convolution network.

Preferably, the method further comprises:

step S4, smoothing the input image, namely the confrontation sample, by adopting a filtering method for contrast modeling of context perception restoration;

s5, the saliency score provided by the backbone network is improved using low-level feature similarity and image context information between pixels of the smoothed confrontation samples, and the saliency map is adjusted by minimizing a partial energy function.

Preferably, in step S5, the similarity between the pixels in the low-level color space and the spatial location is measured and the restoration component adjusts the saliency map by minimizing some energy function.

To achieve the above object, the present invention further provides an image saliency detection system that is robust against, comprising:

the counterattack sample generation unit is used for generating counterattack samples for significance detection as input images of the system on the basis of an iterative gradient method aiming at the counterattack of the significance detection on the original images;

the input image reconstruction unit is used for reconstructing an input image by using an energy-based generation model, performing likelihood modeling by using a neural network approximate energy function and generating an image with anti-noise removed;

and the saliency detection unit is used for taking the reconstructed image obtained by the input image reconstruction unit as the input of the backbone network and generating a densely marked saliency map so as to reduce the high sensitivity of the backbone network on confrontation samples.

Compared with the prior art, the method and the system for detecting the image significance of the robustness countermeasure reconstruct the input image added with the countermeasure noise by utilizing the universality and the simplicity of the energy-based model in the likelihood modeling, thereby effectively eliminating the attack effect, and after a backbone network, a significance map is refined by utilizing the similarity between contexts, so that the detection effect can be obviously improved.

Drawings

FIG. 1 is a flow chart illustrating the steps of a robust image saliency detection method of the present invention;

FIG. 2 is a process diagram of robust image saliency detection according to an embodiment of the present invention;

fig. 3 is a system architecture diagram of a robust image saliency detection system of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

The current counterattack mainly adopts the following method: one-step gradient-based methods (fast gradient notation FGSM) and iterative-based methods. Wherein, under the constraint of infinite norm threshold value epsilon, the fast gradient notation (FGSM) calculates one-step gradient to maximize the loss function of the output and the true value, and the formula of FGSM generating the countermeasure sample is as follows:

wherein x is^*X and y respectively represent the challenge sample, the original image and the true value, and f (·;. theta.) represents the neural network model with a parameter theta.

The iterative method performs FGSM multiple times with step size α, and the formula is as follows:

wherein the content of the first and second substances,

representing the challenge sample generated at the t-th time step, clip (x, ∈) will be used to select each element x of x_iIs maintained at [ x ]_i-∈,x_i+∈]Within the range. Existing attacks are mostly focused on image classificationOn the task, some researches are carried out on semantic segmentation, human body posture estimation and the like, but blank exists in the counterattack aiming at the detection of the remarkable object, and the method initiates the counterattack of a white box and a black box to the detection model of the remarkable object.

Fig. 1 is a flowchart illustrating steps of a robust image saliency detection method according to the present invention, and fig. 2 is a process diagram illustrating a robust image saliency detection according to an embodiment of the present invention. As shown in fig. 1 and fig. 2, the present invention provides a robust image saliency detection method, which includes the following steps:

and step S1, generating a counterattack sample for the significance detection as an input image of the system on the original image by using an iterative gradient-based method aiming at the counterattack for the significance detection. The iterative gradient-based method is a combination of gradient-based and iteration as described above, i.e. the iteration is iterated on a gradient-based basis.

Specifically, in a white-box attack, a visual saliency model is selected as the neural network to be attacked, let f (·, θ) be a pre-trained model with parameter θ, x^*And y respectively represent the original image, the corresponding confrontation sample and the true value, and the data of the confrontation sample needs to be preprocessed before the confrontation sample is synthesized. After the challenge sample is generated, the challenge sample x is^*Up to [0,255 ]]And rounded to an RGB image. Each element y of y_iBelonging to {0,1}, 0 representing insignificant and 1 representing significant. To ensure that the counterdisturbance is not noticeable, the parameter is set to L_∞So that | x-x |^*And | | | is less than or equal to. The maximum number of iterations T limits the overall runtime cost. Once T iterations are completed or L is reached_∞Norm bound, the iteration stops and returns the challenge sample obtained at the current time step.

In each iteration t, the countermeasure sample is taken according to the initialization or the previous time step

Update as follows:

wherein p is_tIs the perturbation calculated for the t-th iteration. The goal of the iteration is to make the prediction of all pixels in x wrong, i.e.

Where i denotes the pixel in x, c denotes two classes: significant and non-significant. To determine p_tThe gradient descent algorithm is used for the following equation:

wherein S is_tRepresenting the set of pixels that f can still be correctly classified. Obtaining p by normalization_t＝α·p′_t/||p′_t||_∞In a specific embodiment of the present invention, only the method of generating white-box challenge samples is described, but it should be noted that it can be directly transferred to black-box attacks because existing visual saliency models have a similar full convolution-based architecture and are typically initialized by the same pre-trained image classification model.

In step S2, the input image is reconstructed using the energy-based generation model, and likelihood modeling is performed using a neural network approximation energy function to generate an image from which the countermeasure noise is removed.

In step S2, with the challenge sample obtained in step S1 as the input of S2, since the challenge noise is accurately calculated by back propagation, it is itself vulnerable, and the challenge noise forms some subtle curve-like patterns, which may play an important role, and the elimination of these patterns may reduce the effect of the attack.

Specifically, step S2 further includes:

step S201, approximating an energy function using a neural network, sampling with a gradient of the energy function, and generating a sample from a probability distribution defined by the energy function. Namely, the challenge samples obtained in step S1 are input to the energy function and sampled by the gradient of the energy function, and the samples are generated from the probability distribution defined by the energy function.

Specifically, given a data point x, let

Is an energy function represented by a neural network with a parameter theta, the energy function being distributed through a Boltzmann distribution

A defined probability distribution, wherein I represents an image, and Z (θ) ═ exp (-E)_θ(x) Dx) represents a partition function and E represents an energy function. The neural network in the present invention is represented by F (which may be expressed as a combination of convolutional layers and nonlinear activation functions, random initialization parameters), i.e., F (I, θ) — E (I; θ). To generate samples from such distributions, the probability density of generating an image is maximized. The reconstructed image is denoted as I_RTo generate samples from the energy model, an iterative refinement process is employed. When reconstructing an image, in a first iteration, an iterative refining process is carried out using a composite image initialized with 0, i.e. in the first iteration, I_RThe iterative refining process is as follows, 0:

wherein I_RFor the current reconstructed image, I_R+1For the next updated image, α is the learning rate,

is the gradient of the energy function with respect to the image I, and can be calculated by back propagation,

represents the gradient of the neural network F, because

Is the gradient of the energy function with respect to the image I, and F (I, θ) — E (I; θ).

Briefly, the reconstructed image is obtained by sampling this probability distribution, and is embodied as a part of the updated formula (i.e., the part for graduating the neural network F)

Portion). Sampling is an iterative process, and a reconstructed image with higher quality can be obtained after repeated iterative sampling, namely an iterative refining process. The neural network F is used for approximating the distribution, and the iterative refining process utilizes the gradient of the neural network to sample when the image is updated.

Step S202, a noise model is further introduced during image reconstruction, the difficulty of recovering fine details is increased, the chance of fitting antagonistic noise is reduced, and the antagonistic noise is effectively eliminated, wherein other types of noise can be selected for replacing the added Gaussian noise

In order to remove the challenge noise in the challenge sample. A noise model is further introduced in the process of reconstructing the image:

where Z represents some noise distribution, e.g., gaussian noise, and e is the noise strength.

Since the countering noise is accurately calculated by back-propagation, it is inherently fragile, and forms a number of subtle curvilinear patterns that may play an important role. Eliminating these patterns may reduce the effect of the attack. Adding noise during image synthesis increases the difficulty of restoring fine details, thereby reducing the chance of fitting antagonistic noise to achieve the goal of removing the antagonistic noise.

Preferably, to increase the mixing time during sampling, langevin dynamics are used to increase the perturbation of the gradient descent:

wherein the content of the first and second substances,

corresponding to the learning rate α.

Is I_RN (0,1) represents the gaussian noise of the standard positive distribution.

In particular, the neural network F is updated so that the reconstructed image gradually approaches the input image I_I. Let likelihood function L (θ) log (p (I)_Iθ)), θ is trained in a direction that maximizes the log-likelihood L (θ):

wherein theta is_tNetwork parameter, theta, representing the current time step_t+1A parameter representing the update of the next time step,

represents the gradient of the maximum likelihood function with respect to θ, β represents the step size.

Wherein E_p(I；θ)[·]Is the expectation of I under the distribution p (I; theta). The expectation is not explicitly calculated but is approximated by a sampling. The reconstructed image is obtained by sampling image I from the distribution p (I; theta) and, therefore, in a particular embodiment of the invention, the selection is made

To approximate the expected E_p(I；θ)[·]. The parameters of the neural network F are updated during the training as follows:

the training process of steps S201-S203 is performed iteratively in alternation, thereby generating a high quality reconstructed image.

Preferably, in step S2, a synthesized image having the same size as the input image is initialized with zeros. The composite image is an image reconstructed with the counternoise removed.

And step S3, using the reconstructed image obtained in step S2 as the input of the backbone network and generating a saliency map of dense marks, so as to reduce the high sensitivity of the network on confrontation samples.

The backbone network can be chosen as any fully convolutional network-based visual saliency model that takes the entire image as input and produces densely labeled saliency maps. The full convolution backbone network has high efficiency and high accuracy. The framework backbone, which is initially based on a full convolution network, is initialized to some pre-trained visual saliency model. The reconstructed image generated in the step of S2 is used as an input of the backbone network, so as to reduce the high sensitivity of the network on the confrontation sample.

Preferably, the present invention provides an image saliency detection method against robustness, further comprising the following steps:

step S4, smoothing the input image, i.e. the confrontation sample, by using a filtering method for contrast modeling of context-aware restoration, thereby refining the final result. In the embodiment of the present invention, a bilateral filtering method is selected to smooth the confrontation samples, but the present invention is not limited thereto, and any filtering method may be selected instead.

S5, the saliency score provided by the backbone network is improved using low-level feature similarity and image context information between pixels of the smoothed confrontation samples, and the saliency map is adjusted by minimizing a partial energy function. The low-level features refer to pixel-level features, such as shape, texture features, which, while countering noise may destroy the high-level semantic understanding of the neural network, do not affect the pixel-level features. The image context information refers to image context information reflecting the correlation between the objects in the image.

To improve the impact of the newly introduced noise on the results in step S2, this step exploits the low-level feature similarity between pixels and image context information to improve the saliency score provided by the backbone network. Since the purpose of combating the perturbations is to parameterize the convolution filter, the recovery component fully adopts a graph model rather than a CNN architecture. Since the previous high-level convolution features have been contaminated, the present invention measures the similarity between pixels in the low-level color space and spatial locations, and the recovery component adjusts the saliency map by minimizing some energy function, using the following formula:

wherein y is a rough saliency map, y^*Is a refined saliency map, wherein the first term embodies global context information and the first term is used for measuring a unary energy function

Assigning a cost of i, the cost being an inverse likelihood, a second term representing pixel-level features and local context, and a second term pairwise energy function measure used simultaneously

And

the cost of i and j is specified, including in particular spatial location and color space, which encourages similar nearby pixels to be labeled as the same saliency value, i.e., the cost measured by the second energy function includes spatial location and color space

And

cost is specified for i and j, where P represents the spatial location of the pixel, x'_iRepresenting the color of the pixel. The pairwise energy function is defined as follows:

wherein p represents the position of the pixel, x'_iRepresents the color of the pixel, x '(i.e., x'_iAnd x'_j) Fighting sample x for input^*The result of bilateral filtering. Omega₂And theta_γSelect as 1, select ω by verification₁、θ_αAnd theta_β. μ is a learnable tag compatibility function that penalizes assigning i and j with different tags, encouraging similar neighboring pixels to have the same label.

In some previous works, such as Conditional random fields as recurrent neural networks (ICCV), the above formula was interpreted as a dense connected Conditional random field with a recurrent neural network. Neural networks are implemented with 1 × 1 convolutional layers. Since the recovery component refines the results using the global context, it is more difficult to change the prediction by countering the noise within a limited perturbation strength. To affect the results for certain pixel locations, the remote feature vector may need to be changed, resulting in greater perturbation. The parameters of the step S5 are established according to the efficiency information in the fully connected crfs with a gaussian edge potential (NeuroIPS), and then the parameters of the backbone network and the recovery component are finely adjusted together.

Fig. 3 is a system architecture diagram of a robust image saliency detection system of the present invention. As shown in fig. 3, the present invention provides an image saliency detection system that is robust against, comprising:

a confrontation sample generating unit 301, configured to generate a confrontation attack sample for saliency detection as an input reference image of the system on the original image by using an iterative gradient-based method for the confrontation attack for saliency detection. The iterative gradient-based approach is a combination of the gradient-based and iterative approaches described above, the iteration being iterative on a gradient-based basis

Fight attacks against significance detection. Inspired by the adaptive algorithms for the correlation and object detection (ICCV), the present invention synthesizes challenge samples by an iterative gradient-based method.

Specifically, in a white-box attack, the visual saliency model is selected as the neural network to be attacked. Let f (·, θ) be the pre-trained model with parameter θ. x, x^*And y represent the original image, its corresponding challenge sample and true value, respectively. The data of the countermeasure sample is preprocessed before synthesizing the countermeasure sample, and in the embodiment of the present invention, the preprocessing is performed by subtracting an image mean value from an original image x, where the image mean value is generally a pixel mean value of a training set, and the averaging is performed to normalize the image. After the challenge sample is generated, the challenge sample x is^*Up to [0,255 ]]And rounded to an RGB image. Each element y of y_iBelonging to {0,1}, 0 representing insignificant and 1 representing significant. To ensure that the counterdisturbance is not noticeable, the parameter is set to L_∞So that | x-x |^*And | | | is less than or equal to. The maximum number of iterations T limits the overall runtime cost. Once T iterations are completed or L is reached_∞Norm bound, the iteration stops and returns the challenge sample obtained at the current time step.

Update as follows:

wherein p is_tIs the perturbation calculated for the t-th iteration. The goal of the iteration is to make x equal toPrediction errors with pixels, i.e.

wherein S is_tRepresenting the set of pixels that f can still be correctly classified. Obtaining p by normalization_t＝α·p′_t/||p′_t||_∞Where α is a fixed step size, the present invention introduces only the method of generating white-box challenge samples, which can be directly transferred to black-box attacks because existing visual saliency models have a similar full convolution-based architecture and are typically initialized by the same pre-trained image classification model.

And an input image reconstruction unit 302, configured to reconstruct an input image using the energy-based generation model, perform likelihood modeling using a neural network approximation energy function, and generate an image from which the anti-noise is removed.

The input image reconstruction unit 302 takes as input the antagonistic sample obtained by the antagonistic sample generation unit 301, which is itself fragile since the antagonistic noise is accurately calculated by back propagation, forms a number of subtle curve-like patterns, which may play an important role, and the elimination of these patterns can reduce the effect of the attack.

Specifically, the input image reconstruction unit 302 further includes:

a sampling module for approximating an energy function using a neural network, sampling with a gradient of the energy function, and generating samples from a probability distribution defined by the energy function. That is, the challenge samples obtained in the challenge sample generation unit 301 are input to the energy function and sampled by the gradient of the energy function, and the samples are generated from the probability distribution defined by the energy function

Specifically, given a data point x, let

A defined probability distribution, wherein I represents an image,

representing the partition function and E the energy function. The neural network is represented by F in the invention, namely F (I, theta) ═ E (I; theta). To generate samples from such distributions, the probability density of generating an image is maximized. The reconstructed image is denoted as I_RIn order to generate samples from the energy model, an iterative refinement process is used, in which, when reconstructing the image, the iterative refinement process is carried out in the first iteration using a composite image initialized with 0, i.e. in the first iteration, I_RThe iterative refining process is as follows, 0: :

wherein I_RIs the current reconstructed image, I_R+1Is the next updated image, α is the learning rate,

represents the gradient of the neural network F, because

The noise introducing module is used for further introducing a noise model when the image is reconstructed, increasing the difficulty of recovering fine details, thereby reducing the chance of fitting antagonistic noise so as to effectively eliminate the antagonistic noise, wherein the added Gaussian noise can also be replaced by other types of noise

wherein the content of the first and second substances,

corresponding to the learning rate α.

And the training module is used for training the neural network parameters of the approximate energy function along the direction of the maximum log likelihood, and after the energy model is trained, the reconstructed image sampled from the gradient of the energy function is gradually close to the original input image.

and the saliency detection unit 303 is used for taking the reconstructed image obtained by the input image reconstruction unit 302 as the input of the backbone network and generating a densely marked saliency map, so that the high sensitivity of the network on the confrontation sample is reduced.

In a specific embodiment of the present invention, the backbone network can be selected as any visual saliency model based on a fully convolutional network that takes the entire image as input and produces a densely labeled saliency map. The full convolution backbone network has high efficiency and high accuracy. The framework backbone, which is initially based on a full convolution network, is initialized to some pre-trained visual saliency model. The reconstructed image generated by the input image reconstruction unit 302 is used as the input of the backbone network, so that the high sensitivity of the network on the confrontation sample is reduced.

Preferably, the present invention is an image saliency detection system that is robust against, further comprising:

the smoothing unit 304 is configured to smooth the input image, i.e., the confrontation sample, by using a filtering method, and use the smoothed image for contrast modeling of context-aware restoration of the saliency map refining unit, so as to refine a final result. In the embodiment of the present invention, a bilateral filtering method is selected to smooth the confrontation samples, but the present invention is not limited thereto, and any filtering method may be selected instead.

A saliency map refinement unit 305 for refining the saliency score provided by the backbone network by minimizing the partial energy function using low-level feature similarities between pixels of the smoothed confrontation samples and image context information. The low-level features refer to pixel-level features, such as shape, texture features, which, while countering noise may destroy the high-level semantic understanding of the neural network, do not affect the pixel-level features. The image context information refers to image context information reflecting the correlation between the objects in the image.

In summary, the image saliency detection method and system for robust confrontation according to the present invention reconstruct an input image without noise pair by using an energy-based generation model, and through context-aware recovery, can effectively solve the problems that the existing saliency detection model is sensitive to noise pair and the like, enhance the robustness of the model, and enable confrontation samples to be used in the saliency detection model, and still obtain a good detection effect. By combining the invention to each application of significance detection, various well-designed antagonistic noise inputs can be defended, and the reliability of the detection result can be ensured

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. An image saliency detection method that is robust against, comprising the steps of:

2. A robust-resistant image saliency detection method as claimed in claim 1, characterized by: in step S1, an iterative gradient-based white-box attack is used.

3. A robust-resistant image saliency detection method as claimed in claim 2, characterized by: in step S1, the maximum number of iterations T is limited, the total run-time cost, once T iterations are completed or L is reached_∞Norm bound, the iteration stops and returns the challenge sample obtained at the current time step.

4. The robust-robust image saliency detection method of claim 1, characterized in that step S2 further comprises:

step S202, a noise model is further introduced when the image is reconstructed;

5. A robust-resistant image saliency detection method as claimed in claim 4, characterized by: in step S201, an iterative refining process of langevin dynamics is employed, and a gradient of an energy function is used to perform sampling so as to reconstruct an input image.

6. The robust image saliency detection method of claim 5 characterized in that in step S202, Langew dynamics is used to increase the gradient descent perturbation:

is the gradient of the energy function with respect to the image I,

in correspondence with the learning rate α,

is I_RThe inertia factor of (c).

7. A robust-resistant image saliency detection method as claimed in claim 1, characterized by: in step S3, the backbone network selects any full-convolution network-based visual saliency model.

8. A robust-resistant image saliency detection method as claimed in claim 1, characterized in that said method further comprises:

step S4, smoothing the input image, namely the confrontation sample, by adopting a filtering method for contrast modeling of image context perception restoration;

9. A robust-resistant image saliency detection method as claimed in claim 8, characterized by: in step S5, the similarity between the pixels in the low-level color space and the spatial location is measured and the restoration component adjusts the saliency map by minimizing a partial energy function.

10. An image saliency detection system that is robust against comprising: