CN115393776A - Black box attack method for self-supervision video target segmentation - Google Patents

Black box attack method for self-supervision video target segmentation Download PDF

Info

Publication number
CN115393776A
CN115393776A CN202211148006.4A CN202211148006A CN115393776A CN 115393776 A CN115393776 A CN 115393776A CN 202211148006 A CN202211148006 A CN 202211148006A CN 115393776 A CN115393776 A CN 115393776A
Authority
CN
China
Prior art keywords
self
multiplied
antagonistic
video sequence
supervision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211148006.4A
Other languages
Chinese (zh)
Inventor
姚睿
陈莹
周勇
赵佳琦
刘兵
祝汉城
邵志文
杜文亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202211148006.4A priority Critical patent/CN115393776A/en
Publication of CN115393776A publication Critical patent/CN115393776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a black box attack method for self-supervision video target segmentation, which learns the characteristic representation of a video sequence by a self-supervision video target segmentation model based on an affinity matrix so as to realize strong pixel corresponding relation. Firstly, establishing contrast loss aiming at single frame, double frame and multiframe based on the countermeasure disturbance generated by initialization random, and carrying out iterative optimization; then designing feature loss to enhance transferability of a countersample generated by the black box attack; making the generated anti-sample noise imperceptible using pixel level loss; constructing a multi-path aggregation module to obtain the iteratively optimized antagonistic disturbance and adding the iteratively optimized antagonistic disturbance to the original video frame to generate an antagonistic sample; and finally inputting the confrontation video into the self-supervision video target segmentation network to obtain a final prediction mask. The study on the black box attack method of the self-supervision video target segmentation model identifies the vulnerability of the segmentation algorithm, and can further improve the safety and the robustness of the self-supervision video target segmentation task.

Description

Black box attack method for self-supervision video target segmentation
Technical Field
The invention relates to a black box attack method for self-supervision video target segmentation, belonging to the image processing technology.
Background
Self-supervised learning, which trains models in a supervised fashion, has become the learning representation of deep neural networks by exploiting self-generated labels from the data itself. In recent years, learning to perform self-supervision on video has led to fruitful research. But deep neural networks are very vulnerable and can easily fool up-to-date models of segmentation of video objects by adding visual noise to the original image. This attack is in the form of a small perturbation of the video frame that is imperceptible to the human visual system. Such an attack may cause the model to completely change its prediction of the video frame. Worse still, the attacked model reports a high confidence in the misprediction. Furthermore, the same antagonistic perturbation may fool multiple neural network models.
Due to the large amount of online data, various ideas have been explored to learn to represent correspondences by utilizing spatio-temporal information in videos. The current self-supervision video target segmentation method is realized by modeling the paired correspondence between the target frame and the reference frame, and the paired correspondence maintains the space-time consistency. The feature representation of the video sequence is learned by an affinity matrix-based self-supervision segmentation model, so that strong pixel correspondence is realized. Thus, black-box attacks are mainly directed to affinity matrices between sequences of video frames. Based on the countermeasure disturbance generated randomly by initialization, the contrast loss aiming at single frame, double frame and multiframe is constructed, and iterative optimization is carried out. Feature loss is employed to enhance transferability of the challenge samples generated by the black box attack. In addition to this, pixel level penalties are also employed in order to make the generated countersample noise imperceptible. And finally, constructing a multi-path aggregation module to obtain the resistibility disturbance after iterative optimization.
Disclosure of Invention
The invention aims to: in order to improve the robustness of the self-supervision video target segmentation method, the invention provides a black box attack method for self-supervision video target segmentation, which finds a confrontation sample with imperceptible difference with a self-supervision video target segmentation network by destroying an affinity matrix between video frames, so that the self-supervision video target segmentation network fails; through research on a resistant attack method, the method can help understand the working mechanism of the depth model, facilitates better recognition of the vulnerability of the segmentation algorithm, and can improve the robustness of the self-supervision video target segmentation algorithm.
The technical scheme is as follows: in order to realize the purpose, the invention adopts the technical scheme that:
a black box attack method for self-supervision video target segmentation is characterized in that a self-supervision video target segmentation network based on an affinity matrix is adopted, and a countermeasure sample which has imperceptible difference with the self-supervision video target segmentation network is found by destroying the affinity matrix between video frames, so that the self-supervision video target segmentation network fails. Considering that implementing a counter attack against video object segmentation faces two challenges: (1) different from classified counterattack, classification only needs a classifier to carry out misclassification, and conditions of successful attack segmentation are more fuzzy; (2) in view of the segmented objects in the video, it is unlikely that antagonistic perturbations generated based on single-frame features will stick to every video frame. Therefore, the invention considers the generation of antagonistic perturbations on a frame-by-frame basis, and the implementation of the invention comprises the following steps:
(1) X = { X ] for an unannotated original video sequence 1 ,x 2 ,…,x n Firstly, an initialized antagonistic disturbance delta X = { delta X = is randomly generated by a noise generator of self-supervision training 1 ,Δx 2 ,…Δx n };x i Representing the original image of the ith frame, Δ x i Represents a correspondence x i I =1,2, \8230;, n;
(2) Obtaining an antagonistic video sequence X by adding an antagonistic disturbance Δ X to an original video sequence X adv
(3) Considering the consistency problem among video sequences, respectively constructing contrast loss attacks aiming at single frames, double frames and multiple frames so as to obtain total contrast loss;
(4) Designing a characteristic loss function to enable the original image to be closer to a countermeasure frame in a characteristic space semantically, and further enhancing the transferability of a countermeasure video sequence;
(5) Designing a pixel-level loss function to make noise of the antagonistic video sequence imperceptible;
(6) Iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X ' to the original video sequence X to obtain a final antagonistic video sequence X ' adv
(7) Given an initial frame mask, the antagonistic video sequence X' adv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.
Preferably, in the step (2), the antagonistic disturbance Δ X is added to the original video sequence X to obtain an antagonistic video sequence
Figure BDA0003853906160000021
Figure BDA0003853906160000022
Wherein:
Figure BDA0003853906160000023
represents a correspondence x i The challenge sample of (1), ε represents the maximum allowable challenge perturbation threshold, | · | | luminance Representing an infinite norm.
Preferably, in the step (3), considering the consistency problem between video sequences, a contrast loss attack for a single frame, a double frame and a multi-frame is respectively constructed:
Figure BDA0003853906160000031
Figure BDA0003853906160000032
Figure BDA0003853906160000033
Figure BDA0003853906160000034
wherein:
Figure BDA0003853906160000035
and
Figure BDA0003853906160000036
respectively representing the contrast loss function for single, double and multiple frames, L con Representing the total contrast loss function, x i Original image, x, representing the ith frame i+1 Representing the original image of the (i + 1) th frame, sim (·,) representing a cosine similarity function, v representing a temperature parameter, { x · neg Is a set of elements in a dynamic queue consisting of countermeasure samples, { x pos Is the set of elements in the dynamic queue composed of the original image, and m represents the number of elements in each dynamic queue.
Because the parameters and structure of the attack model are not known by the black-box attack, to further enhance the transferability of the generated countersamples, the feature loss function is designed to make the original image semantically closer to the counterframes in the feature space, even though the predictive segmentation mask of the countersamples should be closer to the target mask between the feature spaces. Preferably, in the step (4), the feature extractor F is based on θ The extracted features design a feature loss function:
Figure BDA0003853906160000037
wherein:
Figure BDA0003853906160000038
representation by feature extractor F θ Extracted to
Figure BDA0003853906160000039
Is characterized by comprising a characteristic diagram of (A),
Figure BDA00038539061600000310
representation by feature extractor F θ Extracted x i H, W and C respectively represent the height, width and channel number of the feature graph, | · | | | survival 2 Representing the L2 norm.
Preferably, the feature extractor F θ By adopting the Resnet50 network, the characteristic diagram of the input image is output by the previous layer of the last full connection layer of the Resnet50 network, and theta represents the parameter to be learned of the Resnet50 network.
Preferably, in the step (5), the pixel-level loss function is designed to make noise of the antagonistic video sequence imperceptible:
Figure BDA0003853906160000041
wherein: x is a radical of a fluorine atom i Representing the original image of the i-th frame,
Figure BDA0003853906160000042
represents a correspondence x i The confrontation sample, | · | | non-conducting light 2 Representing the L2 norm.
In particular, because the perturbation pattern is similar to noise, smoothing the image helps to mitigate the antagonistic effect. Therefore, to suppress the smoothness, pixel level penalties are applied in the image pixel space. The pixel level penalty represents the L2 distance between the confrontation example and the original clean example. The purpose of minimizing the L2 distance is to constrain the difference between the resistant video sequence and the clean video sequence at the pixel level to facilitate the visual perception of the resistant sample.
Preferably, in the step (6), the overall loss function adopted for iteratively optimizing the overall loss is as follows:
Figure BDA0003853906160000043
wherein: Δ X' represents the antagonism disturbance after iterative optimization, and λ, μ and η are L respectively con
Figure BDA0003853906160000044
And
Figure BDA0003853906160000045
the weight parameter of (2); λ controls the relative importance of noise and contrast loss, μ controls the relative importance of each target feature in the video, and η controls the relative importance of the video frame pixels.
Preferably, in the step (6), a multi-path aggregation module is constructed to obtain the iteratively optimized adversarial disturbance Δ X', the multi-path aggregation module is intended to integrate common features from different path video frames to effectively generate the adversarial disturbance, and an input of the multi-path aggregation module is F t 、F t-1 And F t+1 The output of the multi-path aggregation module is the antagonism disturbance delta X' after iterative optimization; f t 、F t-1 And F t+1 Respectively representing the original images x t-1 、x t And x t+1 Characteristic diagram of (1), F t 、F t-1 And F t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:
(61) F is to be t-1 Is projected to F t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F t And F t-1 Will F t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size t And F t-1 The matrix multiplication is carried out and the matrix multiplication is carried out,the result is normalized again to finally form a projection P with a size of B × H × W × C = B × C × C × H t-1
P t-1 =Resize(BN(Resize(Conv(F t-1 )))×Resize(Conv(F t )))
Wherein: resize (-) denotes the image scaling function, BN (-) denotes the normalization operation, conv (-) denotes the convolution function;
(62) F is to be t+1 Is projected to F t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F t And F t+1 Will F t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to t+1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size t And F t+1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed t+1
P t+1 =Resize(BN(Resize(Conv(F t+1 )))×Resize(Conv(F t )))
(63) From F t Subtracting the aggregation characteristic, and outputting the iteratively optimized antagonistic disturbance delta X':
Figure BDA0003853906160000051
wherein: concat (. Cndot.) represents a merge function;
(64) Adding the antagonism disturbance delta X' to the original video sequence X to obtain the final antagonism video sequence
Figure BDA0003853906160000052
Figure BDA0003853906160000053
Wherein:
Figure BDA0003853906160000054
representing the overall loss function.
The multipath aggregation module projects the feature maps of the previous frame and the next frame onto the current frame through transformation, and finally subtracts the aggregated features from the current frame, so that the image-level operation can effectively integrate noise.
Preferably, in the step (7), the affinity matrix-based self-surveillance video object segmentation is performed, given a pair of video frames, based on the following assumptions: the content in two consecutive video frames is coherent; the frame reconstruction (pixel copy) operation can be represented by a linear transformation using an affinity matrix, which describes the copying process from the reference frame to the target frame. A common choice for similarity measurements in affinity matrices is the dot product operation between feature maps. The black box is used for resisting the attack, and the aim of resisting the attack is to find a resisting sample with imperceptible difference from the segmentation model by destroying the affinity matrix between video frames, so that the segmentation model is invalid.
Has the advantages that: according to the black box attack method for self-supervision video target segmentation, antagonism disturbance is generated by means of contrast loss, antagonism attack can be performed on a self-supervision video target segmentation task, and all pixels of a target are segmented wrongly; the method can be used for specific tasks or specific field researches, can help to know how black box attacks affect the performance of the model, and helps to reduce the influencing factors so as to enhance the safety and the robustness of the model.
Drawings
FIG. 1 is a flow chart of an embodiment of the method of the present invention;
FIG. 2 is a schematic illustration of an acquired challenge sample and a challenge prediction mask;
FIG. 3 is a schematic diagram of a multi-path aggregation module;
FIG. 4 is a schematic diagram of an apparatus for carrying out the method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Fig. 1 is a flowchart of an implementation of a black box attack method for the segmentation of an object in an auto-surveillance video, and the following steps are specifically described below.
Step S01: for an unannotated original video sequence X = { X = 1 ,x 2 ,…,x n First, an initialized antagonistic perturbation Δ X = { Δ X) = is randomly generated by a self-supervised trained noise generator 1 ,Δx 2 ,…Δx n };x i Representing the original image of the ith frame, Δ x i Represents a correspondence x i I =1,2, \8230, n.
Step S02: obtaining an antagonistic video sequence X by adding an antagonistic disturbance DeltaX to an original video sequence X adv
Antagonistic video sequences are represented as
Figure BDA0003853906160000061
Figure BDA0003853906160000071
Wherein:
Figure BDA0003853906160000072
represents a correspondence x i The challenge sample of (1), ε represents the maximum allowable challenge perturbation threshold, | · | | luminance Representing an infinite norm.
Step S03: considering the consistency problem among video sequences, respectively constructing the contrast loss attack aiming at single frame, double frame and multi-frame so as to obtain the total contrast loss.
Constructing contrast loss attacks aiming at single frames, double frames and multiple frames:
Figure BDA0003853906160000073
Figure BDA0003853906160000074
Figure BDA0003853906160000075
Figure BDA0003853906160000076
wherein:
Figure BDA0003853906160000077
and
Figure BDA0003853906160000078
respectively representing the contrast loss function for single, double and multiple frames, L con Representing the total contrast loss function, x i Original image, x, representing the ith frame i+1 Representing the original image of the (i + 1) th frame, sim (·,) representing a cosine similarity function, v representing a temperature parameter, { x · neg Is a set of elements in a dynamic queue consisting of countermeasure samples, { x pos Is the set of elements in the dynamic queue composed of the original image, and m represents the number of elements in each dynamic queue.
Step S04: and designing a characteristic loss function to make the original image semantically closer to a resistant frame in a characteristic space, and further enhancing the transferability of the resistant video sequence.
Feature-based extractor F θ Designing a characteristic loss function by the extracted characteristics; feature extractor F θ Using Resnet50 network, consisting of Resnet50 networkThe feature map of the input image is output at the layer preceding the last fully connected layer of the network, and θ represents the parameter to be learned of the Resnet50 network.
Figure BDA0003853906160000081
Wherein:
Figure BDA0003853906160000082
representation by feature extractor F θ Extracted to
Figure BDA0003853906160000083
Is characterized in that a characteristic diagram of the system is shown,
Figure BDA0003853906160000084
representation by feature extractor F θ Extracted x i H, W and C respectively represent the height, width and channel number of the feature graph, | · | | | survival 2 Representing the L2 norm.
Step S05: the pixel-level loss function is designed to make the noise of the antagonistic video sequence imperceptible.
The pixel level loss function is expressed as:
Figure BDA0003853906160000085
wherein: x is a radical of a fluorine atom i An original image representing the i-th frame,
Figure BDA0003853906160000086
represents a correspondence x i The confrontation sample, | · | | non-conducting light 2 Representing the L2 norm.
Because the perturbation pattern is similar to noise, smoothing the image helps to mitigate the effects of antagonism. Therefore, to suppress the smoothness, pixel level penalties are applied in the image pixel space. The pixel level penalty represents the L2 distance between the confrontation example and the original clean example. The purpose of minimizing the L2 distance is to constrain the difference between the resistant video sequence and the clean video sequence at the pixel level to facilitate the visual perception of the resistant sample.
Step S06: iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X ' into the original video sequence X to obtain a final antagonistic video sequence X ' adv
Constructing a multipath aggregation module to obtain the iteratively optimized antagonistic disturbance Δ X', as shown in fig. 3, the input of the multipath aggregation module is F t 、F t-1 And F t+1 The output of the multipath aggregation module is antagonism disturbance delta X' after iterative optimization; f t 、F t-1 And F t+1 Respectively representing the original images x t-1 、x t And x t+1 Characteristic diagram of (1), F t 、F t-1 And F t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:
(61) F is to be t-1 Is projected to F t The specific process of the characteristic space is as follows: first using a 1X 1 convolution process F t And F t-1 A 1 to F t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size t And F t-1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed t-1
P t-1 =Resize(BN(Resize(Conv(F t-1 )))×Resize(Conv(F t )))
Wherein: resize (-) denotes the image scaling function, BN (-) denotes the normalization operation, conv (-) denotes the convolution function;
(62) F is to be t+1 Is projected to F t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F t And F t+1 Will F t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to t+1 Is adjusted to B × H × W × C = B × W × C × H, F for the adjusted size t And F t+1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed t+1
P t+1 =Resize(BN(Resize(Conv(F t+1 )))×Resize(Conv(F t )))
(63) From F t Subtracting the polymerization characteristic, and outputting the iteratively optimized antagonistic disturbance delta X':
Figure BDA0003853906160000091
wherein: concat (. Cndot.) represents a merge function.
(64) Adding the antagonistic disturbance delta X' to the original video sequence X to obtain the final antagonistic video sequence
Figure BDA0003853906160000092
Figure BDA0003853906160000093
Wherein:
Figure BDA0003853906160000094
representing the overall loss function.
(65) The overall loss function used for calculating the iterative optimization overall loss is as follows:
Figure BDA0003853906160000095
wherein: Δ X' represents the antagonism disturbance after iterative optimization, and λ, μ and η are L respectively con
Figure BDA0003853906160000096
And
Figure BDA0003853906160000097
the weight parameter of (2).
(7) Given an initial frame mask of an original video sequence X ', a resistant video sequence X' adv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.
As shown in fig. 2, a result visualization diagram of the self-surveillance video object segmentation oriented black box attack method provided by the present application is adopted, where the first two lines are respectively an original video sequence and its normal segmentation mask, and the last two lines are respectively an antagonistic video sequence obtained by adding an antagonistic disturbance to the original video sequence and its mask for erroneous segmentation; as can be seen from the attached drawings, after antagonistic disturbance which is difficult to be perceived by human eyes is added in an original video sequence, the position of a target frame estimated by an automatic supervision video target segmentation model is inaccurate, the segmentation precision is obviously reduced, and therefore a target object cannot be correctly segmented.
As shown in fig. 4, the device for implementing the black box attack method for self-surveillance video object segmentation provided by the present invention includes a generator, a Resnet network, and a multi-path aggregation module, where the generator is a noise generator for self-surveillance training, and is configured to randomly generate an initialized adversarial disturbance Δ X, and add the adversarial disturbance Δ X to an original video sequence X to obtain an adversarial video sequence X adv Obtaining an original video sequence X and a resistant video sequence X adv Pixel level loss of
Figure BDA0003853906160000101
Resnet50 network for constructing feature extractor F θ Using a feature extractor F θ Extracting an original video sequence X and an antagonistic video sequence X adv Thereby calculating a feature loss
Figure BDA0003853906160000102
And total contrast loss L con (ii) a Multipath aggregation module for obtaining total loss
Figure BDA0003853906160000103
Iteratively optimized antagonistic perturbation Δ X'.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing shows and describes the general principles, principal features and advantages of the invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims (8)

1. A black box attack method for self-supervision video object segmentation is characterized by comprising the following steps: the self-supervision video target segmentation network adopts the self-supervision video target segmentation network based on the affinity matrix, and the method comprises the following steps:
(1) For an unannotated original video sequence X = { X = 1 ,x 2 ,…,x n Firstly, an initialized antagonistic disturbance delta X = { delta X = is randomly generated by a noise generator of self-supervision training 1 ,Δx 2 ,...Δx n };x i Representing the original image of the i-th frame, Δ x i Represents a correspondence x i The antagonistic perturbation of (a) is performed,i=1,2,…,n;
(2) Obtaining an antagonistic video sequence X by adding an antagonistic disturbance Δ X to an original video sequence X adv
(3) Considering the consistency problem among video sequences, respectively constructing contrast loss attacks aiming at single frames, double frames and multiple frames so as to obtain total contrast loss;
(4) Designing a characteristic loss function to enable the original image to be closer to a countermeasure frame in a characteristic space semantically, and further enhancing the transferability of a countermeasure video sequence;
(5) Designing a pixel-level loss function to make noise of the antagonistic video sequence imperceptible;
(6) Iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X' to the original video sequence X to obtain a final antagonistic video sequence X adv
(7) Given an initial frame mask of an original video sequence X, a resistant video sequence X is to be generated adv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.
2. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (2), the antagonistic disturbance delta X is added to the original video sequence X to obtain an antagonistic video sequence
Figure FDA0003853906150000011
Figure FDA0003853906150000012
Wherein:
Figure FDA0003853906150000013
represents a correspondence x i The challenge sample of (1), ε represents the maximum allowable challenge threshold, | · | | survival |) Representing an infinite norm.
3. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (3), the problem of consistency among video sequences is considered, and the attack of contrast loss aiming at single frame, double frame and multi-frame is respectively constructed:
Figure FDA0003853906150000021
Figure FDA0003853906150000022
Figure FDA0003853906150000023
Figure FDA0003853906150000024
wherein:
Figure FDA0003853906150000025
and
Figure FDA0003853906150000026
respectively representing the contrast loss function for single, double and multiple frames, L con Represents the total contrast loss function, x i Representing the original image of the i-th frame, x i+1 Representing the original image of the (i + 1) th frame, sim (·,) representing a cosine similarity function, v representing a temperature parameter, { x · neg Is a set of elements in a dynamic queue consisting of countermeasure samples, { x pos Is the set of elements in the dynamic queue composed of the original image, and m represents the number of elements in each dynamic queue.
4. According to claim 1The black box attack method for the self-supervision video target segmentation is characterized by comprising the following steps: in the step (4), the feature-based extractor F θ The extracted features design a feature loss function:
Figure FDA0003853906150000027
wherein:
Figure FDA0003853906150000028
representation by feature extractor F θ Extracted to
Figure FDA0003853906150000029
Is characterized in that a characteristic diagram of the system is shown,
Figure FDA00038539061500000210
representation by feature extractor F θ Extracted x i H, W and C respectively represent the height, width and channel number of the feature graph, | · | | | survival 2 Representing the L2 norm.
5. The black box attack method for self-supervision video object segmentation according to claim 4, characterized in that: the feature extractor F θ And outputting the characteristic diagram of the input image by a layer before the last full-connection layer of the Resnet50 network by adopting the Resnet50 network, wherein theta represents the parameter to be learned of the Resnet50 network.
6. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (5), a pixel-level loss function is designed to make noise of the antagonistic video sequence imperceptible:
Figure FDA0003853906150000031
wherein: x is a radical of a fluorine atom i Representing the original image of the i-th frame,
Figure FDA0003853906150000032
represents a correspondence x i The confrontation sample, | · | | non-conducting light 2 Representing the L2 norm.
7. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (6), the overall loss function adopted by the iterative optimization overall loss is as follows:
Figure FDA0003853906150000033
wherein: Δ X' represents the antagonism perturbation after iterative optimization, and λ, μ and η are L respectively con
Figure FDA0003853906150000034
And
Figure FDA0003853906150000035
the weight parameter of (2).
8. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (6), a multipath aggregation module is constructed to obtain the iteratively optimized antagonistic disturbance delta X', and the input of the multipath aggregation module is F t 、F t-1 And F t+1 The output of the multipath aggregation module is antagonism disturbance delta X' after iterative optimization; f t 、F t-1 And F t+1 Respectively representing original images x t-1 、x t And x t+1 Characteristic diagram of (1), F t 、F t-1 And F t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:
(61) F is to be t-1 Is projected to F t The specific process of the characteristic space is as follows: first using a 1X 1 convolution process F t And F t-1 Will F t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size t And F t-1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed t-1
P t-1 =Resize(BN(Resize(Conv(F t-1 )))×Resize(Conv(F t )))
Wherein: resize (-) denotes the image scaling function, BN (-) denotes the normalization operation, conv (-) denotes the convolution function;
(62) F is to be t+1 Is projected to F t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F t And F t+1 Will F t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to t+1 Is adjusted to B × H × W × C = B × W × C × H, F for the adjusted size t And F t+1 Matrix multiplication is carried out, the result is normalized, and finally, the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed t+1
P t+1 =Resize(BN(Resize(Conv(F t+1 )))×Resize(Conv(F t )))
(63) From F t Subtracting the polymerization characteristic, and outputting the iteratively optimized antagonistic disturbance delta X':
Figure FDA0003853906150000041
wherein: concat (. Cndot.) represents a merge function;
(64) Adding the antagonism disturbance delta X' to the original video sequence X to obtain the final antagonism video sequence
Figure FDA0003853906150000042
Figure FDA0003853906150000043
Wherein:
Figure FDA0003853906150000044
representing the overall loss function.
CN202211148006.4A 2022-09-20 2022-09-20 Black box attack method for self-supervision video target segmentation Pending CN115393776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211148006.4A CN115393776A (en) 2022-09-20 2022-09-20 Black box attack method for self-supervision video target segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211148006.4A CN115393776A (en) 2022-09-20 2022-09-20 Black box attack method for self-supervision video target segmentation

Publications (1)

Publication Number Publication Date
CN115393776A true CN115393776A (en) 2022-11-25

Family

ID=84125957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211148006.4A Pending CN115393776A (en) 2022-09-20 2022-09-20 Black box attack method for self-supervision video target segmentation

Country Status (1)

Country Link
CN (1) CN115393776A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878848A (en) * 2023-02-22 2023-03-31 中南大学 Antagonistic video sample generation method, terminal device and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878848A (en) * 2023-02-22 2023-03-31 中南大学 Antagonistic video sample generation method, terminal device and medium
CN115878848B (en) * 2023-02-22 2023-05-02 中南大学 Antagonistic video sample generation method, terminal equipment and medium

Similar Documents

Publication Publication Date Title
Warde-Farley et al. 11 adversarial perturbations of deep neural networks
US11836880B2 (en) Adjusting a digital representation of a head region
US10037610B1 (en) Method for tracking and segmenting a target object in an image using Markov Chain, and device using the same
CN111767900B (en) Face living body detection method, device, computer equipment and storage medium
CN110619628B (en) Face image quality assessment method
Xiao et al. Recurrent 3d-2d dual learning for large-pose facial landmark detection
JP2010218551A (en) Face recognition method, computer readable medium, and image processor
CN113312973B (en) Gesture recognition key point feature extraction method and system
US8254669B2 (en) Data processing apparatus, computer program product, and data processing method for predicting an optimum function based on a case database and image feature values calculated by a feature-value calculating unit
KR20220071143A (en) Lightweight FIRE-DET flame detection method and system
CN113744262B (en) Target segmentation detection method based on GAN and YOLO-v5
CN112396129A (en) Countermeasure sample detection method and general countermeasure attack defense system
CN115393776A (en) Black box attack method for self-supervision video target segmentation
CN114240951B (en) Black box attack method of medical image segmentation neural network based on query
Huangy et al. A framework for reducing ink-bleed in old documents
Guesmi et al. Dap: A dynamic adversarial patch for evading person detectors
Steinberg et al. Hand gesture recognition in images and video
Bowden Learning non-linear Models of Shape and Motion
Wang et al. Early fire recognition based on multi-feature fusion of video smoke
Zhou et al. Single image dehazing based on weighted variational regularized model
CN111191549A (en) Two-stage face anti-counterfeiting detection method
CN112508168B (en) Frame regression neural network construction method based on automatic correction of prediction frame
CN114494760A (en) Domain generalized image classification method based on low-rank constraint local regression
CN114494959A (en) Attention-guided adversarial attack method for video target segmentation
Angelopoulou et al. Tracking gestures using a probabilistic self-organising network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination