CN115393776A

CN115393776A - Black box attack method for self-supervision video target segmentation

Info

Publication number: CN115393776A
Application number: CN202211148006.4A
Authority: CN
Inventors: 姚睿; 陈莹; 周勇; 赵佳琦; 刘兵; 祝汉城; 邵志文; 杜文亮
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-11-25

Abstract

The invention discloses a black box attack method for self-supervision video target segmentation, which learns the characteristic representation of a video sequence by a self-supervision video target segmentation model based on an affinity matrix so as to realize strong pixel corresponding relation. Firstly, establishing contrast loss aiming at single frame, double frame and multiframe based on the countermeasure disturbance generated by initialization random, and carrying out iterative optimization; then designing feature loss to enhance transferability of a countersample generated by the black box attack; making the generated anti-sample noise imperceptible using pixel level loss; constructing a multi-path aggregation module to obtain the iteratively optimized antagonistic disturbance and adding the iteratively optimized antagonistic disturbance to the original video frame to generate an antagonistic sample; and finally inputting the confrontation video into the self-supervision video target segmentation network to obtain a final prediction mask. The study on the black box attack method of the self-supervision video target segmentation model identifies the vulnerability of the segmentation algorithm, and can further improve the safety and the robustness of the self-supervision video target segmentation task.

Description

Black box attack method for self-supervision video target segmentation

Technical Field

The invention relates to a black box attack method for self-supervision video target segmentation, belonging to the image processing technology.

Background

Self-supervised learning, which trains models in a supervised fashion, has become the learning representation of deep neural networks by exploiting self-generated labels from the data itself. In recent years, learning to perform self-supervision on video has led to fruitful research. But deep neural networks are very vulnerable and can easily fool up-to-date models of segmentation of video objects by adding visual noise to the original image. This attack is in the form of a small perturbation of the video frame that is imperceptible to the human visual system. Such an attack may cause the model to completely change its prediction of the video frame. Worse still, the attacked model reports a high confidence in the misprediction. Furthermore, the same antagonistic perturbation may fool multiple neural network models.

Due to the large amount of online data, various ideas have been explored to learn to represent correspondences by utilizing spatio-temporal information in videos. The current self-supervision video target segmentation method is realized by modeling the paired correspondence between the target frame and the reference frame, and the paired correspondence maintains the space-time consistency. The feature representation of the video sequence is learned by an affinity matrix-based self-supervision segmentation model, so that strong pixel correspondence is realized. Thus, black-box attacks are mainly directed to affinity matrices between sequences of video frames. Based on the countermeasure disturbance generated randomly by initialization, the contrast loss aiming at single frame, double frame and multiframe is constructed, and iterative optimization is carried out. Feature loss is employed to enhance transferability of the challenge samples generated by the black box attack. In addition to this, pixel level penalties are also employed in order to make the generated countersample noise imperceptible. And finally, constructing a multi-path aggregation module to obtain the resistibility disturbance after iterative optimization.

Disclosure of Invention

The invention aims to: in order to improve the robustness of the self-supervision video target segmentation method, the invention provides a black box attack method for self-supervision video target segmentation, which finds a confrontation sample with imperceptible difference with a self-supervision video target segmentation network by destroying an affinity matrix between video frames, so that the self-supervision video target segmentation network fails; through research on a resistant attack method, the method can help understand the working mechanism of the depth model, facilitates better recognition of the vulnerability of the segmentation algorithm, and can improve the robustness of the self-supervision video target segmentation algorithm.

The technical scheme is as follows: in order to realize the purpose, the invention adopts the technical scheme that:

a black box attack method for self-supervision video target segmentation is characterized in that a self-supervision video target segmentation network based on an affinity matrix is adopted, and a countermeasure sample which has imperceptible difference with the self-supervision video target segmentation network is found by destroying the affinity matrix between video frames, so that the self-supervision video target segmentation network fails. Considering that implementing a counter attack against video object segmentation faces two challenges: (1) different from classified counterattack, classification only needs a classifier to carry out misclassification, and conditions of successful attack segmentation are more fuzzy; (2) in view of the segmented objects in the video, it is unlikely that antagonistic perturbations generated based on single-frame features will stick to every video frame. Therefore, the invention considers the generation of antagonistic perturbations on a frame-by-frame basis, and the implementation of the invention comprises the following steps:

(1) X = { X ] for an unannotated original video sequence ¹ ,x ² ,…,x ⁿ Firstly, an initialized antagonistic disturbance delta X = { delta X = is randomly generated by a noise generator of self-supervision training ₁ ,Δx ₂ ,…Δx _n }；x ⁱ Representing the original image of the ith frame, Δ x ⁱ Represents a correspondence x ⁱ I =1,2, \8230;, n;

(2) Obtaining an antagonistic video sequence X by adding an antagonistic disturbance Δ X to an original video sequence X _adv ；

(3) Considering the consistency problem among video sequences, respectively constructing contrast loss attacks aiming at single frames, double frames and multiple frames so as to obtain total contrast loss;

(4) Designing a characteristic loss function to enable the original image to be closer to a countermeasure frame in a characteristic space semantically, and further enhancing the transferability of a countermeasure video sequence;

(5) Designing a pixel-level loss function to make noise of the antagonistic video sequence imperceptible;

(6) Iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X ' to the original video sequence X to obtain a final antagonistic video sequence X ' _adv ；

(7) Given an initial frame mask, the antagonistic video sequence X' _adv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.

Preferably, in the step (2), the antagonistic disturbance Δ X is added to the original video sequence X to obtain an antagonistic video sequence

Wherein:

represents a correspondence x ⁱ The challenge sample of (1), ε represents the maximum allowable challenge perturbation threshold, | · | | luminance _∞ Representing an infinite norm.

Preferably, in the step (3), considering the consistency problem between video sequences, a contrast loss attack for a single frame, a double frame and a multi-frame is respectively constructed:

wherein:

and

respectively representing the contrast loss function for single, double and multiple frames, L _con Representing the total contrast loss function, x ⁱ Original image, x, representing the ith frame ⁱ⁺¹ Representing the original image of the (i + 1) th frame, sim (·,) representing a cosine similarity function, v representing a temperature parameter, { x · _neg Is a set of elements in a dynamic queue consisting of countermeasure samples, { x _pos Is the set of elements in the dynamic queue composed of the original image, and m represents the number of elements in each dynamic queue.

Because the parameters and structure of the attack model are not known by the black-box attack, to further enhance the transferability of the generated countersamples, the feature loss function is designed to make the original image semantically closer to the counterframes in the feature space, even though the predictive segmentation mask of the countersamples should be closer to the target mask between the feature spaces. Preferably, in the step (4), the feature extractor F is based on _θ The extracted features design a feature loss function:

wherein:

representation by feature extractor F _θ Extracted to

Is characterized by comprising a characteristic diagram of (A),

representation by feature extractor F _θ Extracted x ⁱ H, W and C respectively represent the height, width and channel number of the feature graph, | · | | | survival ₂ Representing the L2 norm.

Preferably, the feature extractor F _θ By adopting the Resnet50 network, the characteristic diagram of the input image is output by the previous layer of the last full connection layer of the Resnet50 network, and theta represents the parameter to be learned of the Resnet50 network.

Preferably, in the step (5), the pixel-level loss function is designed to make noise of the antagonistic video sequence imperceptible:

wherein: x is a radical of a fluorine atom ⁱ Representing the original image of the i-th frame,

represents a correspondence x ⁱ The confrontation sample, | · | | non-conducting light ₂ Representing the L2 norm.

In particular, because the perturbation pattern is similar to noise, smoothing the image helps to mitigate the antagonistic effect. Therefore, to suppress the smoothness, pixel level penalties are applied in the image pixel space. The pixel level penalty represents the L2 distance between the confrontation example and the original clean example. The purpose of minimizing the L2 distance is to constrain the difference between the resistant video sequence and the clean video sequence at the pixel level to facilitate the visual perception of the resistant sample.

Preferably, in the step (6), the overall loss function adopted for iteratively optimizing the overall loss is as follows:

wherein: Δ X' represents the antagonism disturbance after iterative optimization, and λ, μ and η are L respectively _con 、

And

the weight parameter of (2); λ controls the relative importance of noise and contrast loss, μ controls the relative importance of each target feature in the video, and η controls the relative importance of the video frame pixels.

Preferably, in the step (6), a multi-path aggregation module is constructed to obtain the iteratively optimized adversarial disturbance Δ X', the multi-path aggregation module is intended to integrate common features from different path video frames to effectively generate the adversarial disturbance, and an input of the multi-path aggregation module is F _t 、F _t-1 And F _t+1 The output of the multi-path aggregation module is the antagonism disturbance delta X' after iterative optimization; f _t 、F _t-1 And F _t+1 Respectively representing the original images x ^t-1 、x ^t And x ^t+1 Characteristic diagram of (1), F _t 、F _t-1 And F _t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:

(61) F is to be _t-1 Is projected to F _t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F _t And F _t-1 Will F _t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to _t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size _t And F _t-1 The matrix multiplication is carried out and the matrix multiplication is carried out,the result is normalized again to finally form a projection P with a size of B × H × W × C = B × C × C × H _t-1 ：

P _t-1 ＝Resize(BN(Resize(Conv(F _t-1 )))×Resize(Conv(F _t )))

Wherein: resize (-) denotes the image scaling function, BN (-) denotes the normalization operation, conv (-) denotes the convolution function;

(62) F is to be _t+1 Is projected to F _t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F _t And F _t+1 Will F _t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to _t+1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size _t And F _t+1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed _t+1 ：

P _t+1 ＝Resize(BN(Resize(Conv(F _t+1 )))×Resize(Conv(F _t )))

(63) From F _t Subtracting the aggregation characteristic, and outputting the iteratively optimized antagonistic disturbance delta X':

wherein: concat (. Cndot.) represents a merge function;

(64) Adding the antagonism disturbance delta X' to the original video sequence X to obtain the final antagonism video sequence

Wherein:

representing the overall loss function.

The multipath aggregation module projects the feature maps of the previous frame and the next frame onto the current frame through transformation, and finally subtracts the aggregated features from the current frame, so that the image-level operation can effectively integrate noise.

Preferably, in the step (7), the affinity matrix-based self-surveillance video object segmentation is performed, given a pair of video frames, based on the following assumptions: the content in two consecutive video frames is coherent; the frame reconstruction (pixel copy) operation can be represented by a linear transformation using an affinity matrix, which describes the copying process from the reference frame to the target frame. A common choice for similarity measurements in affinity matrices is the dot product operation between feature maps. The black box is used for resisting the attack, and the aim of resisting the attack is to find a resisting sample with imperceptible difference from the segmentation model by destroying the affinity matrix between video frames, so that the segmentation model is invalid.

Has the advantages that: according to the black box attack method for self-supervision video target segmentation, antagonism disturbance is generated by means of contrast loss, antagonism attack can be performed on a self-supervision video target segmentation task, and all pixels of a target are segmented wrongly; the method can be used for specific tasks or specific field researches, can help to know how black box attacks affect the performance of the model, and helps to reduce the influencing factors so as to enhance the safety and the robustness of the model.

Drawings

FIG. 1 is a flow chart of an embodiment of the method of the present invention;

FIG. 2 is a schematic illustration of an acquired challenge sample and a challenge prediction mask;

FIG. 3 is a schematic diagram of a multi-path aggregation module;

FIG. 4 is a schematic diagram of an apparatus for carrying out the method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the embodiments.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Fig. 1 is a flowchart of an implementation of a black box attack method for the segmentation of an object in an auto-surveillance video, and the following steps are specifically described below.

Step S01: for an unannotated original video sequence X = { X = ¹ ,x ² ,…,x ⁿ First, an initialized antagonistic perturbation Δ X = { Δ X) = is randomly generated by a self-supervised trained noise generator ₁ ,Δx ₂ ,…Δx _n }；x ⁱ Representing the original image of the ith frame, Δ x ⁱ Represents a correspondence x ⁱ I =1,2, \8230, n.

Step S02: obtaining an antagonistic video sequence X by adding an antagonistic disturbance DeltaX to an original video sequence X _adv 。

Antagonistic video sequences are represented as

Wherein:

Step S03: considering the consistency problem among video sequences, respectively constructing the contrast loss attack aiming at single frame, double frame and multi-frame so as to obtain the total contrast loss.

Constructing contrast loss attacks aiming at single frames, double frames and multiple frames:

wherein:

and

Step S04: and designing a characteristic loss function to make the original image semantically closer to a resistant frame in a characteristic space, and further enhancing the transferability of the resistant video sequence.

Feature-based extractor F _θ Designing a characteristic loss function by the extracted characteristics; feature extractor F _θ Using Resnet50 network, consisting of Resnet50 networkThe feature map of the input image is output at the layer preceding the last fully connected layer of the network, and θ represents the parameter to be learned of the Resnet50 network.

Wherein:

representation by feature extractor F _θ Extracted to

Is characterized in that a characteristic diagram of the system is shown,

Step S05: the pixel-level loss function is designed to make the noise of the antagonistic video sequence imperceptible.

The pixel level loss function is expressed as:

wherein: x is a radical of a fluorine atom ⁱ An original image representing the i-th frame,

Because the perturbation pattern is similar to noise, smoothing the image helps to mitigate the effects of antagonism. Therefore, to suppress the smoothness, pixel level penalties are applied in the image pixel space. The pixel level penalty represents the L2 distance between the confrontation example and the original clean example. The purpose of minimizing the L2 distance is to constrain the difference between the resistant video sequence and the clean video sequence at the pixel level to facilitate the visual perception of the resistant sample.

Step S06: iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X ' into the original video sequence X to obtain a final antagonistic video sequence X ' _adv 。

Constructing a multipath aggregation module to obtain the iteratively optimized antagonistic disturbance Δ X', as shown in fig. 3, the input of the multipath aggregation module is F _t 、F _t-1 And F _t+1 The output of the multipath aggregation module is antagonism disturbance delta X' after iterative optimization; f _t 、F _t-1 And F _t+1 Respectively representing the original images x ^t-1 、x ^t And x ^t+1 Characteristic diagram of (1), F _t 、F _t-1 And F _t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:

(61) F is to be _t-1 Is projected to F _t The specific process of the characteristic space is as follows: first using a 1X 1 convolution process F _t And F _t-1 A 1 to F _t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to _t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size _t And F _t-1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed _t-1 ：

P _t-1 ＝Resize(BN(Resize(Conv(F _t-1 )))×Resize(Conv(F _t )))

(62) F is to be _t+1 Is projected to F _t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F _t And F _t+1 Will F _t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to _t+1 Is adjusted to B × H × W × C = B × W × C × H, F for the adjusted size _t And F _t+1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed _t+1 ：

P _t+1 ＝Resize(BN(Resize(Conv(F _t+1 )))×Resize(Conv(F _t )))

(63) From F _t Subtracting the polymerization characteristic, and outputting the iteratively optimized antagonistic disturbance delta X':

wherein: concat (. Cndot.) represents a merge function.

(64) Adding the antagonistic disturbance delta X' to the original video sequence X to obtain the final antagonistic video sequence

Wherein:

representing the overall loss function.

(65) The overall loss function used for calculating the iterative optimization overall loss is as follows:

And

the weight parameter of (2).

(7) Given an initial frame mask of an original video sequence X ', a resistant video sequence X' _adv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.

As shown in fig. 2, a result visualization diagram of the self-surveillance video object segmentation oriented black box attack method provided by the present application is adopted, where the first two lines are respectively an original video sequence and its normal segmentation mask, and the last two lines are respectively an antagonistic video sequence obtained by adding an antagonistic disturbance to the original video sequence and its mask for erroneous segmentation; as can be seen from the attached drawings, after antagonistic disturbance which is difficult to be perceived by human eyes is added in an original video sequence, the position of a target frame estimated by an automatic supervision video target segmentation model is inaccurate, the segmentation precision is obviously reduced, and therefore a target object cannot be correctly segmented.

As shown in fig. 4, the device for implementing the black box attack method for self-surveillance video object segmentation provided by the present invention includes a generator, a Resnet network, and a multi-path aggregation module, where the generator is a noise generator for self-surveillance training, and is configured to randomly generate an initialized adversarial disturbance Δ X, and add the adversarial disturbance Δ X to an original video sequence X to obtain an adversarial video sequence X _adv Obtaining an original video sequence X and a resistant video sequence X _adv Pixel level loss of

Resnet50 network for constructing feature extractor F _θ Using a feature extractor F _θ Extracting an original video sequence X and an antagonistic video sequence X _adv Thereby calculating a feature loss

And total contrast loss L _con (ii) a Multipath aggregation module for obtaining total loss

Iteratively optimized antagonistic perturbation Δ X'.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing shows and describes the general principles, principal features and advantages of the invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims

1. A black box attack method for self-supervision video object segmentation is characterized by comprising the following steps: the self-supervision video target segmentation network adopts the self-supervision video target segmentation network based on the affinity matrix, and the method comprises the following steps:

(1) For an unannotated original video sequence X = { X = ¹ ,x ² ,…,x ⁿ Firstly, an initialized antagonistic disturbance delta X = { delta X = is randomly generated by a noise generator of self-supervision training ₁ ,Δx ₂ ,...Δx _n }；x ⁱ Representing the original image of the i-th frame, Δ x ⁱ Represents a correspondence x ⁱ The antagonistic perturbation of (a) is performed,i＝1,2,…,n；

(6) Iteratively optimizing the total loss, constructing a multi-path aggregation module to obtain an iteratively optimized antagonistic disturbance delta X ', and adding the antagonistic disturbance delta X' to the original video sequence X to obtain a final antagonistic video sequence X _a ′ _dv ；

(7) Given an initial frame mask of an original video sequence X, a resistant video sequence X is to be generated _a ′ _dv And inputting the self-supervision video target segmentation network to obtain a final prediction mask.

2. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (2), the antagonistic disturbance delta X is added to the original video sequence X to obtain an antagonistic video sequence

Wherein:

represents a correspondence x ⁱ The challenge sample of (1), ε represents the maximum allowable challenge threshold, | · | | survival |) _∞ Representing an infinite norm.

3. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (3), the problem of consistency among video sequences is considered, and the attack of contrast loss aiming at single frame, double frame and multi-frame is respectively constructed:

wherein:

and

respectively representing the contrast loss function for single, double and multiple frames, L _con Represents the total contrast loss function, x ⁱ Representing the original image of the i-th frame, x ⁱ⁺¹ Representing the original image of the (i + 1) th frame, sim (·,) representing a cosine similarity function, v representing a temperature parameter, { x · _neg Is a set of elements in a dynamic queue consisting of countermeasure samples, { x _pos Is the set of elements in the dynamic queue composed of the original image, and m represents the number of elements in each dynamic queue.

4. According to claim 1The black box attack method for the self-supervision video target segmentation is characterized by comprising the following steps: in the step (4), the feature-based extractor F _θ The extracted features design a feature loss function:

wherein:

representation by feature extractor F _θ Extracted to

Is characterized in that a characteristic diagram of the system is shown,

5. The black box attack method for self-supervision video object segmentation according to claim 4, characterized in that: the feature extractor F _θ And outputting the characteristic diagram of the input image by a layer before the last full-connection layer of the Resnet50 network by adopting the Resnet50 network, wherein theta represents the parameter to be learned of the Resnet50 network.

6. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (5), a pixel-level loss function is designed to make noise of the antagonistic video sequence imperceptible:

7. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (6), the overall loss function adopted by the iterative optimization overall loss is as follows:

wherein: Δ X' represents the antagonism perturbation after iterative optimization, and λ, μ and η are L respectively _con 、

And

the weight parameter of (2).

8. The black box attack method for self-supervision video object segmentation according to claim 1, characterized in that: in the step (6), a multipath aggregation module is constructed to obtain the iteratively optimized antagonistic disturbance delta X', and the input of the multipath aggregation module is F _t 、F _t-1 And F _t+1 The output of the multipath aggregation module is antagonism disturbance delta X' after iterative optimization; f _t 、F _t-1 And F _t+1 Respectively representing original images x ^t-1 、x ^t And x ^t+1 Characteristic diagram of (1), F _t 、F _t-1 And F _t+1 The sizes of the feature maps are B multiplied by H multiplied by W multiplied by C = B multiplied by H multiplied by W multiplied by C, B, H, W and C respectively represent the batch size, height, width and channel number of the feature maps, and B, H, W and C respectively represent the values of B, H, W and C; the processing procedure of the multipath aggregation module comprises the following steps:

(61) F is to be _t-1 Is projected to F _t The specific process of the characteristic space is as follows: first using a 1X 1 convolution process F _t And F _t-1 Will F _t The size of (B) is adjusted to B × H × W × C = B × C × W × H, F is adjusted to _t-1 Is adjusted to B × H × W × C = B × W × C × H, and F for the adjusted size _t And F _t-1 Matrix multiplication is carried out, the result is normalized, and finally the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed _t-1 ：

P _t-1 ＝Resize(BN(Resize(Conv(F _t-1 )))×Resize(Conv(F _t )))

(62) F is to be _t+1 Is projected to F _t The specific process of the characteristic space is as follows: using a 1X 1 convolution process F _t And F _t+1 Will F _t Is adjusted to B × H × W × C = B × C × W × H, and F is adjusted to _t+1 Is adjusted to B × H × W × C = B × W × C × H, F for the adjusted size _t And F _t+1 Matrix multiplication is carried out, the result is normalized, and finally, the projection P with the size of B multiplied by H multiplied by W multiplied by C = B multiplied by C multiplied by H is formed _t+1 ：

P _t+1 ＝Resize(BN(Resize(Conv(F _t+1 )))×Resize(Conv(F _t )))

wherein: concat (. Cndot.) represents a merge function;

Wherein:

representing the overall loss function.