CN114821432B

CN114821432B - Video target segmentation anti-attack method based on discrete cosine transform

Info

Publication number: CN114821432B
Application number: CN202210481562.7A
Authority: CN
Inventors: 潘震; 李平; 张宇
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-12-02
Anticipated expiration: 2042-05-05
Also published as: CN114821432A

Abstract

The invention discloses a video target segmentation anti-attack method based on discrete cosine transform. The method comprises the steps of obtaining video semantic features on a pre-trained video target segmentation model convolution layer, and converting the semantic features into frequency domain semantic features through discrete cosine transform; obtaining a motion vector through a video target motion perception module, and obtaining a semantic weight through a semantic weight quantization module; and finally, screening and removing the frequency domain value of the frequency domain semantic features according to the semantic weight, restoring by inverse discrete cosine transform to obtain antagonistic semantic features, and realizing attack on a video target segmentation model by removing the semantic features, namely obtaining a segmentation mask with poor performance. The method disclosed by the invention is used for fusing the time sequence relation of the video into the counterattack, so that the video is focused on a moving target in the video, and the time sequence relation is destroyed; through the semantic weight and the screening and removal of the video frame semantic features, the counterattack is realized by generating counterattack samples on the video semantic features, the video target segmentation precision is reduced, and the attack effect is improved.

Description

Video target segmentation anti-attack method based on discrete cosine transform

Technical Field

The invention belongs to the field of computer vision, in particular to the field of antagonistic learning and video target segmentation, and relates to a video target segmentation and antagonistic attack resisting method based on discrete cosine transform.

Background

Video object segmentation is a commonly used video processing technique, which can accurately segment an object in a video from a background and obtain a pixel-level mask (a matrix with the same resolution as a video frame, which corresponds to a video object region having an element value of 1 and a background region having an element value of 0) of the object. The video target segmentation technology can not only play a role in video processing tasks, but also be applied to the leading-edge fields of automatic driving, video monitoring, human-computer interaction, virtual reality and the like. In recent years, various deep learning based neural network models have been proposed and used to handle video object segmentation tasks. However, most studies have shown that the deep neural network model is not robust and is vulnerable to challenge attack, i.e., a challenge sample is generated by adding an imperceptible disturbance (a pixel value with a small value) to an image or video, and then the challenge sample is input into the deep neural network model to deceive the model, so that the model generates an erroneous output. Because the existing video target segmentation model is usually designed based on a deep neural network, the existing video target segmentation model does not have robustness on an anti-contrast sample, namely the video target segmentation model outputs a mask with low segmentation performance aiming at a video after disturbance is added, which is fatal to the practical application (such as automatic driving) of video target segmentation, so that the safety problem of video target segmentation has profound research significance and great research value.

At present, researches on resisting attacks mainly focus on the field of image classification, the researches on video attacks are still in a starting stage, and only a small amount of work focuses on video classification and semantic segmentation tasks. In the field of image classification, most of anti-attack methods utilize input image Gradient (a tensor with the same dimension as the input image) to carry out anti-attack, such as a Fast symbolic Gradient Method (Fast Gradient signal Method), wherein an element in the Gradient tensor is taken to be positive and negative and multiplied by a smaller disturbance coefficient to generate an imperceptible disturbance, and the imperceptible disturbance is added into an image so that a classifier classifies the image wrongly; the subsequent part works by various processing (e.g. projection, convolution) on the image gradient to generate disturbance to enhance attack performance. In the counterattack of video classification, part of methods use the idea of image classification attack, and the video is wrongly classified into other categories by a classifier by processing the gradient of the video to generate disturbance and adding the disturbance to the video; in addition, the method samples video data of different categories by using the characteristics of the video classification data, replaces the gradient of the whole video by using the gradient of a part of the sampled video, and generates general disturbance by using the gradient so as to generate a countersample of the video for attacking. In an attack method for semantic segmentation, an attack algorithm optimizes a loss function over a set of pixels/proposed targets to generate a competing perturbation aimed at confusing the proposed targets as much as possible, so that the semantic segmentation model makes an error in predicting classes of multiple proposed targets in the input image.

The disadvantages of the method are mainly reflected in two aspects: (1) Due to the difference between the image and the video, the attack method aiming at the image is not necessarily suitable for the video, and the existing attack method aiming at the video does not consider the association between continuous video frames and the time sequence relation of the video; (2) The existing attack resisting methods are all modes of adding disturbance to images/videos, so that the model makes wrong output on data after disturbance is added, and the attack methods are easily defended by resisting defense technical means such as denoising and are difficult to cause effective attack. According to the above considerations, it is urgently needed to design a video target segmentation anti-attack method which is integrated with a video time sequence relationship and has strong generalization attack capability.

Disclosure of Invention

The invention aims to provide a video target segmentation anti-attack method based on discrete cosine transform, aiming at the defects of the prior art. The method extracts optical flow from the video and obtains motion vectors by constructing a video target motion perception module, provides time sequence characteristics for an anti-attack algorithm, and further enables the attack algorithm to focus on the motion target in the video; meanwhile, a semantic weight quantization module and a semantic discrete cosine screening module are constructed and respectively used for capturing semantic weight and screening and removing video frame semantic features, and the countermeasure defense technology based on noise removal can be broken through, so that the attack on a video target segmentation model is realized, and the segmentation precision of the original model is reduced.

The method firstly obtains a video data set, a pixel-level target class matrix (mask) and a pre-trained video target segmentation model, and then performs the following operations:

step (1) uniform sampling is carried out on the video to obtain a video frame sequence

Inputting the data into a pre-training video target segmentation model to obtain an original sourceStarting video frame semantic feature Z _t ；

Step (2) constructing a video target motion perception module, and carrying out frame sequence on the video

As input, a motion vector O 'is obtained' _t ；

Step (3) a semantic weight quantization module is constructed, the initialized semantic weight gradient tensor and the motion vector are introduced as input, and the semantic weight Q is obtained _t ；

Step (4) a semantic discrete cosine screening module is constructed, and the semantic weight Q is obtained _t Semantic features Z with video frames _t Obtaining as input a antagonism semantic feature

Fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set

And (6) inputting the optimized antagonistic semantic feature set into the next layer of the intermediate layer of the video target segmentation model, and obtaining the video target segmentation result after being attacked through the subsequent network layer.

Further, the step (1) is specifically:

(1-1) uniformly sampling 5-10 frames per second of the video to obtain T video frames to obtain a video frame sequence

And the real mask sequence

X _t Representing the t-th video frame, Y _t The true mask corresponding to the T-th video frame, T is the number of video frames,

representing a real number domain, H and W respectively representing the height and the width of a video frame, and 3 representing the number of RGB channels;

(1-2) sequence of video frames

Each video frame X in (b) _t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolutional neural network (such as ResNet), and obtaining corresponding semantic features of an original video frame in the middle layer of the model

The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi _l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).

Still further, the step (2) is specifically:

(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolutional layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolutional layers and is used for carrying out frame sequence on a video frame

Inputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videos

M _t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M _T Initializing and completing from all 0;

(2-2) aggregating the optical flows

Inputting down-sampling function, for each optical flow M _t Carry out down-sampling M' _t ＝Interpolate(M _t ) Obtaining a down-sampled set of optical flows

M′ _t Expressing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interplate () uses the optical flow M as a downsampling function _t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;

(2-3) assembling the downsampled optical flow

Inputting two-dimensional convolution, and sampling each optical flow M' _t Perform convolution M _t ＝Conv2D(M′ _t ) Obtaining a multi-channel optical flow set

M″ _t A multi-channel optical flow representing the tth frame video and the t +1 th frame video, conv2D (·) is a two-dimensional convolution, the number of input channels is 2, the number of output channels is C', and the convolution kernel size is 1 × 1;

(2-4) randomly initializing a motion vector set

O _t For the t frame video X _t Corresponding random initialization motion vector, and randomly initializing the motion vector O _t With a multichannel optical flow M _t Inputting Motion function Motion (O) in sequence _t ,M″ _t )＝Sigmoid(O _t ⊙M″ _t ) Obtaining a motion vector

An element-by-element product, sigmoid (·) is a Sigmoid activation function, mapping variables between 0 and 1.

Still further, the step (3) is specifically: the semantic weight quantization module is composed of a semantic weight quantization function,initializing a full 1 semantic weight gradient matrix

Semantic weight matrix

And is associated with motion vector O' _t Input semantic weight quantization function

Obtaining semantic weights

Where α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) _t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,

for updated semantic weight gradient matrix

Represents a cross-entropy loss function, softmax (·) refers to the Softmax function that acts to normalize variables.

Further, the step (4) is specifically:

(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames _t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic features

Cosine (·) represents a discrete Cosine transform function;

(4-2) weighting the semantic weight Q _t Each element q in _k Sequentially inputting a threshold function

Obtaining a semantic screening matrix

k denotes the semantic weight Q _t β is a threshold coefficient greater than 0; (4-3) screening the semantics of the matrix

And frequency domain semantic feature Z' _t Performing element-by-element multiplication to obtain semantic features of screening frequency domain

(4-4) screening the semantic features of the frequency domain

Inputting inverse discrete cosine transform function to obtain antagonistic semantic features

InverseCosine (·) denotes an inverse discrete cosine transform function.

Still further, the step (5) is specifically:

(5-1) antagonistic semantic features

All network structures phi after being input into the middle layer of the pre-trained video target segmentation model _l+ (. To obtain a predictive mask

The middle layer is the first layer convolution;

(5-2) calculating a prediction mask

And video frame X _t True mask Y _t Cross entropy loss of

Obtaining gradients of semantic weights by back propagation

(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' _t Updating to obtain optimized semantic weight gradient H ″) _t ；

(5-4) optimizing semantic weight gradient H' _t ' obtaining initial antagonism semantic features according to step (4)

The superscript n represents the nth iteration optimization;

(5-5) initial antagonism semantic features to be obtained at each iteration

Reserving and obtaining an initial antagonism semantic feature set

N represents the total iterative optimization times;

(5-6) 1 st to T th video frames X _t Corresponding original semantic feature Z _t And corresponding initial set of antagonistic semantic features

Sequentially inputting constraint functions

Obtaining an optimized set of antagonistic semantic features

For each oneVideo frame X _t Corresponding optimized antagonism semantic features, where | · |. Non calculation _p Is L _p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L _p A threshold value of the norm.

Continuing further, the step (6) is specifically: set of optimized antagonistic semantic features

Network structure phi after inputting middle layer of pre-training video target segmentation model _l+ (. Carrying out anti-attack and outputting the final video target segmentation result after the attack

Y′ _t And the segmentation result is corresponding to the t video frame.

The invention provides a video target segmentation anti-attack method based on discrete cosine transform, which has the following characteristics: 1) Designing a semantic attack network aiming at video data and a video target segmentation task, and providing a video target motion perception module to enable the semantic attack network to focus on a moving target in a video; 2) A semantic weight quantization module is provided, and semantic weights are given to semantic features of video frames to distinguish the importance degrees of different semantic features; 3) And a semantic discrete cosine screening module is provided, part of video semantic features are screened and removed according to the semantic weight to obtain antagonistic semantic features, and the optimized antagonistic semantic features are obtained according to the output iterative optimization of the video target segmentation model, so that the effectiveness of resisting attacks is ensured.

The invention is suitable for resisting attack aiming at the video target segmentation model, and has the advantages that: 1) Through the video target motion perception module, a semantic attack network can concern moving targets in a video, destroy the time sequence continuity between video frames and enhance the attack effect on a video target segmentation model; 2) The semantic weight obtained by the semantic weight quantization module is used for distinguishing the semantic features of the video frames, so that the semantic features are screened and removed to realize anti-attack, and the effectiveness of the anti-attack is improved; 3) Starting from video data, a countermeasure sample is generated by iteratively optimizing and screening out partial semantic features, so that a countermeasure defense method based on denoising can be broken through, and the generalization capability of attacks is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, a video target segmentation anti-attack method based on discrete cosine transform includes steps of firstly, uniformly sampling video data to obtain video frames, obtaining semantic features of the video frames in a middle convolution layer of a pre-training video target segmentation model, and converting the video semantic features into frequency domain features through discrete cosine transform; then constructing a video target motion perception module and outputting a motion vector; secondly, a semantic weight quantization module is constructed, and semantic weight is output; then, a semantic discrete cosine screening module is constructed, semantic weight is input, the frequency domain semantic features are screened and removed, and the frequency domain semantic features are restored through inverse discrete cosine transformation to obtain antagonistic semantic features; optimizing the antagonism semantic features through a cross entropy loss function to obtain optimized antagonism semantic features; and finally, inputting the optimized antagonistic semantic features into a subsequent convolutional layer of the pre-training video target segmentation model to obtain an attacked video target segmentation result. According to the method, the video target motion perception module is used for acquiring the time sequence information of the video and integrating the time sequence information into the attack, so that the attack algorithm focuses on the motion target in the video, the time sequence relation is further damaged, meanwhile, the semantic features are captured and screened through the semantic weight quantification module and the semantic discrete cosine screening module, the countermeasure defense technology based on denoising can be broken through, the output segmentation result is only low in accuracy, and the countermeasure attack aiming at the video target segmentation is achieved.

The method comprises the steps of firstly obtaining a video data set, a pixel level target class matrix (mask) and a pre-trained video target segmentation model, and then carrying out the following operations:

step (1) uniformly sampling a video to obtain a video frame sequence

Inputting the data into a pre-training video target segmentation model to obtain semantic features Z of an original video frame _t (ii) a The method comprises the following steps:

And the real mask sequence

X _t Representing the t-th video frame, Y _t The true mask corresponding to the T-th video frame, T being the number of video frames,

representing a real number domain, H and W respectively representing the height and width of a video frame, and 3 representing the number of RGB channels;

(1-2) sequence of video frames

The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the semantic features of the video frame, phi _l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).

As input, a motion vector O 'is obtained' _t (ii) a The method comprises the following steps:

(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on the video

(2-2) aggregating the optical flows

Inputting a down-sampling function, for each optical flow M _t Carry out down-sampling M' _t ＝Interpolate(M _t ) Obtaining a down-sampled set of optical flows

M′ _t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow _t Dimension is changed from hxw x 2 to H '× W' × 2;

(2-3) assembling the downsampled optical flow

Inputting two-dimensional convolution, and sampling each optical flow M' _t Carry out convolution M _t ＝Conv2D(M′ _t ) Obtaining a multi-channel optical flow set

M″ _t A multi-channel optical flow representing the t-th frame video and the t + 1-th frame video, conv2D (-) being twoDimension convolution, the number of input channels is 2, the number of output channels is C', and the size of convolution kernel is 1 multiplied by 1;

(2-4) randomly initializing a motion vector set

O _t For the t frame video X _t Corresponding random initialization motion vector, and randomly initializing the motion vector O _t With multi-channel light stream M _t Inputting Motion function Motion (O) in sequence _t ,M″ _t )＝Sigmoid(O _t ⊙M″ _t ) Obtaining a motion vector

As an element-by-element product, sigmoid (·) is a Sigmoid activation function, which maps variables between 0 and 1.

Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained _t (ii) a The method comprises the following steps: the semantic weight quantization module is composed of semantic weight quantization functions, and a full 1 semantic weight gradient matrix is initialized

Semantic weight matrix

And is combined with motion vector O' _t Input semantic weight quantization function

Obtaining semantic weights

Where α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) _t ) The prediction mask for the tth frame video for the pre-trained video object segmentation model,

for updated semantic weight gradientMatrix array

Step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q _t Semantic features Z with video frames _t Obtaining as input a antagonism semantic feature

The method comprises the following steps:

(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtaining the semantic features Z of the 1 st to T th original video frames _t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic features

Cosine (·) represents a discrete Cosine transform function;

(4-2) weighting the semantic weight Q _t Each element q in _k Inputting threshold function in turn

Obtaining a semantic screening matrix

(4-4) screening the semantic features of the frequency domain

InverseCosine (·) denotes an inverse discrete cosine transform function.

The method comprises the following steps:

(5-1) antagonistic semantic features

The middle layer is the first layer convolution;

(5-2) calculating a prediction mask

And video frame X _t True mask Y _t Cross entropy loss of

Obtaining gradients of semantic weights by back propagation

(5-3) fixing a video target motion perception module, a semantic weight quantization module and a semantic discrete cosineSemantic attack network parameters formed by a screening module are used for semantic weight gradient matrix H 'through a random gradient descent method' _t Updating to obtain optimized semantic weight gradient H ″) _t ；

(5-4) optimizing semantic weight gradient H ″) _t Obtaining initial antagonism semantic features according to the step (4)

The superscript n represents the nth iteration optimization;

(5-5) initial antagonism semantic features to be obtained at each iteration

Reserving and obtaining an initial antagonism semantic feature set

N represents the total iterative optimization times;

(5-6) 1 st to T video frames X _t Corresponding original semantic features Z _t And corresponding initial set of antagonistic semantic features

Sequentially inputting constraint functions

Obtaining an optimized set of antagonistic semantic features

For each video frame X _t Corresponding optimized antagonism semantic features, where | · |. Non calculation _p Is L _p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L _p A threshold value for the norm.

Step (6) inputting the optimized antagonism semantic feature set into the next layer of the video target segmentation model intermediate layer, and obtaining the antagonism semantic feature set through the subsequent network layerSegmenting the video target after being attacked; the method comprises the following steps: set of optimized antagonistic semantic features

Network structure phi after inputting middle layer of pre-training video target segmentation model _l+ (. The) makes the counter attack and outputs the final video target segmentation result after the counter attack

Y′ _t And the segmentation result is corresponding to the t video frame.

The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims

1. A video target segmentation anti-attack method based on discrete cosine transform is characterized in that: firstly, a video data set, a pixel-level target category matrix and a pre-trained video target segmentation model are obtained, and then the following operations are carried out:

step (1) uniformly sampling a video to obtain T video frames to obtain a video frame sequence

And the real mask sequence

represents a real number domain, H,W respectively represents the height and the width of a video frame, and 3 represents the number of RGB channels;

combining a sequence of video frames

Each video frame X in (b) _t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolution neural network, and obtaining corresponding semantic features of an original video frame in an intermediate layer of the model

The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi _l (.) is all network structures of the pre-training video target segmentation model before the l-th layer convolution;

As input, a motion vector O 'is obtained' _t ；

Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained _t ；

The semantic weight quantization module consists of a semantic weight quantization function and initializes a semantic weight gradient matrix of all 1 s

Semantic weight matrix

Obtaining semantic weights

Wherein, alpha is a disturbance coefficient and is set to be 2.0/255, phi (X) _t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,

for updated semantic weight gradient matrix

Indicating a cross entropy loss function, which is an element-by-element product, and Softmax (·) indicates that the Softmax function acts to normalize variables;

step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q _t With video frame semantic features Z _t Obtaining as input a antagonism semantic feature

(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames _t Sequentially inputting discrete cosine transform function to obtain semantic features of frequency domain

Cosine (·) represents a discrete Cosine transform function;

Obtaining a semantic screening matrix

k denotes the semantic weight Q _t β is a threshold coefficient greater than 0;

(4-3) semantic screening matrix

And frequency domain semantic feature Z' _t Obtaining semantic features of the screened frequency domain by element-by-element product

(4-4) screening the semantic features of the frequency domain

Inputting an inverse discrete cosine transform function to obtain the antagonistic semantic features

InverseCosine (·) denotes an inverse discrete cosine transform function;

2. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 1, wherein the video is uniformly sampled at 5-10 frames per second in step (1).

3. The discrete cosine transform-based video object segmentation attack-combating method as claimed in claim 1 or 2, wherein the step (2) is specifically:

(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on a video

(2-2) aggregating the optical flows

M′ _t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow _t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;

(2-3) assembling the downsampled optical flow

M″ _t A multi-channel optical flow representing the t frame video and the t +1 frame video, conv2D (-) beingTwo-dimensional convolution, wherein the number of input channels is 2, the number of output channels is C', and the size of a convolution kernel is 1 multiplied by 1;

(2-4) randomly initializing a motion vector set

O _t For the t frame video X _t Corresponding random initialization motion vector, and randomly initializing the motion vector O _t With a multichannel optical flow M _t Inputting motion functions in sequence

Obtaining motion vectors

Sigmoid (·) is a Sigmoid activation function that maps variables between 0,1.

4. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 3, wherein the step (5) is specifically:

(5-1) semantic feature of antagonism

The middle layer is the first layer convolution;

(5-2) calculating a prediction mask

And video frame X _t True mask Y _t Cross entropy loss of

Obtaining gradients of semantic weights by back propagation

(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantization module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' _t Updating to obtain optimized semantic weight gradient H ″) _t ；

The superscript n represents the nth iteration optimization;

(5-5) initial antagonism semantic features to be obtained at each iteration

Reserving and obtaining an initial antagonism semantic feature set

N represents the total iterative optimization times;

Sequentially inputting constraint functions

Obtaining an optimized set of antagonistic semantic features

For each videoFrame X _t Corresponding optimized antagonism semantic features, where | · |. Non calculation _p Is L _p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L _p A threshold value for the norm.

5. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 4, wherein the step (6) is specifically: set of optimized antagonistic semantic features

Y _t ' is the segmentation result corresponding to the t-th video frame.