CN114821432B - Video target segmentation anti-attack method based on discrete cosine transform - Google Patents
Video target segmentation anti-attack method based on discrete cosine transform Download PDFInfo
- Publication number
- CN114821432B CN114821432B CN202210481562.7A CN202210481562A CN114821432B CN 114821432 B CN114821432 B CN 114821432B CN 202210481562 A CN202210481562 A CN 202210481562A CN 114821432 B CN114821432 B CN 114821432B
- Authority
- CN
- China
- Prior art keywords
- video
- semantic
- discrete cosine
- obtaining
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Abstract
The invention discloses a video target segmentation anti-attack method based on discrete cosine transform. The method comprises the steps of obtaining video semantic features on a pre-trained video target segmentation model convolution layer, and converting the semantic features into frequency domain semantic features through discrete cosine transform; obtaining a motion vector through a video target motion perception module, and obtaining a semantic weight through a semantic weight quantization module; and finally, screening and removing the frequency domain value of the frequency domain semantic features according to the semantic weight, restoring by inverse discrete cosine transform to obtain antagonistic semantic features, and realizing attack on a video target segmentation model by removing the semantic features, namely obtaining a segmentation mask with poor performance. The method disclosed by the invention is used for fusing the time sequence relation of the video into the counterattack, so that the video is focused on a moving target in the video, and the time sequence relation is destroyed; through the semantic weight and the screening and removal of the video frame semantic features, the counterattack is realized by generating counterattack samples on the video semantic features, the video target segmentation precision is reduced, and the attack effect is improved.
Description
Technical Field
The invention belongs to the field of computer vision, in particular to the field of antagonistic learning and video target segmentation, and relates to a video target segmentation and antagonistic attack resisting method based on discrete cosine transform.
Background
Video object segmentation is a commonly used video processing technique, which can accurately segment an object in a video from a background and obtain a pixel-level mask (a matrix with the same resolution as a video frame, which corresponds to a video object region having an element value of 1 and a background region having an element value of 0) of the object. The video target segmentation technology can not only play a role in video processing tasks, but also be applied to the leading-edge fields of automatic driving, video monitoring, human-computer interaction, virtual reality and the like. In recent years, various deep learning based neural network models have been proposed and used to handle video object segmentation tasks. However, most studies have shown that the deep neural network model is not robust and is vulnerable to challenge attack, i.e., a challenge sample is generated by adding an imperceptible disturbance (a pixel value with a small value) to an image or video, and then the challenge sample is input into the deep neural network model to deceive the model, so that the model generates an erroneous output. Because the existing video target segmentation model is usually designed based on a deep neural network, the existing video target segmentation model does not have robustness on an anti-contrast sample, namely the video target segmentation model outputs a mask with low segmentation performance aiming at a video after disturbance is added, which is fatal to the practical application (such as automatic driving) of video target segmentation, so that the safety problem of video target segmentation has profound research significance and great research value.
At present, researches on resisting attacks mainly focus on the field of image classification, the researches on video attacks are still in a starting stage, and only a small amount of work focuses on video classification and semantic segmentation tasks. In the field of image classification, most of anti-attack methods utilize input image Gradient (a tensor with the same dimension as the input image) to carry out anti-attack, such as a Fast symbolic Gradient Method (Fast Gradient signal Method), wherein an element in the Gradient tensor is taken to be positive and negative and multiplied by a smaller disturbance coefficient to generate an imperceptible disturbance, and the imperceptible disturbance is added into an image so that a classifier classifies the image wrongly; the subsequent part works by various processing (e.g. projection, convolution) on the image gradient to generate disturbance to enhance attack performance. In the counterattack of video classification, part of methods use the idea of image classification attack, and the video is wrongly classified into other categories by a classifier by processing the gradient of the video to generate disturbance and adding the disturbance to the video; in addition, the method samples video data of different categories by using the characteristics of the video classification data, replaces the gradient of the whole video by using the gradient of a part of the sampled video, and generates general disturbance by using the gradient so as to generate a countersample of the video for attacking. In an attack method for semantic segmentation, an attack algorithm optimizes a loss function over a set of pixels/proposed targets to generate a competing perturbation aimed at confusing the proposed targets as much as possible, so that the semantic segmentation model makes an error in predicting classes of multiple proposed targets in the input image.
The disadvantages of the method are mainly reflected in two aspects: (1) Due to the difference between the image and the video, the attack method aiming at the image is not necessarily suitable for the video, and the existing attack method aiming at the video does not consider the association between continuous video frames and the time sequence relation of the video; (2) The existing attack resisting methods are all modes of adding disturbance to images/videos, so that the model makes wrong output on data after disturbance is added, and the attack methods are easily defended by resisting defense technical means such as denoising and are difficult to cause effective attack. According to the above considerations, it is urgently needed to design a video target segmentation anti-attack method which is integrated with a video time sequence relationship and has strong generalization attack capability.
Disclosure of Invention
The invention aims to provide a video target segmentation anti-attack method based on discrete cosine transform, aiming at the defects of the prior art. The method extracts optical flow from the video and obtains motion vectors by constructing a video target motion perception module, provides time sequence characteristics for an anti-attack algorithm, and further enables the attack algorithm to focus on the motion target in the video; meanwhile, a semantic weight quantization module and a semantic discrete cosine screening module are constructed and respectively used for capturing semantic weight and screening and removing video frame semantic features, and the countermeasure defense technology based on noise removal can be broken through, so that the attack on a video target segmentation model is realized, and the segmentation precision of the original model is reduced.
The method firstly obtains a video data set, a pixel-level target class matrix (mask) and a pre-trained video target segmentation model, and then performs the following operations:
step (1) uniform sampling is carried out on the video to obtain a video frame sequenceInputting the data into a pre-training video target segmentation model to obtain an original sourceStarting video frame semantic feature Z t ;
Step (2) constructing a video target motion perception module, and carrying out frame sequence on the videoAs input, a motion vector O 'is obtained' t ;
Step (3) a semantic weight quantization module is constructed, the initialized semantic weight gradient tensor and the motion vector are introduced as input, and the semantic weight Q is obtained t ;
Step (4) a semantic discrete cosine screening module is constructed, and the semantic weight Q is obtained t Semantic features Z with video frames t Obtaining as input a antagonism semantic feature
Fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set
And (6) inputting the optimized antagonistic semantic feature set into the next layer of the intermediate layer of the video target segmentation model, and obtaining the video target segmentation result after being attacked through the subsequent network layer.
Further, the step (1) is specifically:
(1-1) uniformly sampling 5-10 frames per second of the video to obtain T video frames to obtain a video frame sequenceAnd the real mask sequence X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T is the number of video frames,representing a real number domain, H and W respectively representing the height and the width of a video frame, and 3 representing the number of RGB channels;
(1-2) sequence of video framesEach video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolutional neural network (such as ResNet), and obtaining corresponding semantic features of an original video frame in the middle layer of the modelThe middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).
Still further, the step (2) is specifically:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolutional layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolutional layers and is used for carrying out frame sequence on a video frameInputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videosM t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flowsInputting down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flowsM′ t Expressing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interplate () uses the optical flow M as a downsampling function t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;
(2-3) assembling the downsampled optical flowInputting two-dimensional convolution, and sampling each optical flow M' t Perform convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow setM″ t A multi-channel optical flow representing the tth frame video and the t +1 th frame video, conv2D (·) is a two-dimensional convolution, the number of input channels is 2, the number of output channels is C', and the convolution kernel size is 1 × 1;
(2-4) randomly initializing a motion vector setO t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With a multichannel optical flow M t Inputting Motion function Motion (O) in sequence t ,M″ t )=Sigmoid(O t ⊙M″ t ) Obtaining a motion vectorAn element-by-element product, sigmoid (·) is a Sigmoid activation function, mapping variables between 0 and 1.
Still further, the step (3) is specifically: the semantic weight quantization module is composed of a semantic weight quantization function,initializing a full 1 semantic weight gradient matrixSemantic weight matrixAnd is associated with motion vector O' t Input semantic weight quantization functionObtaining semantic weightsWhere α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,for updated semantic weight gradient matrix Represents a cross-entropy loss function, softmax (·) refers to the Softmax function that acts to normalize variables.
Further, the step (4) is specifically:
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic featuresCosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Sequentially inputting a threshold functionObtaining a semantic screening matrixk denotes the semantic weight Q t β is a threshold coefficient greater than 0; (4-3) screening the semantics of the matrixAnd frequency domain semantic feature Z' t Performing element-by-element multiplication to obtain semantic features of screening frequency domain
(4-4) screening the semantic features of the frequency domainInputting inverse discrete cosine transform function to obtain antagonistic semantic featuresInverseCosine (·) denotes an inverse discrete cosine transform function.
Still further, the step (5) is specifically:
(5-1) antagonistic semantic featuresAll network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive maskThe middle layer is the first layer convolution;
(5-2) calculating a prediction maskAnd video frame X t True mask Y t Cross entropy loss ofObtaining gradients of semantic weights by back propagation
(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' t Updating to obtain optimized semantic weight gradient H ″) t ;
(5-4) optimizing semantic weight gradient H' t ' obtaining initial antagonism semantic features according to step (4)The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iterationReserving and obtaining an initial antagonism semantic feature setN represents the total iterative optimization times;
(5-6) 1 st to T th video frames X t Corresponding original semantic feature Z t And corresponding initial set of antagonistic semantic featuresSequentially inputting constraint functionsObtaining an optimized set of antagonistic semantic features For each oneVideo frame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value of the norm.
Continuing further, the step (6) is specifically: set of optimized antagonistic semantic featuresNetwork structure phi after inputting middle layer of pre-training video target segmentation model l+ (. Carrying out anti-attack and outputting the final video target segmentation result after the attackY′ t And the segmentation result is corresponding to the t video frame.
The invention provides a video target segmentation anti-attack method based on discrete cosine transform, which has the following characteristics: 1) Designing a semantic attack network aiming at video data and a video target segmentation task, and providing a video target motion perception module to enable the semantic attack network to focus on a moving target in a video; 2) A semantic weight quantization module is provided, and semantic weights are given to semantic features of video frames to distinguish the importance degrees of different semantic features; 3) And a semantic discrete cosine screening module is provided, part of video semantic features are screened and removed according to the semantic weight to obtain antagonistic semantic features, and the optimized antagonistic semantic features are obtained according to the output iterative optimization of the video target segmentation model, so that the effectiveness of resisting attacks is ensured.
The invention is suitable for resisting attack aiming at the video target segmentation model, and has the advantages that: 1) Through the video target motion perception module, a semantic attack network can concern moving targets in a video, destroy the time sequence continuity between video frames and enhance the attack effect on a video target segmentation model; 2) The semantic weight obtained by the semantic weight quantization module is used for distinguishing the semantic features of the video frames, so that the semantic features are screened and removed to realize anti-attack, and the effectiveness of the anti-attack is improved; 3) Starting from video data, a countermeasure sample is generated by iteratively optimizing and screening out partial semantic features, so that a countermeasure defense method based on denoising can be broken through, and the generalization capability of attacks is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, a video target segmentation anti-attack method based on discrete cosine transform includes steps of firstly, uniformly sampling video data to obtain video frames, obtaining semantic features of the video frames in a middle convolution layer of a pre-training video target segmentation model, and converting the video semantic features into frequency domain features through discrete cosine transform; then constructing a video target motion perception module and outputting a motion vector; secondly, a semantic weight quantization module is constructed, and semantic weight is output; then, a semantic discrete cosine screening module is constructed, semantic weight is input, the frequency domain semantic features are screened and removed, and the frequency domain semantic features are restored through inverse discrete cosine transformation to obtain antagonistic semantic features; optimizing the antagonism semantic features through a cross entropy loss function to obtain optimized antagonism semantic features; and finally, inputting the optimized antagonistic semantic features into a subsequent convolutional layer of the pre-training video target segmentation model to obtain an attacked video target segmentation result. According to the method, the video target motion perception module is used for acquiring the time sequence information of the video and integrating the time sequence information into the attack, so that the attack algorithm focuses on the motion target in the video, the time sequence relation is further damaged, meanwhile, the semantic features are captured and screened through the semantic weight quantification module and the semantic discrete cosine screening module, the countermeasure defense technology based on denoising can be broken through, the output segmentation result is only low in accuracy, and the countermeasure attack aiming at the video target segmentation is achieved.
The method comprises the steps of firstly obtaining a video data set, a pixel level target class matrix (mask) and a pre-trained video target segmentation model, and then carrying out the following operations:
step (1) uniformly sampling a video to obtain a video frame sequenceInputting the data into a pre-training video target segmentation model to obtain semantic features Z of an original video frame t (ii) a The method comprises the following steps:
(1-1) uniformly sampling 5-10 frames per second of the video to obtain T video frames to obtain a video frame sequenceAnd the real mask sequence X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T being the number of video frames,representing a real number domain, H and W respectively representing the height and width of a video frame, and 3 representing the number of RGB channels;
(1-2) sequence of video framesEach video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolutional neural network (such as ResNet), and obtaining corresponding semantic features of an original video frame in the middle layer of the modelThe middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the semantic features of the video frame, phi l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).
Step (2) constructing a video target motion perception module, and carrying out frame sequence on the videoAs input, a motion vector O 'is obtained' t (ii) a The method comprises the following steps:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on the videoInputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videosM t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flowsInputting a down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flowsM′ t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow t Dimension is changed from hxw x 2 to H '× W' × 2;
(2-3) assembling the downsampled optical flowInputting two-dimensional convolution, and sampling each optical flow M' t Carry out convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow setM″ t A multi-channel optical flow representing the t-th frame video and the t + 1-th frame video, conv2D (-) being twoDimension convolution, the number of input channels is 2, the number of output channels is C', and the size of convolution kernel is 1 multiplied by 1;
(2-4) randomly initializing a motion vector setO t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With multi-channel light stream M t Inputting Motion function Motion (O) in sequence t ,M″ t )=Sigmoid(O t ⊙M″ t ) Obtaining a motion vectorAs an element-by-element product, sigmoid (·) is a Sigmoid activation function, which maps variables between 0 and 1.
Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained t (ii) a The method comprises the following steps: the semantic weight quantization module is composed of semantic weight quantization functions, and a full 1 semantic weight gradient matrix is initializedSemantic weight matrixAnd is combined with motion vector O' t Input semantic weight quantization functionObtaining semantic weightsWhere α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) t ) The prediction mask for the tth frame video for the pre-trained video object segmentation model,for updated semantic weight gradientMatrix array Represents a cross-entropy loss function, softmax (·) refers to the Softmax function that acts to normalize variables.
Step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q t Semantic features Z with video frames t Obtaining as input a antagonism semantic featureThe method comprises the following steps:
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtaining the semantic features Z of the 1 st to T th original video frames t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic featuresCosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Inputting threshold function in turnObtaining a semantic screening matrixk denotes the semantic weight Q t β is a threshold coefficient greater than 0; (4-3) screening the semantics of the matrixAnd frequency domain semantic feature Z' t Performing element-by-element multiplication to obtain semantic features of screening frequency domain
(4-4) screening the semantic features of the frequency domainInputting inverse discrete cosine transform function to obtain antagonistic semantic featuresInverseCosine (·) denotes an inverse discrete cosine transform function.
Fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature setThe method comprises the following steps:
(5-1) antagonistic semantic featuresAll network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive maskThe middle layer is the first layer convolution;
(5-2) calculating a prediction maskAnd video frame X t True mask Y t Cross entropy loss ofObtaining gradients of semantic weights by back propagation
(5-3) fixing a video target motion perception module, a semantic weight quantization module and a semantic discrete cosineSemantic attack network parameters formed by a screening module are used for semantic weight gradient matrix H 'through a random gradient descent method' t Updating to obtain optimized semantic weight gradient H ″) t ;
(5-4) optimizing semantic weight gradient H ″) t Obtaining initial antagonism semantic features according to the step (4)The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iterationReserving and obtaining an initial antagonism semantic feature setN represents the total iterative optimization times;
(5-6) 1 st to T video frames X t Corresponding original semantic features Z t And corresponding initial set of antagonistic semantic featuresSequentially inputting constraint functionsObtaining an optimized set of antagonistic semantic features For each video frame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value for the norm.
Step (6) inputting the optimized antagonism semantic feature set into the next layer of the video target segmentation model intermediate layer, and obtaining the antagonism semantic feature set through the subsequent network layerSegmenting the video target after being attacked; the method comprises the following steps: set of optimized antagonistic semantic featuresNetwork structure phi after inputting middle layer of pre-training video target segmentation model l+ (. The) makes the counter attack and outputs the final video target segmentation result after the counter attackY′ t And the segmentation result is corresponding to the t video frame.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
Claims (5)
1. A video target segmentation anti-attack method based on discrete cosine transform is characterized in that: firstly, a video data set, a pixel-level target category matrix and a pre-trained video target segmentation model are obtained, and then the following operations are carried out:
step (1) uniformly sampling a video to obtain T video frames to obtain a video frame sequenceAnd the real mask sequence X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T is the number of video frames,represents a real number domain, H,W respectively represents the height and the width of a video frame, and 3 represents the number of RGB channels;
combining a sequence of video framesEach video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolution neural network, and obtaining corresponding semantic features of an original video frame in an intermediate layer of the modelThe middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi l (.) is all network structures of the pre-training video target segmentation model before the l-th layer convolution;
step (2) constructing a video target motion perception module, and carrying out frame sequence on the videoAs input, a motion vector O 'is obtained' t ;
Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained t ;
The semantic weight quantization module consists of a semantic weight quantization function and initializes a semantic weight gradient matrix of all 1 sSemantic weight matrixAnd is associated with motion vector O' t Input semantic weight quantization functionObtaining semantic weightsWherein, alpha is a disturbance coefficient and is set to be 2.0/255, phi (X) t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,for updated semantic weight gradient matrix Indicating a cross entropy loss function, which is an element-by-element product, and Softmax (·) indicates that the Softmax function acts to normalize variables;
step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q t With video frame semantic features Z t Obtaining as input a antagonism semantic feature
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames t Sequentially inputting discrete cosine transform function to obtain semantic features of frequency domainCosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Inputting threshold function in turnObtaining a semantic screening matrixk denotes the semantic weight Q t β is a threshold coefficient greater than 0;
(4-3) semantic screening matrixAnd frequency domain semantic feature Z' t Obtaining semantic features of the screened frequency domain by element-by-element product
(4-4) screening the semantic features of the frequency domainInputting an inverse discrete cosine transform function to obtain the antagonistic semantic featuresInverseCosine (·) denotes an inverse discrete cosine transform function;
fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set
And (6) inputting the optimized antagonistic semantic feature set into the next layer of the intermediate layer of the video target segmentation model, and obtaining the video target segmentation result after being attacked through the subsequent network layer.
2. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 1, wherein the video is uniformly sampled at 5-10 frames per second in step (1).
3. The discrete cosine transform-based video object segmentation attack-combating method as claimed in claim 1 or 2, wherein the step (2) is specifically:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on a videoInputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videosM t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flowsInputting a down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flowsM′ t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;
(2-3) assembling the downsampled optical flowInputting two-dimensional convolution, and sampling each optical flow M' t Perform convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow setM″ t A multi-channel optical flow representing the t frame video and the t +1 frame video, conv2D (-) beingTwo-dimensional convolution, wherein the number of input channels is 2, the number of output channels is C', and the size of a convolution kernel is 1 multiplied by 1;
(2-4) randomly initializing a motion vector setO t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With a multichannel optical flow M t Inputting motion functions in sequenceObtaining motion vectorsSigmoid (·) is a Sigmoid activation function that maps variables between 0,1.
4. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 3, wherein the step (5) is specifically:
(5-1) semantic feature of antagonismAll network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive maskThe middle layer is the first layer convolution;
(5-2) calculating a prediction maskAnd video frame X t True mask Y t Cross entropy loss ofObtaining gradients of semantic weights by back propagation
(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantization module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' t Updating to obtain optimized semantic weight gradient H ″) t ;
(5-4) optimizing semantic weight gradient H ″) t Obtaining initial antagonism semantic features according to the step (4)The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iterationReserving and obtaining an initial antagonism semantic feature setN represents the total iterative optimization times;
(5-6) 1 st to T video frames X t Corresponding original semantic features Z t And corresponding initial set of antagonistic semantic featuresSequentially inputting constraint functionsObtaining an optimized set of antagonistic semantic features For each videoFrame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value for the norm.
5. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 4, wherein the step (6) is specifically: set of optimized antagonistic semantic featuresNetwork structure phi after inputting middle layer of pre-training video target segmentation model l+ (. Carrying out anti-attack and outputting the final video target segmentation result after the attackY t ' is the segmentation result corresponding to the t-th video frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210481562.7A CN114821432B (en) | 2022-05-05 | 2022-05-05 | Video target segmentation anti-attack method based on discrete cosine transform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210481562.7A CN114821432B (en) | 2022-05-05 | 2022-05-05 | Video target segmentation anti-attack method based on discrete cosine transform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114821432A CN114821432A (en) | 2022-07-29 |
CN114821432B true CN114821432B (en) | 2022-12-02 |
Family
ID=82510542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210481562.7A Active CN114821432B (en) | 2022-05-05 | 2022-05-05 | Video target segmentation anti-attack method based on discrete cosine transform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114821432B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115311521B (en) * | 2022-09-13 | 2023-04-28 | 中南大学 | Black box video countermeasure sample generation method and evaluation method based on reinforcement learning |
CN116308978B (en) * | 2022-12-08 | 2024-01-23 | 北京瑞莱智慧科技有限公司 | Video processing method, related device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301019A (en) * | 1992-09-17 | 1994-04-05 | Zenith Electronics Corp. | Data compression system having perceptually weighted motion vectors |
CN1767653A (en) * | 2005-11-08 | 2006-05-03 | 上海广电(集团)有限公司中央研究院 | Bit rate control method |
CN101668170A (en) * | 2009-09-23 | 2010-03-10 | 中山大学 | Digital television program copyright protecting method for resisting time synchronization attacks |
CN104243974A (en) * | 2014-09-12 | 2014-12-24 | 宁波大学 | Stereoscopic video quality objective evaluation method based on three-dimensional discrete cosine transformation |
CN113538457A (en) * | 2021-06-28 | 2021-10-22 | 杭州电子科技大学 | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution |
CN114202017A (en) * | 2021-11-29 | 2022-03-18 | 南京航空航天大学 | SAR optical image mapping model lightweight method based on condition generation countermeasure network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6895055B2 (en) * | 2001-10-29 | 2005-05-17 | Koninklijke Philips Electronics N.V. | Bit-rate guided frequency weighting matrix selection |
CN105828064B (en) * | 2015-01-07 | 2017-12-12 | 中国人民解放军理工大学 | The local video quality evaluation without reference method with global space-time characterisation of fusion |
CN112927202B (en) * | 2021-02-25 | 2022-06-03 | 华南理工大学 | Method and system for detecting Deepfake video with combination of multiple time domains and multiple characteristics |
-
2022
- 2022-05-05 CN CN202210481562.7A patent/CN114821432B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301019A (en) * | 1992-09-17 | 1994-04-05 | Zenith Electronics Corp. | Data compression system having perceptually weighted motion vectors |
CN1767653A (en) * | 2005-11-08 | 2006-05-03 | 上海广电(集团)有限公司中央研究院 | Bit rate control method |
CN101668170A (en) * | 2009-09-23 | 2010-03-10 | 中山大学 | Digital television program copyright protecting method for resisting time synchronization attacks |
CN104243974A (en) * | 2014-09-12 | 2014-12-24 | 宁波大学 | Stereoscopic video quality objective evaluation method based on three-dimensional discrete cosine transformation |
CN113538457A (en) * | 2021-06-28 | 2021-10-22 | 杭州电子科技大学 | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution |
CN114202017A (en) * | 2021-11-29 | 2022-03-18 | 南京航空航天大学 | SAR optical image mapping model lightweight method based on condition generation countermeasure network |
Also Published As
Publication number | Publication date |
---|---|
CN114821432A (en) | 2022-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wei et al. | 3-D quasi-recurrent neural network for hyperspectral image denoising | |
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN114821432B (en) | Video target segmentation anti-attack method based on discrete cosine transform | |
Lin et al. | Refinenet: Multi-path refinement networks for high-resolution semantic segmentation | |
JP6656111B2 (en) | Method and system for removing image noise | |
Fan et al. | Low-level structure feature extraction for image processing via stacked sparse denoising autoencoder | |
CN113191489B (en) | Training method of binary neural network model, image processing method and device | |
Mirmozaffari | Filtering in image processing | |
CN113379618B (en) | Optical remote sensing image cloud removing method based on residual dense connection and feature fusion | |
Wang et al. | PFDN: Pyramid feature decoupling network for single image deraining | |
Song et al. | Multistage curvature-guided network for progressive single image reflection removal | |
Mana et al. | An intelligent deep learning enabled marine fish species detection and classification model | |
Gökcen et al. | Real-time impulse noise removal | |
CN112308087B (en) | Integrated imaging identification method based on dynamic vision sensor | |
KR102095444B1 (en) | Method and Apparatus for Removing gain Linearity Noise Based on Deep Learning | |
Schirrmacher et al. | Sr 2: Super-resolution with structure-aware reconstruction | |
CN111401155B (en) | Image recognition method of residual error neural network based on implicit Euler jump connection | |
Li et al. | Distribution-transformed network for impulse noise removal | |
Wang et al. | A Denoising Network Based on Frequency-Spectral-Spatial-Feature for Hyperspectral Image | |
Kunapuli et al. | Enhanced Medical Image De-noising Using Auto Encoders and MLP | |
Doulamis | Vision based fall detector exploiting deep learning | |
Wang et al. | Deep hyperspectral and multispectral image fusion with inter-image variability | |
Zhao et al. | Cascaded residual density network for crowd counting | |
Ning et al. | The Importance of Anti-Aliasing in Tiny Object Detection | |
Antony et al. | T2FRF Filter: An Effective Algorithm for the Restoration of Fingerprint Images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |