CN114821432B - Video target segmentation anti-attack method based on discrete cosine transform - Google Patents

Video target segmentation anti-attack method based on discrete cosine transform Download PDF

Info

Publication number
CN114821432B
CN114821432B CN202210481562.7A CN202210481562A CN114821432B CN 114821432 B CN114821432 B CN 114821432B CN 202210481562 A CN202210481562 A CN 202210481562A CN 114821432 B CN114821432 B CN 114821432B
Authority
CN
China
Prior art keywords
video
semantic
discrete cosine
obtaining
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210481562.7A
Other languages
Chinese (zh)
Other versions
CN114821432A (en
Inventor
潘震
李平
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210481562.7A priority Critical patent/CN114821432B/en
Publication of CN114821432A publication Critical patent/CN114821432A/en
Application granted granted Critical
Publication of CN114821432B publication Critical patent/CN114821432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a video target segmentation anti-attack method based on discrete cosine transform. The method comprises the steps of obtaining video semantic features on a pre-trained video target segmentation model convolution layer, and converting the semantic features into frequency domain semantic features through discrete cosine transform; obtaining a motion vector through a video target motion perception module, and obtaining a semantic weight through a semantic weight quantization module; and finally, screening and removing the frequency domain value of the frequency domain semantic features according to the semantic weight, restoring by inverse discrete cosine transform to obtain antagonistic semantic features, and realizing attack on a video target segmentation model by removing the semantic features, namely obtaining a segmentation mask with poor performance. The method disclosed by the invention is used for fusing the time sequence relation of the video into the counterattack, so that the video is focused on a moving target in the video, and the time sequence relation is destroyed; through the semantic weight and the screening and removal of the video frame semantic features, the counterattack is realized by generating counterattack samples on the video semantic features, the video target segmentation precision is reduced, and the attack effect is improved.

Description

Video target segmentation anti-attack method based on discrete cosine transform
Technical Field
The invention belongs to the field of computer vision, in particular to the field of antagonistic learning and video target segmentation, and relates to a video target segmentation and antagonistic attack resisting method based on discrete cosine transform.
Background
Video object segmentation is a commonly used video processing technique, which can accurately segment an object in a video from a background and obtain a pixel-level mask (a matrix with the same resolution as a video frame, which corresponds to a video object region having an element value of 1 and a background region having an element value of 0) of the object. The video target segmentation technology can not only play a role in video processing tasks, but also be applied to the leading-edge fields of automatic driving, video monitoring, human-computer interaction, virtual reality and the like. In recent years, various deep learning based neural network models have been proposed and used to handle video object segmentation tasks. However, most studies have shown that the deep neural network model is not robust and is vulnerable to challenge attack, i.e., a challenge sample is generated by adding an imperceptible disturbance (a pixel value with a small value) to an image or video, and then the challenge sample is input into the deep neural network model to deceive the model, so that the model generates an erroneous output. Because the existing video target segmentation model is usually designed based on a deep neural network, the existing video target segmentation model does not have robustness on an anti-contrast sample, namely the video target segmentation model outputs a mask with low segmentation performance aiming at a video after disturbance is added, which is fatal to the practical application (such as automatic driving) of video target segmentation, so that the safety problem of video target segmentation has profound research significance and great research value.
At present, researches on resisting attacks mainly focus on the field of image classification, the researches on video attacks are still in a starting stage, and only a small amount of work focuses on video classification and semantic segmentation tasks. In the field of image classification, most of anti-attack methods utilize input image Gradient (a tensor with the same dimension as the input image) to carry out anti-attack, such as a Fast symbolic Gradient Method (Fast Gradient signal Method), wherein an element in the Gradient tensor is taken to be positive and negative and multiplied by a smaller disturbance coefficient to generate an imperceptible disturbance, and the imperceptible disturbance is added into an image so that a classifier classifies the image wrongly; the subsequent part works by various processing (e.g. projection, convolution) on the image gradient to generate disturbance to enhance attack performance. In the counterattack of video classification, part of methods use the idea of image classification attack, and the video is wrongly classified into other categories by a classifier by processing the gradient of the video to generate disturbance and adding the disturbance to the video; in addition, the method samples video data of different categories by using the characteristics of the video classification data, replaces the gradient of the whole video by using the gradient of a part of the sampled video, and generates general disturbance by using the gradient so as to generate a countersample of the video for attacking. In an attack method for semantic segmentation, an attack algorithm optimizes a loss function over a set of pixels/proposed targets to generate a competing perturbation aimed at confusing the proposed targets as much as possible, so that the semantic segmentation model makes an error in predicting classes of multiple proposed targets in the input image.
The disadvantages of the method are mainly reflected in two aspects: (1) Due to the difference between the image and the video, the attack method aiming at the image is not necessarily suitable for the video, and the existing attack method aiming at the video does not consider the association between continuous video frames and the time sequence relation of the video; (2) The existing attack resisting methods are all modes of adding disturbance to images/videos, so that the model makes wrong output on data after disturbance is added, and the attack methods are easily defended by resisting defense technical means such as denoising and are difficult to cause effective attack. According to the above considerations, it is urgently needed to design a video target segmentation anti-attack method which is integrated with a video time sequence relationship and has strong generalization attack capability.
Disclosure of Invention
The invention aims to provide a video target segmentation anti-attack method based on discrete cosine transform, aiming at the defects of the prior art. The method extracts optical flow from the video and obtains motion vectors by constructing a video target motion perception module, provides time sequence characteristics for an anti-attack algorithm, and further enables the attack algorithm to focus on the motion target in the video; meanwhile, a semantic weight quantization module and a semantic discrete cosine screening module are constructed and respectively used for capturing semantic weight and screening and removing video frame semantic features, and the countermeasure defense technology based on noise removal can be broken through, so that the attack on a video target segmentation model is realized, and the segmentation precision of the original model is reduced.
The method firstly obtains a video data set, a pixel-level target class matrix (mask) and a pre-trained video target segmentation model, and then performs the following operations:
step (1) uniform sampling is carried out on the video to obtain a video frame sequence
Figure BDA0003627684810000021
Inputting the data into a pre-training video target segmentation model to obtain an original sourceStarting video frame semantic feature Z t
Step (2) constructing a video target motion perception module, and carrying out frame sequence on the video
Figure BDA0003627684810000022
As input, a motion vector O 'is obtained' t
Step (3) a semantic weight quantization module is constructed, the initialized semantic weight gradient tensor and the motion vector are introduced as input, and the semantic weight Q is obtained t
Step (4) a semantic discrete cosine screening module is constructed, and the semantic weight Q is obtained t Semantic features Z with video frames t Obtaining as input a antagonism semantic feature
Figure BDA0003627684810000023
Fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set
Figure BDA0003627684810000024
And (6) inputting the optimized antagonistic semantic feature set into the next layer of the intermediate layer of the video target segmentation model, and obtaining the video target segmentation result after being attacked through the subsequent network layer.
Further, the step (1) is specifically:
(1-1) uniformly sampling 5-10 frames per second of the video to obtain T video frames to obtain a video frame sequence
Figure BDA0003627684810000031
And the real mask sequence
Figure BDA0003627684810000032
Figure BDA0003627684810000033
X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T is the number of video frames,
Figure BDA0003627684810000034
representing a real number domain, H and W respectively representing the height and the width of a video frame, and 3 representing the number of RGB channels;
(1-2) sequence of video frames
Figure BDA0003627684810000035
Each video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolutional neural network (such as ResNet), and obtaining corresponding semantic features of an original video frame in the middle layer of the model
Figure BDA0003627684810000036
The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).
Still further, the step (2) is specifically:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolutional layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolutional layers and is used for carrying out frame sequence on a video frame
Figure BDA0003627684810000037
Inputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videos
Figure BDA0003627684810000038
M t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flows
Figure BDA0003627684810000039
Inputting down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flows
Figure BDA00036276848100000310
M′ t Expressing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interplate () uses the optical flow M as a downsampling function t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;
(2-3) assembling the downsampled optical flow
Figure BDA00036276848100000311
Inputting two-dimensional convolution, and sampling each optical flow M' t Perform convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow set
Figure BDA00036276848100000312
M″ t A multi-channel optical flow representing the tth frame video and the t +1 th frame video, conv2D (·) is a two-dimensional convolution, the number of input channels is 2, the number of output channels is C', and the convolution kernel size is 1 × 1;
(2-4) randomly initializing a motion vector set
Figure BDA00036276848100000313
O t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With a multichannel optical flow M t Inputting Motion function Motion (O) in sequence t ,M″ t )=Sigmoid(O t ⊙M″ t ) Obtaining a motion vector
Figure BDA0003627684810000041
An element-by-element product, sigmoid (·) is a Sigmoid activation function, mapping variables between 0 and 1.
Still further, the step (3) is specifically: the semantic weight quantization module is composed of a semantic weight quantization function,initializing a full 1 semantic weight gradient matrix
Figure BDA0003627684810000042
Semantic weight matrix
Figure BDA0003627684810000043
And is associated with motion vector O' t Input semantic weight quantization function
Figure BDA0003627684810000044
Obtaining semantic weights
Figure BDA0003627684810000045
Where α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,
Figure BDA0003627684810000046
for updated semantic weight gradient matrix
Figure BDA0003627684810000047
Figure BDA0003627684810000048
Represents a cross-entropy loss function, softmax (·) refers to the Softmax function that acts to normalize variables.
Further, the step (4) is specifically:
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic features
Figure BDA0003627684810000049
Cosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Sequentially inputting a threshold function
Figure BDA00036276848100000410
Obtaining a semantic screening matrix
Figure BDA00036276848100000411
k denotes the semantic weight Q t β is a threshold coefficient greater than 0; (4-3) screening the semantics of the matrix
Figure BDA00036276848100000412
And frequency domain semantic feature Z' t Performing element-by-element multiplication to obtain semantic features of screening frequency domain
Figure BDA00036276848100000413
(4-4) screening the semantic features of the frequency domain
Figure BDA00036276848100000414
Inputting inverse discrete cosine transform function to obtain antagonistic semantic features
Figure BDA00036276848100000415
InverseCosine (·) denotes an inverse discrete cosine transform function.
Still further, the step (5) is specifically:
(5-1) antagonistic semantic features
Figure BDA00036276848100000416
All network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive mask
Figure BDA00036276848100000417
The middle layer is the first layer convolution;
(5-2) calculating a prediction mask
Figure BDA00036276848100000418
And video frame X t True mask Y t Cross entropy loss of
Figure BDA00036276848100000419
Obtaining gradients of semantic weights by back propagation
Figure BDA0003627684810000051
(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' t Updating to obtain optimized semantic weight gradient H ″) t
(5-4) optimizing semantic weight gradient H' t ' obtaining initial antagonism semantic features according to step (4)
Figure BDA0003627684810000052
The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iteration
Figure BDA0003627684810000053
Reserving and obtaining an initial antagonism semantic feature set
Figure BDA0003627684810000054
N represents the total iterative optimization times;
(5-6) 1 st to T th video frames X t Corresponding original semantic feature Z t And corresponding initial set of antagonistic semantic features
Figure BDA0003627684810000055
Sequentially inputting constraint functions
Figure BDA0003627684810000056
Obtaining an optimized set of antagonistic semantic features
Figure BDA0003627684810000057
Figure BDA0003627684810000058
For each oneVideo frame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value of the norm.
Continuing further, the step (6) is specifically: set of optimized antagonistic semantic features
Figure BDA0003627684810000059
Network structure phi after inputting middle layer of pre-training video target segmentation model l+ (. Carrying out anti-attack and outputting the final video target segmentation result after the attack
Figure BDA00036276848100000510
Y′ t And the segmentation result is corresponding to the t video frame.
The invention provides a video target segmentation anti-attack method based on discrete cosine transform, which has the following characteristics: 1) Designing a semantic attack network aiming at video data and a video target segmentation task, and providing a video target motion perception module to enable the semantic attack network to focus on a moving target in a video; 2) A semantic weight quantization module is provided, and semantic weights are given to semantic features of video frames to distinguish the importance degrees of different semantic features; 3) And a semantic discrete cosine screening module is provided, part of video semantic features are screened and removed according to the semantic weight to obtain antagonistic semantic features, and the optimized antagonistic semantic features are obtained according to the output iterative optimization of the video target segmentation model, so that the effectiveness of resisting attacks is ensured.
The invention is suitable for resisting attack aiming at the video target segmentation model, and has the advantages that: 1) Through the video target motion perception module, a semantic attack network can concern moving targets in a video, destroy the time sequence continuity between video frames and enhance the attack effect on a video target segmentation model; 2) The semantic weight obtained by the semantic weight quantization module is used for distinguishing the semantic features of the video frames, so that the semantic features are screened and removed to realize anti-attack, and the effectiveness of the anti-attack is improved; 3) Starting from video data, a countermeasure sample is generated by iteratively optimizing and screening out partial semantic features, so that a countermeasure defense method based on denoising can be broken through, and the generalization capability of attacks is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, a video target segmentation anti-attack method based on discrete cosine transform includes steps of firstly, uniformly sampling video data to obtain video frames, obtaining semantic features of the video frames in a middle convolution layer of a pre-training video target segmentation model, and converting the video semantic features into frequency domain features through discrete cosine transform; then constructing a video target motion perception module and outputting a motion vector; secondly, a semantic weight quantization module is constructed, and semantic weight is output; then, a semantic discrete cosine screening module is constructed, semantic weight is input, the frequency domain semantic features are screened and removed, and the frequency domain semantic features are restored through inverse discrete cosine transformation to obtain antagonistic semantic features; optimizing the antagonism semantic features through a cross entropy loss function to obtain optimized antagonism semantic features; and finally, inputting the optimized antagonistic semantic features into a subsequent convolutional layer of the pre-training video target segmentation model to obtain an attacked video target segmentation result. According to the method, the video target motion perception module is used for acquiring the time sequence information of the video and integrating the time sequence information into the attack, so that the attack algorithm focuses on the motion target in the video, the time sequence relation is further damaged, meanwhile, the semantic features are captured and screened through the semantic weight quantification module and the semantic discrete cosine screening module, the countermeasure defense technology based on denoising can be broken through, the output segmentation result is only low in accuracy, and the countermeasure attack aiming at the video target segmentation is achieved.
The method comprises the steps of firstly obtaining a video data set, a pixel level target class matrix (mask) and a pre-trained video target segmentation model, and then carrying out the following operations:
step (1) uniformly sampling a video to obtain a video frame sequence
Figure BDA0003627684810000061
Inputting the data into a pre-training video target segmentation model to obtain semantic features Z of an original video frame t (ii) a The method comprises the following steps:
(1-1) uniformly sampling 5-10 frames per second of the video to obtain T video frames to obtain a video frame sequence
Figure BDA0003627684810000062
And the real mask sequence
Figure BDA0003627684810000063
Figure BDA0003627684810000064
X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T being the number of video frames,
Figure BDA0003627684810000065
representing a real number domain, H and W respectively representing the height and width of a video frame, and 3 representing the number of RGB channels;
(1-2) sequence of video frames
Figure BDA0003627684810000066
Each video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolutional neural network (such as ResNet), and obtaining corresponding semantic features of an original video frame in the middle layer of the model
Figure BDA0003627684810000071
The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the semantic features of the video frame, phi l And (·) all network structures of the pre-training video target segmentation model before the l-th layer convolution, and recording the whole pre-training video target segmentation model as phi (·).
Step (2) constructing a video target motion perception module, and carrying out frame sequence on the video
Figure BDA0003627684810000072
As input, a motion vector O 'is obtained' t (ii) a The method comprises the following steps:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on the video
Figure BDA0003627684810000073
Inputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videos
Figure BDA0003627684810000074
M t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flows
Figure BDA0003627684810000075
Inputting a down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flows
Figure BDA0003627684810000076
M′ t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow t Dimension is changed from hxw x 2 to H '× W' × 2;
(2-3) assembling the downsampled optical flow
Figure BDA0003627684810000077
Inputting two-dimensional convolution, and sampling each optical flow M' t Carry out convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow set
Figure BDA0003627684810000078
M″ t A multi-channel optical flow representing the t-th frame video and the t + 1-th frame video, conv2D (-) being twoDimension convolution, the number of input channels is 2, the number of output channels is C', and the size of convolution kernel is 1 multiplied by 1;
(2-4) randomly initializing a motion vector set
Figure BDA0003627684810000079
O t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With multi-channel light stream M t Inputting Motion function Motion (O) in sequence t ,M″ t )=Sigmoid(O t ⊙M″ t ) Obtaining a motion vector
Figure BDA00036276848100000710
As an element-by-element product, sigmoid (·) is a Sigmoid activation function, which maps variables between 0 and 1.
Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained t (ii) a The method comprises the following steps: the semantic weight quantization module is composed of semantic weight quantization functions, and a full 1 semantic weight gradient matrix is initialized
Figure BDA00036276848100000711
Semantic weight matrix
Figure BDA00036276848100000712
And is combined with motion vector O' t Input semantic weight quantization function
Figure BDA0003627684810000081
Obtaining semantic weights
Figure BDA0003627684810000082
Where α is a disturbance coefficient whose magnitude is set to 2.0/255, Φ (X) t ) The prediction mask for the tth frame video for the pre-trained video object segmentation model,
Figure BDA0003627684810000083
for updated semantic weight gradientMatrix array
Figure BDA0003627684810000084
Figure BDA0003627684810000085
Represents a cross-entropy loss function, softmax (·) refers to the Softmax function that acts to normalize variables.
Step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q t Semantic features Z with video frames t Obtaining as input a antagonism semantic feature
Figure BDA0003627684810000086
The method comprises the following steps:
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtaining the semantic features Z of the 1 st to T th original video frames t Sequentially inputting discrete cosine transform function to obtain frequency domain semantic features
Figure BDA0003627684810000087
Cosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Inputting threshold function in turn
Figure BDA0003627684810000088
Obtaining a semantic screening matrix
Figure BDA0003627684810000089
k denotes the semantic weight Q t β is a threshold coefficient greater than 0; (4-3) screening the semantics of the matrix
Figure BDA00036276848100000810
And frequency domain semantic feature Z' t Performing element-by-element multiplication to obtain semantic features of screening frequency domain
Figure BDA00036276848100000811
(4-4) screening the semantic features of the frequency domain
Figure BDA00036276848100000812
Inputting inverse discrete cosine transform function to obtain antagonistic semantic features
Figure BDA00036276848100000813
InverseCosine (·) denotes an inverse discrete cosine transform function.
Fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set
Figure BDA00036276848100000814
The method comprises the following steps:
(5-1) antagonistic semantic features
Figure BDA00036276848100000815
All network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive mask
Figure BDA00036276848100000816
The middle layer is the first layer convolution;
(5-2) calculating a prediction mask
Figure BDA00036276848100000817
And video frame X t True mask Y t Cross entropy loss of
Figure BDA00036276848100000818
Obtaining gradients of semantic weights by back propagation
Figure BDA00036276848100000819
(5-3) fixing a video target motion perception module, a semantic weight quantization module and a semantic discrete cosineSemantic attack network parameters formed by a screening module are used for semantic weight gradient matrix H 'through a random gradient descent method' t Updating to obtain optimized semantic weight gradient H ″) t
(5-4) optimizing semantic weight gradient H ″) t Obtaining initial antagonism semantic features according to the step (4)
Figure BDA0003627684810000091
The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iteration
Figure BDA0003627684810000092
Reserving and obtaining an initial antagonism semantic feature set
Figure BDA0003627684810000093
N represents the total iterative optimization times;
(5-6) 1 st to T video frames X t Corresponding original semantic features Z t And corresponding initial set of antagonistic semantic features
Figure BDA0003627684810000094
Sequentially inputting constraint functions
Figure BDA0003627684810000095
Obtaining an optimized set of antagonistic semantic features
Figure BDA0003627684810000096
Figure BDA0003627684810000097
For each video frame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value for the norm.
Step (6) inputting the optimized antagonism semantic feature set into the next layer of the video target segmentation model intermediate layer, and obtaining the antagonism semantic feature set through the subsequent network layerSegmenting the video target after being attacked; the method comprises the following steps: set of optimized antagonistic semantic features
Figure BDA0003627684810000098
Network structure phi after inputting middle layer of pre-training video target segmentation model l+ (. The) makes the counter attack and outputs the final video target segmentation result after the counter attack
Figure BDA0003627684810000099
Y′ t And the segmentation result is corresponding to the t video frame.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (5)

1. A video target segmentation anti-attack method based on discrete cosine transform is characterized in that: firstly, a video data set, a pixel-level target category matrix and a pre-trained video target segmentation model are obtained, and then the following operations are carried out:
step (1) uniformly sampling a video to obtain T video frames to obtain a video frame sequence
Figure FDA0003888072480000011
And the real mask sequence
Figure FDA0003888072480000012
Figure FDA0003888072480000013
X t Representing the t-th video frame, Y t The true mask corresponding to the T-th video frame, T is the number of video frames,
Figure FDA0003888072480000014
represents a real number domain, H,W respectively represents the height and the width of a video frame, and 3 represents the number of RGB channels;
combining a sequence of video frames
Figure FDA0003888072480000015
Each video frame X in (b) t Sequentially inputting the data into a pre-training video target segmentation model consisting of a residual convolution neural network, and obtaining corresponding semantic features of an original video frame in an intermediate layer of the model
Figure FDA0003888072480000016
The middle layer is the first layer convolution which is rounded up by half of the total number of layers of the model; wherein H ', W ' and C ' are respectively the height, width and channel number of the video frame semantic features, phi l (.) is all network structures of the pre-training video target segmentation model before the l-th layer convolution;
step (2) constructing a video target motion perception module, and carrying out frame sequence on the video
Figure FDA0003888072480000017
As input, a motion vector O 'is obtained' t
Step (3) a semantic weight quantization module is constructed, an initialized semantic weight gradient tensor and a motion vector are introduced as input, and a semantic weight Q is obtained t
The semantic weight quantization module consists of a semantic weight quantization function and initializes a semantic weight gradient matrix of all 1 s
Figure FDA0003888072480000018
Semantic weight matrix
Figure FDA0003888072480000019
And is associated with motion vector O' t Input semantic weight quantization function
Figure FDA00038880724800000110
Obtaining semantic weights
Figure FDA00038880724800000111
Wherein, alpha is a disturbance coefficient and is set to be 2.0/255, phi (X) t ) To pre-train the video object segmentation model to the prediction mask of the t frame video,
Figure FDA00038880724800000112
for updated semantic weight gradient matrix
Figure FDA00038880724800000113
Figure FDA00038880724800000114
Indicating a cross entropy loss function, which is an element-by-element product, and Softmax (·) indicates that the Softmax function acts to normalize variables;
step (4) constructing a semantic discrete cosine screening module and weighting the semantic Q t With video frame semantic features Z t Obtaining as input a antagonism semantic feature
Figure FDA00038880724800000115
(4-1) constructing a semantic discrete cosine screening module which consists of a discrete cosine transform function, an inverse discrete cosine transform function and a threshold function and obtains the semantic features Z of the 1 st to T original video frames t Sequentially inputting discrete cosine transform function to obtain semantic features of frequency domain
Figure FDA0003888072480000021
Cosine (·) represents a discrete Cosine transform function;
(4-2) weighting the semantic weight Q t Each element q in k Inputting threshold function in turn
Figure FDA0003888072480000022
Obtaining a semantic screening matrix
Figure FDA0003888072480000023
k denotes the semantic weight Q t β is a threshold coefficient greater than 0;
(4-3) semantic screening matrix
Figure FDA0003888072480000024
And frequency domain semantic feature Z' t Obtaining semantic features of the screened frequency domain by element-by-element product
Figure FDA0003888072480000025
(4-4) screening the semantic features of the frequency domain
Figure FDA0003888072480000026
Inputting an inverse discrete cosine transform function to obtain the antagonistic semantic features
Figure FDA0003888072480000027
InverseCosine (·) denotes an inverse discrete cosine transform function;
fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantification module and a semantic discrete cosine screening module, and iteratively optimizing antagonistic semantic features by using a cross entropy loss function to obtain an optimized antagonistic semantic feature set
Figure FDA0003888072480000028
And (6) inputting the optimized antagonistic semantic feature set into the next layer of the intermediate layer of the video target segmentation model, and obtaining the video target segmentation result after being attacked through the subsequent network layer.
2. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 1, wherein the video is uniformly sampled at 5-10 frames per second in step (1).
3. The discrete cosine transform-based video object segmentation attack-combating method as claimed in claim 1 or 2, wherein the step (2) is specifically:
(2-1) the video target motion perception module consists of a FlowNet module, a two-dimensional convolution layer and a motion function, wherein the FlowNet module is an optical flow extraction network consisting of a plurality of convolution layers and is used for carrying out frame sequence on a video
Figure FDA0003888072480000029
Inputting to FlowNet, and obtaining the optical flow set between all adjacent two-frame videos
Figure FDA00038880724800000210
M t Showing the optical flows of the T frame video and the T +1 frame video, when T = T, M T Initializing and completing from all 0;
(2-2) aggregating the optical flows
Figure FDA00038880724800000211
Inputting a down-sampling function, for each optical flow M t Carry out down-sampling M' t =Interpolate(M t ) Obtaining a down-sampled set of optical flows
Figure FDA00038880724800000212
M′ t Representing the downsampling optical flow of the t frame video and the t +1 frame video, wherein the interpolation () is a downsampling function to convert the optical flow M into the optical flow t Dimension is changed from H multiplied by W multiplied by 2 to H 'multipliedby W' multipliedby 2;
(2-3) assembling the downsampled optical flow
Figure FDA0003888072480000031
Inputting two-dimensional convolution, and sampling each optical flow M' t Perform convolution M t =Conv2D(M′ t ) Obtaining a multi-channel optical flow set
Figure FDA0003888072480000032
M″ t A multi-channel optical flow representing the t frame video and the t +1 frame video, conv2D (-) beingTwo-dimensional convolution, wherein the number of input channels is 2, the number of output channels is C', and the size of a convolution kernel is 1 multiplied by 1;
(2-4) randomly initializing a motion vector set
Figure FDA0003888072480000033
O t For the t frame video X t Corresponding random initialization motion vector, and randomly initializing the motion vector O t With a multichannel optical flow M t Inputting motion functions in sequence
Figure FDA00038880724800000315
Obtaining motion vectors
Figure FDA0003888072480000034
Sigmoid (·) is a Sigmoid activation function that maps variables between 0,1.
4. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 3, wherein the step (5) is specifically:
(5-1) semantic feature of antagonism
Figure FDA0003888072480000035
All network structures phi after being input into the middle layer of the pre-trained video target segmentation model l+ (. To obtain a predictive mask
Figure FDA0003888072480000036
The middle layer is the first layer convolution;
(5-2) calculating a prediction mask
Figure FDA0003888072480000037
And video frame X t True mask Y t Cross entropy loss of
Figure FDA0003888072480000038
Obtaining gradients of semantic weights by back propagation
Figure FDA0003888072480000039
(5-3) fixing semantic attack network parameters consisting of a video target motion perception module, a semantic weight quantization module and a semantic discrete cosine screening module, and performing stochastic gradient descent on a semantic weight gradient matrix H' t Updating to obtain optimized semantic weight gradient H ″) t
(5-4) optimizing semantic weight gradient H ″) t Obtaining initial antagonism semantic features according to the step (4)
Figure FDA00038880724800000310
The superscript n represents the nth iteration optimization;
(5-5) initial antagonism semantic features to be obtained at each iteration
Figure FDA00038880724800000311
Reserving and obtaining an initial antagonism semantic feature set
Figure FDA00038880724800000312
N represents the total iterative optimization times;
(5-6) 1 st to T video frames X t Corresponding original semantic features Z t And corresponding initial set of antagonistic semantic features
Figure FDA00038880724800000313
Sequentially inputting constraint functions
Figure FDA00038880724800000314
Obtaining an optimized set of antagonistic semantic features
Figure FDA0003888072480000041
Figure FDA0003888072480000042
For each videoFrame X t Corresponding optimized antagonism semantic features, where | · |. Non calculation p Is L p Norm, p ∈ {2, ∞ }, ε ∈ {128/255,8/255} is constraint L p A threshold value for the norm.
5. The discrete cosine transform-based video object segmentation attack-resistant method as claimed in claim 4, wherein the step (6) is specifically: set of optimized antagonistic semantic features
Figure FDA0003888072480000043
Network structure phi after inputting middle layer of pre-training video target segmentation model l+ (. Carrying out anti-attack and outputting the final video target segmentation result after the attack
Figure FDA0003888072480000044
Y t ' is the segmentation result corresponding to the t-th video frame.
CN202210481562.7A 2022-05-05 2022-05-05 Video target segmentation anti-attack method based on discrete cosine transform Active CN114821432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210481562.7A CN114821432B (en) 2022-05-05 2022-05-05 Video target segmentation anti-attack method based on discrete cosine transform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210481562.7A CN114821432B (en) 2022-05-05 2022-05-05 Video target segmentation anti-attack method based on discrete cosine transform

Publications (2)

Publication Number Publication Date
CN114821432A CN114821432A (en) 2022-07-29
CN114821432B true CN114821432B (en) 2022-12-02

Family

ID=82510542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210481562.7A Active CN114821432B (en) 2022-05-05 2022-05-05 Video target segmentation anti-attack method based on discrete cosine transform

Country Status (1)

Country Link
CN (1) CN114821432B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311521B (en) * 2022-09-13 2023-04-28 中南大学 Black box video countermeasure sample generation method and evaluation method based on reinforcement learning
CN116308978B (en) * 2022-12-08 2024-01-23 北京瑞莱智慧科技有限公司 Video processing method, related device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301019A (en) * 1992-09-17 1994-04-05 Zenith Electronics Corp. Data compression system having perceptually weighted motion vectors
CN1767653A (en) * 2005-11-08 2006-05-03 上海广电(集团)有限公司中央研究院 Bit rate control method
CN101668170A (en) * 2009-09-23 2010-03-10 中山大学 Digital television program copyright protecting method for resisting time synchronization attacks
CN104243974A (en) * 2014-09-12 2014-12-24 宁波大学 Stereoscopic video quality objective evaluation method based on three-dimensional discrete cosine transformation
CN113538457A (en) * 2021-06-28 2021-10-22 杭州电子科技大学 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution
CN114202017A (en) * 2021-11-29 2022-03-18 南京航空航天大学 SAR optical image mapping model lightweight method based on condition generation countermeasure network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6895055B2 (en) * 2001-10-29 2005-05-17 Koninklijke Philips Electronics N.V. Bit-rate guided frequency weighting matrix selection
CN105828064B (en) * 2015-01-07 2017-12-12 中国人民解放军理工大学 The local video quality evaluation without reference method with global space-time characterisation of fusion
CN112927202B (en) * 2021-02-25 2022-06-03 华南理工大学 Method and system for detecting Deepfake video with combination of multiple time domains and multiple characteristics

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301019A (en) * 1992-09-17 1994-04-05 Zenith Electronics Corp. Data compression system having perceptually weighted motion vectors
CN1767653A (en) * 2005-11-08 2006-05-03 上海广电(集团)有限公司中央研究院 Bit rate control method
CN101668170A (en) * 2009-09-23 2010-03-10 中山大学 Digital television program copyright protecting method for resisting time synchronization attacks
CN104243974A (en) * 2014-09-12 2014-12-24 宁波大学 Stereoscopic video quality objective evaluation method based on three-dimensional discrete cosine transformation
CN113538457A (en) * 2021-06-28 2021-10-22 杭州电子科技大学 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution
CN114202017A (en) * 2021-11-29 2022-03-18 南京航空航天大学 SAR optical image mapping model lightweight method based on condition generation countermeasure network

Also Published As

Publication number Publication date
CN114821432A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
Wei et al. 3-D quasi-recurrent neural network for hyperspectral image denoising
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN114821432B (en) Video target segmentation anti-attack method based on discrete cosine transform
Lin et al. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation
JP6656111B2 (en) Method and system for removing image noise
Fan et al. Low-level structure feature extraction for image processing via stacked sparse denoising autoencoder
CN113191489B (en) Training method of binary neural network model, image processing method and device
Mirmozaffari Filtering in image processing
CN113379618B (en) Optical remote sensing image cloud removing method based on residual dense connection and feature fusion
Wang et al. PFDN: Pyramid feature decoupling network for single image deraining
Song et al. Multistage curvature-guided network for progressive single image reflection removal
Mana et al. An intelligent deep learning enabled marine fish species detection and classification model
Gökcen et al. Real-time impulse noise removal
CN112308087B (en) Integrated imaging identification method based on dynamic vision sensor
KR102095444B1 (en) Method and Apparatus for Removing gain Linearity Noise Based on Deep Learning
Schirrmacher et al. Sr 2: Super-resolution with structure-aware reconstruction
CN111401155B (en) Image recognition method of residual error neural network based on implicit Euler jump connection
Li et al. Distribution-transformed network for impulse noise removal
Wang et al. A Denoising Network Based on Frequency-Spectral-Spatial-Feature for Hyperspectral Image
Kunapuli et al. Enhanced Medical Image De-noising Using Auto Encoders and MLP
Doulamis Vision based fall detector exploiting deep learning
Wang et al. Deep hyperspectral and multispectral image fusion with inter-image variability
Zhao et al. Cascaded residual density network for crowd counting
Ning et al. The Importance of Anti-Aliasing in Tiny Object Detection
Antony et al. T2FRF Filter: An Effective Algorithm for the Restoration of Fingerprint Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant