CN115147412A - Long time sequence network for memory transfer and video shadow detection method - Google Patents
Long time sequence network for memory transfer and video shadow detection method Download PDFInfo
- Publication number
- CN115147412A CN115147412A CN202211051584.6A CN202211051584A CN115147412A CN 115147412 A CN115147412 A CN 115147412A CN 202211051584 A CN202211051584 A CN 202211051584A CN 115147412 A CN115147412 A CN 115147412A
- Authority
- CN
- China
- Prior art keywords
- video
- shadow detection
- network
- shadow
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a long time sequence network transmitted by a memory and a video shadow detection method, wherein the method utilizes the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of the learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.
Description
Technical Field
The invention belongs to the field of dynamic video illumination identification, and particularly relates to a long time sequence network transmitted by a memory and a semi-supervised video shadow detection method based on the network.
Background
Currently, the commonly used video shadow detection methods can be mainly classified into the following two categories: 1. conventional physical model-based methods, such as the video Shadow detection method proposed in the paper "Shadow detection algorithms for traffic flow analysis: a comparative study", use the HSV color space to distinguish moving shadows from the background by comparing the magnitude of the luminance at the same saturation and hue. The method depends on the low-dimensional characteristics manually made by experts according to experience, and can only obtain better results under the scene with stronger constraint (such as stable illumination conditions and single moving object); 2. the method based on deep learning is different from the method based on the traditional physical model, and the method relies on the strong semantic information representation capability of the deep learning technology to adaptively select corresponding features for judging the video shadow, for example, a Triple-collaborative video shadow detection proposed in the paper "Triple-collaborative video shadow detection" uses three parallel networks to learn the discriminant feature representation at the video intra-video and inter-video levels in a collaborative manner so as to detect the video shadow. Even though the method has made a certain progress on the video shadow detection task, the method is limited to the scale of the data set, and the method is difficult to be effectively generalized to the practical application scene and difficult to meet the requirements in practical application. In the existing video shadow detection technology, a video shadow detection method which has strong generalization capability and can meet the user requirements is still lacked.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a long-time-sequence network with memory transfer and a semi-supervised video shadow detection method based on the network, and aims to solve the problems that the current video shadow detection method is limited by the data set scale, has insufficient generalization capability and is difficult to be effectively generalized to practical application scenes.
The invention provides a long time sequence network for memory transmission, which is characterized in that: the network uses a memory mechanism, guides the detection of the shadow of the next video frame by storing and transmitting all shadow information of the previous video frame, and generates a video shadow detection result with consistent long time sequence in a light weight mode, wherein the network comprises a weight value shared characteristic encoder, a memory transmission module, a motion perception recursive deformable convolution module and a multi-scale video shadow detection decoder module;
the weight sharing feature encoder is used for extracting shadow information of different video frames;
the memory transfer module stores and dynamically updates historical shadow information of the video sequence;
the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on the extracted features according to the position bias;
and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory module by using a multi-scale mechanism.
Further, the memory transfer module uses a fixed-size memory matrix to store the shadow information of all histories in the video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the deformable convolution and gating mechanism comprises a deformable convolution layer and three convolution layers.
Furthermore, the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result.
Based on the same invention concept, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory, which consists of an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The invention relates to a space-time alignment network based on deformable convolution, which is a multi-tower network with a plurality of inputs and an output, and refines image results to video results by integrating rough shadow detection results of adjacent frames and video frame information.
The method for detecting the semi-supervised video shadow of the long time sequence network based on the memory transfer comprises the following operation steps:
step S1: the image-assisted video shadow detection pseudo label generation method provided by the invention is used for generating a video shadow detection pseudo label for the video data without annotation;
step S2: estimating a pixel-level uncertainty map of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to evaluate the accuracy of the generated pseudo label; training for guiding the model;
and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data;
and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).
The space-time alignment network with the deformable convolution comprises a feature encoder shared by weight values, a motion sensing recursive deformable convolution module and a pixel-level shadow detection decoder module;
the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;
the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.
Further, the motion-aware recursive deformable convolution module uses optical flow between adjacent video frames as a guide for prediction of position bias between adjacent frames, including three convolution layers for predicting the position bias existing between current adjacent video frame features and three deformable convolution layers for spatio-temporally aligning video frame features according to the predicted position bias.
Further, the pixel-level shadow detection decoder includes three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer.
By utilizing the space-time alignment network based on the deformable convolution, the image-assisted video shadow detection pseudo label generation method comprises the following steps:
s11, training an existing single image shadow detection network by using an SBU data set, such as: BDRAR;
s12, processing the ViSha data set and the additional unmarked video data frame by using the single image shadow detection network trained in the step S11 to generate a rough video shadow detection pseudo label;
s13, using the rough pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set to train the spatiotemporal alignment network based on deformable convolution provided by the invention in a supervised manner;
and S14, generating a video shadow detection pseudo label by using the network trained in the step S13 and the rough pseudo label of the label-free video data obtained in the step S12.
Step S13 is to train the spatio-temporal alignment network based on deformable convolution provided in the present invention with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames;
S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA to respectively predictFrame to firsttThe optical flow of the frame;
s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual mode to predict accurate position offset information, wherein a specific calculation formula is as follows:
whereinRepresents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is such that,GMA(•)which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a traditional convolutional neural network layer and a deformable convolutional layer;
s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3And carrying out feature alignment, wherein the calculation formula is as follows:
whereinRepresents a standard convolution kernel radius ofrThe set of sampling grids of (a) is,ω(P n ) Is composed ofP n The coefficients of the convolution kernel at the location,representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-EncopyFor supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:
whereinA pseudo-tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.
Further, in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo label to guide semi-supervised learning, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
whereinRepresents a pixel (x,y) The degree of uncertainty in the position of the location,μ x, y() the mean of 10 forward propagation results is shown.
Further, in step S3, the long-time-sequence network transferred by the memory performs supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data, respectively, and the specific process is as follows:
s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and an optical flow prediction network GMA to extract feature maps of two adjacent video frames with different scales and optical flows between the feature maps;
s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;
s33, according to the position bias predicted in S32, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism;
s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, and inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder to predict video shadow detection results on different scales;
s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module, and fusing the video shadow detection results to generate a final video shadow detection result;
s36 respectively arranging the labeled ViSha data setsD L And additional unlabeled datasetsD U Upper use has a supervision lossL full And semi-supervised lossL semi For the supervised and semi-supervised training of the long time sequence network for memory transfer, the specific calculation formula is as follows:
whereinl ECE For Binary Cross Engine losses,O L andG L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O U andrespectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,and the method is used for preventing the problem of zero division in calculation.
The invention has the advantages that:
1. the long-time-sequence network for memory transfer provided by the invention can generate a video shadow detection result with consistent long time sequence under the condition of a small amount of time and calculation cost by using a memory mechanism.
2. The spatio-temporal alignment network based on deformable convolution provided by the invention can explicitly align the information between adjacent frames, so that the models can integrate the adjacent frames better, thereby generating a result of spatio-temporal consistency.
3. The image-assisted video shadow detection pseudo label generation method provided by the invention can utilize the existing large-scale image shadow detection data set to generate a video shadow detection pseudo label with better robustness under the condition of a limited video shadow detection data set.
4. The semi-supervised video shadow detection method based on uncertainty guidance improves the generalization capability of the model by using additional pseudo labels, and meanwhile, the introduction of the uncertainty map enables the use of the pseudo labels to be more accurate.
Drawings
Fig. 1 is a schematic diagram of a long-time access network for memory transfer according to an embodiment.
FIG. 2 is a schematic diagram of a spatio-temporal alignment network based on deformable convolution in an embodiment.
Fig. 3 is a schematic diagram of a semi-supervised video shadow detection method using a long-time-series network with memory transfer in an embodiment.
Detailed Description
For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting.
The space-time alignment network based on the deformable convolution for generating the video shadow detection pseudo label comprises a feature encoder with shared weight, a motion perception recursive deformable convolution module and a pixel-level shadow detection decoder module, wherein the feature encoder comprises:
in the embodiment, a convolutional network part of ResNeXt101 is used as the feature encoder, and four feature maps with the sizes of 1/4,1/8,1/16 and 1/16 of the input pictures are respectively provided;
the motion-aware recursive deformable convolution recursively predicts the positional bias between adjacent frames due to motion using the optical flow between adjacent frames as a guide and performs spatial-temporal alignment of the extracted features according to it. Specifically, three convolutional layers and three deformable convolutional layers are included. The convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias, wherein the deformable convolution comprises a conventional convolutional layer and a position bias prediction layer;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampled layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.
In addition, the generated video shadow detection pseudo label is used for detecting in real time in order to obtain a light-weight video shadow detection model with stronger generalization capability. The long time sequence network for the memory transfer of the video shadow detection comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module, wherein the weight value sharing characteristic encoder comprises a first module, a second module and a third module, the first module comprises a first module and a second module, the second module comprises a third module, the third module comprises a fourth module, a fourth module and a fourth module, the fourth module comprises a fourth module, the fifth module comprises a fourth module and a sixth module, the fifth module comprises a fourth module, the sixth module comprises a fourth module, a sixth module and a sixth module, the fifth module comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module:
the feature encoder shared by the weight in the network and the feature encoder of the space-time alignment network based on the deformable convolution provided by the invention adopt the same structure;
two motion-aware recursive deformable convolution modules are used to predict the position offset of adjacent video frames on the 1/8 and 1/16 scales, respectively, and align the information of adjacent frames using deformable convolution according to the position offset, taking into account computational resources and temporal costs. Specifically, the optical flow used in this motion-aware deformable convolution is the resize result on the 1/8 and 1/16 scale for optical flows generated by the disclosed GMA pre-training network. Furthermore, the two motion-aware deformable convolutions adopt the same structure;
the memory transfer module comprises a memory matrix with the size of a fixed size (25 × 256) for storing shadow information of all histories in the video shadow detection process, and dynamically transfers and updates the stored shadow information by using a deformable convolution and gating mechanism, wherein the memory matrix comprises a deformable convolution layer and three convolution layers. Wherein the deformable convolution uses the predicted position bias of the motion-aware recursive deformable convolution on the 1/16 scale to perform feature alignment on the historical features stored in the memory matrix;
the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and a Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the final video shadow detection results are obtained by fusing the multi-scale prediction results. The random drop probability for a particular Dropout layer is set to 0.1.
Based on the same conception, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory. The method comprises two parts, namely an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The image-assisted video shadow detection pseudo label generation method comprises the following steps:
step S11: an existing learning-based image shadow detection method BDRAR is trained by using an SBU image shadow detection data set, and the specific training strategy is consistent with that in the thesis.
Step S12: the vissha data set and the additional unmarked video data are divided into a plurality of continuous video frames, and the divided video frames are processed frame by using the BDRAR method trained in the step S11 to generate rough video shadow detection pseudo labels.
Step S13: carrying out supervised training on the spatio-temporal alignment network based on deformable convolution provided by the invention by using the rough video shadow detection pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set;
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames;
S3-2, inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, and respectively predictingFrame to firsttThe optical flow of the frame;
s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a motion perception recursive deformable convolution module, taking the predicted optical flow of adjacent video frames as an initial value of position offset between the adjacent video frames, and recursively refining the predicted optical flow in a residual mode to predict accurate position offset information, wherein a specific calculation formula is as follows:
whereinRepresents the input oftA frame of the video is displayed in a frame,represents the predicted secondiObtained by a sub-recursion oftFrame and the firstt-xThe position offset between the frames is such that,GMA(•) Indicating use ofThe optical flow prediction network GMA of (a),Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;
s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3And carrying out feature alignment, wherein the calculation formula is as follows:
whereinP 0 For the coordinates of the center point of a single convolution operation performed on the feature map,in order to be able to predict the position offset,
represents a standard convolution kernel radius ofThe set of sampling grids of (a) is,ω(P n ) Is composed ofP n The coefficients of the convolution kernel at the location,representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-EncopyTo train the space-time alignment net based on deformable convolution with supervisionThe specific calculation formula is as follows:
whereinA pseudo tag result is detected for the generated video shadow,Gtags are detected for the actual manually labeled video shadows.
Step S14: and generating a video shadow detection pseudo label by using the space-time alignment network based on the deformable convolution trained in the step S13 and the rough pseudo label of the label-free video data generated in the step S12. Specifically, the following process is performed:
inputting every two original input images into an optical flow prediction network GMA to predict the optical flow between adjacent frames; the originally input adjacent 5 video frame images and the corresponding generated rough pseudo labels are cascaded and input into a trained space-time alignment network based on deformable convolution together with the predicted optical flow, and the video shadow detection pseudo labels with consistent space-time are output. And synthesizing the generated continuous pseudo labels according to the original frame rate to obtain the video shadow detection pseudo label in the video format.
The generated video shadow detection pseudo labels are utilized, and the semi-supervised video shadow detection method of the long-time network based on memory transfer, provided by the invention, comprises the following steps:
step S1: the image-assisted video shadow detection pseudo tag generation method is used for generating video shadow detection pseudo tags for unmarked video data.
Step S2: calculating an uncertainty map of a pixel level for the generated video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, wherein the MC-Dropout method calculates 20 forward propagation results containing random Dropout layers, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
whereinRepresents a pixel (x,y) The degree of uncertainty in the position of the location,μ x, y() represents the mean of the 20 forward propagation results.
And step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data:
s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and a GMA (light stream prediction) network to extract feature graphs of two adjacent video frames in different scales and light streams between the feature graphs;
s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;
and S33, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution according to the position bias predicted in S32, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism. Initially, initializing the features in the memory matrix to be represented by the features of the first video frame;
s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder, and predicting video shadow detection results on 1/8, 1/4, 1/2 and 1 scale of an original input image;
s35, reducing the video shadow detection results of different scales predicted in S34 to the size of the original image through an up-sampling module and fusing the original image, namely cascading the original image and transforming the original image through a convolution layer to generate a final video shadow detection result;
s36 respectively having labelsViSha data setD L And additional unlabeled datasetsD U Upper use has a supervision lossL full And semi-supervised lossL semi For training the memory-transferred long-time sequence network with supervision and semi-supervision (in this embodiment, a test set of a ViSha data set is used as an unmarked data set without manual marking), a specific calculation formula is as follows:
whereinl ECE For Binary Cross Engine losses,O L andG L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD L The output of (c) and the true value,Umapfor the uncertainty map calculated in step S2,O U andrespectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,and the method is used for preventing the problem of zero division in calculation.
And step S4: and (4) performing video shadow detection by using the model trained in the step (S3), wherein the following process is specifically required:
the method comprises the following steps of carrying out segmentation processing on a video needing shadow detection to decompose the video into independent video frames, and simultaneously carrying out normalization operation on the video frames, wherein the calculation formula is as follows:
whereinImageRepresenting independent video frames;
sequentially inputting two adjacent video frames into a long-time sequence network transmitted by a memory according to a time sequence to obtain a network detection result of a next frame, and updating and storing a memory matrix (when an input first video frame is processed, two identical first video frames are input into the network);
and classifying the network detection result according to a threshold value of 0.5 to obtain a binary shadow detection mask of 0-1, and reducing the video frame into video data to obtain a final video shadow detection result.
The embodiment provides a semi-supervised video shadow detection method of a long time sequence network by using memory transfer, which utilizes the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of a learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long-time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.
Claims (10)
1. A long-time-sequence network for memory transfer, comprising: the network comprises a weight value sharing characteristic encoder, a memory transfer module, a motion sensing recursive deformable convolution module and a multi-scale video shadow detection decoder module;
the weight sharing feature encoder is used for extracting shadow information of different video frames;
the memory transfer module stores and dynamically updates historical shadow information of the video sequence;
the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on the extracted features according to the position bias;
and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory module by using a multi-scale mechanism.
2. The memory-passing long-timing network of claim 1, wherein: the memory transfer module uses a memory matrix with a fixed size to store shadow information of all histories in a video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the memory transfer module comprises a deformable convolution layer and three convolution layers.
3. The memory-passing long-timing network of claim 1, wherein: the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the multi-scale prediction results are fused to obtain a final video shadow detection result.
4. The memory-passing long-timing network of claim 1, wherein:
the space-time alignment adopts a space-time alignment network based on deformable convolution, and the network comprises a feature encoder shared by weight values, a recursive deformable convolution module for motion perception and a pixel-level shadow detection decoder module;
the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;
the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.
5. The memory-passing long-timing network of claim 4, wherein:
the space-time alignment network of the deformable convolution comprises a feature encoder shared by weight values, a recursive deformable convolution module of motion perception and a pixel-level shadow detection decoder module;
the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;
the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.
6. The memory-transferring long-time-slot network of claim 5, wherein:
the motion-aware recursive deformable convolution module using optical flow between adjacent video frames as a guide for position offset prediction between adjacent frames; the module includes three convolutional layers and three deformable convolutional layers; the convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias.
7. A semi-supervised video shadow detection method of a long time sequence network based on memory transfer is characterized by comprising the following steps:
step S1: generating a video shadow detection pseudo label from the video data without the label;
step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label;
and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2, and respectively performing supervised and semi-supervised training on the data sets with labels and the video data without labels by using the long-time-sequence network transmitted by the memory of any one of claims 1 to 6;
and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).
8. The method according to claim 7, wherein the method comprises:
s11, training an existing single image shadow detection network by using an SBU data set;
s12: processing the ViSha data set and the additional unmarked video data frame by frame using the single image shadow detection network trained in step S11 to generate a coarse video shadow detection pseudo label;
s13: supervised training of the spatio-temporal alignment network is performed by using the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in the step S12; the space-time alignment network is a multi-tower network with a plurality of inputs and one output, and refines an image result to a video result by integrating rough shadow detection results of adjacent multiple frames and video frame information;
s14: and (4) generating a video shadow detection pseudo label by using the network trained in the step (S13) and the rough pseudo label of the label-free video data obtained in the step (S12).
9. The method for detecting the shadow of the semi-supervised video of the long-time-series network based on the memory transfer as recited in claim 8, wherein:
step S13 is to train the deformable convolved spatio-temporal alignment network with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames;
S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, respectively predictFrame to firsttThe optical flow of the frame;
s3-3, inputting the feature diagram extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position offset information, wherein a specific calculation formula is as follows:
whereinRepresents an input oftThe frame of the video is a frame of video,represents the predicted secondiObtained by a sub-recursion oftFrame and the firstt-xThe position offset between the frames is set such that,GMA(•) Which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;
s3-4 separately pairing with the predicted position bias in S3-3 using deformable convolutionAnd carrying out feature alignment, wherein the calculation formula is as follows:
whereinP 0 The coordinates of the center point of a single convolution operation performed on the feature map,in order to be able to predict the position offset,
represents a standard convolution kernel radius ofThe set of sampling grids of (a) is,ω(P n ) Is composed ofP n The coefficients of the convolution kernel at the location,representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 on the channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-EncopyThe deformable convolution-based space-time alignment network is trained in a supervision mode, and a specific calculation formula is as follows:
10. The method for semi-supervised video shadow detection in long-term memory transfer based networks as recited in claim 9, wherein:
in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo label and guide the semi-supervised learning, wherein the MC-Dropout method calculates 20 forward propagation results containing a random Dropout layer, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211051584.6A CN115147412B (en) | 2022-08-31 | 2022-08-31 | Long time sequence network for memory transfer and video shadow detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211051584.6A CN115147412B (en) | 2022-08-31 | 2022-08-31 | Long time sequence network for memory transfer and video shadow detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115147412A true CN115147412A (en) | 2022-10-04 |
CN115147412B CN115147412B (en) | 2022-12-16 |
Family
ID=83416542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211051584.6A Active CN115147412B (en) | 2022-08-31 | 2022-08-31 | Long time sequence network for memory transfer and video shadow detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115147412B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110273620A1 (en) * | 2008-12-24 | 2011-11-10 | Rafael Advanced Defense Systems Ltd. | Removal of shadows from images in a video signal |
CN113378775A (en) * | 2021-06-29 | 2021-09-10 | 武汉大学 | Video shadow detection and elimination method based on deep learning |
CN113436115A (en) * | 2021-07-30 | 2021-09-24 | 西安热工研究院有限公司 | Image shadow detection method based on depth unsupervised learning |
CN113538357A (en) * | 2021-07-09 | 2021-10-22 | 同济大学 | Shadow interference resistant road surface state online detection method |
CN113628129A (en) * | 2021-07-19 | 2021-11-09 | 武汉大学 | Method for removing shadow of single image by edge attention based on semi-supervised learning |
CN114220001A (en) * | 2021-11-25 | 2022-03-22 | 南京信息工程大学 | Remote sensing image cloud and cloud shadow detection method based on double attention neural networks |
-
2022
- 2022-08-31 CN CN202211051584.6A patent/CN115147412B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110273620A1 (en) * | 2008-12-24 | 2011-11-10 | Rafael Advanced Defense Systems Ltd. | Removal of shadows from images in a video signal |
CN113378775A (en) * | 2021-06-29 | 2021-09-10 | 武汉大学 | Video shadow detection and elimination method based on deep learning |
CN113538357A (en) * | 2021-07-09 | 2021-10-22 | 同济大学 | Shadow interference resistant road surface state online detection method |
CN113628129A (en) * | 2021-07-19 | 2021-11-09 | 武汉大学 | Method for removing shadow of single image by edge attention based on semi-supervised learning |
CN113436115A (en) * | 2021-07-30 | 2021-09-24 | 西安热工研究院有限公司 | Image shadow detection method based on depth unsupervised learning |
CN114220001A (en) * | 2021-11-25 | 2022-03-22 | 南京信息工程大学 | Remote sensing image cloud and cloud shadow detection method based on double attention neural networks |
Non-Patent Citations (2)
Title |
---|
CHUNXIA XIAO,ET AL.: "Fast Shadow Removal Using Adaptive Multi-Scale Illumination Transfer", 《WILEY ONLINE LIBRARY》 * |
马永杰 等: "基于改进拉普拉斯-高斯算子的阴影消除方法", 《激光与光电子学进展》 * |
Also Published As
Publication number | Publication date |
---|---|
CN115147412B (en) | 2022-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Scale-aware domain adaptive faster r-cnn | |
CN110097568B (en) | Video object detection and segmentation method based on space-time dual-branch network | |
CN112149547B (en) | Remote sensing image water body identification method based on image pyramid guidance and pixel pair matching | |
CN106127197B (en) | Image saliency target detection method and device based on saliency label sorting | |
CN110555420B (en) | Fusion model network and method based on pedestrian regional feature extraction and re-identification | |
CN114724155B (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN109886176B (en) | Lane line detection method in complex driving scene | |
CN114022793A (en) | Optical remote sensing image change detection method based on twin network | |
CN112712052A (en) | Method for detecting and identifying weak target in airport panoramic video | |
CN112651423A (en) | Intelligent vision system | |
CN112257526A (en) | Action identification method based on feature interactive learning and terminal equipment | |
CN116385466B (en) | Method and system for dividing targets in image based on boundary box weak annotation | |
CN112801236A (en) | Image recognition model migration method, device, equipment and storage medium | |
CN113569650A (en) | Unmanned aerial vehicle autonomous inspection positioning method based on electric power tower label identification | |
CN115577768A (en) | Semi-supervised model training method and device | |
CN115410162A (en) | Multi-target detection and tracking algorithm under complex urban road environment | |
US20230072445A1 (en) | Self-supervised video representation learning by exploring spatiotemporal continuity | |
CN111339892A (en) | Swimming pool drowning detection method based on end-to-end 3D convolutional neural network | |
CN118135221A (en) | Image semantic segmentation method based on cross-network disturbance mechanism | |
CN115147412B (en) | Long time sequence network for memory transfer and video shadow detection method | |
CN115719368B (en) | Multi-target ship tracking method and system | |
CN115761438A (en) | Depth estimation-based saliency target detection method | |
CN115240163A (en) | Traffic sign detection method and system based on one-stage detection network | |
CN116363656A (en) | Image recognition method and device containing multiple lines of text and computer equipment | |
CN112464989A (en) | Closed loop detection method based on target detection network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |