CN115147412B - Long time sequence network for memory transfer and video shadow detection method - Google Patents

Long time sequence network for memory transfer and video shadow detection method Download PDF

Info

Publication number
CN115147412B
CN115147412B CN202211051584.6A CN202211051584A CN115147412B CN 115147412 B CN115147412 B CN 115147412B CN 202211051584 A CN202211051584 A CN 202211051584A CN 115147412 B CN115147412 B CN 115147412B
Authority
CN
China
Prior art keywords
video
shadow detection
shadow
network
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211051584.6A
Other languages
Chinese (zh)
Other versions
CN115147412A (en
Inventor
肖春霞
陈子沛
罗飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202211051584.6A priority Critical patent/CN115147412B/en
Publication of CN115147412A publication Critical patent/CN115147412A/en
Application granted granted Critical
Publication of CN115147412B publication Critical patent/CN115147412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a long time sequence network transmitted by a memory and a video shadow detection method, which utilize the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of the learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Description

Long time sequence network for memory transfer and video shadow detection method
Technical Field
The invention belongs to the field of dynamic video illumination identification, and particularly relates to a long time sequence network transmitted by a memory and a semi-supervised video shadow detection method based on the network.
Background
Currently, the commonly used video shadow detection methods can be mainly classified into the following two categories: 1. conventional physical model-based methods, such as the video Shadow detection method proposed in the paper "Shadow detection algorithms for traffic flow analysis: a comparative study", use the HSV color space to distinguish moving shadows from the background by comparing the magnitude of the luminance at the same saturation and hue. The method relies on the low-dimensional characteristics manually made by experts according to experience, and can only obtain better results in scenes under strong constraint (such as stable illumination conditions and single moving objects); 2. the method based on deep learning is different from the method based on the traditional physical model, and the method relies on the strong semantic information representation capability of the deep learning technology to adaptively select corresponding features for judging the video shadow, for example, a Triple-collaborative video shadow detection provided in the thesis 'Triple-collaborative video shadow detection' learns the discriminant feature representation at the video intra-video and inter-video levels by using three parallel networks in a collaborative manner so as to detect the video shadow. Even though the method has made a certain progress on the task of video shadow detection, the method is limited to the scale of the data set, and is difficult to be effectively generalized to the practical application scene and difficult to meet the requirements in practical application. In the existing video shadow detection technology, a video shadow detection method which has strong generalization capability and can meet the requirements of users is still lacked.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a long-time sequence network for memory transfer and a semi-supervised video shadow detection method based on the network, and aims to solve the problems that the current video shadow detection method is limited by the scale of a data set, has insufficient generalization capability and is difficult to be effectively generalized to a practical application scene.
The invention provides a long time sequence network for memory transmission, which is characterized in that: the network uses a memory mechanism, guides the detection of the shadow of the next video frame by storing and transmitting all shadow information of the previous video frame, and generates a video shadow detection result with consistent long time sequence in a light weight mode, wherein the network comprises a weight value shared characteristic encoder, a memory transmission module, a motion perception recursive deformable convolution module and a multi-scale video shadow detection decoder module;
the feature encoder shared by the weight values is used for extracting shadow information of different video frames;
the memory transfer module stores and dynamically updates historical shadow information of the video sequence;
the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on the extracted features according to the position bias;
and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory module by using a multi-scale mechanism.
Further, the memory transfer module uses a fixed-size memory matrix to store the shadow information of all histories in the video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the deformable convolution and gating mechanism comprises a deformable convolution layer and three convolution layers.
Furthermore, the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result.
Based on the same invention concept, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory, which consists of an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The invention relates to a space-time alignment network based on deformable convolution, which is a multi-tower network with a plurality of inputs and an output, and refines image results to video results by integrating rough shadow detection results of adjacent frames and video frame information.
The method for detecting the semi-supervised video shadow of the long time sequence network based on the memory transfer comprises the following operation steps:
step S1: the image-assisted video shadow detection pseudo label generation method provided by the invention is used for generating a video shadow detection pseudo label for unmarked video data;
step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label; training for guiding the model;
and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time-sequence network transmitted by the memory to respectively perform supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data;
and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).
The space-time alignment network with the deformable convolution comprises a feature encoder shared by weight values, a motion sensing recursive deformable convolution module and a pixel-level shadow detection decoder module;
the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;
the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.
Further, the motion-aware recursive deformable convolution module uses optical flow between adjacent video frames as a guide for prediction of position bias between adjacent frames, including three convolution layers for predicting the position bias existing between current adjacent video frame features and three deformable convolution layers for spatio-temporally aligning video frame features according to the predicted position bias.
Further, the pixel-level shadow detection decoder includes three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer.
By utilizing the space-time alignment network based on the deformable convolution, the image-assisted video shadow detection pseudo label generation method comprises the following steps:
s11, training an existing single image shadow detection network by using an SBU data set, such as: BDRAR;
s12, processing the ViSha data set and the additional unmarked video data frame by using the single image shadow detection network trained in the step S11 to generate a rough video shadow detection pseudo label;
s13, using the rough pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set to train the spatiotemporal alignment network based on deformable convolution provided by the invention in a supervised manner;
and S14, generating a video shadow detection pseudo label by using the network trained in the step S13 and the rough pseudo label of the label-free video data obtained in the step S12.
Step S13 is to train the spatio-temporal alignment network based on deformable convolution provided in the present invention with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames
Figure 332193DEST_PATH_IMAGE001
S3-2, inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, and respectively predicting
Figure 395964DEST_PATH_IMAGE002
Frame to first
Figure 126023DEST_PATH_IMAGE003
The optical flow of the frame;
s3-3, inputting the feature diagram extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position offset information, wherein a specific calculation formula is as follows:
Figure 63410DEST_PATH_IMAGE004
wherein
Figure 587933DEST_PATH_IMAGE005
Represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is set such that,GMA(•)which represents the optical flow prediction network GMA used,Conv() AndDCN() Respectively representing a traditional convolutional neural network layer and a deformable convolutional layer;
s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3
Figure 873421DEST_PATH_IMAGE006
And carrying out feature alignment, wherein the calculation formula is as follows:
Figure 876012DEST_PATH_IMAGE007
wherein
Figure 930555DEST_PATH_IMAGE008
Represents a standard convolution kernel radius ofrThe set of sampling grids of (a) is,ωP n ) Is composed ofP n The coefficients of the convolution kernel at the location,
Figure 891558DEST_PATH_IMAGE009
representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 on the channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-Encopy
Figure 398763DEST_PATH_IMAGE010
For supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:
Figure 470624DEST_PATH_IMAGE011
wherein
Figure 379674DEST_PATH_IMAGE012
A pseudo-tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.
Further, in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo tag by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo tag so as to guide the semi-supervised learning, wherein the MC-Dropout method calculates 10 forward propagation results containing a random Dropout layer, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
Figure 245999DEST_PATH_IMAGE013
wherein
Figure 506079DEST_PATH_IMAGE014
A display pixel (x,y) The degree of uncertainty in the position of the location,μ x, y() the mean of 10 forward propagation results is shown.
Further, in step S3, the long-time-sequence network transferred by the memory performs supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data, respectively, and the specific process is as follows:
s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and a GMA (light stream prediction) network to extract feature graphs of two adjacent video frames in different scales and light streams between the feature graphs;
s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;
s33, according to the position bias predicted in S32, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism;
s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, and inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder to predict video shadow detection results on different scales;
s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module and fusing the video shadow detection results to generate a final video shadow detection result;
s36 respectively arranging the labeled ViSha data setsD L And additional unlabeled datasetsD U Loss of upper usage supervisionL full And semi-supervised lossL semi For the supervised and semi-supervised training of the long time sequence network for memory transfer, the specific calculation formula is as follows:
Figure 116052DEST_PATH_IMAGE015
whereinl BCE For Binary Cross Engine losses,O L andG L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O U and
Figure 145188DEST_PATH_IMAGE016
respectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,
Figure 916835DEST_PATH_IMAGE017
and the method is used for preventing the problem of zero division in calculation.
The invention has the advantages that:
1. the long-time-sequence network with memory transfer provided by the invention can generate a video shadow detection result with consistent long time sequence under the condition of a small amount of time and calculation cost by using a memory mechanism.
2. The spatio-temporal alignment network based on deformable convolution provided by the invention can explicitly align the information between adjacent frames, so that the models can integrate the adjacent frames better, thereby generating a result of spatio-temporal consistency.
3. The image-assisted video shadow detection pseudo label generation method provided by the invention can utilize the existing large-scale image shadow detection data set to generate a video shadow detection pseudo label with better robustness under the condition of a limited video shadow detection data set.
4. The semi-supervised video shadow detection method based on uncertainty guidance improves the generalization capability of the model by using additional pseudo labels, and meanwhile, the introduction of the uncertainty map enables the use of the pseudo labels to be more accurate.
Drawings
Fig. 1 is a schematic diagram of a long-time access network for memory transfer according to an embodiment.
FIG. 2 is a schematic diagram of a spatio-temporal alignment network based on deformable convolution in an embodiment.
Fig. 3 is a schematic diagram of a semi-supervised video shadow detection method using a long-time-series network with memory transfer in an embodiment.
Detailed Description
For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting.
The space-time alignment network based on deformable convolution for generating the video shadow detection pseudo label comprises a feature encoder shared by weight values a motion-aware recursive deformable convolution module and a pixel-level shadow detection decoder module:
in the embodiment, a convolution network part of ResNeXt101 is used as the feature encoder, and four feature maps with the sizes of 1/4,1/8,1/16 and 1/16 are respectively provided;
the motion-aware recursive deformable convolution recursively predicts the positional bias between adjacent frames due to motion using the optical flow between adjacent frames as a guide and performs spatial-temporal alignment of the extracted features according to it. Specifically, three convolutional layers and three deformable convolutional layers are included. The convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias, wherein the deformable convolution comprises a conventional convolutional layer and a position bias prediction layer;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampled layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.
In addition, the generated video shadow detection pseudo label is used for detecting in real time in order to obtain a light-weight video shadow detection model with stronger generalization capability. The long time sequence network for the memory transfer of the video shadow detection comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module, wherein the weight value sharing characteristic encoder comprises a first module, a second module and a third module, the first module comprises a first module and a second module, the second module comprises a third module, the third module comprises a fourth module, a fourth module and a fourth module, the fourth module comprises a fourth module, the fifth module comprises a fourth module and a sixth module, the fifth module comprises a fourth module, the sixth module comprises a fourth module, a sixth module and a sixth module, the fifth module comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module:
the feature encoder shared by the weight in the network and the feature encoder of the space-time alignment network based on the deformable convolution provided by the invention adopt the same structure;
two motion-aware recursive deformable convolution modules are used to predict the position offsets of adjacent video frames on the 1/8 and 1/16 scales, respectively, and align the information of adjacent frames using deformable convolution according to the position offsets, taking into account the computational resources and the time cost. Specifically, the optical flow used in this motion-aware deformable convolution is the resize result on the 1/8 and 1/16 scale for optical flows generated by the disclosed GMA pre-training network. Furthermore, the two motion-aware deformable convolutions adopt the same structure;
the memory transfer module comprises a memory matrix with the size of a fixed size (25 × 256) for storing shadow information of all histories in the video shadow detection process, and dynamically transfers and updates the stored shadow information by using a deformable convolution and gating mechanism, wherein the memory matrix comprises a deformable convolution layer and three convolution layers. Wherein the deformable convolution uses the predicted position bias of the motion-aware recursive deformable convolution on the 1/16 scale to perform feature alignment on the historical features stored in the memory matrix;
the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result. The random drop probability for a particular Dropout layer is set to 0.1.
Based on the same conception, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory. The method comprises an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The image-assisted video shadow detection pseudo label generation method comprises the following steps:
step S11: an existing learning-based image shadow detection method BDRAR is trained by using an SBU image shadow detection data set, and the specific training strategy is consistent with that in the thesis.
Step S12: the vissha data set and the additional unmarked video data are divided into a plurality of continuous video frames, and the divided video frames are processed frame by using the BDRAR method trained in the step S11 to generate rough video shadow detection pseudo labels.
Step S13: carrying out supervised training on the spatio-temporal alignment network based on deformable convolution provided by the invention by using the rough video shadow detection pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set;
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames
Figure 664211DEST_PATH_IMAGE001
S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA to respectively predict
Figure 812295DEST_PATH_IMAGE002
Frame to firsttThe optical flow of the frame;
s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a motion-aware recursive deformable convolution module, taking the predicted optical flow of adjacent video frames as an initial value of position bias between the adjacent video frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position bias information, wherein a specific calculation formula is as follows:
Figure 647720DEST_PATH_IMAGE018
wherein
Figure 121427DEST_PATH_IMAGE019
Represents the input oftThe frame of the video is a frame of video,
Figure 90520DEST_PATH_IMAGE020
represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is such that,GMA() Which represents the optical flow prediction network GMA used,Conv() AndDCN() Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;
s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3
Figure 42295DEST_PATH_IMAGE006
And carrying out feature alignment, wherein the calculation formula is as follows:
Figure 46023DEST_PATH_IMAGE021
whereinP 0 For the coordinates of the center point of a single convolution operation performed on the feature map,
Figure 159473DEST_PATH_IMAGE022
in order to be able to predict the position offset,
Figure 615862DEST_PATH_IMAGE023
represents a standard convolution kernel radius of
Figure 371328DEST_PATH_IMAGE024
The set of sampling grids of (a) is,ωP n ) Is composed ofP n The coefficients of the convolution kernel at the location,
Figure 229563DEST_PATH_IMAGE025
representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-Encopy
Figure 779493DEST_PATH_IMAGE010
For supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:
Figure 723178DEST_PATH_IMAGE026
wherein
Figure 16756DEST_PATH_IMAGE012
A pseudo tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.
Step S14: and generating a video shadow detection pseudo label by using the space-time alignment network based on the deformable convolution trained in the step S13 and the rough pseudo label of the label-free video data generated in the step S12. Specifically, the following process is performed:
inputting every two original input images into an optical flow prediction network GMA to predict the optical flow between adjacent frames; the originally input adjacent 5 video frame images and the corresponding generated rough pseudo labels are cascaded and input into a trained space-time alignment network based on deformable convolution together with the predicted optical flow, and the video shadow detection pseudo labels with consistent space-time are output. And synthesizing the generated continuous pseudo labels according to the original frame rate to obtain the video shadow detection pseudo label in the video format.
The generated video shadow detection pseudo labels are utilized, and the semi-supervised video shadow detection method of the long-time sequence network based on memory transfer, which is provided by the invention, comprises the following steps:
step S1: the image-assisted video shadow detection pseudo tag generation method is used for generating video shadow detection pseudo tags for unmarked video data.
Step S2: calculating an uncertainty map of a pixel level for the generated video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
Figure 729497DEST_PATH_IMAGE027
wherein
Figure 450329DEST_PATH_IMAGE028
A display pixel (x,y) Uncertainty of positionThe degree of the magnetic field is measured,μ x, y() the mean of 10 forward propagation results is shown.
And step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data:
s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and a GMA (light stream prediction) network to extract feature graphs of two adjacent video frames in different scales and light streams between the feature graphs;
s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;
and S33, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution according to the position bias predicted in S32, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism. Initially, initializing the features in the memory matrix to be represented by the features of the first video frame;
s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder, and predicting video shadow detection results on 1/8, 1/4, 1/2 and 1 scale of an original input image;
s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module and fusing the video shadow detection results, namely cascading the video shadow detection results and transforming the video shadow detection results through a convolution layer to generate final video shadow detection results;
s36 respectively arranging the labeled ViSha data setsD L And additional unlabeled datasetsD U Upper use has a supervision lossL full And semi-supervised lossL semi To train the long-time sequence network of memory transfer with supervision and semi-supervision (this fact)In the examples, a test set of a ViSha data set is used as an unmarked data set without manual marking), and the specific calculation formula is as follows:
Figure 881310DEST_PATH_IMAGE029
whereinl BCE For Binary Cross Encopy loss,O L andG L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O U and
Figure 713000DEST_PATH_IMAGE030
respectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,
Figure 283177DEST_PATH_IMAGE031
and the method is used for preventing the problem of zero division in calculation.
And step S4: and (4) performing video shadow detection by using the model trained in the step (S3), wherein the following process is specifically required:
the method comprises the following steps of carrying out segmentation processing on a video needing shadow detection to decompose the video into independent video frames, and simultaneously carrying out normalization operation on the video frames, wherein the calculation formula is as follows:
Figure 440489DEST_PATH_IMAGE032
whereinImageRepresenting independent video frames;
sequentially inputting two adjacent video frames into a long-time sequence network transmitted by a memory according to a time sequence to obtain a network detection result of a next frame, and updating and storing a memory matrix (when an input first video frame is processed, two identical first video frames are input into the network);
and classifying the network detection result according to a threshold value of 0.5 to obtain a binary shadow detection mask of 0-1, and restoring the video frame into video data to obtain a final video shadow detection result.
The embodiment provides a semi-supervised video shadow detection method of a long-time sequence network transmitted by using a memory, which utilizes the assistance of the existing annotated image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of a learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Claims (10)

1. A long-time-sequence network for memory transfer, comprising: the network comprises a weight value sharing characteristic encoder, a memory transfer module, a motion sensing recursive deformable convolution module and a multi-scale video shadow detection decoder module;
the weight sharing feature encoder is used for extracting shadow information of different video frames;
the memory transfer module stores and dynamically updates historical shadow information of the video sequence;
the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on video frame features extracted by the weight-shared feature encoder according to position bias information;
and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory transfer module by using a multi-scale mechanism.
2. The memory-passing long-timing network of claim 1, wherein: the memory transfer module uses a memory matrix with a fixed size to store shadow information of all histories in a video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, and the memory transfer module comprises a deformable convolution layer and three convolution layers.
3. The memory-passing long-timing network of claim 1, wherein: the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the multi-scale prediction results are fused to obtain a final video shadow detection result.
4. The memory-passing long-timing network of claim 1, wherein: the motion-aware recursive deformable convolution module includes three convolutional layers for predicting a position offset existing between current neighboring video frame features and three deformable convolutional layers for spatio-temporal alignment of video frame features according to the predicted position offset, the deformable convolution includes a conventional convolutional layer and a position offset prediction layer.
5. A semi-supervised video shadow detection method of a long time sequence network based on memory transfer is characterized by comprising the following steps:
step S1: generating a video shadow detection pseudo label from the video data without the label;
step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label;
and step S3: using the video shadow detection pseudo label generated in step S1 and the uncertain graph generated in step S2, respectively performing supervised and semi-supervised training on the labeled data set and the unlabelled video data by using the long-time-series network transferred by the memory according to any one of claims 1 to 4, specifically:
s31, respectively inputting two adjacent video frames into a feature encoder and an optical flow prediction network with shared weights to extract feature graphs of two adjacent video frames with different scales and optical flows between the feature graphs;
s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;
s33, according to the position bias predicted in S32, utilizing deformable convolution to carry out memory transfer on historical shadow information stored in a memory matrix, using a gating mechanism, and selectively storing and updating the memory matrix according to the result of the memory transfer and the characteristics of the current video frame, wherein initially, the characteristics in the memory matrix are initialized to be represented by the characteristics of the first video frame;
s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder, and predicting video shadow detection results on 1/8, 1/4, 1/2 and 1 scale of an original input image;
s35, reducing the video shadow detection results of different scales predicted in S34 to the size of the original image through an up-sampling module and fusing the original image, namely cascading the original image and transforming the original image through a convolution layer to generate a final video shadow detection result;
s36, training the long-time-sequence network of the memory transfer in a supervised and semi-supervised manner by using supervised loss and semi-supervised loss on the labeled data set and the additional unlabeled data set respectively;
and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).
6. The semi-supervised video shadow detection method based on the memory transfer long-time network as recited in claim 5, wherein:
the shadow detection pseudo label is generated by adopting a space-time alignment network based on deformable convolution, and the network comprises a feature encoder shared by weight values, a recursive deformable convolution module for motion perception and a pixel-level shadow detection decoder module;
the feature encoder shared by the weight values is used for extracting features of the input video frame and corresponding rough shadow detection results;
the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask;
the structures of the feature encoder shared by the weight and the recursive deformable convolution module of the motion perception are the same as the structures of the modules in the long-time sequence network transmitted by the memory.
7. The method according to claim 6, wherein the method comprises:
the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.
8. The method according to claim 6, wherein the method comprises: the shadow detection pseudo label generation method specifically comprises the following steps:
s11, training an existing single image shadow detection network by using an SBU data set;
s12: processing the ViSha data set and the additional unmarked video data frame by frame using the single image shadow detection network trained in step S11 to generate a coarse video shadow detection pseudo label;
s13: supervised training the spatio-temporal alignment network based on deformable convolution by using the rough pseudo labels of the ViSha data set and the ViSha data set obtained in the step S12; the space-time alignment network based on the deformable convolution is a multi-tower network with a plurality of inputs and one output, and refines an image result to a video result by integrating rough shadow detection results of adjacent multiple frames and video frame information;
s14: and (4) generating a video shadow detection pseudo label by using the network trained in the step (S13) and the rough pseudo label of the label-free video data obtained in the step (S12).
9. The method for semi-supervised video shadow detection in long-term memory transfer based networks as recited in claim 8, wherein:
step S13 is to train the deformable convolved spatio-temporal alignment network with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:
s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a feature encoder with shared weight, and extracting shadow feature maps of different frames
Figure 605668DEST_PATH_IMAGE001
S3-2, inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, and respectively predicting
Figure 471993DEST_PATH_IMAGE002
Frame to firsttThe optical flow of the frame;
s3-3, inputting the feature diagram extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position offset information, wherein a specific calculation formula is as follows:
Figure 466494DEST_PATH_IMAGE003
wherein
Figure 810888DEST_PATH_IMAGE004
Represents an input oftThe frame of the video is a frame of video,
Figure 574444DEST_PATH_IMAGE005
represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is set such that,GMA() Which represents the optical flow prediction network GMA used,Conv() AndDCN() Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;
s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3
Figure 457343DEST_PATH_IMAGE006
And carrying out feature alignment, wherein the calculation formula is as follows:
Figure 673561DEST_PATH_IMAGE007
whereinP 0 For the coordinates of the center point of a single convolution operation performed on the feature map,
Figure 821645DEST_PATH_IMAGE008
in order to be able to predict the position offset,
Figure 439708DEST_PATH_IMAGE009
represents a standard convolution kernel radius of
Figure 398568DEST_PATH_IMAGE010
The set of sampling grids of (a) is,ωP n ) Is composed ofP n The coefficients of the convolution kernel at the location,
Figure 102082DEST_PATH_IMAGE011
representing the feature map after the features are aligned;
s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;
s3-6 loss with Binary Cross-Encopy
Figure 53858DEST_PATH_IMAGE012
For supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:
Figure 792006DEST_PATH_IMAGE013
wherein
Figure 171035DEST_PATH_IMAGE014
A pseudo-tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.
10. The method for detecting the shadow of the semi-supervised video of the long-time-series network based on the memory transfer of claim 9, wherein:
in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, and is used to evaluate the accuracy of the pseudo label so as to guide semi-supervised learning, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:
Figure 361845DEST_PATH_IMAGE015
wherein
Figure 851732DEST_PATH_IMAGE016
Represents a pixel (x,y) The degree of uncertainty in the position of the location,μ x, y() the mean of 10 forward propagation results is shown.
CN202211051584.6A 2022-08-31 2022-08-31 Long time sequence network for memory transfer and video shadow detection method Active CN115147412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211051584.6A CN115147412B (en) 2022-08-31 2022-08-31 Long time sequence network for memory transfer and video shadow detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211051584.6A CN115147412B (en) 2022-08-31 2022-08-31 Long time sequence network for memory transfer and video shadow detection method

Publications (2)

Publication Number Publication Date
CN115147412A CN115147412A (en) 2022-10-04
CN115147412B true CN115147412B (en) 2022-12-16

Family

ID=83416542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211051584.6A Active CN115147412B (en) 2022-08-31 2022-08-31 Long time sequence network for memory transfer and video shadow detection method

Country Status (1)

Country Link
CN (1) CN115147412B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378775A (en) * 2021-06-29 2021-09-10 武汉大学 Video shadow detection and elimination method based on deep learning
CN113436115A (en) * 2021-07-30 2021-09-24 西安热工研究院有限公司 Image shadow detection method based on depth unsupervised learning
CN113628129A (en) * 2021-07-19 2021-11-09 武汉大学 Method for removing shadow of single image by edge attention based on semi-supervised learning
CN114220001A (en) * 2021-11-25 2022-03-22 南京信息工程大学 Remote sensing image cloud and cloud shadow detection method based on double attention neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL196161A (en) * 2008-12-24 2015-03-31 Rafael Advanced Defense Sys Removal of shadows from images in a video signal
CN113538357B (en) * 2021-07-09 2022-10-25 同济大学 Shadow interference resistant road surface state online detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378775A (en) * 2021-06-29 2021-09-10 武汉大学 Video shadow detection and elimination method based on deep learning
CN113628129A (en) * 2021-07-19 2021-11-09 武汉大学 Method for removing shadow of single image by edge attention based on semi-supervised learning
CN113436115A (en) * 2021-07-30 2021-09-24 西安热工研究院有限公司 Image shadow detection method based on depth unsupervised learning
CN114220001A (en) * 2021-11-25 2022-03-22 南京信息工程大学 Remote sensing image cloud and cloud shadow detection method based on double attention neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Fast Shadow Removal Using Adaptive Multi-Scale Illumination Transfer;Chunxia Xiao,et al.;《Wiley Online Library》;20130820;1-6 *
基于改进拉普拉斯-高斯算子的阴影消除方法;马永杰 等;《激光与光电子学进展》;20200603;1-5 *

Also Published As

Publication number Publication date
CN115147412A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
Chen et al. Scale-aware domain adaptive faster r-cnn
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN106547880B (en) Multi-dimensional geographic scene identification method fusing geographic area knowledge
US8620026B2 (en) Video-based detection of multiple object types under varying poses
CN112149547B (en) Remote sensing image water body identification method based on image pyramid guidance and pixel pair matching
CN110163286B (en) Hybrid pooling-based domain adaptive image classification method
CN108053420B (en) Partition method based on finite space-time resolution class-independent attribute dynamic scene
CN106127197B (en) Image saliency target detection method and device based on saliency label sorting
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN110555420B (en) Fusion model network and method based on pedestrian regional feature extraction and re-identification
CN115512103A (en) Multi-scale fusion remote sensing image semantic segmentation method and system
CN112712052A (en) Method for detecting and identifying weak target in airport panoramic video
CN116453121B (en) Training method and device for lane line recognition model
CN114419323A (en) Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method
CN112801236A (en) Image recognition model migration method, device, equipment and storage medium
CN111339892B (en) Swimming pool drowning detection method based on end-to-end 3D convolutional neural network
CN115410162A (en) Multi-target detection and tracking algorithm under complex urban road environment
CN115147412B (en) Long time sequence network for memory transfer and video shadow detection method
CN111476226B (en) Text positioning method and device and model training method
CN110909645B (en) Crowd counting method based on semi-supervised manifold embedding
CN114782827B (en) Object capture point acquisition method and device based on image
CN115439926A (en) Small sample abnormal behavior identification method based on key region and scene depth
CN116363656A (en) Image recognition method and device containing multiple lines of text and computer equipment
CN112464989A (en) Closed loop detection method based on target detection network
Liu et al. MotionRFCN: Motion Segmentation using Consecutive Dense Depth Maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant