CN115147412A

CN115147412A - Long time sequence network for memory transfer and video shadow detection method

Info

Publication number: CN115147412A
Application number: CN202211051584.6A
Authority: CN
Inventors: 肖春霞; 陈子沛; 罗飞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-10-04
Anticipated expiration: 2042-08-31
Also published as: CN115147412B

Abstract

The invention discloses a long time sequence network transmitted by a memory and a video shadow detection method, wherein the method utilizes the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of the learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Description

Long time sequence network for memory transfer and video shadow detection method

Technical Field

The invention belongs to the field of dynamic video illumination identification, and particularly relates to a long time sequence network transmitted by a memory and a semi-supervised video shadow detection method based on the network.

Background

Currently, the commonly used video shadow detection methods can be mainly classified into the following two categories: 1. conventional physical model-based methods, such as the video Shadow detection method proposed in the paper "Shadow detection algorithms for traffic flow analysis: a comparative study", use the HSV color space to distinguish moving shadows from the background by comparing the magnitude of the luminance at the same saturation and hue. The method depends on the low-dimensional characteristics manually made by experts according to experience, and can only obtain better results under the scene with stronger constraint (such as stable illumination conditions and single moving object); 2. the method based on deep learning is different from the method based on the traditional physical model, and the method relies on the strong semantic information representation capability of the deep learning technology to adaptively select corresponding features for judging the video shadow, for example, a Triple-collaborative video shadow detection proposed in the paper "Triple-collaborative video shadow detection" uses three parallel networks to learn the discriminant feature representation at the video intra-video and inter-video levels in a collaborative manner so as to detect the video shadow. Even though the method has made a certain progress on the video shadow detection task, the method is limited to the scale of the data set, and the method is difficult to be effectively generalized to the practical application scene and difficult to meet the requirements in practical application. In the existing video shadow detection technology, a video shadow detection method which has strong generalization capability and can meet the user requirements is still lacked.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a long-time-sequence network with memory transfer and a semi-supervised video shadow detection method based on the network, and aims to solve the problems that the current video shadow detection method is limited by the data set scale, has insufficient generalization capability and is difficult to be effectively generalized to practical application scenes.

The invention provides a long time sequence network for memory transmission, which is characterized in that: the network uses a memory mechanism, guides the detection of the shadow of the next video frame by storing and transmitting all shadow information of the previous video frame, and generates a video shadow detection result with consistent long time sequence in a light weight mode, wherein the network comprises a weight value shared characteristic encoder, a memory transmission module, a motion perception recursive deformable convolution module and a multi-scale video shadow detection decoder module;

the weight sharing feature encoder is used for extracting shadow information of different video frames;

the memory transfer module stores and dynamically updates historical shadow information of the video sequence;

the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on the extracted features according to the position bias;

and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory module by using a multi-scale mechanism.

Further, the memory transfer module uses a fixed-size memory matrix to store the shadow information of all histories in the video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the deformable convolution and gating mechanism comprises a deformable convolution layer and three convolution layers.

Furthermore, the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result.

Based on the same invention concept, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory, which consists of an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The invention relates to a space-time alignment network based on deformable convolution, which is a multi-tower network with a plurality of inputs and an output, and refines image results to video results by integrating rough shadow detection results of adjacent frames and video frame information.

The method for detecting the semi-supervised video shadow of the long time sequence network based on the memory transfer comprises the following operation steps:

step S1: the image-assisted video shadow detection pseudo label generation method provided by the invention is used for generating a video shadow detection pseudo label for the video data without annotation;

step S2: estimating a pixel-level uncertainty map of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to evaluate the accuracy of the generated pseudo label; training for guiding the model;

and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data;

and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).

The space-time alignment network with the deformable convolution comprises a feature encoder shared by weight values, a motion sensing recursive deformable convolution module and a pixel-level shadow detection decoder module;

the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;

the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.

Further, the motion-aware recursive deformable convolution module uses optical flow between adjacent video frames as a guide for prediction of position bias between adjacent frames, including three convolution layers for predicting the position bias existing between current adjacent video frame features and three deformable convolution layers for spatio-temporally aligning video frame features according to the predicted position bias.

Further, the pixel-level shadow detection decoder includes three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer.

By utilizing the space-time alignment network based on the deformable convolution, the image-assisted video shadow detection pseudo label generation method comprises the following steps:

s11, training an existing single image shadow detection network by using an SBU data set, such as: BDRAR;

s12, processing the ViSha data set and the additional unmarked video data frame by using the single image shadow detection network trained in the step S11 to generate a rough video shadow detection pseudo label;

s13, using the rough pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set to train the spatiotemporal alignment network based on deformable convolution provided by the invention in a supervised manner;

and S14, generating a video shadow detection pseudo label by using the network trained in the step S13 and the rough pseudo label of the label-free video data obtained in the step S12.

Step S13 is to train the spatio-temporal alignment network based on deformable convolution provided in the present invention with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:

s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames

；

S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA to respectively predict

Frame to firsttThe optical flow of the frame;

s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual mode to predict accurate position offset information, wherein a specific calculation formula is as follows:

wherein

Represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is such that,GMA(•)which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a traditional convolutional neural network layer and a deformable convolutional layer;

s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3

And carrying out feature alignment, wherein the calculation formula is as follows:

wherein

Represents a standard convolution kernel radius ofrThe set of sampling grids of (a) is,ω（P _n ) Is composed ofP _n The coefficients of the convolution kernel at the location,

representing the feature map after the features are aligned;

s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;

s3-6 loss with Binary Cross-Encopy

For supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:

wherein

A pseudo-tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.

Further, in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo label to guide semi-supervised learning, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

Represents a pixel (x,y) The degree of uncertainty in the position of the location,μ _{x, y()} the mean of 10 forward propagation results is shown.

Further, in step S3, the long-time-sequence network transferred by the memory performs supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data, respectively, and the specific process is as follows:

s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and an optical flow prediction network GMA to extract feature maps of two adjacent video frames with different scales and optical flows between the feature maps;

s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;

s33, according to the position bias predicted in S32, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism;

s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, and inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder to predict video shadow detection results on different scales;

s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module, and fusing the video shadow detection results to generate a final video shadow detection result;

s36 respectively arranging the labeled ViSha data setsD _L And additional unlabeled datasetsD _U Upper use has a supervision lossL _full And semi-supervised lossL _semi For the supervised and semi-supervised training of the long time sequence network for memory transfer, the specific calculation formula is as follows:

；

whereinl _ECE For Binary Cross Engine losses,O _L andG _L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD _L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O _U and

respectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,

and the method is used for preventing the problem of zero division in calculation.

The invention has the advantages that:

1. the long-time-sequence network for memory transfer provided by the invention can generate a video shadow detection result with consistent long time sequence under the condition of a small amount of time and calculation cost by using a memory mechanism.

2. The spatio-temporal alignment network based on deformable convolution provided by the invention can explicitly align the information between adjacent frames, so that the models can integrate the adjacent frames better, thereby generating a result of spatio-temporal consistency.

3. The image-assisted video shadow detection pseudo label generation method provided by the invention can utilize the existing large-scale image shadow detection data set to generate a video shadow detection pseudo label with better robustness under the condition of a limited video shadow detection data set.

4. The semi-supervised video shadow detection method based on uncertainty guidance improves the generalization capability of the model by using additional pseudo labels, and meanwhile, the introduction of the uncertainty map enables the use of the pseudo labels to be more accurate.

Drawings

Fig. 1 is a schematic diagram of a long-time access network for memory transfer according to an embodiment.

FIG. 2 is a schematic diagram of a spatio-temporal alignment network based on deformable convolution in an embodiment.

Fig. 3 is a schematic diagram of a semi-supervised video shadow detection method using a long-time-series network with memory transfer in an embodiment.

Detailed Description

For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting.

The space-time alignment network based on the deformable convolution for generating the video shadow detection pseudo label comprises a feature encoder with shared weight, a motion perception recursive deformable convolution module and a pixel-level shadow detection decoder module, wherein the feature encoder comprises:

in the embodiment, a convolutional network part of ResNeXt101 is used as the feature encoder, and four feature maps with the sizes of 1/4,1/8,1/16 and 1/16 of the input pictures are respectively provided;

the motion-aware recursive deformable convolution recursively predicts the positional bias between adjacent frames due to motion using the optical flow between adjacent frames as a guide and performs spatial-temporal alignment of the extracted features according to it. Specifically, three convolutional layers and three deformable convolutional layers are included. The convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias, wherein the deformable convolution comprises a conventional convolutional layer and a position bias prediction layer;

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampled layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.

In addition, the generated video shadow detection pseudo label is used for detecting in real time in order to obtain a light-weight video shadow detection model with stronger generalization capability. The long time sequence network for the memory transfer of the video shadow detection comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module, wherein the weight value sharing characteristic encoder comprises a first module, a second module and a third module, the first module comprises a first module and a second module, the second module comprises a third module, the third module comprises a fourth module, a fourth module and a fourth module, the fourth module comprises a fourth module, the fifth module comprises a fourth module and a sixth module, the fifth module comprises a fourth module, the sixth module comprises a fourth module, a sixth module and a sixth module, the fifth module comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module:

the feature encoder shared by the weight in the network and the feature encoder of the space-time alignment network based on the deformable convolution provided by the invention adopt the same structure;

two motion-aware recursive deformable convolution modules are used to predict the position offset of adjacent video frames on the 1/8 and 1/16 scales, respectively, and align the information of adjacent frames using deformable convolution according to the position offset, taking into account computational resources and temporal costs. Specifically, the optical flow used in this motion-aware deformable convolution is the resize result on the 1/8 and 1/16 scale for optical flows generated by the disclosed GMA pre-training network. Furthermore, the two motion-aware deformable convolutions adopt the same structure;

the memory transfer module comprises a memory matrix with the size of a fixed size (25 × 256) for storing shadow information of all histories in the video shadow detection process, and dynamically transfers and updates the stored shadow information by using a deformable convolution and gating mechanism, wherein the memory matrix comprises a deformable convolution layer and three convolution layers. Wherein the deformable convolution uses the predicted position bias of the motion-aware recursive deformable convolution on the 1/16 scale to perform feature alignment on the historical features stored in the memory matrix;

the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and a Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the final video shadow detection results are obtained by fusing the multi-scale prediction results. The random drop probability for a particular Dropout layer is set to 0.1.

Based on the same conception, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory. The method comprises two parts, namely an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The image-assisted video shadow detection pseudo label generation method comprises the following steps:

step S11: an existing learning-based image shadow detection method BDRAR is trained by using an SBU image shadow detection data set, and the specific training strategy is consistent with that in the thesis.

Step S12: the vissha data set and the additional unmarked video data are divided into a plurality of continuous video frames, and the divided video frames are processed frame by using the BDRAR method trained in the step S11 to generate rough video shadow detection pseudo labels.

Step S13: carrying out supervised training on the spatio-temporal alignment network based on deformable convolution provided by the invention by using the rough video shadow detection pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set;

；

S3-2, inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, and respectively predicting

Frame to firsttThe optical flow of the frame;

s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a motion perception recursive deformable convolution module, taking the predicted optical flow of adjacent video frames as an initial value of position offset between the adjacent video frames, and recursively refining the predicted optical flow in a residual mode to predict accurate position offset information, wherein a specific calculation formula is as follows:

wherein

Represents the input oftA frame of the video is displayed in a frame,

represents the predicted secondiObtained by a sub-recursion oftFrame and the firstt-xThe position offset between the frames is such that,GMA(•) Indicating use ofThe optical flow prediction network GMA of (a),Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;

whereinP ₀ For the coordinates of the center point of a single convolution operation performed on the feature map,

in order to be able to predict the position offset,

represents a standard convolution kernel radius of

The set of sampling grids of (a) is,ω（P _n ) Is composed ofP _n The coefficients of the convolution kernel at the location,

representing the feature map after the features are aligned;

s3-6 loss with Binary Cross-Encopy

To train the space-time alignment net based on deformable convolution with supervisionThe specific calculation formula is as follows:

wherein

A pseudo tag result is detected for the generated video shadow,Gtags are detected for the actual manually labeled video shadows.

Step S14: and generating a video shadow detection pseudo label by using the space-time alignment network based on the deformable convolution trained in the step S13 and the rough pseudo label of the label-free video data generated in the step S12. Specifically, the following process is performed:

inputting every two original input images into an optical flow prediction network GMA to predict the optical flow between adjacent frames; the originally input adjacent 5 video frame images and the corresponding generated rough pseudo labels are cascaded and input into a trained space-time alignment network based on deformable convolution together with the predicted optical flow, and the video shadow detection pseudo labels with consistent space-time are output. And synthesizing the generated continuous pseudo labels according to the original frame rate to obtain the video shadow detection pseudo label in the video format.

The generated video shadow detection pseudo labels are utilized, and the semi-supervised video shadow detection method of the long-time network based on memory transfer, provided by the invention, comprises the following steps:

step S1: the image-assisted video shadow detection pseudo tag generation method is used for generating video shadow detection pseudo tags for unmarked video data.

Step S2: calculating an uncertainty map of a pixel level for the generated video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, wherein the MC-Dropout method calculates 20 forward propagation results containing random Dropout layers, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

Represents a pixel (x,y) The degree of uncertainty in the position of the location,μ _{x, y()} represents the mean of the 20 forward propagation results.

And step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data:

s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and a GMA (light stream prediction) network to extract feature graphs of two adjacent video frames in different scales and light streams between the feature graphs;

and S33, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution according to the position bias predicted in S32, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism. Initially, initializing the features in the memory matrix to be represented by the features of the first video frame;

s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder, and predicting video shadow detection results on 1/8, 1/4, 1/2 and 1 scale of an original input image;

s35, reducing the video shadow detection results of different scales predicted in S34 to the size of the original image through an up-sampling module and fusing the original image, namely cascading the original image and transforming the original image through a convolution layer to generate a final video shadow detection result;

s36 respectively having labelsViSha data setD _L And additional unlabeled datasetsD _U Upper use has a supervision lossL _full And semi-supervised lossL _semi For training the memory-transferred long-time sequence network with supervision and semi-supervision (in this embodiment, a test set of a ViSha data set is used as an unmarked data set without manual marking), a specific calculation formula is as follows:

，

whereinl _ECE For Binary Cross Engine losses,O _L andG _L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD _L The output of (c) and the true value,Umapfor the uncertainty map calculated in step S2,O _U and

And step S4: and (4) performing video shadow detection by using the model trained in the step (S3), wherein the following process is specifically required:

the method comprises the following steps of carrying out segmentation processing on a video needing shadow detection to decompose the video into independent video frames, and simultaneously carrying out normalization operation on the video frames, wherein the calculation formula is as follows:

，

whereinImageRepresenting independent video frames;

sequentially inputting two adjacent video frames into a long-time sequence network transmitted by a memory according to a time sequence to obtain a network detection result of a next frame, and updating and storing a memory matrix (when an input first video frame is processed, two identical first video frames are input into the network);

and classifying the network detection result according to a threshold value of 0.5 to obtain a binary shadow detection mask of 0-1, and reducing the video frame into video data to obtain a final video shadow detection result.

The embodiment provides a semi-supervised video shadow detection method of a long time sequence network by using memory transfer, which utilizes the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of a learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long-time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Claims

1. A long-time-sequence network for memory transfer, comprising: the network comprises a weight value sharing characteristic encoder, a memory transfer module, a motion sensing recursive deformable convolution module and a multi-scale video shadow detection decoder module;

2. The memory-passing long-timing network of claim 1, wherein: the memory transfer module uses a memory matrix with a fixed size to store shadow information of all histories in a video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the memory transfer module comprises a deformable convolution layer and three convolution layers.

3. The memory-passing long-timing network of claim 1, wherein: the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the multi-scale prediction results are fused to obtain a final video shadow detection result.

4. The memory-passing long-timing network of claim 1, wherein:

the space-time alignment adopts a space-time alignment network based on deformable convolution, and the network comprises a feature encoder shared by weight values, a recursive deformable convolution module for motion perception and a pixel-level shadow detection decoder module;

5. The memory-passing long-timing network of claim 4, wherein:

the space-time alignment network of the deformable convolution comprises a feature encoder shared by weight values, a recursive deformable convolution module of motion perception and a pixel-level shadow detection decoder module;

6. The memory-transferring long-time-slot network of claim 5, wherein:

the motion-aware recursive deformable convolution module using optical flow between adjacent video frames as a guide for position offset prediction between adjacent frames; the module includes three convolutional layers and three deformable convolutional layers; the convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias.

7. A semi-supervised video shadow detection method of a long time sequence network based on memory transfer is characterized by comprising the following steps:

step S1: generating a video shadow detection pseudo label from the video data without the label;

step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label;

and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2, and respectively performing supervised and semi-supervised training on the data sets with labels and the video data without labels by using the long-time-sequence network transmitted by the memory of any one of claims 1 to 6;

8. The method according to claim 7, wherein the method comprises:

s11, training an existing single image shadow detection network by using an SBU data set;

s12: processing the ViSha data set and the additional unmarked video data frame by frame using the single image shadow detection network trained in step S11 to generate a coarse video shadow detection pseudo label;

s13: supervised training of the spatio-temporal alignment network is performed by using the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in the step S12; the space-time alignment network is a multi-tower network with a plurality of inputs and one output, and refines an image result to a video result by integrating rough shadow detection results of adjacent multiple frames and video frame information;

s14: and (4) generating a video shadow detection pseudo label by using the network trained in the step (S13) and the rough pseudo label of the label-free video data obtained in the step (S12).

9. The method for detecting the shadow of the semi-supervised video of the long-time-series network based on the memory transfer as recited in claim 8, wherein:

step S13 is to train the deformable convolved spatio-temporal alignment network with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:

；

S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, respectively predict

Frame to firsttThe optical flow of the frame;

s3-3, inputting the feature diagram extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position offset information, wherein a specific calculation formula is as follows:

wherein

Represents an input oftThe frame of the video is a frame of video,

represents the predicted secondiObtained by a sub-recursion oftFrame and the firstt-xThe position offset between the frames is set such that,GMA(•) Which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;

s3-4 separately pairing with the predicted position bias in S3-3 using deformable convolution

whereinP ₀ The coordinates of the center point of a single convolution operation performed on the feature map,

in order to be able to predict the position offset,

represents a standard convolution kernel radius of

representing the feature map after the features are aligned;

s3-5, integrating the aligned feature maps obtained in the S3-4 on the channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;

s3-6 loss with Binary Cross-Encopy

The deformable convolution-based space-time alignment network is trained in a supervision mode, and a specific calculation formula is as follows:

wherein

A pseudo tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.

10. The method for semi-supervised video shadow detection in long-term memory transfer based networks as recited in claim 9, wherein:

in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo label and guide the semi-supervised learning, wherein the MC-Dropout method calculates 20 forward propagation results containing a random Dropout layer, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

A display pixel (x,y) The degree of uncertainty in the position of the location,μ _{x, y()} represents the mean of the 20 forward propagation results.