CN115147412B

CN115147412B - Long time sequence network for memory transfer and video shadow detection method

Info

Publication number: CN115147412B
Application number: CN202211051584.6A
Authority: CN
Inventors: 肖春霞; 陈子沛; 罗飞
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-16
Anticipated expiration: 2042-08-31
Also published as: CN115147412A

Abstract

The invention discloses a long time sequence network transmitted by a memory and a video shadow detection method, which utilize the assistance of the existing labeled image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of the learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Description

Long time sequence network for memory transfer and video shadow detection method

Technical Field

The invention belongs to the field of dynamic video illumination identification, and particularly relates to a long time sequence network transmitted by a memory and a semi-supervised video shadow detection method based on the network.

Background

Currently, the commonly used video shadow detection methods can be mainly classified into the following two categories: 1. conventional physical model-based methods, such as the video Shadow detection method proposed in the paper "Shadow detection algorithms for traffic flow analysis: a comparative study", use the HSV color space to distinguish moving shadows from the background by comparing the magnitude of the luminance at the same saturation and hue. The method relies on the low-dimensional characteristics manually made by experts according to experience, and can only obtain better results in scenes under strong constraint (such as stable illumination conditions and single moving objects); 2. the method based on deep learning is different from the method based on the traditional physical model, and the method relies on the strong semantic information representation capability of the deep learning technology to adaptively select corresponding features for judging the video shadow, for example, a Triple-collaborative video shadow detection provided in the thesis 'Triple-collaborative video shadow detection' learns the discriminant feature representation at the video intra-video and inter-video levels by using three parallel networks in a collaborative manner so as to detect the video shadow. Even though the method has made a certain progress on the task of video shadow detection, the method is limited to the scale of the data set, and is difficult to be effectively generalized to the practical application scene and difficult to meet the requirements in practical application. In the existing video shadow detection technology, a video shadow detection method which has strong generalization capability and can meet the requirements of users is still lacked.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a long-time sequence network for memory transfer and a semi-supervised video shadow detection method based on the network, and aims to solve the problems that the current video shadow detection method is limited by the scale of a data set, has insufficient generalization capability and is difficult to be effectively generalized to a practical application scene.

The invention provides a long time sequence network for memory transmission, which is characterized in that: the network uses a memory mechanism, guides the detection of the shadow of the next video frame by storing and transmitting all shadow information of the previous video frame, and generates a video shadow detection result with consistent long time sequence in a light weight mode, wherein the network comprises a weight value shared characteristic encoder, a memory transmission module, a motion perception recursive deformable convolution module and a multi-scale video shadow detection decoder module;

the feature encoder shared by the weight values is used for extracting shadow information of different video frames;

the memory transfer module stores and dynamically updates historical shadow information of the video sequence;

the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on the extracted features according to the position bias;

and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory module by using a multi-scale mechanism.

Further, the memory transfer module uses a fixed-size memory matrix to store the shadow information of all histories in the video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, wherein the deformable convolution and gating mechanism comprises a deformable convolution layer and three convolution layers.

Furthermore, the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result.

Based on the same invention concept, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory, which consists of an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The invention relates to a space-time alignment network based on deformable convolution, which is a multi-tower network with a plurality of inputs and an output, and refines image results to video results by integrating rough shadow detection results of adjacent frames and video frame information.

The method for detecting the semi-supervised video shadow of the long time sequence network based on the memory transfer comprises the following operation steps:

step S1: the image-assisted video shadow detection pseudo label generation method provided by the invention is used for generating a video shadow detection pseudo label for unmarked video data;

step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label; training for guiding the model;

and step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time-sequence network transmitted by the memory to respectively perform supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data;

and step S4: and (4) carrying out video shadow detection by using the model trained in the step (S3).

The space-time alignment network with the deformable convolution comprises a feature encoder shared by weight values, a motion sensing recursive deformable convolution module and a pixel-level shadow detection decoder module;

the weight sharing feature encoder is used for extracting features of the input video frame and the corresponding rough shadow detection result;

the motion-aware recursive deformable convolution module recursively predicts position bias caused by motion between adjacent frames by taking an optical flow between the adjacent frames as a guide and performs space-time alignment on the extracted features according to the position bias;

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask.

Further, the motion-aware recursive deformable convolution module uses optical flow between adjacent video frames as a guide for prediction of position bias between adjacent frames, including three convolution layers for predicting the position bias existing between current adjacent video frame features and three deformable convolution layers for spatio-temporally aligning video frame features according to the predicted position bias.

Further, the pixel-level shadow detection decoder includes three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer.

By utilizing the space-time alignment network based on the deformable convolution, the image-assisted video shadow detection pseudo label generation method comprises the following steps:

s11, training an existing single image shadow detection network by using an SBU data set, such as: BDRAR;

s12, processing the ViSha data set and the additional unmarked video data frame by using the single image shadow detection network trained in the step S11 to generate a rough video shadow detection pseudo label;

s13, using the rough pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set to train the spatiotemporal alignment network based on deformable convolution provided by the invention in a supervised manner;

and S14, generating a video shadow detection pseudo label by using the network trained in the step S13 and the rough pseudo label of the label-free video data obtained in the step S12.

Step S13 is to train the spatio-temporal alignment network based on deformable convolution provided in the present invention with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:

s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a shared feature encoder, and extracting shadow feature maps of different frames

；

S3-2, inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA, and respectively predicting

Frame to first

The optical flow of the frame;

s3-3, inputting the feature diagram extracted in S3-1 and the predicted optical flow in S3-2 into a recursive deformable convolution module for motion perception, taking the predicted optical flow as an initial value of position offset between adjacent frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position offset information, wherein a specific calculation formula is as follows:

wherein

Represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is set such that,GMA(•)which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a traditional convolutional neural network layer and a deformable convolutional layer;

s3-4 separately pairs using deformable convolution with the predicted position bias in S3-3

And carrying out feature alignment, wherein the calculation formula is as follows:

wherein

Represents a standard convolution kernel radius ofrThe set of sampling grids of (a) is,ω（P _n ) Is composed ofP _n The coefficients of the convolution kernel at the location,

representing the feature map after the features are aligned;

s3-5, integrating the aligned feature maps obtained in the S3-4 on the channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;

s3-6 loss with Binary Cross-Encopy

For supervised training of the space-time alignment network based on deformable convolution, a specific calculation formula is as follows:

wherein

A pseudo-tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.

Further, in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo tag by using an uncertainty estimation method MC-Dropout, which is used to evaluate the accuracy of the pseudo tag so as to guide the semi-supervised learning, wherein the MC-Dropout method calculates 10 forward propagation results containing a random Dropout layer, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

A display pixel (x,y) The degree of uncertainty in the position of the location,μ _{x, y()} the mean of 10 forward propagation results is shown.

Further, in step S3, the long-time-sequence network transferred by the memory performs supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data, respectively, and the specific process is as follows:

s31, respectively inputting two adjacent video frames into a feature encoder with shared weight and a GMA (light stream prediction) network to extract feature graphs of two adjacent video frames in different scales and light streams between the feature graphs;

s32, inputting the feature map extracted in S31 and the optical flow into a position bias between prediction video frames in a motion perception recursive deformable convolution module, and aligning the extracted features by using the position bias;

s33, according to the position bias predicted in S32, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism;

s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, and inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder to predict video shadow detection results on different scales;

s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module and fusing the video shadow detection results to generate a final video shadow detection result;

s36 respectively arranging the labeled ViSha data setsD _L And additional unlabeled datasetsD _U Loss of upper usage supervisionL _full And semi-supervised lossL _semi For the supervised and semi-supervised training of the long time sequence network for memory transfer, the specific calculation formula is as follows:

；

whereinl _BCEFor Binary Cross Engine losses,O _L andG _L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD _L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O _U and

respectively outputting the non-label data and the generated pseudo label for the long-time sequence network transmitted by the memory,

and the method is used for preventing the problem of zero division in calculation.

The invention has the advantages that:

1. the long-time-sequence network with memory transfer provided by the invention can generate a video shadow detection result with consistent long time sequence under the condition of a small amount of time and calculation cost by using a memory mechanism.

2. The spatio-temporal alignment network based on deformable convolution provided by the invention can explicitly align the information between adjacent frames, so that the models can integrate the adjacent frames better, thereby generating a result of spatio-temporal consistency.

3. The image-assisted video shadow detection pseudo label generation method provided by the invention can utilize the existing large-scale image shadow detection data set to generate a video shadow detection pseudo label with better robustness under the condition of a limited video shadow detection data set.

4. The semi-supervised video shadow detection method based on uncertainty guidance improves the generalization capability of the model by using additional pseudo labels, and meanwhile, the introduction of the uncertainty map enables the use of the pseudo labels to be more accurate.

Drawings

Fig. 1 is a schematic diagram of a long-time access network for memory transfer according to an embodiment.

FIG. 2 is a schematic diagram of a spatio-temporal alignment network based on deformable convolution in an embodiment.

Fig. 3 is a schematic diagram of a semi-supervised video shadow detection method using a long-time-series network with memory transfer in an embodiment.

Detailed Description

For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting.

The space-time alignment network based on deformable convolution for generating the video shadow detection pseudo label comprises a feature encoder shared by weight values a motion-aware recursive deformable convolution module and a pixel-level shadow detection decoder module:

in the embodiment, a convolution network part of ResNeXt101 is used as the feature encoder, and four feature maps with the sizes of 1/4,1/8,1/16 and 1/16 are respectively provided;

the motion-aware recursive deformable convolution recursively predicts the positional bias between adjacent frames due to motion using the optical flow between adjacent frames as a guide and performs spatial-temporal alignment of the extracted features according to it. Specifically, three convolutional layers and three deformable convolutional layers are included. The convolutional layer is used for predicting the position bias existing between the current adjacent video frame characteristics, and the deformable convolutional layer is used for performing space-time alignment on the video frame characteristics according to the predicted position bias, wherein the deformable convolution comprises a conventional convolutional layer and a position bias prediction layer;

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampled layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.

In addition, the generated video shadow detection pseudo label is used for detecting in real time in order to obtain a light-weight video shadow detection model with stronger generalization capability. The long time sequence network for the memory transfer of the video shadow detection comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module, wherein the weight value sharing characteristic encoder comprises a first module, a second module and a third module, the first module comprises a first module and a second module, the second module comprises a third module, the third module comprises a fourth module, a fourth module and a fourth module, the fourth module comprises a fourth module, the fifth module comprises a fourth module and a sixth module, the fifth module comprises a fourth module, the sixth module comprises a fourth module, a sixth module and a sixth module, the fifth module comprises a weight value sharing characteristic encoder, a memory transfer module, two motion perception recursive deformable convolution modules and a multi-scale video shadow detection decoder module:

the feature encoder shared by the weight in the network and the feature encoder of the space-time alignment network based on the deformable convolution provided by the invention adopt the same structure;

two motion-aware recursive deformable convolution modules are used to predict the position offsets of adjacent video frames on the 1/8 and 1/16 scales, respectively, and align the information of adjacent frames using deformable convolution according to the position offsets, taking into account the computational resources and the time cost. Specifically, the optical flow used in this motion-aware deformable convolution is the resize result on the 1/8 and 1/16 scale for optical flows generated by the disclosed GMA pre-training network. Furthermore, the two motion-aware deformable convolutions adopt the same structure;

the memory transfer module comprises a memory matrix with the size of a fixed size (25 × 256) for storing shadow information of all histories in the video shadow detection process, and dynamically transfers and updates the stored shadow information by using a deformable convolution and gating mechanism, wherein the memory matrix comprises a deformable convolution layer and three convolution layers. Wherein the deformable convolution uses the predicted position bias of the motion-aware recursive deformable convolution on the 1/16 scale to perform feature alignment on the historical features stored in the memory matrix;

the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the prediction results of the multiple scales are fused to obtain a final video shadow detection result. The random drop probability for a particular Dropout layer is set to 0.1.

Based on the same conception, the invention also provides a semi-supervised video shadow detection method of the long time sequence network transmitted by using the memory. The method comprises an image-assisted video shadow detection pseudo label generation method and an uncertainty-guided semi-supervised video shadow detection method. The image-assisted video shadow detection pseudo label generation method comprises the following steps:

step S11: an existing learning-based image shadow detection method BDRAR is trained by using an SBU image shadow detection data set, and the specific training strategy is consistent with that in the thesis.

Step S12: the vissha data set and the additional unmarked video data are divided into a plurality of continuous video frames, and the divided video frames are processed frame by using the BDRAR method trained in the step S11 to generate rough video shadow detection pseudo labels.

Step S13: carrying out supervised training on the spatio-temporal alignment network based on deformable convolution provided by the invention by using the rough video shadow detection pseudo label of the ViSha data set obtained in the step S12 and the ViSha data set;

；

S3-2 inputting every two of 5 original shadow video frames of the ViSha data set used in S3-1 into the existing optical flow prediction network GMA to respectively predict

Frame to firsttThe optical flow of the frame;

s3-3, inputting the feature map extracted in S3-1 and the predicted optical flow in S3-2 into a motion-aware recursive deformable convolution module, taking the predicted optical flow of adjacent video frames as an initial value of position bias between the adjacent video frames, and recursively refining the predicted optical flow in a residual manner to predict accurate position bias information, wherein a specific calculation formula is as follows:

wherein

Represents the input oftThe frame of the video is a frame of video,

represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is such that,GMA(•) Which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;

whereinP ₀ For the coordinates of the center point of a single convolution operation performed on the feature map,

in order to be able to predict the position offset,

represents a standard convolution kernel radius of

The set of sampling grids of (a) is,ω（P _n ) Is composed ofP _n The coefficients of the convolution kernel at the location,

representing the feature map after the features are aligned;

s3-5, integrating the aligned feature maps obtained in the S3-4 in a channel dimension, inputting the feature maps into a pixel-level shadow detection decoder, reducing the feature maps to the same size as an original input image through an up-sampling layer, and generating a video shadow detection pseudo label;

s3-6 loss with Binary Cross-Encopy

wherein

A pseudo tag result is detected for the generated video shadow,Gtags are detected for the actual artificially annotated video shadows.

Step S14: and generating a video shadow detection pseudo label by using the space-time alignment network based on the deformable convolution trained in the step S13 and the rough pseudo label of the label-free video data generated in the step S12. Specifically, the following process is performed:

inputting every two original input images into an optical flow prediction network GMA to predict the optical flow between adjacent frames; the originally input adjacent 5 video frame images and the corresponding generated rough pseudo labels are cascaded and input into a trained space-time alignment network based on deformable convolution together with the predicted optical flow, and the video shadow detection pseudo labels with consistent space-time are output. And synthesizing the generated continuous pseudo labels according to the original frame rate to obtain the video shadow detection pseudo label in the video format.

The generated video shadow detection pseudo labels are utilized, and the semi-supervised video shadow detection method of the long-time sequence network based on memory transfer, which is provided by the invention, comprises the following steps:

step S1: the image-assisted video shadow detection pseudo tag generation method is used for generating video shadow detection pseudo tags for unmarked video data.

Step S2: calculating an uncertainty map of a pixel level for the generated video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

A display pixel (x,y) Uncertainty of positionThe degree of the magnetic field is measured,μ _{x, y()} the mean of 10 forward propagation results is shown.

And step S3: using the video shadow detection pseudo label generated in the step S1 and the uncertain graph generated in the step S2 to enable the long-time sequence network transmitted by the memory to respectively carry out supervised and semi-supervised training on the labeled ViSha data set and the unlabelled video data:

and S33, performing memory transfer on the historical shadow information stored in the memory matrix by using deformable convolution according to the position bias predicted in S32, and selectively storing and updating the memory matrix according to the memory transfer result and the characteristics of the current video frame by using a gating mechanism. Initially, initializing the features in the memory matrix to be represented by the features of the first video frame;

s34, integrating the updated historical shadow information in S33 and the aligned video frame feature map in S32 on a channel dimension, inputting the integrated historical shadow information and the aligned video frame feature map into a multi-scale video shadow detection decoder, and predicting video shadow detection results on 1/8, 1/4, 1/2 and 1 scale of an original input image;

s35, reducing the video shadow detection results of different scales predicted in the S34 to the size of the original image through an up-sampling module and fusing the video shadow detection results, namely cascading the video shadow detection results and transforming the video shadow detection results through a convolution layer to generate final video shadow detection results;

s36 respectively arranging the labeled ViSha data setsD _L And additional unlabeled datasetsD _U Upper use has a supervision lossL _full And semi-supervised lossL _semi To train the long-time sequence network of memory transfer with supervision and semi-supervision (this fact)In the examples, a test set of a ViSha data set is used as an unmarked data set without manual marking), and the specific calculation formula is as follows:

，

whereinl _BCEFor Binary Cross Encopy loss,O _L andG _L tagged Visha data sets for long-time-sequence networks respectively transferred for memoriesD _L The output of (a) and the true value,Umapfor the uncertainty map calculated in step S2,O _U and

And step S4: and (4) performing video shadow detection by using the model trained in the step (S3), wherein the following process is specifically required:

the method comprises the following steps of carrying out segmentation processing on a video needing shadow detection to decompose the video into independent video frames, and simultaneously carrying out normalization operation on the video frames, wherein the calculation formula is as follows:

，

whereinImageRepresenting independent video frames;

sequentially inputting two adjacent video frames into a long-time sequence network transmitted by a memory according to a time sequence to obtain a network detection result of a next frame, and updating and storing a memory matrix (when an input first video frame is processed, two identical first video frames are input into the network);

and classifying the network detection result according to a threshold value of 0.5 to obtain a binary shadow detection mask of 0-1, and restoring the video frame into video data to obtain a final video shadow detection result.

The embodiment provides a semi-supervised video shadow detection method of a long-time sequence network transmitted by using a memory, which utilizes the assistance of the existing annotated image shadow detection data set to generate a high-quality video shadow detection pseudo label as additional supervision information to relieve the dependence of a learning-based video shadow detection method on labeled data; furthermore, long-order consistent video shadow detection results are efficiently generated by explicitly storing historical information of the video. The method solves the problems that the robustness of the current video shadow detection method is poor and is difficult to generalize into practical application, and the long time sequence consistency is difficult to maintain, and realizes the robust video shadow detection of video data.

Claims

1. A long-time-sequence network for memory transfer, comprising: the network comprises a weight value sharing characteristic encoder, a memory transfer module, a motion sensing recursive deformable convolution module and a multi-scale video shadow detection decoder module;

the weight sharing feature encoder is used for extracting shadow information of different video frames;

the motion-aware recursive deformable convolution module is used for predicting position bias caused by motion between adjacent frames and performing space-time alignment on video frame features extracted by the weight-shared feature encoder according to position bias information;

and the multi-scale video shadow detection decoder module performs video shadow detection according to the aligned video frame characteristics and the historical information stored in the memory transfer module by using a multi-scale mechanism.

2. The memory-passing long-timing network of claim 1, wherein: the memory transfer module uses a memory matrix with a fixed size to store shadow information of all histories in a video shadow detection process, and uses a deformable convolution and gating mechanism to dynamically transfer and update the stored shadow information, and the memory transfer module comprises a deformable convolution layer and three convolution layers.

3. The memory-passing long-timing network of claim 1, wherein: the multi-scale video shadow detection decoder module comprises three residual error layers, four upsampling layers, eight convolutional layers and one Dropout layer, video shadow detection results under different scales are respectively predicted by using a multi-scale mechanism, and the multi-scale prediction results are fused to obtain a final video shadow detection result.

4. The memory-passing long-timing network of claim 1, wherein: the motion-aware recursive deformable convolution module includes three convolutional layers for predicting a position offset existing between current neighboring video frame features and three deformable convolutional layers for spatio-temporal alignment of video frame features according to the predicted position offset, the deformable convolution includes a conventional convolutional layer and a position offset prediction layer.

5. A semi-supervised video shadow detection method of a long time sequence network based on memory transfer is characterized by comprising the following steps:

step S1: generating a video shadow detection pseudo label from the video data without the label;

step S2: estimating a pixel-level uncertainty image of the video shadow detection pseudo label generated in the step S1 by using an uncertainty estimation method to estimate the accuracy of the generated pseudo label;

and step S3: using the video shadow detection pseudo label generated in step S1 and the uncertain graph generated in step S2, respectively performing supervised and semi-supervised training on the labeled data set and the unlabelled video data by using the long-time-series network transferred by the memory according to any one of claims 1 to 4, specifically:

s31, respectively inputting two adjacent video frames into a feature encoder and an optical flow prediction network with shared weights to extract feature graphs of two adjacent video frames with different scales and optical flows between the feature graphs;

s33, according to the position bias predicted in S32, utilizing deformable convolution to carry out memory transfer on historical shadow information stored in a memory matrix, using a gating mechanism, and selectively storing and updating the memory matrix according to the result of the memory transfer and the characteristics of the current video frame, wherein initially, the characteristics in the memory matrix are initialized to be represented by the characteristics of the first video frame;

s35, reducing the video shadow detection results of different scales predicted in S34 to the size of the original image through an up-sampling module and fusing the original image, namely cascading the original image and transforming the original image through a convolution layer to generate a final video shadow detection result;

s36, training the long-time-sequence network of the memory transfer in a supervised and semi-supervised manner by using supervised loss and semi-supervised loss on the labeled data set and the additional unlabeled data set respectively;

6. The semi-supervised video shadow detection method based on the memory transfer long-time network as recited in claim 5, wherein:

the shadow detection pseudo label is generated by adopting a space-time alignment network based on deformable convolution, and the network comprises a feature encoder shared by weight values, a recursive deformable convolution module for motion perception and a pixel-level shadow detection decoder module;

the feature encoder shared by the weight values is used for extracting features of the input video frame and corresponding rough shadow detection results;

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask;

the structures of the feature encoder shared by the weight and the recursive deformable convolution module of the motion perception are the same as the structures of the modules in the long-time sequence network transmitted by the memory.

7. The method according to claim 6, wherein the method comprises:

the pixel-level shadow detection decoder module restores the aligned features to the original size of the input video and predicts its shadow position mask, including three residual layers, four upsampling layers, three convolutional layers, and one Dropout layer, where the random drop probability of the Dropout layer is set to 0.1.

8. The method according to claim 6, wherein the method comprises: the shadow detection pseudo label generation method specifically comprises the following steps:

s11, training an existing single image shadow detection network by using an SBU data set;

s12: processing the ViSha data set and the additional unmarked video data frame by frame using the single image shadow detection network trained in step S11 to generate a coarse video shadow detection pseudo label;

s13: supervised training the spatio-temporal alignment network based on deformable convolution by using the rough pseudo labels of the ViSha data set and the ViSha data set obtained in the step S12; the space-time alignment network based on the deformable convolution is a multi-tower network with a plurality of inputs and one output, and refines an image result to a video result by integrating rough shadow detection results of adjacent multiple frames and video frame information;

s14: and (4) generating a video shadow detection pseudo label by using the network trained in the step (S13) and the rough pseudo label of the label-free video data obtained in the step (S12).

9. The method for semi-supervised video shadow detection in long-term memory transfer based networks as recited in claim 8, wherein:

step S13 is to train the deformable convolved spatio-temporal alignment network with the coarse pseudo labels of the ViSha data set and the ViSha data set obtained in step S12, and the specific process is as follows:

s3-1, cascading the rough pseudo labels of 5 adjacent ViSha data sets and the original shadow video frames of the ViSha data sets, inputting the rough pseudo labels and the original shadow video frames into a feature encoder with shared weight, and extracting shadow feature maps of different frames

；

Frame to firsttThe optical flow of the frame;

wherein

Represents an input oftThe frame of the video is a frame of video,

represents the predicted secondiObtained by a sub-recursion oftFrame and secondt-xThe position offset between the frames is set such that,GMA(•) Which represents the optical flow prediction network GMA used,Conv(•) AndDCN(•) Respectively representing a conventional convolutional neural network layer and a deformable convolutional layer in the network;

in order to be able to predict the position offset,

represents a standard convolution kernel radius of

representing the feature map after the features are aligned;

s3-6 loss with Binary Cross-Encopy

wherein

10. The method for detecting the shadow of the semi-supervised video of the long-time-series network based on the memory transfer of claim 9, wherein:

in the step S2, a pixel-level uncertainty map is generated for the video shadow detection pseudo label by using an uncertainty estimation method MC-Dropout, and is used to evaluate the accuracy of the pseudo label so as to guide semi-supervised learning, wherein the MC-Dropout method calculates forward propagation results containing a random Dropout layer 10 times, and takes the variance of the results as the uncertainty, and the calculation process is as follows:

wherein

Represents a pixel (x,y) The degree of uncertainty in the position of the location,μ _{x, y()} the mean of 10 forward propagation results is shown.