CN115953663A

CN115953663A - Weak supervision shadow detection method using line marking

Info

Publication number: CN115953663A
Application number: CN202211739474.9A
Authority: CN
Inventors: 周凯; 邵艳利; 方景龙; 魏丹
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-11

Abstract

The invention discloses a weak supervision shadow detection method using line marking. Firstly, re-labeling two existing reference data sets with lines, namely S-SBU and S-ISTD; a transform-based shadow detection network is designed to capture significant contextual information interaction and propose an edge-oriented multi-task learning framework to generate intermediate and primary predictions with rich structure. Obtaining an edge-preserved fine shadow map by fusing the two complementary predictions; a feature-oriented semantic perception loss is also introduced to overcome complex scene interference, so that the model can use higher-level semantic information to perceive shadow and non-shadow areas. The method can obtain the high-quality shadow prediction image from the weak supervised learning of line marking. Experimental results on three reference data sets show that competitive performance is obtained with the present method compared to the most advanced method of full supervision.

Description

Weak supervision shadow detection method using line marking

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a weak supervision shadow detection method by using line marking.

Background

Shadows are common in natural images and video, and are formed by objects blocking light sources. Accurate positioning of shadows can provide valuable clues to the perceived ray direction, scene geometry, camera position and parameters, facilitating various scene understanding tasks such as rough geometry estimation, three-dimensional scene reconstruction, target detection and tracking. Therefore, shadow detection is crucial in these computer vision tasks.

Early shadow detection methods constructed physical models or machine learning models mainly using manually designed shadow features, common manual features including color, texture, illumination, shape, edges, and the like. However, these methods are often difficult to adapt to complex shadow scenes because manual features have representation limitations and insufficient recognition. With the intensive research of deep learning techniques in various visual tasks, in recent years, a Convolutional Neural Network (CNN) is often used in shadow detection to construct a data-driven deep learning model. They exhibit superior performance compared to earlier conventional methods, and are currently the mainstream method for shadow detection.

There are mainstream approaches to improving performance that typically employ two strategies, namely combining context information or large-scale training data that rely on pixel-level labeling. The existing large-scale data set mainly comprises SBU, ISTD and CUHK-Shadow, wherein the SBU and the ISTD are training data sets commonly used by a depth model, the ISTD only has 4 kinds of shelters, a background is shared by a plurality of shadows, and the data sets are obtained by pixel-level intensive labeling. However, pixel-level labeling is not only costly but also inefficient, which limits mainstream approaches to further expanding their training data, resulting in poor model performance generalization capability.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a weak supervision shadow detection method by using line marking, which remarkably improves the data annotation efficiency through rapid line marking and designs an effective weak supervision learning strategy to ensure the shadow detection performance.

A weak supervision shadow detection method using line marking specifically comprises the following steps:

step 1, re-labeling a common shadow data set by using lines

In order to improve the efficiency of marking shadow data, the invention aims to seek a quick marking method. The fast labeling methods commonly used in the weak supervised learning are point labeling, frame labeling, line labeling and the like, which are also called weak labels. The bounding box is only suitable for simple scenes with relatively concentrated shadow distribution, and the point marking has limitation on shadow areas covering a plurality of textures. Compared with the method, the method has the advantages that line marking is more flexible, and various complex shadow scenes can be adapted.

Step 1.1, marking rule

And (4) formulating a line marking rule according to the characteristics of the shadow in the complex scene, wherein the line marking rule comprises a general rule and a specific rule.

Step 1.2, annotating shadow data again

And re-labeling the SBU and ISTD shadow detection data sets, namely S-SBU and S-ISTD.

Step 2, designing a transform-based shadow detection network

Under the local processing principle of the CNN, the shadow detection network designed based on the CNN is difficult to learn the interaction of global and long-range semantic information, which is not beneficial to recovering different shadow areas under sparse line supervision. Aiming at the problem, a refined shadow detection network is designed based on the latest visual Transformer architecture, and line label information is effectively transmitted to an unmarked area in training. The detection network comprises four modules: network backbone, primary prediction, intermediate prediction, and edge detection.

Step 2.1, selecting a network backbone

The CSWin Transformer shows the powerful global information interaction and remote dependence modeling capability through a cross-shaped window self-attention scheme. Therefore, the network backbone using CSWin Transformer as shadow detection is chosen.

Step 2.2, select Primary prediction

To fully exploit the global representation, global context priors are extracted using a pyramid pooling module at the top of the backbone network. Firstly, high-level features are extracted from a backbone network, and then the high-level features are processed by a pyramid pooling module to obtain a refined feature map which is used as a main prediction of the network.

Step 2.3, select intermediate prediction

Complementary shadow information exists between different stages of the network, with lower-level features containing a large amount of shadow and non-shadow details, and higher-level features ignoring most of the non-shadow areas, but also omitting some of the shadow areas. Therefore, feature maps obtained in the middle three stages of the network are fused, and then the fused feature maps are used as intermediate predictions.

Step 2.4, determining edge detection

Because the greatest challenge of weakly supervised learning with line labeling is to accurately detect shadow boundaries, the present invention employs edge detection to explicitly assist in shadow structure perception. Firstly, fusing the features of the lowest layer and the features of the highest layer to predict an edge graph; then, the edge graph is respectively connected with the middle prediction graph and the main prediction graph to generate a shadow graph with a rich structure; finally, the two prediction graphs are connected to obtain the final predicted shadow mask (namely, the output result).

Step 3, constructing weak supervised learning of structure perception

Although the designed detection network may encourage tag information to propagate to unmarked areas, it is difficult to infer shadow structure and details from line markings because of their sparseness. In order to predict high-quality shadow maps, the invention provides a structure-aware weak supervised learning strategy, which uses a multitask learning framework and semantic perception loss to accurately locate shadow regions.

Step 3.1 edge-guided multitask learning

An edge-oriented multi-task learning framework is constructed based on a shadow detection network, and the framework is combined with line supervision and edge detection to generate a structured shadow prediction graph.

Step 3.1.1, online supervision

During the training process, the intermediate, primary, and final output predictions of the shadow detection network are supervised by line labels.

Step 3.1.2, edge detection

To highlight shadow structures, edge detection is used to explicitly assist in structure perception. In a specific implementation, the edge detection task is combined with the intermediate prediction and the main prediction to form a multi-task learning framework.

Step 3.2, feature-oriented semantic perception learning

Although edge detection encourages the network to generate a structurally rich shadow map, there is not enough constraint on the recovery range of the shadow region, especially for boundary-blurred shadows. To this end, a feature-oriented semantic perception loss is proposed to accurately perceive shadow regions from complex scenes. Semantic perception loss is designed based on visual features, and comprises visual similarity loss and visual difference loss; the visual similarity loss takes into account color correlation, illumination correlation and position correlation among pixels; the loss of difference is specifically addressed using higher semantic information (i.e., salient features between pixels) that models the way humans recognize shadows.

And 4, step 4: and capturing visually similar features in a complex environment by combining semantic perception loss, and accurately positioning a shadow region through visual difference.

The invention has the following beneficial effects:

1. a weak supervision shadow detection method using line labeling is firstly provided, and two new data sets, namely an S-SBU and an S-ISTD are introduced. Extensive experiments have shown that the proposed method can perform as well as the most recent fully supervised method with only about 8% of the labeled pixels. The method has the advantages that each shadow image is marked only in 8 seconds on average, the marking time is reduced by about 12 times compared with pixel-level marking, the marking efficiency of shadow data is obviously improved, and the requirement of a training depth model on data marking is relaxed.

2. In order to enhance the line supervision, a transform-based shadow detection network is designed to capture the significant context information interaction so as to better promote the label information propagation. An edge-oriented multi-task learning framework is then developed on the shadow detection network, encouraging the network to produce intermediate and primary prediction graphs with rich structure. By fusing these two complementary prediction maps, an edge-preserving fine shadow map can be obtained.

3. In order to overcome the interference of complex scenes, a semantic perception loss auxiliary multi-task learning based on feature guidance is also provided. Semantic perception loss includes visual similarity loss, which perceives shaded and unshaded pixels by visual affinity of pixel features, and visual difference loss, which guides shadow boundary prediction by higher-level semantic relationships.

Drawings

FIG. 1 is a general flow diagram of a method for weakly supervised shadow detection with line labeling;

FIG. 2 is a schematic diagram illustrating a method for labeling a shadow image with lines according to an embodiment;

FIG. 3 is the analysis and statistics of the line labeled data sets in the embodiment, where a and b are labeled details of the two data sets and contrasts with existing pixel level labels, and c is the statistics of labeled pixels in the two data sets;

FIG. 4 is a diagram illustrating a shadow detection network according to an embodiment;

FIG. 5 is a comparison of structural aware weakly supervised learning visualization results in an embodiment;

FIG. 6 is a visual feature map obtained by combining semantic perception loss in an embodiment;

FIG. 7 is a graph comparing the results of qualitative analyses of different methods in the examples;

FIG. 8 is a graph comparing the results of ablation analysis for loss functions according to the examples.

Detailed Description

The invention is further explained below with reference to the drawings;

as shown in fig. 1, a weak supervised shadow detection method using line labeling takes a shadow image as an input, and performs end-to-end prediction shadow detection results (shadow mask). The method mainly comprises three parts: line marking (step 1), detection network (step 2) and structure-aware weak supervised learning (step 3). In structure-aware weakly supervised learning, edge detection is combined with line supervision to build an edge-guided multi-task learning framework (step 3.1) and to generate two complementary shadow predictions (i.e. the intermediate prediction and the main prediction). In addition, feature-oriented semantic perceptual loss is further incorporated in these predictions (step 3.2 section) to obtain high quality shadow masks. The method specifically comprises the following steps:

step 1, re-labeling a common shadow data set by using lines

It is used to improve shadow data labeling efficiency in view of flexibility of line labeling.

Step 1.1, marking rule

As shown in fig. 2, a specific method of labeling a shadow image with lines is given. Aiming at the complexity of a shadow scene, several line marking rules are summarized according to the characteristics of the shadow, including:

(1) General rules:

(1) for a shadow image, marking shadow and non-shadow areas with at least two lines (fig. 2 a);

(2) complex scenes may contain richer shading information such as color, texture, and shape, and line labels should cover as many areas as possible (fig. 2 b);

(3) heterogeneous backgrounds tend to interfere with shadow detection, and therefore cross-texture labeling is performed for shadow areas (or non-shadow areas) with different textures (fig. 2 c);

(2) Specific rules:

(1) shadow-like areas have a similar color to shadow areas and are often detected as shadows. To mitigate ambiguity, explicit labels are given to shadow-like regions (FIG. 2 d);

(2) for soft shadows, existing depth models are generally insensitive to these regions because they have wider penumbra regions, so line labeling extends from the shadow region to the penumbra region (fig. 2 e);

(3) existing shadow detectors typically miss (or falsely detect) self-shadows and small shadow regions, which are explicitly labeled (fig. 2 f) because they are not sufficiently salient.

Step 1.2, annotating shadow data again

The two shadow detection datasets in common use (i.e., SBU and ISTD) are re-labeled, named S-SBU and S-ISTD, as shown in fig. 3, showing more detail of the labeling of the two datasets. Since the line labeling is very sparse, it takes only 8 seconds to label one shadow image on average. The present invention also compares the line label to the original pixel level label. It can be observed that: (1) The existing shadow detection data set has many noise labels, and the pixel level Truth label GT (GT) lacks some important shadow areas, as indicated by the arrows in fig. 3a, b. However, the texture and illumination (or color) of these noisy regions are different from the labeled regions. The present invention explicitly labels them to enhance model training. (2) During the labeling process, the invention also focuses on self-shadow, soft shadow, small shadow and shadow-like areas. However, the original pixel-level labeling typically ignores them.

Furthermore, as shown in FIG. 3c, the statistical results of the two line label data sets are further shown, which show the proportion of the label pixels in the whole data set, wherein the abscissa represents the percentage of label pixels (percentage of labeled pixels) and the ordinate represents the number of images (number of images). It can be observed that only about 10% (S-SBU) and 6% (S-ISTD) pixels in the line label are labeled as shaded or unshaded. The labeled pixels of the S-SBU are significantly more than the S-ISTD because the shadow scene of the S-SBU is more complex than the S-ISTD, resulting in more areas needing labeling.

Step 2, designing a transform-based shadow detection network

As shown in fig. 4, the detection network mainly includes four modules: network backbone, primary prediction, intermediate prediction, and edge detection.

Step 2.1, network backbone

The present invention uses cross-window transform Block (CSWin transform Block, CSTB) (fig. 4 b) to build a layered structure as a network backbone, as shown in fig. 4 a. The backbone network is downsampled using convolutional layers (3 × 3 convolutional, step size 2) and multi-scale feature maps, denoted F respectively, are extracted from low to high levels ¹ ,F ² ,F ³ And F ⁴ . In addition, instead of directly outputting the feature map, a 3 × 3 convolution block is used for feature transformation at each stage of the network. For an input image

Generating ≧ H using a Convolutional Token Embedding (CTE) (7 × 7 convolution with a step size of 4)>

The dimensions of each tile token, are denoted as C. The characteristic map constructed at each stage therefore has ^ er>

Token, where i ∈ {1,2,3,4}. In addition, with the benefit of high computational efficiency of the CSWin transform, the inference speed of the network can reach 178FPS.

Step 2.2, main prediction

Firstly, high-level features are extracted from a backbone network, then a refined feature map is obtained by the high-level features through a Pyramid Pooling Module (PPM), and finally the refined feature map is used as a main prediction of the network. Specifically, PPM first passes through four different pooling layers versus a high level feature map F ⁴ Down-sampling is performed to produce four scaled feature maps of different scales, which are then concatenated to obtain an effective global prior representation, as shown in figure 4 c.

Step 2.3, intermediate prediction

Firstly, the characteristic diagram obtained in the middle three stages of the network is processedAnd fusing, and then taking the fused feature map as an intermediate prediction. Specifically, these feature maps are merged using short connections (i.e., F) in the last three phases ² ,F ³ And F ⁴ ) Then they are fused to get an intermediate prediction.

Step 2.4, edge detection

First fusing the low-level features F ¹ And high level feature F ⁴ To predict an edge map; then, connecting the edge graph with the middle prediction graph and the main prediction graph respectively to generate a shadow graph with a rich structure; finally, the two shadow maps are concatenated to obtain the final predicted shadow mask (i.e., the output result).

Step 3, constructing weak supervised learning of structure perception

In order to learn shadow detection from line labeling data set, the invention provides a structure-aware weak supervised learning strategy, which uses a multitask learning framework and semantic perception loss to accurately position shadow regions.

Step 3.1 edge-guided multitask learning

An edge-oriented multi-task learning framework is built based on a shadow detection network, and a structured shadow prediction graph is generated by combining line supervision and edge detection.

Step 3.1.1, online supervision

During the training process, the intermediate, primary, and final output predictions of the shadow detection network are supervised by line labels. Notably, the pixel-level label for fully supervised shadow detection uses 1 for shadow, 0 for non-shadow; in the weakly supervised shadow detection of the present invention, the line tag uses three supervisory signals: 1 as shaded, 2 as unshaded, 0 as unmarked pixel. Since most of the line labels are unlabeled pixels, partial Cross-Entropy (PCE) loss is used here to train the detection network:

wherein omega _S Is the set of labeled pixels, y _i And p _i The true class and the prediction at pixel i, respectively. In conjunction with all PCE losses, a final PCE loss can be obtained:

wherein s is _int For intermediate prediction, s _mai For the main prediction, s _out To finally output the prediction.

Step 3.1.2, edge detection

In order to highlight the shadow structure, the invention combines the edge detection task with the intermediate prediction and the main prediction to form a multi-task learning framework. In training, edge detection was performed using Cross-Entropy (CE) loss:

/>

where g is an edge map pre-calculated by the most recent edge detector edeter as edge detection GT, e is a predicted edge map, and (r, c) represents the row and column coordinates of the pixel. Combining edge detection with line surveillance, a shadow map with edge preservation can be obtained.

Step 3.2, feature-oriented semantic perception learning

Multitask learning encourages the network to generate a structurally rich shadow map, but does not have sufficient constraints on the recovery range of shadow regions, particularly for shadows with fuzzy boundaries. As shown in FIG. 5, the shadow in the first row has clear boundaries, and an accurate shadow prediction map can be obtained only by using multi-task learning; the shadow boundaries of the second row are blurred and further perception of the shadow boundaries is required. To this end, a feature-oriented semantic perception penalty is proposed to help multitask learning accurately perceive shadow boundaries from complex scenes. The loss of semantic perception designed based on visual features includes loss of visual similarity and loss of visual difference.

Step 3.2.1, loss of visual similarity

The visual similarity loss takes into account color correlation, illumination correlation, and position correlation between pixels.

The color correlation between pixels is defined as:

where C (i) is the color at pixel i, C (j) is the color at pixel j, σ _C Is a hyper-parameter.

Similarly, the illumination dependence R _i And the position dependence R _p Is defined as:

where I (I), L (I) are the illumination and position at pixel I, σ _I And σ _L Is a hyper-parameter. It can further be obtained that the similarity of the visual characteristic is ^ based>

It aims to make similar pixels tend towards similar predictions. The loss of visual similarity is therefore defined as:

where Di is an adjacent area d × d centered on the pixel i, G (i, j) =1-p _i p _j -(1-p _i )(1-p _j )。

Step 3.2.2 loss of visual difference

Firstly, extracting a characteristic diagram before a main prediction P

Then passes through the covariance S _i ＝cov(F _i ^m P) determining the significance of the characteristic channel, wherein i e { 1.,. C } is the characteristic channel, the characteristic map of the first N channels is taken>

As a remarkable featureFeature, then feature relevance can be written as:

further visual feature saliency can be derived:

the apparent difference loss is therefore defined as:

wherein E is _k As effective edge region, λ _vd Is a hyper-parameter that increases as the number of iterations increases during the training process.

In summary, the final semantic perceptual loss can be written as:

as shown in fig. 6, in combination with semantic perception loss, visually similar features can be captured in a complex environment, and the shadow region can be accurately located by visual difference (semantic features).

Step 3.3, objective function

The overall penalty is generated by the above-described multitask learning and semantic perception learning. Thus, the final objective function is defined as:

wherein, beta ₁ 、β ₂ And beta ₃ Is a hyper-parameter. In the course of the training process,

intended to propagate line tag information to unannotated regions, in conjunction with a line tag reader>

The shadow map of the forced prediction remains aligned with the shadow boundary, and is->

Semantic information is used to perceive shadows and non-shadows. Bonding of

And &>

The marked shadow (unshaded) pixels can be prevented from propagating to unshaded (shaded) regions.

The present invention evaluates the proposed method on three commonly used shadow detection datasets (SBU, ISTD and UCF). (1) The SBU dataset is the largest pixel-level annotation dataset, containing 4089 training images and 638 test images. (2) The ISTD data set was created for shadow detection and removal, containing 1330 training images and 540 test images. (3) The shadow image of the UCF is similar to the SBU, containing 135 training images and 110 test images. The method comprises the steps of firstly, carrying out on-line labeling on data sets S-SBU and S-ISTD to train models, and then carrying out testing on SBU, ISTD and UCF testing sets. Note that the tests performed on the UCF here were to verify the generalization capability of the model.

Following The State-Of-The-Art (SOTA) method, the present invention also employs a widely used Balanced Error Rate (BER) to evaluate The proposed method:

wherein, TP, TN, P, N respectively represent the number of pixels of positive partial pair, negative partial pair, shadow and non-shadow. In the evaluation, a smaller BER value represents a better model performance.

The shadow detection model proposed in the present invention is based on PyTorch 1.12.1 and Python 3.6.12 implementations, and trains the detection network on a single NVIDIA RTX 3090GPU with 24GB of memory. The number of CSTBs at each stage of the backbone network is set to 1,2, 21 and 1, respectively. A fully connected Conditional Random Field (CRF) is used for further processingA refined network prediction result. All input image sizes are uniformly set to 416 × 416 and the patch size is set to 4 × 4. The hyperparameter in equation (9) is set to β ₁ ＝0.4、β ₂ =0.3 and β ₃ And =0.3. For the enhancement strategy of the training data, strong enhancement uses color dithering and blurring, and weak enhancement uses random horizontal flipping. In the training phase, pre-training is firstly carried out on ImageNet to generate initialization parameters of a backbone network, and other convolutional layers adopt random initialization parameters

The network was optimized using random Gradient Descent (SGD), with the momentum values and weight attenuations set to 0.9 and 1e-4, respectively. The number of iterations on both the S-SBU and S-ISTD datasets was set to 40, the learning rate was 1e-4, and the training batch size was 4.

The present invention compares the proposed method with 9 SOTA methods, namely ScGAN, DSC, A + D Net, DC-DSPF, BDRAR, DSDNet, MTMT-Net, FDRNe and SDCM. The MTMT-Net is a semi-supervised model, extra unmarked data are used for training, and other methods are all full-supervised models.

As shown in table 1, the results of quantitative comparisons of all methods on three datasets are given, where "N" and "NS" represent the pixel error rates of the shaded and unshaded areas, respectively. It can be observed that the method is based on the model designed by the CSWin Transformer, and the parameter quantity of the method is less than that of most CNN models. In terms of performance, the present method is able to obtain performances comparable to those of the recent fully supervised (or semi-supervised) methods, mainly due to three aspects: (1) A plurality of noise labels exist in the existing pixel-level labeling data, so that the model with high learning capacity is over-fitted with label noise, and the performance of the model is limited. The method uses line labeling, focusing on more shadow cases (e.g., self-shadows, soft shadows, and small shadows), while pixel-level labeling typically ignores them. (2) Compared with the existing detection network based on the CNN, the network designed based on the Transformer can better transmit the label information. (3) The method adopts a structure-aware weak supervised learning strategy, and can well deduce the shadow structure and details in line supervised learning. In addition, the method achieves better evaluation results on the ISTD than the SBU, because the ISTD is mainly based on hard shadow data, and the shadow structure and the boundary of the ISTD are clearer than those of the SBU, which is more beneficial to the provided weak supervised learning strategy. Thus, the present method achieves comparable performance to the latest methods (FDRNet and SDCM) in ISTD.

TABLE 1

The present invention further provides some qualitative comparison results, as shown in fig. 7, noting that in lines 1 and 7, GT is a noise label. From experimental results, the performance of the method is equivalent to that of the latest fully-supervised (or semi-supervised) SOTA method, and different types of shadows can be effectively detected in different scenes. For example, the method can effectively locate small shadows (lines 1 and 6), soft shadows (line 4), and self shadows (lines 5 and 7) in an image. For ambiguous cases, i.e., non-shaded regions that are shadow-like (line 2) and shaded regions that are non-shaded mode (line 3), the method can still identify them. Furthermore, in some cases, the method has even better performance than existing methods, e.g., it can accurately detect ambiguous line 3 shadows and line 5 self-shadows. Robustness shown by the method in various scenes proves that the marked method and the weak supervised learning strategy are effective to shadow detection.

To verify the validity of the proposed network design and loss function, they were separately studied for ablation.

(1) Verifying validity of components of a proposed detection network

To verify the validity of the various components of the network, the proposed network is compared with three variants thereof:

basic: deleting an Intermediate Prediction (IP) module and a PPM module to obtain a Basic model;

basic + PPM: adding a PPM module into 'Basic';

basic + IP: and adding an IP module in the Basic.

Table 2 shows the results of the ablation study, analysis of which can yield: (1) With the benefit of the transform newest structure, the four models can achieve excellent performance on two reference data sets; (2) The BER of 'Basic + PPM' is lower than that of 'Basic', which shows that the global context prior extracted from the PPM module is effective for shadow detection; (3) The 'basic + IP' prediction result is superior to the 'basic' prediction result, which shows that the intermediate prediction result can provide more shadow and non-shadow details, thereby improving the quality of the final prediction result. (4) The method combines IP and PPM to obtain the best performance.

TABLE 2

(2) Verifying the validity of a loss function

In the weakly supervised learning, the objective function of the method consists of three loss functions, the detailed ablation study of the loss functions is shown in table 3, and it is noted that if edge detection loss is not used in the ablation study, the edge detection module will be directly deleted. It can be observed that only partial cross entropy loss is used

The detection network is trained to obtain the worst BER value. When combined with an edge detection loss>

Or loss of semantic perception>

When combined, performance can be significantly increased since +>

Encouraging the network to generate shadow maps with rich structure>

Focusing the network on distinguishing shadow vs. nonAnd (4) shading. Thus, the method (i.e.. Sup. -, -are based on->

) While combining them to achieve optimum performance. In addition, are combined>

Has a performance superior to->

This suggests that explicit edge constraint (edge detection) is more effective for weakly supervised learning of line labels. As shown in fig. 8, ablation analysis results are further visualized.

Table 3.

Claims

1. A weak supervision shadow detection method using line marking is characterized in that: the method specifically comprises the following steps:

step 1, re-labeling a common shadow data set by using lines

Step 1.1, marking rule

Formulating a line marking rule according to the characteristics of the shadow in the complex scene;

step 1.2, annotating shadow data again

Re-labeling the SBU and the ISTD shadow detection data sets, namely S-SBU and S-ISTD;

step 2, designing a transform-based shadow detection network

The detection network comprises four modules: network backbone, main prediction, intermediate prediction and edge detection;

step 2.1, select the network backbone

Selecting a network backbone using a CSWin Transformer as a shadow detection;

step 2.2, select Primary prediction

In order to fully utilize the global representation, a pyramid pooling module is used at the top of the backbone network to extract the global context prior; firstly, extracting high-level features from a backbone network, then obtaining a refined feature map by passing the high-level features through a pyramid pooling module, and taking the refined feature map as a main prediction of the network;

step 2.3, selecting intermediate prediction

Fusing the feature maps obtained in the middle three stages of the network, and then taking the fused feature maps as middle prediction;

step 2.4, determining edge detection

Adopting edge detection to explicitly assist the perception of the shadow structure; the method specifically comprises the following steps: firstly, fusing the features of the lowest layer and the features of the highest layer to predict an edge graph; then, the edge graph is respectively connected with the middle prediction graph and the main prediction graph to generate a shadow graph with a rich structure; finally, connecting the two prediction images to obtain a final predicted shadow mask, namely outputting a result;

step 3, constructing a structure perception weak supervision learning method

Step 3.1, constructing an edge-oriented multi-task learning framework

Constructing an edge-oriented multitask learning framework based on a shadow detection network, and combining line supervision and edge detection to generate a structured shadow prediction graph;

step 3.1.1, online supervision

In the training process, the intermediate prediction, the main prediction and the final output prediction of the shadow detection network are supervised by a line label;

step 3.1.2, edge detection

To highlight shadow structures, structure perception is explicitly aided using edge detection; in specific execution, combining an edge detection task with intermediate prediction and main prediction to form a multi-task learning framework;

step 3.2, feature-oriented semantic perception learning

A feature-oriented semantic perception loss is proposed to accurately perceive shadow regions from complex scenes; semantic perception loss is designed based on visual features, which include visual similarity loss and visual difference loss; the visual similarity loss takes into account color correlation, illumination correlation and position correlation among pixels; the visual difference loss is specifically to solve the problem by using higher semantic information, which simulates the way human recognizes shadows;

2. The method of claim 1, wherein the method comprises the following steps: the line marking rule is formulated according to the characteristics of the shadow in the complex scene, and comprises a general rule and a specific rule;

the general rule is as follows:

(1) for a shadow image, marking shadow and non-shadow areas by using at least two lines;

(2) for complex scenes, the line marking should cover as many areas as possible;

(3) carrying out cross-texture labeling on a shadow region or a non-shadow region with different textures;

the specific rule is as follows:

(1) giving explicit labels to the shadow-like regions;

(2) for soft shadows, the line labels extend from the shadow region to the penumbra region;

(3) for self-shadow and small shadow regions, explicit notation is used.

3. The method for weak supervised shadow detection with line labeling as recited in claim 1, wherein: the selecting uses a CSWin Transformer as a network backbone for shadow detection; the network backbone specifically comprises: the main network uses convolution of 3 x 3 and convolution layer with step length of 2 to make down-sampling, and extracts multi-scale characteristic diagram from low level to high level, respectively expressed as F ¹ ,F ² ,F ³ And F ⁴ (ii) a A 3 x 3 convolution block is used for feature transformation at each stage of the network; for an input image

Generates ≧ or using a convolution token embedded layer of 7 × 7 convolution with a step size of 4>

Token, the dimension of which is denoted C, so that feature maps constructed at each stage have +>

A token, where i ∈ {1,2,3,4}.

4. The method of claim 1, wherein the method comprises the following steps: the visual similarity loss takes color correlation, illumination correlation and position correlation among pixels into consideration; wherein the color correlation R between pixels _c The method specifically comprises the following steps:

where C (i) is the color at pixel i, C (j) is the color at pixel j, σ _C Is a hyper-parameter;

illumination dependence R _i And the position dependence R _p Is defined as:

where I (I), L (I) are the luminance and position at pixel I, σ _I And σ _L Is a hyper-parameter; the similarity of the visual characteristics is->

It aims to make similar pixels tend towards similar predictions; the loss of visual similarity is therefore defined as:

wherein D is _i Is an adjacent region d × d, G (i, j) =1-p centered on the pixel i _i p _j -(1-p _i )(1-p _j )；

p _i Representing the prediction at pixel i, p _j Representing the prediction at pixel j.

5. The method of claim 1, wherein the method comprises the following steps: the loss of visual disparity;

firstly, extracting a characteristic diagram before a main prediction P

Then passes through covariance S _i ＝cov(F _i ^m P) determining the significance of the characteristic channel, wherein i e { 1.,. C } is the characteristic channel, the characteristic map of the first N channels is taken>

As a salient feature, then the feature correlation is written as: />

Further visual feature saliency was obtained:

the apparent difference loss is therefore defined as: