CN115953663A - Weak supervision shadow detection method using line marking - Google Patents

Weak supervision shadow detection method using line marking Download PDF

Info

Publication number
CN115953663A
CN115953663A CN202211739474.9A CN202211739474A CN115953663A CN 115953663 A CN115953663 A CN 115953663A CN 202211739474 A CN202211739474 A CN 202211739474A CN 115953663 A CN115953663 A CN 115953663A
Authority
CN
China
Prior art keywords
shadow
prediction
network
detection
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211739474.9A
Other languages
Chinese (zh)
Inventor
周凯
邵艳利
方景龙
魏丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202211739474.9A priority Critical patent/CN115953663A/en
Publication of CN115953663A publication Critical patent/CN115953663A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision shadow detection method using line marking. Firstly, re-labeling two existing reference data sets with lines, namely S-SBU and S-ISTD; a transform-based shadow detection network is designed to capture significant contextual information interaction and propose an edge-oriented multi-task learning framework to generate intermediate and primary predictions with rich structure. Obtaining an edge-preserved fine shadow map by fusing the two complementary predictions; a feature-oriented semantic perception loss is also introduced to overcome complex scene interference, so that the model can use higher-level semantic information to perceive shadow and non-shadow areas. The method can obtain the high-quality shadow prediction image from the weak supervised learning of line marking. Experimental results on three reference data sets show that competitive performance is obtained with the present method compared to the most advanced method of full supervision.

Description

Weak supervision shadow detection method using line marking
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a weak supervision shadow detection method by using line marking.
Background
Shadows are common in natural images and video, and are formed by objects blocking light sources. Accurate positioning of shadows can provide valuable clues to the perceived ray direction, scene geometry, camera position and parameters, facilitating various scene understanding tasks such as rough geometry estimation, three-dimensional scene reconstruction, target detection and tracking. Therefore, shadow detection is crucial in these computer vision tasks.
Early shadow detection methods constructed physical models or machine learning models mainly using manually designed shadow features, common manual features including color, texture, illumination, shape, edges, and the like. However, these methods are often difficult to adapt to complex shadow scenes because manual features have representation limitations and insufficient recognition. With the intensive research of deep learning techniques in various visual tasks, in recent years, a Convolutional Neural Network (CNN) is often used in shadow detection to construct a data-driven deep learning model. They exhibit superior performance compared to earlier conventional methods, and are currently the mainstream method for shadow detection.
There are mainstream approaches to improving performance that typically employ two strategies, namely combining context information or large-scale training data that rely on pixel-level labeling. The existing large-scale data set mainly comprises SBU, ISTD and CUHK-Shadow, wherein the SBU and the ISTD are training data sets commonly used by a depth model, the ISTD only has 4 kinds of shelters, a background is shared by a plurality of shadows, and the data sets are obtained by pixel-level intensive labeling. However, pixel-level labeling is not only costly but also inefficient, which limits mainstream approaches to further expanding their training data, resulting in poor model performance generalization capability.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a weak supervision shadow detection method by using line marking, which remarkably improves the data annotation efficiency through rapid line marking and designs an effective weak supervision learning strategy to ensure the shadow detection performance.
A weak supervision shadow detection method using line marking specifically comprises the following steps:
step 1, re-labeling a common shadow data set by using lines
In order to improve the efficiency of marking shadow data, the invention aims to seek a quick marking method. The fast labeling methods commonly used in the weak supervised learning are point labeling, frame labeling, line labeling and the like, which are also called weak labels. The bounding box is only suitable for simple scenes with relatively concentrated shadow distribution, and the point marking has limitation on shadow areas covering a plurality of textures. Compared with the method, the method has the advantages that line marking is more flexible, and various complex shadow scenes can be adapted.
Step 1.1, marking rule
And (4) formulating a line marking rule according to the characteristics of the shadow in the complex scene, wherein the line marking rule comprises a general rule and a specific rule.
Step 1.2, annotating shadow data again
And re-labeling the SBU and ISTD shadow detection data sets, namely S-SBU and S-ISTD.
Step 2, designing a transform-based shadow detection network
Under the local processing principle of the CNN, the shadow detection network designed based on the CNN is difficult to learn the interaction of global and long-range semantic information, which is not beneficial to recovering different shadow areas under sparse line supervision. Aiming at the problem, a refined shadow detection network is designed based on the latest visual Transformer architecture, and line label information is effectively transmitted to an unmarked area in training. The detection network comprises four modules: network backbone, primary prediction, intermediate prediction, and edge detection.
Step 2.1, selecting a network backbone
The CSWin Transformer shows the powerful global information interaction and remote dependence modeling capability through a cross-shaped window self-attention scheme. Therefore, the network backbone using CSWin Transformer as shadow detection is chosen.
Step 2.2, select Primary prediction
To fully exploit the global representation, global context priors are extracted using a pyramid pooling module at the top of the backbone network. Firstly, high-level features are extracted from a backbone network, and then the high-level features are processed by a pyramid pooling module to obtain a refined feature map which is used as a main prediction of the network.
Step 2.3, select intermediate prediction
Complementary shadow information exists between different stages of the network, with lower-level features containing a large amount of shadow and non-shadow details, and higher-level features ignoring most of the non-shadow areas, but also omitting some of the shadow areas. Therefore, feature maps obtained in the middle three stages of the network are fused, and then the fused feature maps are used as intermediate predictions.
Step 2.4, determining edge detection
Because the greatest challenge of weakly supervised learning with line labeling is to accurately detect shadow boundaries, the present invention employs edge detection to explicitly assist in shadow structure perception. Firstly, fusing the features of the lowest layer and the features of the highest layer to predict an edge graph; then, the edge graph is respectively connected with the middle prediction graph and the main prediction graph to generate a shadow graph with a rich structure; finally, the two prediction graphs are connected to obtain the final predicted shadow mask (namely, the output result).
Step 3, constructing weak supervised learning of structure perception
Although the designed detection network may encourage tag information to propagate to unmarked areas, it is difficult to infer shadow structure and details from line markings because of their sparseness. In order to predict high-quality shadow maps, the invention provides a structure-aware weak supervised learning strategy, which uses a multitask learning framework and semantic perception loss to accurately locate shadow regions.
Step 3.1 edge-guided multitask learning
An edge-oriented multi-task learning framework is constructed based on a shadow detection network, and the framework is combined with line supervision and edge detection to generate a structured shadow prediction graph.
Step 3.1.1, online supervision
During the training process, the intermediate, primary, and final output predictions of the shadow detection network are supervised by line labels.
Step 3.1.2, edge detection
To highlight shadow structures, edge detection is used to explicitly assist in structure perception. In a specific implementation, the edge detection task is combined with the intermediate prediction and the main prediction to form a multi-task learning framework.
Step 3.2, feature-oriented semantic perception learning
Although edge detection encourages the network to generate a structurally rich shadow map, there is not enough constraint on the recovery range of the shadow region, especially for boundary-blurred shadows. To this end, a feature-oriented semantic perception loss is proposed to accurately perceive shadow regions from complex scenes. Semantic perception loss is designed based on visual features, and comprises visual similarity loss and visual difference loss; the visual similarity loss takes into account color correlation, illumination correlation and position correlation among pixels; the loss of difference is specifically addressed using higher semantic information (i.e., salient features between pixels) that models the way humans recognize shadows.
And 4, step 4: and capturing visually similar features in a complex environment by combining semantic perception loss, and accurately positioning a shadow region through visual difference.
The invention has the following beneficial effects:
1. a weak supervision shadow detection method using line labeling is firstly provided, and two new data sets, namely an S-SBU and an S-ISTD are introduced. Extensive experiments have shown that the proposed method can perform as well as the most recent fully supervised method with only about 8% of the labeled pixels. The method has the advantages that each shadow image is marked only in 8 seconds on average, the marking time is reduced by about 12 times compared with pixel-level marking, the marking efficiency of shadow data is obviously improved, and the requirement of a training depth model on data marking is relaxed.
2. In order to enhance the line supervision, a transform-based shadow detection network is designed to capture the significant context information interaction so as to better promote the label information propagation. An edge-oriented multi-task learning framework is then developed on the shadow detection network, encouraging the network to produce intermediate and primary prediction graphs with rich structure. By fusing these two complementary prediction maps, an edge-preserving fine shadow map can be obtained.
3. In order to overcome the interference of complex scenes, a semantic perception loss auxiliary multi-task learning based on feature guidance is also provided. Semantic perception loss includes visual similarity loss, which perceives shaded and unshaded pixels by visual affinity of pixel features, and visual difference loss, which guides shadow boundary prediction by higher-level semantic relationships.
Drawings
FIG. 1 is a general flow diagram of a method for weakly supervised shadow detection with line labeling;
FIG. 2 is a schematic diagram illustrating a method for labeling a shadow image with lines according to an embodiment;
FIG. 3 is the analysis and statistics of the line labeled data sets in the embodiment, where a and b are labeled details of the two data sets and contrasts with existing pixel level labels, and c is the statistics of labeled pixels in the two data sets;
FIG. 4 is a diagram illustrating a shadow detection network according to an embodiment;
FIG. 5 is a comparison of structural aware weakly supervised learning visualization results in an embodiment;
FIG. 6 is a visual feature map obtained by combining semantic perception loss in an embodiment;
FIG. 7 is a graph comparing the results of qualitative analyses of different methods in the examples;
FIG. 8 is a graph comparing the results of ablation analysis for loss functions according to the examples.
Detailed Description
The invention is further explained below with reference to the drawings;
as shown in fig. 1, a weak supervised shadow detection method using line labeling takes a shadow image as an input, and performs end-to-end prediction shadow detection results (shadow mask). The method mainly comprises three parts: line marking (step 1), detection network (step 2) and structure-aware weak supervised learning (step 3). In structure-aware weakly supervised learning, edge detection is combined with line supervision to build an edge-guided multi-task learning framework (step 3.1) and to generate two complementary shadow predictions (i.e. the intermediate prediction and the main prediction). In addition, feature-oriented semantic perceptual loss is further incorporated in these predictions (step 3.2 section) to obtain high quality shadow masks. The method specifically comprises the following steps:
step 1, re-labeling a common shadow data set by using lines
It is used to improve shadow data labeling efficiency in view of flexibility of line labeling.
Step 1.1, marking rule
As shown in fig. 2, a specific method of labeling a shadow image with lines is given. Aiming at the complexity of a shadow scene, several line marking rules are summarized according to the characteristics of the shadow, including:
(1) General rules:
(1) for a shadow image, marking shadow and non-shadow areas with at least two lines (fig. 2 a);
(2) complex scenes may contain richer shading information such as color, texture, and shape, and line labels should cover as many areas as possible (fig. 2 b);
(3) heterogeneous backgrounds tend to interfere with shadow detection, and therefore cross-texture labeling is performed for shadow areas (or non-shadow areas) with different textures (fig. 2 c);
(2) Specific rules:
(1) shadow-like areas have a similar color to shadow areas and are often detected as shadows. To mitigate ambiguity, explicit labels are given to shadow-like regions (FIG. 2 d);
(2) for soft shadows, existing depth models are generally insensitive to these regions because they have wider penumbra regions, so line labeling extends from the shadow region to the penumbra region (fig. 2 e);
(3) existing shadow detectors typically miss (or falsely detect) self-shadows and small shadow regions, which are explicitly labeled (fig. 2 f) because they are not sufficiently salient.
Step 1.2, annotating shadow data again
The two shadow detection datasets in common use (i.e., SBU and ISTD) are re-labeled, named S-SBU and S-ISTD, as shown in fig. 3, showing more detail of the labeling of the two datasets. Since the line labeling is very sparse, it takes only 8 seconds to label one shadow image on average. The present invention also compares the line label to the original pixel level label. It can be observed that: (1) The existing shadow detection data set has many noise labels, and the pixel level Truth label GT (GT) lacks some important shadow areas, as indicated by the arrows in fig. 3a, b. However, the texture and illumination (or color) of these noisy regions are different from the labeled regions. The present invention explicitly labels them to enhance model training. (2) During the labeling process, the invention also focuses on self-shadow, soft shadow, small shadow and shadow-like areas. However, the original pixel-level labeling typically ignores them.
Furthermore, as shown in FIG. 3c, the statistical results of the two line label data sets are further shown, which show the proportion of the label pixels in the whole data set, wherein the abscissa represents the percentage of label pixels (percentage of labeled pixels) and the ordinate represents the number of images (number of images). It can be observed that only about 10% (S-SBU) and 6% (S-ISTD) pixels in the line label are labeled as shaded or unshaded. The labeled pixels of the S-SBU are significantly more than the S-ISTD because the shadow scene of the S-SBU is more complex than the S-ISTD, resulting in more areas needing labeling.
Step 2, designing a transform-based shadow detection network
As shown in fig. 4, the detection network mainly includes four modules: network backbone, primary prediction, intermediate prediction, and edge detection.
Step 2.1, network backbone
The present invention uses cross-window transform Block (CSWin transform Block, CSTB) (fig. 4 b) to build a layered structure as a network backbone, as shown in fig. 4 a. The backbone network is downsampled using convolutional layers (3 × 3 convolutional, step size 2) and multi-scale feature maps, denoted F respectively, are extracted from low to high levels 1 ,F 2 ,F 3 And F 4 . In addition, instead of directly outputting the feature map, a 3 × 3 convolution block is used for feature transformation at each stage of the network. For an input image
Figure SMS_1
Generating ≧ H using a Convolutional Token Embedding (CTE) (7 × 7 convolution with a step size of 4)>
Figure SMS_2
The dimensions of each tile token, are denoted as C. The characteristic map constructed at each stage therefore has ^ er>
Figure SMS_3
Token, where i ∈ {1,2,3,4}. In addition, with the benefit of high computational efficiency of the CSWin transform, the inference speed of the network can reach 178FPS.
Step 2.2, main prediction
Firstly, high-level features are extracted from a backbone network, then a refined feature map is obtained by the high-level features through a Pyramid Pooling Module (PPM), and finally the refined feature map is used as a main prediction of the network. Specifically, PPM first passes through four different pooling layers versus a high level feature map F 4 Down-sampling is performed to produce four scaled feature maps of different scales, which are then concatenated to obtain an effective global prior representation, as shown in figure 4 c.
Step 2.3, intermediate prediction
Firstly, the characteristic diagram obtained in the middle three stages of the network is processedAnd fusing, and then taking the fused feature map as an intermediate prediction. Specifically, these feature maps are merged using short connections (i.e., F) in the last three phases 2 ,F 3 And F 4 ) Then they are fused to get an intermediate prediction.
Step 2.4, edge detection
First fusing the low-level features F 1 And high level feature F 4 To predict an edge map; then, connecting the edge graph with the middle prediction graph and the main prediction graph respectively to generate a shadow graph with a rich structure; finally, the two shadow maps are concatenated to obtain the final predicted shadow mask (i.e., the output result).
Step 3, constructing weak supervised learning of structure perception
In order to learn shadow detection from line labeling data set, the invention provides a structure-aware weak supervised learning strategy, which uses a multitask learning framework and semantic perception loss to accurately position shadow regions.
Step 3.1 edge-guided multitask learning
An edge-oriented multi-task learning framework is built based on a shadow detection network, and a structured shadow prediction graph is generated by combining line supervision and edge detection.
Step 3.1.1, online supervision
During the training process, the intermediate, primary, and final output predictions of the shadow detection network are supervised by line labels. Notably, the pixel-level label for fully supervised shadow detection uses 1 for shadow, 0 for non-shadow; in the weakly supervised shadow detection of the present invention, the line tag uses three supervisory signals: 1 as shaded, 2 as unshaded, 0 as unmarked pixel. Since most of the line labels are unlabeled pixels, partial Cross-Entropy (PCE) loss is used here to train the detection network:
Figure SMS_4
wherein omega S Is the set of labeled pixels, y i And p i The true class and the prediction at pixel i, respectively. In conjunction with all PCE losses, a final PCE loss can be obtained:
Figure SMS_5
wherein s is int For intermediate prediction, s mai For the main prediction, s out To finally output the prediction.
Step 3.1.2, edge detection
In order to highlight the shadow structure, the invention combines the edge detection task with the intermediate prediction and the main prediction to form a multi-task learning framework. In training, edge detection was performed using Cross-Entropy (CE) loss:
Figure SMS_6
/>
where g is an edge map pre-calculated by the most recent edge detector edeter as edge detection GT, e is a predicted edge map, and (r, c) represents the row and column coordinates of the pixel. Combining edge detection with line surveillance, a shadow map with edge preservation can be obtained.
Step 3.2, feature-oriented semantic perception learning
Multitask learning encourages the network to generate a structurally rich shadow map, but does not have sufficient constraints on the recovery range of shadow regions, particularly for shadows with fuzzy boundaries. As shown in FIG. 5, the shadow in the first row has clear boundaries, and an accurate shadow prediction map can be obtained only by using multi-task learning; the shadow boundaries of the second row are blurred and further perception of the shadow boundaries is required. To this end, a feature-oriented semantic perception penalty is proposed to help multitask learning accurately perceive shadow boundaries from complex scenes. The loss of semantic perception designed based on visual features includes loss of visual similarity and loss of visual difference.
Step 3.2.1, loss of visual similarity
The visual similarity loss takes into account color correlation, illumination correlation, and position correlation between pixels.
The color correlation between pixels is defined as:
Figure SMS_7
where C (i) is the color at pixel i, C (j) is the color at pixel j, σ C Is a hyper-parameter.
Similarly, the illumination dependence R i And the position dependence R p Is defined as:
Figure SMS_8
Figure SMS_9
where I (I), L (I) are the illumination and position at pixel I, σ I And σ L Is a hyper-parameter. It can further be obtained that the similarity of the visual characteristic is ^ based>
Figure SMS_10
It aims to make similar pixels tend towards similar predictions. The loss of visual similarity is therefore defined as:
Figure SMS_11
where Di is an adjacent area d × d centered on the pixel i, G (i, j) =1-p i p j -(1-p i )(1-p j )。
Step 3.2.2 loss of visual difference
Firstly, extracting a characteristic diagram before a main prediction P
Figure SMS_12
Then passes through the covariance S i =cov(F i m P) determining the significance of the characteristic channel, wherein i e { 1.,. C } is the characteristic channel, the characteristic map of the first N channels is taken>
Figure SMS_13
As a remarkable featureFeature, then feature relevance can be written as:
Figure SMS_14
further visual feature saliency can be derived:
Figure SMS_15
the apparent difference loss is therefore defined as:
Figure SMS_16
wherein E is k As effective edge region, λ vd Is a hyper-parameter that increases as the number of iterations increases during the training process.
In summary, the final semantic perceptual loss can be written as:
Figure SMS_17
as shown in fig. 6, in combination with semantic perception loss, visually similar features can be captured in a complex environment, and the shadow region can be accurately located by visual difference (semantic features).
Step 3.3, objective function
The overall penalty is generated by the above-described multitask learning and semantic perception learning. Thus, the final objective function is defined as:
Figure SMS_18
wherein, beta 1 、β 2 And beta 3 Is a hyper-parameter. In the course of the training process,
Figure SMS_19
intended to propagate line tag information to unannotated regions, in conjunction with a line tag reader>
Figure SMS_20
The shadow map of the forced prediction remains aligned with the shadow boundary, and is->
Figure SMS_21
Semantic information is used to perceive shadows and non-shadows. Bonding of
Figure SMS_22
And &>
Figure SMS_23
The marked shadow (unshaded) pixels can be prevented from propagating to unshaded (shaded) regions.
The present invention evaluates the proposed method on three commonly used shadow detection datasets (SBU, ISTD and UCF). (1) The SBU dataset is the largest pixel-level annotation dataset, containing 4089 training images and 638 test images. (2) The ISTD data set was created for shadow detection and removal, containing 1330 training images and 540 test images. (3) The shadow image of the UCF is similar to the SBU, containing 135 training images and 110 test images. The method comprises the steps of firstly, carrying out on-line labeling on data sets S-SBU and S-ISTD to train models, and then carrying out testing on SBU, ISTD and UCF testing sets. Note that the tests performed on the UCF here were to verify the generalization capability of the model.
Following The State-Of-The-Art (SOTA) method, the present invention also employs a widely used Balanced Error Rate (BER) to evaluate The proposed method:
Figure SMS_24
wherein, TP, TN, P, N respectively represent the number of pixels of positive partial pair, negative partial pair, shadow and non-shadow. In the evaluation, a smaller BER value represents a better model performance.
The shadow detection model proposed in the present invention is based on PyTorch 1.12.1 and Python 3.6.12 implementations, and trains the detection network on a single NVIDIA RTX 3090GPU with 24GB of memory. The number of CSTBs at each stage of the backbone network is set to 1,2, 21 and 1, respectively. A fully connected Conditional Random Field (CRF) is used for further processingA refined network prediction result. All input image sizes are uniformly set to 416 × 416 and the patch size is set to 4 × 4. The hyperparameter in equation (9) is set to β 1 =0.4、β 2 =0.3 and β 3 And =0.3. For the enhancement strategy of the training data, strong enhancement uses color dithering and blurring, and weak enhancement uses random horizontal flipping. In the training phase, pre-training is firstly carried out on ImageNet to generate initialization parameters of a backbone network, and other convolutional layers adopt random initialization parameters
Figure SMS_25
The network was optimized using random Gradient Descent (SGD), with the momentum values and weight attenuations set to 0.9 and 1e-4, respectively. The number of iterations on both the S-SBU and S-ISTD datasets was set to 40, the learning rate was 1e-4, and the training batch size was 4.
The present invention compares the proposed method with 9 SOTA methods, namely ScGAN, DSC, A + D Net, DC-DSPF, BDRAR, DSDNet, MTMT-Net, FDRNe and SDCM. The MTMT-Net is a semi-supervised model, extra unmarked data are used for training, and other methods are all full-supervised models.
As shown in table 1, the results of quantitative comparisons of all methods on three datasets are given, where "N" and "NS" represent the pixel error rates of the shaded and unshaded areas, respectively. It can be observed that the method is based on the model designed by the CSWin Transformer, and the parameter quantity of the method is less than that of most CNN models. In terms of performance, the present method is able to obtain performances comparable to those of the recent fully supervised (or semi-supervised) methods, mainly due to three aspects: (1) A plurality of noise labels exist in the existing pixel-level labeling data, so that the model with high learning capacity is over-fitted with label noise, and the performance of the model is limited. The method uses line labeling, focusing on more shadow cases (e.g., self-shadows, soft shadows, and small shadows), while pixel-level labeling typically ignores them. (2) Compared with the existing detection network based on the CNN, the network designed based on the Transformer can better transmit the label information. (3) The method adopts a structure-aware weak supervised learning strategy, and can well deduce the shadow structure and details in line supervised learning. In addition, the method achieves better evaluation results on the ISTD than the SBU, because the ISTD is mainly based on hard shadow data, and the shadow structure and the boundary of the ISTD are clearer than those of the SBU, which is more beneficial to the provided weak supervised learning strategy. Thus, the present method achieves comparable performance to the latest methods (FDRNet and SDCM) in ISTD.
Figure SMS_26
TABLE 1
The present invention further provides some qualitative comparison results, as shown in fig. 7, noting that in lines 1 and 7, GT is a noise label. From experimental results, the performance of the method is equivalent to that of the latest fully-supervised (or semi-supervised) SOTA method, and different types of shadows can be effectively detected in different scenes. For example, the method can effectively locate small shadows (lines 1 and 6), soft shadows (line 4), and self shadows (lines 5 and 7) in an image. For ambiguous cases, i.e., non-shaded regions that are shadow-like (line 2) and shaded regions that are non-shaded mode (line 3), the method can still identify them. Furthermore, in some cases, the method has even better performance than existing methods, e.g., it can accurately detect ambiguous line 3 shadows and line 5 self-shadows. Robustness shown by the method in various scenes proves that the marked method and the weak supervised learning strategy are effective to shadow detection.
To verify the validity of the proposed network design and loss function, they were separately studied for ablation.
(1) Verifying validity of components of a proposed detection network
To verify the validity of the various components of the network, the proposed network is compared with three variants thereof:
basic: deleting an Intermediate Prediction (IP) module and a PPM module to obtain a Basic model;
basic + PPM: adding a PPM module into 'Basic';
basic + IP: and adding an IP module in the Basic.
Table 2 shows the results of the ablation study, analysis of which can yield: (1) With the benefit of the transform newest structure, the four models can achieve excellent performance on two reference data sets; (2) The BER of 'Basic + PPM' is lower than that of 'Basic', which shows that the global context prior extracted from the PPM module is effective for shadow detection; (3) The 'basic + IP' prediction result is superior to the 'basic' prediction result, which shows that the intermediate prediction result can provide more shadow and non-shadow details, thereby improving the quality of the final prediction result. (4) The method combines IP and PPM to obtain the best performance.
Figure SMS_27
TABLE 2
(2) Verifying the validity of a loss function
In the weakly supervised learning, the objective function of the method consists of three loss functions, the detailed ablation study of the loss functions is shown in table 3, and it is noted that if edge detection loss is not used in the ablation study, the edge detection module will be directly deleted. It can be observed that only partial cross entropy loss is used
Figure SMS_29
The detection network is trained to obtain the worst BER value. When combined with an edge detection loss>
Figure SMS_32
Or loss of semantic perception>
Figure SMS_35
When combined, performance can be significantly increased since +>
Figure SMS_30
Encouraging the network to generate shadow maps with rich structure>
Figure SMS_31
Focusing the network on distinguishing shadow vs. nonAnd (4) shading. Thus, the method (i.e.. Sup. -, -are based on->
Figure SMS_33
) While combining them to achieve optimum performance. In addition, are combined>
Figure SMS_34
Has a performance superior to->
Figure SMS_28
This suggests that explicit edge constraint (edge detection) is more effective for weakly supervised learning of line labels. As shown in fig. 8, ablation analysis results are further visualized.
Figure SMS_36
Table 3.

Claims (5)

1. A weak supervision shadow detection method using line marking is characterized in that: the method specifically comprises the following steps:
step 1, re-labeling a common shadow data set by using lines
Step 1.1, marking rule
Formulating a line marking rule according to the characteristics of the shadow in the complex scene;
step 1.2, annotating shadow data again
Re-labeling the SBU and the ISTD shadow detection data sets, namely S-SBU and S-ISTD;
step 2, designing a transform-based shadow detection network
The detection network comprises four modules: network backbone, main prediction, intermediate prediction and edge detection;
step 2.1, select the network backbone
Selecting a network backbone using a CSWin Transformer as a shadow detection;
step 2.2, select Primary prediction
In order to fully utilize the global representation, a pyramid pooling module is used at the top of the backbone network to extract the global context prior; firstly, extracting high-level features from a backbone network, then obtaining a refined feature map by passing the high-level features through a pyramid pooling module, and taking the refined feature map as a main prediction of the network;
step 2.3, selecting intermediate prediction
Fusing the feature maps obtained in the middle three stages of the network, and then taking the fused feature maps as middle prediction;
step 2.4, determining edge detection
Adopting edge detection to explicitly assist the perception of the shadow structure; the method specifically comprises the following steps: firstly, fusing the features of the lowest layer and the features of the highest layer to predict an edge graph; then, the edge graph is respectively connected with the middle prediction graph and the main prediction graph to generate a shadow graph with a rich structure; finally, connecting the two prediction images to obtain a final predicted shadow mask, namely outputting a result;
step 3, constructing a structure perception weak supervision learning method
Step 3.1, constructing an edge-oriented multi-task learning framework
Constructing an edge-oriented multitask learning framework based on a shadow detection network, and combining line supervision and edge detection to generate a structured shadow prediction graph;
step 3.1.1, online supervision
In the training process, the intermediate prediction, the main prediction and the final output prediction of the shadow detection network are supervised by a line label;
step 3.1.2, edge detection
To highlight shadow structures, structure perception is explicitly aided using edge detection; in specific execution, combining an edge detection task with intermediate prediction and main prediction to form a multi-task learning framework;
step 3.2, feature-oriented semantic perception learning
A feature-oriented semantic perception loss is proposed to accurately perceive shadow regions from complex scenes; semantic perception loss is designed based on visual features, which include visual similarity loss and visual difference loss; the visual similarity loss takes into account color correlation, illumination correlation and position correlation among pixels; the visual difference loss is specifically to solve the problem by using higher semantic information, which simulates the way human recognizes shadows;
and 4, step 4: and capturing visually similar features in a complex environment by combining semantic perception loss, and accurately positioning a shadow region through visual difference.
2. The method of claim 1, wherein the method comprises the following steps: the line marking rule is formulated according to the characteristics of the shadow in the complex scene, and comprises a general rule and a specific rule;
the general rule is as follows:
(1) for a shadow image, marking shadow and non-shadow areas by using at least two lines;
(2) for complex scenes, the line marking should cover as many areas as possible;
(3) carrying out cross-texture labeling on a shadow region or a non-shadow region with different textures;
the specific rule is as follows:
(1) giving explicit labels to the shadow-like regions;
(2) for soft shadows, the line labels extend from the shadow region to the penumbra region;
(3) for self-shadow and small shadow regions, explicit notation is used.
3. The method for weak supervised shadow detection with line labeling as recited in claim 1, wherein: the selecting uses a CSWin Transformer as a network backbone for shadow detection; the network backbone specifically comprises: the main network uses convolution of 3 x 3 and convolution layer with step length of 2 to make down-sampling, and extracts multi-scale characteristic diagram from low level to high level, respectively expressed as F 1 ,F 2 ,F 3 And F 4 (ii) a A 3 x 3 convolution block is used for feature transformation at each stage of the network; for an input image
Figure FDA0004032060480000021
Generates ≧ or using a convolution token embedded layer of 7 × 7 convolution with a step size of 4>
Figure FDA0004032060480000022
Token, the dimension of which is denoted C, so that feature maps constructed at each stage have +>
Figure FDA0004032060480000023
A token, where i ∈ {1,2,3,4}.
4. The method of claim 1, wherein the method comprises the following steps: the visual similarity loss takes color correlation, illumination correlation and position correlation among pixels into consideration; wherein the color correlation R between pixels c The method specifically comprises the following steps:
Figure FDA0004032060480000031
where C (i) is the color at pixel i, C (j) is the color at pixel j, σ C Is a hyper-parameter;
illumination dependence R i And the position dependence R p Is defined as:
Figure FDA0004032060480000032
Figure FDA0004032060480000033
where I (I), L (I) are the luminance and position at pixel I, σ I And σ L Is a hyper-parameter; the similarity of the visual characteristics is->
Figure FDA0004032060480000034
It aims to make similar pixels tend towards similar predictions; the loss of visual similarity is therefore defined as:
Figure FDA0004032060480000035
wherein D is i Is an adjacent region d × d, G (i, j) =1-p centered on the pixel i i p j -(1-p i )(1-p j );
p i Representing the prediction at pixel i, p j Representing the prediction at pixel j.
5. The method of claim 1, wherein the method comprises the following steps: the loss of visual disparity;
firstly, extracting a characteristic diagram before a main prediction P
Figure FDA0004032060480000036
Then passes through covariance S i =cov(F i m P) determining the significance of the characteristic channel, wherein i e { 1.,. C } is the characteristic channel, the characteristic map of the first N channels is taken>
Figure FDA0004032060480000037
As a salient feature, then the feature correlation is written as: />
Figure FDA0004032060480000038
Further visual feature saliency was obtained:
Figure FDA0004032060480000039
the apparent difference loss is therefore defined as:
Figure FDA00040320604800000310
wherein E is k As effective edge region, λ vd Is a hyper-parameter that increases as the number of iterations increases during the training process.
CN202211739474.9A 2022-12-30 2022-12-30 Weak supervision shadow detection method using line marking Pending CN115953663A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211739474.9A CN115953663A (en) 2022-12-30 2022-12-30 Weak supervision shadow detection method using line marking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211739474.9A CN115953663A (en) 2022-12-30 2022-12-30 Weak supervision shadow detection method using line marking

Publications (1)

Publication Number Publication Date
CN115953663A true CN115953663A (en) 2023-04-11

Family

ID=87289221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211739474.9A Pending CN115953663A (en) 2022-12-30 2022-12-30 Weak supervision shadow detection method using line marking

Country Status (1)

Country Link
CN (1) CN115953663A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575976A (en) * 2024-01-12 2024-02-20 腾讯科技(深圳)有限公司 Image shadow processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117575976A (en) * 2024-01-12 2024-02-20 腾讯科技(深圳)有限公司 Image shadow processing method, device, equipment and storage medium
CN117575976B (en) * 2024-01-12 2024-04-19 腾讯科技(深圳)有限公司 Image shadow processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110322495B (en) Scene text segmentation method based on weak supervised deep learning
CN110956185B (en) Method for detecting image salient object
Hu et al. Direction-aware spatial context features for shadow detection
Yuliang et al. Detecting curve text in the wild: New dataset and new solution
US20200273192A1 (en) Systems and methods for depth estimation using convolutional spatial propagation networks
Wang et al. Salient object detection based on multi-scale contrast
CN110287826B (en) Video target detection method based on attention mechanism
CN113158862B (en) Multitasking-based lightweight real-time face detection method
Lv et al. Residential floor plan recognition and reconstruction
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
Peng et al. Full-level domain adaptation for building extraction in very-high-resolution optical remote-sensing images
CN111523463B (en) Target tracking method and training method based on matching-regression network
CN113920148B (en) Building boundary extraction method and equipment based on polygon and storage medium
Yin et al. Attention-guided siamese networks for change detection in high resolution remote sensing images
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
Lee et al. Unsupervised video object segmentation via prototype memory network
CN115661505A (en) Semantic perception image shadow detection method
CN115019039A (en) Example segmentation method and system combining self-supervision and global information enhancement
Song et al. PSTNet: Progressive sampling transformer network for remote sensing image change detection
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115953663A (en) Weak supervision shadow detection method using line marking
Cui et al. Deep saliency detection via spatial-wise dilated convolutional attention
Song et al. Fsnet: Focus scanning network for camouflaged object detection
Zeng et al. Deep confidence propagation stereo network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination