CN110889416A - Salient object detection method based on cascade improved network - Google Patents
Salient object detection method based on cascade improved network Download PDFInfo
- Publication number
- CN110889416A CN110889416A CN201911278227.1A CN201911278227A CN110889416A CN 110889416 A CN110889416 A CN 110889416A CN 201911278227 A CN201911278227 A CN 201911278227A CN 110889416 A CN110889416 A CN 110889416A
- Authority
- CN
- China
- Prior art keywords
- features
- depth
- cascade
- rgb
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
Abstract
The invention discloses an RGB-D significance object detection method based on a cascade improved network, and belongs to the technical field of image processing. Most existing RGB-D models aggregate features from CNN networks at different levels directly, easily introducing noise and interference information contained in the lower-level features. The invention creatively provides a cascade improved structure, which takes a significance map generated by the characteristics of a high-level part as a mask to improve the characteristics of a low-level part, and then generates a final significance map by polymerizing the improved low-level characteristics; in addition, in order to eliminate the interference information of the depth map, the invention provides a depth enhancement module for preprocessing the depth features and the RGB features before mixing. The invention uses 4 evaluation indexes to carry out experiments on 7 data sets, and the result shows that the invention surpasses all the most advanced RGB-D significant object detection methods at present.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an RGB-D salient object detection method based on a cascade improved network.
Technical Field
RGB-D saliency detection aims at RGB images in combination with depth information to find the most salient objects in a certain scene. In recent years, various smart devices (such as smart phones, somatosensory peripherals, etc.) capable of capturing depth information have become popular and widely used, and thus a large number of RGB-D significant algorithms have been proposed.
Early RGB-D saliency detection algorithms mainly utilized manual features, and these methods greatly relied on specific knowledge such as local area contrast, global area contrast, background prior knowledge, spatial prior knowledge, channel prior knowledge, and so on. In order to effectively utilize the manual features, researchers have utilized various classical tools such as support vector machines, markov chains, random forest algorithms, cellular automata, etc., which all achieve reliable results. In addition, various fusion strategies have been explored by researchers, such as early fusion, i.e., directly inputting the depth map as a fourth channel input network except for RGB, medium fusion, i.e., fusing features from the RGB network and the depth network, and late fusion, i.e., mixing the significance maps predicted by the depth information and the RGB information by multiplication or addition, and the like, which also achieve good effects.
With the popularity of Convolutional Neural Networks (CNNs), various deep network-based algorithms have been proposed in particular. Early depth algorithms were based on manual features, classified using deep networks, which relied on artificially defined features and could not be trained end-to-end. In order to fully utilize the depth information, researchers have proposed different deep network architectures (such as single network streaming, dual network streaming, triple network streaming) and various multi-scale multi-modal mixing strategies. However, since the depth map acquired by the device may contain much noisy and misleading information, researchers have proposed using a priori knowledge, a depth filter unit, to improve the depth information.
Although the above work considers that the features of different layers of the network all contain effective information and are utilized, the noise and redundancy contained in the features of the lower layer are ignored, the features are not utilized effectively, and the interference information often causes the generated significance map to contain the interference of the background; furthermore, depth features are often combined with RGB information by stitching, element-level addition or multiplication over channels, which are not effective in reducing the differences between the depth features and the RGB feature modality and eliminating the interference of low quality depth maps.
Disclosure of Invention
The invention aims to solve the problems of background interference caused by noise contained in low-layer part features due to the fact that distinguishing and direct aggregation are not carried out on all the layer features in the existing RGB-D significance detection method and the problem of improving the matching capability of depth features and RGB feature modes, and designs an RGB-D significance object detection method based on a cascade improved network.
The technical scheme adopted by the invention is as follows:
a salient object detection method based on a cascade improved network improves the characteristics of a low-level part by utilizing a primary salient map generated by the characteristics of the high-level part, and generates a final salient map by aggregating the improved low-level characteristics, and specifically comprises the following steps:
and step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features, wherein the depth enhancement module is composed of two sequentially executed channel attention operations and space attention operations.
And step 3, aggregating the multi-mode features of the upper layer parts from the 3 rd to the 5 th layers by a Cascade feature Decoder (Cascade Decoder 1) to generate an initial saliency map, and improving the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by respectively carrying out element-level multiplication operation on the initial saliency map and each channel of the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by using the initial saliency map as a mask.
And step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM). The cascade feature decoder is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.
The invention has the advantages and beneficial effects that:
according to the method, the semantic information contained in the high-level features in the depth network is effectively utilized through the depth map filter module to generate a relatively accurate initial saliency map which is then used for improving the low-level features, so that the influence of noise in the low-level features can be fully inhibited, the detailed information of the low-level features can be well kept, and the saliency map with better edge and detailed information can be generated; on the other hand, the depth enhancement unit provided by the invention can enable the network to intensively extract information which is beneficial to significance detection in the depth map, and can improve the modal matching capability of the RGB features and the depth features.
Drawings
Fig. 1 is a block diagram of an embodiment of a significance detection method based on a cascaded modified network according to the present invention;
fig. 2 is a specific structure of a depth enhancement unit (DEM) of the significance detection method based on the cascaded modified network proposed by the present invention;
FIG. 3 is a specific structure of a global information unit (GCM) in the significance detection method based on the cascaded modified network proposed by the present invention;
FIG. 4 is a comparison experiment of 4 evaluation indexes using the present invention and 10 most advanced RGB-D significance detection methods, including 8 deep learning-based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 conventional manual characterization methods (SE and LBE).
The specific implementation mode is as follows:
referring to fig. 1, the salient object detection method based on the cascaded modified network provided by the present invention mainly comprises a depth enhancement unit (DEM) and a cascaded feature Decoder (Cascade Decoder), and the implementation steps of the salient object detection method based on the cascaded modified network are as follows:
1. using two ResNet50 CNN networks of the same architecture, a network input RGB image extracts RGB features of 5 different levelsExtracting depth features of 5 different levels from another network input depth map imageThe number of input channels of the RGB network is 3, and the number of input channels of the deep network is 1.
2. The depth image features of 5 different levels extracted in the step 1 are respectively subjected to Depth Enhancement Module (DEM) to obtain enhanced depth features, and are fused with RGB features of corresponding levels through element level addition to obtain mixed mode (multi-mode) featuresNamely:
wherein, with reference to FIG. 2, the depth enhancement module is operated by two channel attentions performed in sequence CattAnd spatial attention operation SattThe composition is as follows:
⊙ in the above equation denotes a multiplication operation on an element level, and the channel attention operation and the spatial attention operation are defined as:
wherein f represents the input feature map, M represents a 2-layer multi-layer perceptron, PmaxRepresenting each feature map global max pooling operation, P'maxRepresenting the global max-pooling operation along the channel dimension in the feature map, Conv represents the standard 3 x 3 convolution operation,express features ofAnd performing dimension expansion on the figure graph, and then performing multiplication operation on an element level.
3. Aggregating multi-modal features of high-level parts of layers 3 to 5 by means of a Cascade feature Decoder (Cascade Decoder 1)Generating an initial saliency map, namely:
wherein D1Representing a cascaded feature decoder at a first level, and then using the initial saliency map to refine the multi-modal features of the lower layer portion of layers 1 through 3Improved characteristicsNamely:
referring to fig. 1, each concatenated feature decoder consists of 3 global information elements (GCMs) and pyramidal product and concatenation operations. As shown in fig. 3, each global information unit is composed of four branches, for the four branches, the channel dimension of the feature map is reduced to 32 by using 1 × 1 convolution, no other additional operation is performed on the 1 st branch, for the k (k ∈ 2,3,4}) branch, the convolution operation with the convolution kernel size of 2k-1 and the expansion rate of 1 is performed first, then the convolution operation with the convolution kernel size of 3 × 3 and the expansion rate of 2k-1 is performed to capture global information, and the receptive field is improved. Then the outputs of the 4 branches are spliced together by the channel and then residual connected with the input.
Features of the output global information element with reference to FIG. 1(b)Each output characteristic fgcmAnd (3) performing multiplication operation updating on all feature maps higher than the feature maps in the hierarchy in sequence, and generating final output for the updated features through layer-by-layer splicing operation on one channel. In the pyramid multiplication and concatenation operation, for small-size feature maps in the features with unmatched sizes, processing is performed through an upsampling-convolution (upsample-conv) to obtain the feature maps with matched sizes, and then pyramid operation is performed to update the feature maps.
4. Aggregating low-level partially modified multi-modal features of layers 1 through 3 with another cascaded feature Decoder (Cascade Decoder 2)Then generating a final saliency map S by a progressive upsampling module (PTM)2:
Wherein D is2Represents a second concatenated feature decoder, the structure of which is identical to the specific structure of the concatenated feature decoder described in step 3, and T represents a progressive upsampling module.
Progressive upsampling (PTM) is used to output a fixed size saliency map, which consists of 2 residual-based deconvolution modules with a channel number of 32, each module consisting of 1 residual-based convolution operation and 1 residual-based deconvolution operation.
where α is the weight controlling the loss of the two-stage output, set to 0.5 in the present invention, G represents the true label graph,represents the cross entropy loss function, defined as:
where S represents the predicted significance map.
6. The effect of the invention is further illustrated by the following simulation experiment:
table 1 shows comparative experiments of the invention with other 18 RGB-D significance detection methods on seven data sets of NJU2K, NLPR, STERE, DES, LFSD, SSD and SIP. The evaluation index adopted in the experiment is S-measure (S)α)、Max F-measure(Fβ)、Max E-Measure(Eξ) And mae (m) to fully evaluate the method. The results show that the performance of the present invention exceeds all the latest results that have been published.
TABLE 1
In addition, FIG. 2 shows the effect of the present invention in comparison to the other 10 most advanced methods in a specific application scenario, including 8 deep learning based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 traditional manual characterization methods (SE and LBE). It can be seen that the effect of the present invention is closer to the real tag GT than other methods.
The parts of this example not described in detail belong to common general knowledge in the field, and are not described in detail herein.
The significance detection method based on the cascade improved network is described in detail above, and the principle and the implementation mode of the invention are explained by applying specific embodiments in the text; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and the content of the present specification should not be construed as a limitation to the present invention, and all designs similar to or the same as the present invention belong to the protection scope of the present invention.
Claims (4)
1. A salient object detection method based on a cascade improved network is characterized in that a primary salient map generated by characteristics of a high-level part is used for improving the characteristics of the low-level part, and a final salient map is generated by aggregating the improved low-level characteristics, and comprises the following steps:
step 1, two CNN networks with the same structure are utilized, one network input RGB image extracts RGB features of 5 different levels, and the other network input depth map image extracts depth features of 5 different levels;
step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features;
step 3, aggregating multi-mode features of the upper layer parts of the 3 rd to 5 th layers by a cascade feature decoder to generate an initial saliency map, and improving the multi-mode features of the lower layer parts of the 1 st to 3 rd layers by using the initial saliency map as a mask;
and step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM).
2. The cascade improved network-based salient object detection method according to claim 1, wherein: the depth enhancement module described in step 2 consists of two channel attention operations and a spatial attention operation performed in sequence.
3. The cascade improved network-based salient object detection method according to claim 1, wherein: the improvement method described in step 3 is to perform element-wise multiplication (element-wise multiplication) on the initial saliency map and each channel of the low-level part multi-modal features of layers 1 to 3 by using the initial saliency map as a mask.
4. The cascade improved network-based salient object detection method according to claim 1, wherein: the cascade feature decoder in the steps 3 and 4 is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911278227.1A CN110889416B (en) | 2019-12-13 | 2019-12-13 | Salient object detection method based on cascade improved network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911278227.1A CN110889416B (en) | 2019-12-13 | 2019-12-13 | Salient object detection method based on cascade improved network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110889416A true CN110889416A (en) | 2020-03-17 |
CN110889416B CN110889416B (en) | 2023-04-18 |
Family
ID=69751772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911278227.1A Active CN110889416B (en) | 2019-12-13 | 2019-12-13 | Salient object detection method based on cascade improved network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110889416B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN112905828A (en) * | 2021-03-18 | 2021-06-04 | 西北大学 | Image retriever, database and retrieval method combined with significant features |
CN113298814A (en) * | 2021-05-21 | 2021-08-24 | 浙江科技学院 | Indoor scene image processing method based on progressive guidance fusion complementary network |
CN113780241A (en) * | 2021-09-29 | 2021-12-10 | 北京航空航天大学 | Acceleration method and device for detecting salient object |
WO2022134842A1 (en) * | 2020-12-24 | 2022-06-30 | 广东博智林机器人有限公司 | Method and apparatus for identifying building features |
CN116403174A (en) * | 2022-12-12 | 2023-07-07 | 深圳市大数据研究院 | End-to-end automatic driving method, system, simulation system and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180068198A1 (en) * | 2016-09-06 | 2018-03-08 | Carnegie Mellon University | Methods and Software for Detecting Objects in an Image Using Contextual Multiscale Fast Region-Based Convolutional Neural Network |
CN110210539A (en) * | 2019-05-22 | 2019-09-06 | 西安电子科技大学 | The RGB-T saliency object detection method of multistage depth characteristic fusion |
CN110458797A (en) * | 2019-06-18 | 2019-11-15 | 南开大学 | A kind of conspicuousness object detecting method based on depth map filter |
-
2019
- 2019-12-13 CN CN201911278227.1A patent/CN110889416B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180068198A1 (en) * | 2016-09-06 | 2018-03-08 | Carnegie Mellon University | Methods and Software for Detecting Objects in an Image Using Contextual Multiscale Fast Region-Based Convolutional Neural Network |
CN110210539A (en) * | 2019-05-22 | 2019-09-06 | 西安电子科技大学 | The RGB-T saliency object detection method of multistage depth characteristic fusion |
CN110458797A (en) * | 2019-06-18 | 2019-11-15 | 南开大学 | A kind of conspicuousness object detecting method based on depth map filter |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111582316B (en) * | 2020-04-10 | 2022-06-28 | 天津大学 | RGB-D significance target detection method |
WO2022134842A1 (en) * | 2020-12-24 | 2022-06-30 | 广东博智林机器人有限公司 | Method and apparatus for identifying building features |
CN112905828A (en) * | 2021-03-18 | 2021-06-04 | 西北大学 | Image retriever, database and retrieval method combined with significant features |
CN113298814A (en) * | 2021-05-21 | 2021-08-24 | 浙江科技学院 | Indoor scene image processing method based on progressive guidance fusion complementary network |
CN113780241A (en) * | 2021-09-29 | 2021-12-10 | 北京航空航天大学 | Acceleration method and device for detecting salient object |
CN113780241B (en) * | 2021-09-29 | 2024-02-06 | 北京航空航天大学 | Acceleration method and device for detecting remarkable object |
CN116403174A (en) * | 2022-12-12 | 2023-07-07 | 深圳市大数据研究院 | End-to-end automatic driving method, system, simulation system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110889416B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110889416B (en) | Salient object detection method based on cascade improved network | |
CN107945204B (en) | Pixel-level image matting method based on generation countermeasure network | |
CN108416327B (en) | Target detection method and device, computer equipment and readable storage medium | |
CN108764317B (en) | Residual convolutional neural network image classification method based on multipath feature weighting | |
CN112528976B (en) | Text detection model generation method and text detection method | |
WO2021057056A1 (en) | Neural architecture search method, image processing method and device, and storage medium | |
CN110222760B (en) | Quick image processing method based on winograd algorithm | |
JP2022006174A (en) | Method, equipment, device, media, and program products for training model | |
CN108021923A (en) | A kind of image characteristic extracting method for deep neural network | |
Yuan et al. | DMFNet: Deep multi-modal fusion network for RGB-D indoor scene segmentation | |
Wang et al. | TRC‐YOLO: A real‐time detection method for lightweight targets based on mobile devices | |
Zhang et al. | ReYOLO: A traffic sign detector based on network reparameterization and features adaptive weighting | |
CN110458084A (en) | A kind of face age estimation method based on inversion residual error network | |
US20220318946A1 (en) | Method for image shape transformation based on generative adversarial network | |
CN112749300B (en) | Method, apparatus, device, storage medium and program product for video classification | |
Yang et al. | Spatio-temporal domain awareness for multi-agent collaborative perception | |
WO2021147276A1 (en) | Data processing method and apparatus, and chip, electronic device and storage medium | |
JP2023001926A (en) | Method and apparatus of fusing image, method and apparatus of training image fusion model, electronic device, storage medium and computer program | |
Zhou et al. | RFNet: Reverse fusion network with attention mechanism for RGB-D indoor scene understanding | |
Leng et al. | Single-shot augmentation detector for object detection | |
US11276249B2 (en) | Method and system for video action classification by mixing 2D and 3D features | |
Li et al. | YOLOSA: Object detection based on 2D local feature superimposed self-attention | |
US20230306600A1 (en) | System and method for performing semantic image segmentation | |
Tan et al. | PPEDNet: Pyramid pooling encoder-decoder network for real-time semantic segmentation | |
CN115457365A (en) | Model interpretation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |