CN110889416A - Salient object detection method based on cascade improved network - Google Patents

Salient object detection method based on cascade improved network Download PDF

Info

Publication number
CN110889416A
CN110889416A CN201911278227.1A CN201911278227A CN110889416A CN 110889416 A CN110889416 A CN 110889416A CN 201911278227 A CN201911278227 A CN 201911278227A CN 110889416 A CN110889416 A CN 110889416A
Authority
CN
China
Prior art keywords
features
depth
cascade
rgb
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911278227.1A
Other languages
Chinese (zh)
Other versions
CN110889416B (en
Inventor
杨巨峰
翟英杰
范登平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN201911278227.1A priority Critical patent/CN110889416B/en
Publication of CN110889416A publication Critical patent/CN110889416A/en
Application granted granted Critical
Publication of CN110889416B publication Critical patent/CN110889416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Abstract

The invention discloses an RGB-D significance object detection method based on a cascade improved network, and belongs to the technical field of image processing. Most existing RGB-D models aggregate features from CNN networks at different levels directly, easily introducing noise and interference information contained in the lower-level features. The invention creatively provides a cascade improved structure, which takes a significance map generated by the characteristics of a high-level part as a mask to improve the characteristics of a low-level part, and then generates a final significance map by polymerizing the improved low-level characteristics; in addition, in order to eliminate the interference information of the depth map, the invention provides a depth enhancement module for preprocessing the depth features and the RGB features before mixing. The invention uses 4 evaluation indexes to carry out experiments on 7 data sets, and the result shows that the invention surpasses all the most advanced RGB-D significant object detection methods at present.

Description

Salient object detection method based on cascade improved network
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an RGB-D salient object detection method based on a cascade improved network.
Technical Field
RGB-D saliency detection aims at RGB images in combination with depth information to find the most salient objects in a certain scene. In recent years, various smart devices (such as smart phones, somatosensory peripherals, etc.) capable of capturing depth information have become popular and widely used, and thus a large number of RGB-D significant algorithms have been proposed.
Early RGB-D saliency detection algorithms mainly utilized manual features, and these methods greatly relied on specific knowledge such as local area contrast, global area contrast, background prior knowledge, spatial prior knowledge, channel prior knowledge, and so on. In order to effectively utilize the manual features, researchers have utilized various classical tools such as support vector machines, markov chains, random forest algorithms, cellular automata, etc., which all achieve reliable results. In addition, various fusion strategies have been explored by researchers, such as early fusion, i.e., directly inputting the depth map as a fourth channel input network except for RGB, medium fusion, i.e., fusing features from the RGB network and the depth network, and late fusion, i.e., mixing the significance maps predicted by the depth information and the RGB information by multiplication or addition, and the like, which also achieve good effects.
With the popularity of Convolutional Neural Networks (CNNs), various deep network-based algorithms have been proposed in particular. Early depth algorithms were based on manual features, classified using deep networks, which relied on artificially defined features and could not be trained end-to-end. In order to fully utilize the depth information, researchers have proposed different deep network architectures (such as single network streaming, dual network streaming, triple network streaming) and various multi-scale multi-modal mixing strategies. However, since the depth map acquired by the device may contain much noisy and misleading information, researchers have proposed using a priori knowledge, a depth filter unit, to improve the depth information.
Although the above work considers that the features of different layers of the network all contain effective information and are utilized, the noise and redundancy contained in the features of the lower layer are ignored, the features are not utilized effectively, and the interference information often causes the generated significance map to contain the interference of the background; furthermore, depth features are often combined with RGB information by stitching, element-level addition or multiplication over channels, which are not effective in reducing the differences between the depth features and the RGB feature modality and eliminating the interference of low quality depth maps.
Disclosure of Invention
The invention aims to solve the problems of background interference caused by noise contained in low-layer part features due to the fact that distinguishing and direct aggregation are not carried out on all the layer features in the existing RGB-D significance detection method and the problem of improving the matching capability of depth features and RGB feature modes, and designs an RGB-D significance object detection method based on a cascade improved network.
The technical scheme adopted by the invention is as follows:
a salient object detection method based on a cascade improved network improves the characteristics of a low-level part by utilizing a primary salient map generated by the characteristics of the high-level part, and generates a final salient map by aggregating the improved low-level characteristics, and specifically comprises the following steps:
step 1, two CNN networks with the same structure are utilized, one network input RGB image extracts RGB features of 5 different levels, and the other network input depth map image extracts depth features of 5 different levels;
and step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features, wherein the depth enhancement module is composed of two sequentially executed channel attention operations and space attention operations.
And step 3, aggregating the multi-mode features of the upper layer parts from the 3 rd to the 5 th layers by a Cascade feature Decoder (Cascade Decoder 1) to generate an initial saliency map, and improving the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by respectively carrying out element-level multiplication operation on the initial saliency map and each channel of the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by using the initial saliency map as a mask.
And step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM). The cascade feature decoder is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.
The invention has the advantages and beneficial effects that:
according to the method, the semantic information contained in the high-level features in the depth network is effectively utilized through the depth map filter module to generate a relatively accurate initial saliency map which is then used for improving the low-level features, so that the influence of noise in the low-level features can be fully inhibited, the detailed information of the low-level features can be well kept, and the saliency map with better edge and detailed information can be generated; on the other hand, the depth enhancement unit provided by the invention can enable the network to intensively extract information which is beneficial to significance detection in the depth map, and can improve the modal matching capability of the RGB features and the depth features.
Drawings
Fig. 1 is a block diagram of an embodiment of a significance detection method based on a cascaded modified network according to the present invention;
fig. 2 is a specific structure of a depth enhancement unit (DEM) of the significance detection method based on the cascaded modified network proposed by the present invention;
FIG. 3 is a specific structure of a global information unit (GCM) in the significance detection method based on the cascaded modified network proposed by the present invention;
FIG. 4 is a comparison experiment of 4 evaluation indexes using the present invention and 10 most advanced RGB-D significance detection methods, including 8 deep learning-based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 conventional manual characterization methods (SE and LBE).
The specific implementation mode is as follows:
referring to fig. 1, the salient object detection method based on the cascaded modified network provided by the present invention mainly comprises a depth enhancement unit (DEM) and a cascaded feature Decoder (Cascade Decoder), and the implementation steps of the salient object detection method based on the cascaded modified network are as follows:
1. using two ResNet50 CNN networks of the same architecture, a network input RGB image extracts RGB features of 5 different levels
Figure BDA0002315960240000031
Extracting depth features of 5 different levels from another network input depth map image
Figure BDA0002315960240000032
The number of input channels of the RGB network is 3, and the number of input channels of the deep network is 1.
2. The depth image features of 5 different levels extracted in the step 1 are respectively subjected to Depth Enhancement Module (DEM) to obtain enhanced depth features, and are fused with RGB features of corresponding levels through element level addition to obtain mixed mode (multi-mode) features
Figure BDA0002315960240000033
Namely:
Figure BDA0002315960240000034
wherein, with reference to FIG. 2, the depth enhancement module is operated by two channel attentions performed in sequence CattAnd spatial attention operation SattThe composition is as follows:
Figure BDA0002315960240000035
⊙ in the above equation denotes a multiplication operation on an element level, and the channel attention operation and the spatial attention operation are defined as:
Figure BDA0002315960240000036
wherein f represents the input feature map, M represents a 2-layer multi-layer perceptron, PmaxRepresenting each feature map global max pooling operation, P'maxRepresenting the global max-pooling operation along the channel dimension in the feature map, Conv represents the standard 3 x 3 convolution operation,
Figure BDA0002315960240000049
express features ofAnd performing dimension expansion on the figure graph, and then performing multiplication operation on an element level.
3. Aggregating multi-modal features of high-level parts of layers 3 to 5 by means of a Cascade feature Decoder (Cascade Decoder 1)
Figure BDA0002315960240000041
Generating an initial saliency map, namely:
Figure BDA0002315960240000042
wherein D1Representing a cascaded feature decoder at a first level, and then using the initial saliency map to refine the multi-modal features of the lower layer portion of layers 1 through 3
Figure BDA0002315960240000043
Improved characteristics
Figure BDA0002315960240000044
Namely:
Figure BDA0002315960240000045
referring to fig. 1, each concatenated feature decoder consists of 3 global information elements (GCMs) and pyramidal product and concatenation operations. As shown in fig. 3, each global information unit is composed of four branches, for the four branches, the channel dimension of the feature map is reduced to 32 by using 1 × 1 convolution, no other additional operation is performed on the 1 st branch, for the k (k ∈ 2,3,4}) branch, the convolution operation with the convolution kernel size of 2k-1 and the expansion rate of 1 is performed first, then the convolution operation with the convolution kernel size of 3 × 3 and the expansion rate of 2k-1 is performed to capture global information, and the receptive field is improved. Then the outputs of the 4 branches are spliced together by the channel and then residual connected with the input.
Features of the output global information element with reference to FIG. 1(b)
Figure BDA0002315960240000046
Each output characteristic fgcmAnd (3) performing multiplication operation updating on all feature maps higher than the feature maps in the hierarchy in sequence, and generating final output for the updated features through layer-by-layer splicing operation on one channel. In the pyramid multiplication and concatenation operation, for small-size feature maps in the features with unmatched sizes, processing is performed through an upsampling-convolution (upsample-conv) to obtain the feature maps with matched sizes, and then pyramid operation is performed to update the feature maps.
4. Aggregating low-level partially modified multi-modal features of layers 1 through 3 with another cascaded feature Decoder (Cascade Decoder 2)
Figure BDA0002315960240000047
Then generating a final saliency map S by a progressive upsampling module (PTM)2
Figure BDA0002315960240000048
Wherein D is2Represents a second concatenated feature decoder, the structure of which is identical to the specific structure of the concatenated feature decoder described in step 3, and T represents a progressive upsampling module.
Progressive upsampling (PTM) is used to output a fixed size saliency map, which consists of 2 residual-based deconvolution modules with a channel number of 32, each module consisting of 1 residual-based convolution operation and 1 residual-based deconvolution operation.
5. Loss function of the whole network in the training phase
Figure BDA0002315960240000051
Is defined as:
Figure BDA0002315960240000052
where α is the weight controlling the loss of the two-stage output, set to 0.5 in the present invention, G represents the true label graph,
Figure BDA0002315960240000055
represents the cross entropy loss function, defined as:
Figure BDA0002315960240000053
where S represents the predicted significance map.
6. The effect of the invention is further illustrated by the following simulation experiment:
table 1 shows comparative experiments of the invention with other 18 RGB-D significance detection methods on seven data sets of NJU2K, NLPR, STERE, DES, LFSD, SSD and SIP. The evaluation index adopted in the experiment is S-measure (S)α)、Max F-measure(Fβ)、Max E-Measure(Eξ) And mae (m) to fully evaluate the method. The results show that the performance of the present invention exceeds all the latest results that have been published.
TABLE 1
Figure BDA0002315960240000054
In addition, FIG. 2 shows the effect of the present invention in comparison to the other 10 most advanced methods in a specific application scenario, including 8 deep learning based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 traditional manual characterization methods (SE and LBE). It can be seen that the effect of the present invention is closer to the real tag GT than other methods.
The parts of this example not described in detail belong to common general knowledge in the field, and are not described in detail herein.
The significance detection method based on the cascade improved network is described in detail above, and the principle and the implementation mode of the invention are explained by applying specific embodiments in the text; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and the content of the present specification should not be construed as a limitation to the present invention, and all designs similar to or the same as the present invention belong to the protection scope of the present invention.

Claims (4)

1. A salient object detection method based on a cascade improved network is characterized in that a primary salient map generated by characteristics of a high-level part is used for improving the characteristics of the low-level part, and a final salient map is generated by aggregating the improved low-level characteristics, and comprises the following steps:
step 1, two CNN networks with the same structure are utilized, one network input RGB image extracts RGB features of 5 different levels, and the other network input depth map image extracts depth features of 5 different levels;
step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features;
step 3, aggregating multi-mode features of the upper layer parts of the 3 rd to 5 th layers by a cascade feature decoder to generate an initial saliency map, and improving the multi-mode features of the lower layer parts of the 1 st to 3 rd layers by using the initial saliency map as a mask;
and step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM).
2. The cascade improved network-based salient object detection method according to claim 1, wherein: the depth enhancement module described in step 2 consists of two channel attention operations and a spatial attention operation performed in sequence.
3. The cascade improved network-based salient object detection method according to claim 1, wherein: the improvement method described in step 3 is to perform element-wise multiplication (element-wise multiplication) on the initial saliency map and each channel of the low-level part multi-modal features of layers 1 to 3 by using the initial saliency map as a mask.
4. The cascade improved network-based salient object detection method according to claim 1, wherein: the cascade feature decoder in the steps 3 and 4 is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.
CN201911278227.1A 2019-12-13 2019-12-13 Salient object detection method based on cascade improved network Active CN110889416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911278227.1A CN110889416B (en) 2019-12-13 2019-12-13 Salient object detection method based on cascade improved network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911278227.1A CN110889416B (en) 2019-12-13 2019-12-13 Salient object detection method based on cascade improved network

Publications (2)

Publication Number Publication Date
CN110889416A true CN110889416A (en) 2020-03-17
CN110889416B CN110889416B (en) 2023-04-18

Family

ID=69751772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911278227.1A Active CN110889416B (en) 2019-12-13 2019-12-13 Salient object detection method based on cascade improved network

Country Status (1)

Country Link
CN (1) CN110889416B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN112905828A (en) * 2021-03-18 2021-06-04 西北大学 Image retriever, database and retrieval method combined with significant features
CN113298814A (en) * 2021-05-21 2021-08-24 浙江科技学院 Indoor scene image processing method based on progressive guidance fusion complementary network
CN113780241A (en) * 2021-09-29 2021-12-10 北京航空航天大学 Acceleration method and device for detecting salient object
WO2022134842A1 (en) * 2020-12-24 2022-06-30 广东博智林机器人有限公司 Method and apparatus for identifying building features
CN116403174A (en) * 2022-12-12 2023-07-07 深圳市大数据研究院 End-to-end automatic driving method, system, simulation system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180068198A1 (en) * 2016-09-06 2018-03-08 Carnegie Mellon University Methods and Software for Detecting Objects in an Image Using Contextual Multiscale Fast Region-Based Convolutional Neural Network
CN110210539A (en) * 2019-05-22 2019-09-06 西安电子科技大学 The RGB-T saliency object detection method of multistage depth characteristic fusion
CN110458797A (en) * 2019-06-18 2019-11-15 南开大学 A kind of conspicuousness object detecting method based on depth map filter

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180068198A1 (en) * 2016-09-06 2018-03-08 Carnegie Mellon University Methods and Software for Detecting Objects in an Image Using Contextual Multiscale Fast Region-Based Convolutional Neural Network
CN110210539A (en) * 2019-05-22 2019-09-06 西安电子科技大学 The RGB-T saliency object detection method of multistage depth characteristic fusion
CN110458797A (en) * 2019-06-18 2019-11-15 南开大学 A kind of conspicuousness object detecting method based on depth map filter

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN111582316B (en) * 2020-04-10 2022-06-28 天津大学 RGB-D significance target detection method
WO2022134842A1 (en) * 2020-12-24 2022-06-30 广东博智林机器人有限公司 Method and apparatus for identifying building features
CN112905828A (en) * 2021-03-18 2021-06-04 西北大学 Image retriever, database and retrieval method combined with significant features
CN113298814A (en) * 2021-05-21 2021-08-24 浙江科技学院 Indoor scene image processing method based on progressive guidance fusion complementary network
CN113780241A (en) * 2021-09-29 2021-12-10 北京航空航天大学 Acceleration method and device for detecting salient object
CN113780241B (en) * 2021-09-29 2024-02-06 北京航空航天大学 Acceleration method and device for detecting remarkable object
CN116403174A (en) * 2022-12-12 2023-07-07 深圳市大数据研究院 End-to-end automatic driving method, system, simulation system and storage medium

Also Published As

Publication number Publication date
CN110889416B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110889416B (en) Salient object detection method based on cascade improved network
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN108416327B (en) Target detection method and device, computer equipment and readable storage medium
CN108764317B (en) Residual convolutional neural network image classification method based on multipath feature weighting
CN112528976B (en) Text detection model generation method and text detection method
WO2021057056A1 (en) Neural architecture search method, image processing method and device, and storage medium
CN110222760B (en) Quick image processing method based on winograd algorithm
JP2022006174A (en) Method, equipment, device, media, and program products for training model
CN108021923A (en) A kind of image characteristic extracting method for deep neural network
Yuan et al. DMFNet: Deep multi-modal fusion network for RGB-D indoor scene segmentation
Wang et al. TRC‐YOLO: A real‐time detection method for lightweight targets based on mobile devices
Zhang et al. ReYOLO: A traffic sign detector based on network reparameterization and features adaptive weighting
CN110458084A (en) A kind of face age estimation method based on inversion residual error network
US20220318946A1 (en) Method for image shape transformation based on generative adversarial network
CN112749300B (en) Method, apparatus, device, storage medium and program product for video classification
Yang et al. Spatio-temporal domain awareness for multi-agent collaborative perception
WO2021147276A1 (en) Data processing method and apparatus, and chip, electronic device and storage medium
JP2023001926A (en) Method and apparatus of fusing image, method and apparatus of training image fusion model, electronic device, storage medium and computer program
Zhou et al. RFNet: Reverse fusion network with attention mechanism for RGB-D indoor scene understanding
Leng et al. Single-shot augmentation detector for object detection
US11276249B2 (en) Method and system for video action classification by mixing 2D and 3D features
Li et al. YOLOSA: Object detection based on 2D local feature superimposed self-attention
US20230306600A1 (en) System and method for performing semantic image segmentation
Tan et al. PPEDNet: Pyramid pooling encoder-decoder network for real-time semantic segmentation
CN115457365A (en) Model interpretation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant