CN110929736B - Multi-feature cascading RGB-D significance target detection method - Google Patents

Multi-feature cascading RGB-D significance target detection method Download PDF

Info

Publication number
CN110929736B
CN110929736B CN201911099871.2A CN201911099871A CN110929736B CN 110929736 B CN110929736 B CN 110929736B CN 201911099871 A CN201911099871 A CN 201911099871A CN 110929736 B CN110929736 B CN 110929736B
Authority
CN
China
Prior art keywords
depth
layer
rgb
branch
color
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911099871.2A
Other languages
Chinese (zh)
Other versions
CN110929736A (en
Inventor
周武杰
潘思佳
林鑫杨
黄铿达
雷景生
何成
王海江
薛林林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201911099871.2A priority Critical patent/CN110929736B/en
Publication of CN110929736A publication Critical patent/CN110929736A/en
Application granted granted Critical
Publication of CN110929736B publication Critical patent/CN110929736B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-feature cascading RGB-D significance target detection method. Selecting RGB images and corresponding depth images and true significance images thereof to form a training set, constructing a convolutional neural network, and inputting the training set into the convolutional neural network to train to obtain significance prediction images corresponding to each RGB image in the training set, calculating loss function values between the significance prediction images corresponding to each RGB image in the training set and the corresponding true significance images, and continuously training weight vectors and bias items corresponding to the loss function values with the minimum values; and inputting the RGB image and the depth image to be predicted into a trained convolutional neural network training model to obtain a prediction segmentation image. The model disclosed by the invention is novel in structure, and the similarity of the obtained salient map and the target map is high after model processing.

Description

Multi-feature cascading RGB-D significance target detection method
Technical Field
The invention relates to a human eye saliency target detection method, in particular to a multi-feature cascading RGB-D saliency target detection method.
Background
Saliency target detection is a branch of image processing and is also a field of computer vision. Computer vision, in a broad sense, is the discipline that imparts natural vision capabilities to machines. Natural vision ability refers to the visual ability that the biological vision system embodies. In fact, computer vision is essentially the research of visual perception problems. The key problem is to study how to organize the input image information, identify objects and scenes, and further explain the image content.
Computer vision has been the subject of increasing interest and rigorous research in recent decades. Computer vision is also increasingly better at recognizing patterns from images. Even in various fields, as the striking achievements of artificial intelligence and computer vision technology become more common in different industries, the future of computer vision appears to be filled with desirable and inconceivable results. The salient object detection referred to herein is one of the classifications, but plays a great role.
Significance detection is a method for predicting the position of a person in an image, and has attracted extensive research interest in recent years. It plays an important role in preprocessing in the problems of image classification, image repositioning, target recognition, etc. Unlike RGB saliency detection, RGB saliency detection is less studied. Saliency detection methods can be classified into top-down methods and bottom-up methods according to the definition of saliency. Top-down saliency detection is a task-dependent detection method that incorporates high-level features to locate salient objects. On the other hand, the bottom-up approach is dead, which uses low-level features to map out regions from a biological perspective.
Disclosure of Invention
In order to solve the problems in the background technology, the invention provides a multi-feature cascading RGB-D saliency target detection method, the similarity of a saliency map obtained after model processing and a target map is high, and the model structure is novel.
The technical scheme adopted by the invention is as follows, comprising the following steps:
step 1_1: q original RGB images and corresponding depth maps thereof are selected, and a training set is formed by combining the true significance images corresponding to the original RGB images;
step 1_2: constructing a convolutional neural network: the convolutional neural network comprises two input layers, a hidden layer and an output layer, wherein the two input layers are connected to the input end of the hidden layer, and the output end of the hidden layer is connected to the output layer;
step 1_3: each original RGB image and the corresponding depth image in the training set are respectively used as the original input images of the two input layers and are input into a convolutional neural network for training, so that a prediction significance image corresponding to each original RGB image in the training set is obtained; calculating a loss function value between a predicted saliency image corresponding to each original RGB image in the training set and a corresponding real saliency image, wherein the loss function value is obtained by adopting a BCE loss function;
Step 1_4: repeating the step 1_3 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term, and the weight vector and the bias term in the trained convolutional neural network training model are replaced;
step 1_5: inputting the RGB image to be predicted and the depth image corresponding to the RGB image to be predicted into a trained convolutional neural network training model, and predicting by utilizing an optimal weight vector and an optimal bias term to obtain a predicted saliency image corresponding to the RGB image to be predicted, thereby realizing saliency target detection.
Two input layers in the step 1_2, wherein the 1 st input layer is an RGB image input layer, and the 2 nd input layer is a depth image input layer; the hidden layer comprises an RGB feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, a SKNet network model and a post-processing module;
the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color upsampling layers, four attention convolution layers and four color convolution layers which are connected in sequence; the four sequentially connected color map neural network blocks correspond to the four sequentially connected modules in the ResNet50 respectively, the output of the first color map neural network block is connected to the first RGB branch and the fifth RGB branch respectively, the output of the second color map neural network block is connected to the second RGB branch and the sixth RGB branch respectively, the output of the third color map neural network block is connected to the third RGB branch and the seventh RGB branch respectively, and the output of the fourth color map neural network block is connected to the fourth RGB branch and the eighth RGB branch respectively;
The depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth up-sampling layers, four attention convolution layers and four depth convolution layers which are sequentially connected, wherein the four depth map neural network blocks are respectively corresponding to the four modules which are sequentially connected in the ResNet50, the output of the first depth map neural network block is respectively connected to the first depth branch and the fifth depth branch, the output of the second depth map neural network block is respectively connected to the second depth branch and the sixth depth branch, the output of the third depth map neural network block is respectively connected to the third depth branch and the seventh depth branch, and the output of the fourth depth map neural network block is respectively connected to the fourth depth branch and the eighth depth branch;
the outputs of the first RGB branch and the second RGB branch are multiplied to be used as one input of the low-level characteristic convolution layer, and the outputs of the first depth branch and the second depth branch are multiplied to be used as the other input of the low-level characteristic convolution layer; the outputs of the third RGB branch and the fourth RGB branch are multiplied to be used as one input of the advanced characteristic convolution layer, and the outputs of the third depth branch and the fourth depth branch are multiplied to be used as the other input of the advanced characteristic convolution layer;
The outputs of the low-level characteristic convolution layer and the high-level characteristic convolution layer are input into the mixed characteristic convolution layer;
the fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module;
the output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then used as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then used as the other input of the SKNet network model;
the post-processing module comprises a first deconvolution layer and a second deconvolution layer which are sequentially connected, wherein the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer.
The first RGB branch comprises a first color attention layer, a first color upsampling layer and a first attention convolution layer which are sequentially connected, the second RGB branch comprises a second color attention layer, a second color upsampling layer and a second attention convolution layer which are sequentially connected, the third RGB branch comprises a third color attention layer, a third color upsampling layer and a third attention convolution layer which are sequentially connected, and the fourth RGB branch comprises a fourth color attention layer, a fourth color upsampling layer and a fourth attention convolution layer which are sequentially connected;
The fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;
the first depth branch comprises a first depth attention layer, a first depth upsampling layer and a fifth attention convolution layer which are sequentially connected, the second depth branch comprises a second depth attention layer, a second depth upsampling layer and a sixth attention convolution layer which are sequentially connected, the third depth branch comprises a third depth attention layer, a third depth upsampling layer and a seventh attention convolution layer which are sequentially connected, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth upsampling layer and an eighth attention convolution layer which are sequentially connected;
the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.
The detail information processing module comprises a first network module and a second excessive convolution layer which are sequentially connected, wherein the input of the detail information processing module is fused with the output of the second excessive convolution layer through the output of the first excessive convolution layer and then used as the output of the detail information processing module;
the global information processing module comprises three processing branches, wherein the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then used as the output of the global information processing module.
Each color attention layer and each depth attention layer adopt a CBAM module (attention mechanism module of a convolution module), and each color upsampling layer and each depth upsampling layer are used for upsampling processing of bilinear interpolation of input features; each attention convolution layer, color convolution layer, depth convolution layer, low-level feature convolution layer, high-level feature convolution layer and mixed feature convolution layer comprises a convolution layer; each of said deconvolution layers comprising a deconvolution;
the overconvolution layer in the detail information processing module and the global convolution layer in the global information processing module comprise one convolution layer, a first network module in the detail information processing module adopts a Dense block in a DenseNet network, and each global network module in the global information processing module adopts an ASPP module (space pyramid pool module).
The input end of the RGB image input layer receives an RGB input image, and the input end of the depth image input layer receives a depth image corresponding to the RGB image; the input of the RGB feature extraction module and the depth feature extraction module is the output of an RGB image input layer and a depth image input layer respectively.
Compared with the prior art, the invention has the advantages that:
1) The invention uses ResNet50 to respectively pretrain RGB image and depth image (change depth image into three-channel input), respectively extracts different results of RGB image and depth image passing through 4 modules in ResNet50, and carries out two different operations on the extracted results, firstly, carries out detail optimization of high-low level characteristics through an attention mechanism, secondly, fuses the high-low level characteristics after forming a main network, and then transmits the fused high-low level characteristics into a later model.
2) The invention extracts the characteristic information from the pre-training, divides the image characteristic into two types of high-level and low-level characteristics, extracts the detail characteristic of the image from the high-level and low-level characteristics at the left part of the model, and fuses the high-level characteristic and the low-level characteristic after the operation respectively, so that the mode has excellent effect.
3) Two novel modules are designed in the right side of the model: the first module adopts a module combining convolution and a Dense block, and fully combines the advantages of convolution and DenseNet, so that the detection result of the method is finer. The second module utilizes ASPP to expand the field of view and then matches with convolution, which is beneficial to the collection of global features, so that the detection result of the method is more comprehensive. And finally, carrying out overall fusion on the detail characteristics at the left side of the fusion model.
Drawings
Fig. 1 is a block diagram of a general implementation of the method of the present invention.
Fig. 2a is an original RGB image.
Fig. 2b is the depth image of fig. 2 a.
Fig. 3a is a true saliency detection image of fig. 2 a.
Fig. 3b is a predicted saliency image of fig. 2a and 2b according to the present invention.
FIG. 4a shows the results of the present invention on a Recall evaluation.
Fig. 4b shows the results of the present invention on ROC.
FIG. 4c shows the results of the present invention on MAE.
Detailed Description
The invention is described in further detail below with reference to the embodiments of the drawings.
The invention provides a multi-feature cascading RGB-D significance target detection method, the general implementation block diagram of which is shown in figure 1, and the method comprises three processes of a training stage, a verification stage and a testing stage.
The training phase process comprises the following specific steps:
step 1_1: q color real target images and corresponding depth images are selected, a training set is formed by the corresponding saliency images of each color real target image, and the Q-th original object image in the training set is recorded as { I } q (i, j) } the depth image is denoted as { D } q (I, j) } and { I } in the training set q The true saliency image corresponding to (i, j) is recorded as
Figure BDA0002269501490000051
Wherein the color real target image is an RGB color image, the depth image is a binary gray scale image, Q is a positive integer, Q is more than or equal to 200, if Q=1588 is taken, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, and W represents { I } q Width of (I, j) }, H represents { I }, and q height of (i, j), e.g. w=512, h=512, i q (I, j) represents { I } q Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>
Figure BDA0002269501490000052
Representation->
Figure BDA0002269501490000053
Pixel values of the pixel points with the middle coordinate positions (i, j); here, the color real target image directly selects the database NJU2000 training set1588 images of (3).
Step 1_2: constructing a convolutional neural network:
the convolutional neural network comprises an input layer, a hidden layer and an output layer;
the input layers include an RGB image input layer and a depth image input layer. For an RGB image input layer, an input end receives an R channel component, a G channel component and a B channel component of an original input image, and an output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein the width of the original input image received by the input end of the required input layer is W, and the height is H. For a depth image input layer, an input end receives a depth image corresponding to an original input image, an output end of the input end outputs the original depth image, the original depth image is changed into a three-channel depth image through superposition of two channels, and three-channel components are given to a hidden layer; wherein the width of the original input image received by the input end of the required input layer is W, and the height is H.
The hidden layer comprises an RGB feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, a SKNet network model and a post-processing module;
the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color upsampling layers, four attention convolution layers and four color convolution layers which are connected in sequence. The four color map neural network blocks connected in sequence correspond to the four modules connected in sequence in the ResNet50 respectively. For the 1 st color image neural network block, the 2 nd color image neural network, the 3 rd color image neural network and the 4 th color image neural network, 4 modules in the ResNet50 are respectively corresponding in sequence, a pretraining method is adopted, and the input image is pretrained by utilizing the network of the ResNet50 of the pytorch and the weight thereof. The output is 256 feature images after passing through the 1 st color image neural network block, and the set formed by the 256 feature images is marked as P 1 ,P 1 The width of each characteristic diagram is
Figure BDA0002269501490000061
Height is +.>
Figure BDA0002269501490000062
The obtained image is output as 512 feature images after passing through the 2 nd color image neural network block, and the set formed by the output 512 feature images is marked as P 2 ,P 2 The width of each feature map in (a) is +.>
Figure BDA0002269501490000063
Height is +.>
Figure BDA0002269501490000064
The output is 1024 feature images after passing through the 3 rd color image neural network block, and the set formed by the 1024 feature images is marked as P 3 ,P 3 The width of each feature map in (a) is +.>
Figure BDA0002269501490000065
Height is +.>
Figure BDA0002269501490000066
The result is output as 2048 feature images after passing through the 4 th color image neural network block, and the set formed by the 2048 feature images is marked as P 4 ,P 4 The width of each feature map is +.>
Figure BDA0002269501490000067
Height is +.>
Figure BDA0002269501490000068
The depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth upsampling layers, four attention convolution layers and four depth convolution layers which are connected in sequence. The four depth map neural network blocks connected in sequence correspond to the four modules connected in sequence in the ResNet50 respectively. For the 1 st depth image neural network, the 2 nd depth image neural network, the 3 rd depth image neural network and the 4 th depth image neural network, 4 ResNet50 are respectively corresponding in sequenceThe module adopts a pretraining method, and pretrains the input image by utilizing the network of ResNet50 of the pyrach and the weight thereof. The output is 256 feature images after passing through the 1 st depth image neural network block, and the set formed by the 256 feature images is marked as D 1 ,D 1 The width of each characteristic diagram is
Figure BDA0002269501490000069
Height is +.>
Figure BDA00022695014900000610
The obtained images are output into 512 feature images after passing through a 2 nd depth image neural network block, and a set formed by the output 512 feature images is marked as D 2 ,D 2 The width of each feature map in (a) is +.>
Figure BDA00022695014900000611
Height is +.>
Figure BDA00022695014900000612
The 1024 feature images are output after passing through the 3 rd depth image neural network block, and the set formed by the 1024 feature images is marked as D 3 ,D 3 The width of each feature map in (a) is +.>
Figure BDA00022695014900000613
Height is +.>
Figure BDA00022695014900000614
The image is output as 2048 feature images after passing through the 4 th depth image neural network block, and a set formed by the 2048 feature images is marked as D 4 ,D 4 The width of each feature map in (a) is +.>
Figure BDA0002269501490000071
Height is +.>
Figure BDA0002269501490000072
The output of the first color map neural network block is connected to the first RGB branch and the fifth RGB branch, respectively, the output of the second color map neural network block is connected to the second RGB branch and the sixth RGB branch, respectively, the output of the third color map neural network block is connected to the third RGB branch and the seventh RGB branch, respectively, and the output of the fourth color map neural network block is connected to the fourth RGB branch and the eighth RGB branch, respectively. The first RGB branch comprises a first color attention layer, a first color upsampling layer and a first attention convolution layer which are sequentially connected, the second RGB branch comprises a second color attention layer, a second color upsampling layer and a second attention convolution layer which are sequentially connected, the third RGB branch comprises a third color attention layer, a third color upsampling layer and a third attention convolution layer which are sequentially connected, and the fourth RGB branch comprises a fourth color attention layer, a fourth color upsampling layer and a fourth attention convolution layer which are sequentially connected; the fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;
The output of the first depth map neural network block is connected to the first depth branch and the fifth depth branch respectively, the output of the second depth map neural network block is connected to the second depth branch and the sixth depth branch respectively, the output of the third depth map neural network block is connected to the third depth branch and the seventh depth branch respectively, and the output of the fourth depth map neural network block is connected to the fourth depth branch and the eighth depth branch respectively. The first depth branch comprises a first depth attention layer, a first depth upsampling layer and a fifth attention convolution layer which are sequentially connected, the second depth branch comprises a second depth attention layer, a second depth upsampling layer and a sixth attention convolution layer which are sequentially connected, the third depth branch comprises a third depth attention layer, a third depth upsampling layer and a seventh attention convolution layer which are sequentially connected, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth upsampling layer and an eighth attention convolution layer which are sequentially connected; the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.
Each color attention layer or depth attention layer is composed of one CBAM (Convolutional Block Attention Module) module. For the 1 st color attention layer, this operation does not change the graph size and channel number, and is still 256 feature graphs. For the 2 nd color attention layer, this operation did not change the graph size and channel number, still 512 feature graphs. For the 3 rd color attention layer, this operation does not change the graph size and channel number, still 1024 feature graphs. For the 4 th color attention layer, this operation does not change the graph size and channel number, still 2048 feature graphs. For the 1 st depth attention layer, this operation does not change the graph size and channel number, and is still 256 feature graphs. For the 2 nd depth attention layer, this operation did not change the graph size and channel number, still 512 feature graphs. For the 3 rd depth attention layer, this operation does not change the graph size and channel number, still 1024 feature graphs. For the 4 th depth attention layer, this operation does not change the graph size and channel number, still 2048 feature graphs.
Each attention convolution layer is composed of one convolution layer. For the 1 st attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as S 1 . For the 2 nd attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as S 2 . For a pair ofIn the 3 rd attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as S 3 . For the 4 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as S 4 . For the 5 th attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G 1 . For the 6 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step size is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G. For the 7 th attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G 3 . For the 8 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G 4
For the 1 st multiplication operation, S is 1 And S is 2 The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as S 1 S 2 As one of the inputs to the low-level feature convolution layer. For the 2 nd multiplication operation, S is 3 And S is 4 The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as S 3 S 4 As another input to the low-level feature convolution layer. For the 3 rd multiplication operation, G will be 1 And G 2 The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as G 1 G 2 As one of the inputs to the advanced feature convolution layer. For the 4 th multiplication operation, G will be 3 And G 4 The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as G 3 G 4 As another input to the advanced feature convolution layer.
Each color convolution layer or depth convolution layer is composed of one convolution. For the 1 st color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 2 nd color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 3 rd color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 4 th color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 1 st depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 2 nd depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 3 rd depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 4 th depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output.
Each color upsampling layer or depth upsampling layer is used for upsampling processing of bilinear interpolation of the input features.
For the 1 st color up-sampling layer, the width of the output characteristic diagram is set to be
Figure BDA0002269501490000091
Height is +.>
Figure BDA0002269501490000092
This operation does not change the feature map number. For the 2 nd color up-sampling layer, the width of the output feature map is set to be +.>
Figure BDA0002269501490000093
Height is +.>
Figure BDA0002269501490000094
This operation does not change the feature map number. For the 3 rd color upsampling layer, the output feature map width is set to +.>
Figure BDA0002269501490000095
Height is +.>
Figure BDA0002269501490000096
This operation does not change the feature map number. For the 4 th color up-sampling layer, the width of the output feature map is set to be +.>
Figure BDA0002269501490000097
Height is +.>
Figure BDA0002269501490000098
This operation does not change the feature map number. For the 1 st depth up-sampling layer, the width of the output feature map is set to be +.>
Figure BDA0002269501490000099
Height is +.>
Figure BDA00022695014900000910
This operation does not change the feature map number. For the 2 nd depth up-sampling layer, the width of the output feature map is set to be +.>
Figure BDA00022695014900000911
Height is +.>
Figure BDA00022695014900000912
This operation does not change the feature map number. For the 3 rd depth upsampling layer, the output feature map width is set to +.>
Figure BDA00022695014900000913
Height is +.>
Figure BDA00022695014900000914
This operation does not change the feature map number. For the 4 th depth up-sampling layer, setting the width of the output characteristic diagramDegree is->
Figure BDA00022695014900000915
Height is +.>
Figure BDA00022695014900000916
This operation does not change the feature map number.
For the 5 th color up-sampling layer, the width of the output characteristic diagram is set to be
Figure BDA00022695014900000917
Height is +.>
Figure BDA00022695014900000918
The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U 1 . For the 6 th color up-sampling layer, the width of the output feature map is set to be +.>
Figure BDA00022695014900000919
Height is +.>
Figure BDA00022695014900000920
The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U 2 . For the 7 th color upsampling layer, the output feature map width is set to +.>
Figure BDA00022695014900000921
Height is +.>
Figure BDA00022695014900000922
The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U 3 . For the 8 th color up-sampling layer, the width of the output feature map is set to +.>
Figure BDA00022695014900000923
Height is +.>
Figure BDA00022695014900000924
The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U 4 . For the 5 th depth upsampling layer, the output feature map width is set to +.>
Figure BDA00022695014900000925
Height is +.>
Figure BDA00022695014900000926
The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F 1 . For the 6 th depth upsampling layer, the output feature map width is set to +.>
Figure BDA00022695014900000927
Height is +.>
Figure BDA00022695014900000928
The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F 2 . For the 7 th depth upsampling layer, the output feature map width is set to +. >
Figure BDA00022695014900000929
Height is +.>
Figure BDA00022695014900000930
The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F 3 . For the 8 th depth upsampling layer, the output feature map width is set to +.>
Figure BDA0002269501490000101
Height is +.>
Figure BDA0002269501490000102
The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F 4
Low-level feature volumeThe outputs of both the laminate and advanced feature convolution layers are input into the hybrid feature convolution layer. For the 1 st advanced characteristic convolution layer, the convolution layer consists of one convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, the zero padding parameter is 1, the step length is 1, and 256 characteristic images are output; the high-level features of the RGB map and the depth map are fused. For the 1 st low-level characteristic convolution layer, the convolution layer consists of a convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, and 256 characteristic diagrams are output; the low-level features of the RGB map and the depth map are fused. For the 1 st mixed characteristic convolution layer, the convolution layer consists of one convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, 256 characteristic images are output, and a set formed by the 256 characteristic images is marked as X 1
For the 5 th multiplication operation, U is 1 And U 2 The result after addition is compared with F 1 And F 2 The added results are multiplied and output as 512 feature maps. For the 6 th multiplication operation, U is 3 And U 4 The result after addition is compared with F 3 And F 4 The added results are multiplied and output as 512 feature maps. The fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module.
The detail information processing module comprises a first network module and a second excessive convolution layer which are connected in sequence, and the input of the detail information processing module is fused with the output of the second excessive convolution layer through the output of the first excessive convolution layer and then is used as the output of the detail information processing module. For the 1 st network module, a Dense block of the DenseNet network is used. Wherein the parameters are set as follows: the number of layers is 6, the size is 4, the number of steps is increased by 4, and 536 characteristic diagrams are output. For the 1 st excessive convolution layer, the method consists of one convolution, wherein the convolution kernel size is 3 multiplied by 3, the number of convolution kernels is 256, zero padding parameters are 1, step length is 1, 256 feature images are output, and the 256 feature images are formed The set is denoted as H 1 . For the 2 nd excessive convolution layer, the convolution kernel is formed by one convolution, the size of the convolution kernel is 3 multiplied by 3, the number of the convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as H 2
The global information processing module comprises three processing branches, wherein the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then used as the output of the global information processing module. Each global network module adopts a ASPP (Atrous Spatial Pyramid Pooling) module. For the 1 st global network module, the output is 512 feature graphs. For the 2 nd global network module, the output is 512 feature graphs. For the 3 rd global network module, the output is 512 feature maps. Each global convolution layer is made up of one convolution layer. For the 1 st global convolution layer, the convolution kernel size is 3 multiplied by 3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as E 1 . For the 2 nd global convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as E 2 . For the 3 rd global convolution layer, the convolution kernel size is 7 multiplied by 7, the number of convolution kernels is 256, the zero padding parameter is 3, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as E 3
The output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then used as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then used as the other input of the SKNet network model. For the 1 st SKNet, consisting of one Selective Kernel Networks, the SKNet network model has two inputs, the first input being H 1 、H 2 And X is 1 The sum of the second inputs is E 1 、E 2 、E 3 And X is 1 The sum of the input parameters is: 256 feature maps with width of
Figure BDA0002269501490000111
Height is +.>
Figure BDA0002269501490000112
The operation output is still 256 feature maps, and the map size is unchanged.
The post-processing module comprises a first deconvolution layer and a second deconvolution layer which are sequentially connected, wherein the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer. For the 1 st deconvolution layer, the deconvolution layer consists of a deconvolution layer, the convolution kernel size of the deconvolution layer is 2 multiplied by 2, the number of convolution kernels is 128, the zero padding parameter is 0, the step length is 2, and the width of each feature graph is
Figure BDA0002269501490000113
Height is +.>
Figure BDA0002269501490000114
The 2 nd deconvolution layer consists of a deconvolution layer, the convolution kernel size of the deconvolution layer is 2 multiplied by 2, the number of the convolution kernels is 1, the zero padding parameter is 0, the step length is 2, and the width of each feature map is W and the height is H.
Step 1_3: the conversion size of each original color real target image in the training set is changed into 224 multiplied by 224 to be used as an original RGB input image, the conversion size of the depth image corresponding to each original color real target image in the training set is changed into 224 multiplied by 224 and is changed into a three-channel image to be used as a depth input image, the depth input image is input into the ResNet50 for pre-training, and the corresponding feature images are input into the model for training after the pre-training. Obtaining a significance detection prediction graph corresponding to each color real target image in the training set, and carrying out { I } q The set of significance detection prediction graphs corresponding to (i, j) is denoted as
Figure BDA0002269501490000115
Step 1_4: computing a set and a correspondence of saliency detection predictive graphs corresponding to each original color real target image in a training setIs processed into a set of encoded images of corresponding size, the loss function value between the set of encoded images will
Figure BDA0002269501490000116
And->
Figure BDA0002269501490000117
The loss function value between them is recorded as
Figure BDA0002269501490000118
Obtained using BCE loss function.
Step 1_5: repeating the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model, and correspondingly marked as W best And b best The method comprises the steps of carrying out a first treatment on the surface of the Where V > 1, v=100 in this example.
The specific steps of the test phase process of the embodiment are as follows:
step 2_1: order the
Figure BDA0002269501490000119
Representing a colored real target image to be detected for saliency, a color real target image>
Figure BDA00022695014900001110
Representing a depth image corresponding to a real object to be detected for significance; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>
Figure BDA00022695014900001111
Is H' represents ∈>
Figure BDA0002269501490000121
Height of->
Figure BDA0002269501490000122
Representation->
Figure BDA0002269501490000123
The pixel value of the pixel point with the middle coordinate position of (i, j),
Figure BDA0002269501490000124
representation->
Figure BDA0002269501490000125
The pixel value of the pixel point whose middle coordinate position is (i, j).
Step 2_2: will be
Figure BDA0002269501490000126
R channel component, G channel component, and B channel component, and converted
Figure BDA0002269501490000127
Is input into a convolutional neural network classification training model and utilizes W best And b best Predicting to obtain->
Figure BDA0002269501490000128
And->
Figure BDA0002269501490000129
Corresponding predictive saliency detection image, noted +.>
Figure BDA00022695014900001210
Wherein (1)>
Figure BDA00022695014900001211
Representation->
Figure BDA00022695014900001212
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
Architecture of multi-scale residual convolutional neural network was built using python-based deep learning library pytorch4.0.1. The method adopts an NJU2000 test set of a real object image database to analyze how the significance detection effect of the real scene image (397 real object images) predicted by the method is achieved. Here, the detection performance of the predicted saliency detection image is evaluated using 3 commonly used objective parameters of the estimated saliency detection method as evaluation indexes, namely, a class accuracy recall curve (Precision Recall Curve), a work feature curve (ROC), and an average absolute error (Mean Absolute Error, MAE).
The method is used for predicting each real scene image in the real scene image database NJU2000 test set to obtain a prediction saliency detection image corresponding to each real scene image.
FIG. 4a reflects the accuracy recall Curve (PR Curve) of the significance detection effect of the method of the present invention, with the result Curve being better the closer to 1.
Fig. 4b reflects the operating characteristic curve (ROC) of the significance test effect of the method of the present invention, with the result curve being better the closer to 1.
FIG. 4c reflects the Mean Absolute Error (MAE) of the significance test effect of the method of the present invention, with lower MAE results representing better test effect.
The figure shows that the significance detection result of the real scene image obtained by the method is very good, which indicates that the method for obtaining the predicted significance detection image corresponding to the real scene image is feasible and effective.

Claims (6)

1. The multi-feature cascading RGB-D significance target detection method is characterized by comprising the following steps of:
step 1_1: q original RGB images and corresponding depth images thereof are selected, and a training set is formed by combining the true significance images corresponding to the original RGB images;
step 1_2: constructing a convolutional neural network: the convolutional neural network comprises two input layers, a hidden layer and an output layer, wherein the two input layers are connected to the input end of the hidden layer, and the output end of the hidden layer is connected to the output layer;
step 1_3: each original RGB image and the corresponding depth image in the training set are respectively used as the original input images of the two input layers and are input into a convolutional neural network for training, so that a prediction significance image corresponding to each original RGB image in the training set is obtained; calculating a loss function value between a predicted saliency image corresponding to each original RGB image in the training set and a corresponding real saliency image, wherein the loss function value is obtained by adopting a BCE loss function;
Step 1_4: repeating the step 1_3 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term, and the weight vector and the bias term in the trained convolutional neural network classification training model are replaced;
step 1_5: inputting the RGB image to be predicted and the depth image corresponding to the RGB image to be predicted into a trained convolutional neural network classification training model, and predicting by utilizing an optimal weight vector and an optimal bias term to obtain a predicted saliency image corresponding to the RGB image to be predicted, thereby realizing saliency target detection;
two input layers in the step 1_2, wherein the 1 st input layer is an RGB image input layer, and the 2 nd input layer is a depth image input layer; the hidden layer comprises an RGB feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, a SKNet network model and a post-processing module;
the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color upsampling layers, four attention convolution layers and four color convolution layers which are connected in sequence; the four sequentially connected color map neural network blocks correspond to the four sequentially connected modules in the ResNet50 respectively, the output of the first color map neural network block is connected to the first RGB branch and the fifth RGB branch respectively, the output of the second color map neural network block is connected to the second RGB branch and the sixth RGB branch respectively, the output of the third color map neural network block is connected to the third RGB branch and the seventh RGB branch respectively, and the output of the fourth color map neural network block is connected to the fourth RGB branch and the eighth RGB branch respectively;
The depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth up-sampling layers, four attention convolution layers and four depth convolution layers which are sequentially connected, wherein the four depth map neural network blocks are respectively corresponding to the four modules which are sequentially connected in the ResNet50, the output of the first depth map neural network block is respectively connected to the first depth branch and the fifth depth branch, the output of the second depth map neural network block is respectively connected to the second depth branch and the sixth depth branch, the output of the third depth map neural network block is respectively connected to the third depth branch and the seventh depth branch, and the output of the fourth depth map neural network block is respectively connected to the fourth depth branch and the eighth depth branch;
the outputs of the first RGB branch and the second RGB branch are multiplied to be used as one input of the low-level characteristic convolution layer, and the outputs of the first depth branch and the second depth branch are multiplied to be used as the other input of the low-level characteristic convolution layer; the outputs of the third RGB branch and the fourth RGB branch are multiplied to be used as one input of the advanced characteristic convolution layer, and the outputs of the third depth branch and the fourth depth branch are multiplied to be used as the other input of the advanced characteristic convolution layer;
The outputs of the low-level characteristic convolution layer and the high-level characteristic convolution layer are input into the mixed characteristic convolution layer;
the fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module;
the output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then used as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then used as the other input of the SKNet network model;
the post-processing module comprises a first deconvolution layer and a second deconvolution layer which are sequentially connected, wherein the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer.
2. A multi-feature cascading RGB-D significance target detection method according to claim 1, characterized in that,
the first RGB branch comprises a first color attention layer, a first color upsampling layer and a first attention convolution layer which are sequentially connected, the second RGB branch comprises a second color attention layer, a second color upsampling layer and a second attention convolution layer which are sequentially connected, the third RGB branch comprises a third color attention layer, a third color upsampling layer and a third attention convolution layer which are sequentially connected, and the fourth RGB branch comprises a fourth color attention layer, a fourth color upsampling layer and a fourth attention convolution layer which are sequentially connected;
The fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;
the first depth branch comprises a first depth attention layer, a first depth upsampling layer and a fifth attention convolution layer which are sequentially connected, the second depth branch comprises a second depth attention layer, a second depth upsampling layer and a sixth attention convolution layer which are sequentially connected, the third depth branch comprises a third depth attention layer, a third depth upsampling layer and a seventh attention convolution layer which are sequentially connected, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth upsampling layer and an eighth attention convolution layer which are sequentially connected;
the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.
3. The multi-feature cascading RGB-D saliency target detection method of claim 1, wherein the detail information processing module includes a first network module and a second excessive convolution layer connected in sequence, and an input of the detail information processing module is fused with an output of the first excessive convolution layer and an output of the second excessive convolution layer to serve as an output of the detail information processing module;
the global information processing module comprises three processing branches, wherein the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then used as the output of the global information processing module.
4. The multi-feature cascading RGB-D saliency target detection method of claim 2, wherein each of the color attention layer and the depth attention layer adopts a CBAM module, and each of the color upsampling layer and the depth upsampling layer is used for upsampling processing of bilinear interpolation of input features; each attention convolution layer, color convolution layer, depth convolution layer, low-level feature convolution layer, high-level feature convolution layer and mixed feature convolution layer comprises a convolution layer; each of said deconvolution layers comprises a deconvolution.
5. A multi-feature cascading RGB-D saliency target detection method according to claim 3, wherein the excessive convolution layer in the detail information processing module and the global convolution layer in the global information processing module each comprise a convolution layer, a first network module in the detail information processing module adopts a Dense block in a DenseNet network, and each global network module in the global information processing module adopts an ASPP module.
6. The multi-feature cascading RGB-D saliency target detection method of claim 1, wherein the RGB image input layer input receives an RGB input image and the depth image input layer input receives a depth image corresponding to the RGB image; the input of the RGB feature extraction module and the depth feature extraction module is the output of an RGB image input layer and a depth image input layer respectively.
CN201911099871.2A 2019-11-12 2019-11-12 Multi-feature cascading RGB-D significance target detection method Active CN110929736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911099871.2A CN110929736B (en) 2019-11-12 2019-11-12 Multi-feature cascading RGB-D significance target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911099871.2A CN110929736B (en) 2019-11-12 2019-11-12 Multi-feature cascading RGB-D significance target detection method

Publications (2)

Publication Number Publication Date
CN110929736A CN110929736A (en) 2020-03-27
CN110929736B true CN110929736B (en) 2023-05-26

Family

ID=69852888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911099871.2A Active CN110929736B (en) 2019-11-12 2019-11-12 Multi-feature cascading RGB-D significance target detection method

Country Status (1)

Country Link
CN (1) CN110929736B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461043B (en) * 2020-04-07 2023-04-18 河北工业大学 Video significance detection method based on deep network
CN111666854B (en) * 2020-05-29 2022-08-30 武汉大学 High-resolution SAR image vehicle target detection method fusing statistical significance
CN111768375B (en) * 2020-06-24 2022-07-26 海南大学 Asymmetric GM multi-mode fusion significance detection method and system based on CWAM
CN111985552B (en) * 2020-08-17 2022-07-29 中国民航大学 Method for detecting diseases of thin strip-shaped structure of airport pavement under complex background
CN112330642B (en) * 2020-11-09 2022-11-04 山东师范大学 Pancreas image segmentation method and system based on double-input full convolution network
CN112580694B (en) * 2020-12-01 2024-04-19 中国船舶重工集团公司第七0九研究所 Small sample image target recognition method and system based on joint attention mechanism
CN112507933B (en) * 2020-12-16 2022-09-16 南开大学 Saliency target detection method and system based on centralized information interaction
CN112528899B (en) * 2020-12-17 2022-04-12 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN112651406B (en) * 2020-12-18 2022-08-09 浙江大学 Depth perception and multi-mode automatic fusion RGB-D significance target detection method
CN113516022B (en) * 2021-04-23 2023-01-10 黑龙江机智通智能科技有限公司 Fine-grained classification system for cervical cells
CN114723951B (en) * 2022-06-08 2022-11-04 成都信息工程大学 Method for RGB-D image segmentation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109903276A (en) * 2019-02-23 2019-06-18 中国民航大学 Convolutional neural networks RGB-D conspicuousness detection method based on multilayer fusion
CN110263813A (en) * 2019-05-27 2019-09-20 浙江科技学院 A kind of conspicuousness detection method merged based on residual error network and depth information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437878B2 (en) * 2016-12-28 2019-10-08 Shutterstock, Inc. Identification of a salient portion of an image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109903276A (en) * 2019-02-23 2019-06-18 中国民航大学 Convolutional neural networks RGB-D conspicuousness detection method based on multilayer fusion
CN110263813A (en) * 2019-05-27 2019-09-20 浙江科技学院 A kind of conspicuousness detection method merged based on residual error network and depth information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multi-scale deep encoder-decoder network for salient object detection;Qinghua Ren et al.;《Neurocomputing》;20181117;第316卷;全文 *
基于级联全卷积神经网络的显著性检测;张松龙等;《激光与光电子学进展》;20181029(第07期);全文 *

Also Published As

Publication number Publication date
CN110929736A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN110929736B (en) Multi-feature cascading RGB-D significance target detection method
CN108510532B (en) Optical and SAR image registration method based on deep convolution GAN
CN108182441B (en) Parallel multichannel convolutional neural network, construction method and image feature extraction method
CN110728192B (en) High-resolution remote sensing image classification method based on novel characteristic pyramid depth network
CN108154194B (en) Method for extracting high-dimensional features by using tensor-based convolutional network
Wu et al. 3d shapenets for 2.5 d object recognition and next-best-view prediction
CN113221639B (en) Micro-expression recognition method for representative AU (AU) region extraction based on multi-task learning
CN112184752A (en) Video target tracking method based on pyramid convolution
CN110458178B (en) Multi-mode multi-spliced RGB-D significance target detection method
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
JP7135659B2 (en) SHAPE COMPLEMENTATION DEVICE, SHAPE COMPLEMENTATION LEARNING DEVICE, METHOD, AND PROGRAM
Nguyen et al. Satellite image classification using convolutional learning
CN112036260B (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN110674685B (en) Human body analysis segmentation model and method based on edge information enhancement
CN103646256A (en) Image characteristic sparse reconstruction based image classification method
CN113269224A (en) Scene image classification method, system and storage medium
CN113743521B (en) Target detection method based on multi-scale context awareness
CN114494594A (en) Astronaut operating equipment state identification method based on deep learning
CN111612046B (en) Feature pyramid graph convolution neural network and application thereof in 3D point cloud classification
CN112597956A (en) Multi-person attitude estimation method based on human body anchor point set and perception enhancement network
CN116342961B (en) Time sequence classification deep learning system based on mixed quantum neural network
Jafrasteh et al. Generative adversarial networks as a novel approach for tectonic fault and fracture extraction in high resolution satellite and airborne optical images
EP3588441B1 (en) Imagification of multivariate data sequences
CN113450313B (en) Image significance visualization method based on regional contrast learning
CN105718858A (en) Pedestrian recognition method based on positive-negative generalized max-pooling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant