CN116310396A - RGB-D significance target detection method based on depth quality weighting - Google Patents

RGB-D significance target detection method based on depth quality weighting Download PDF

Info

Publication number
CN116310396A
CN116310396A CN202310201765.0A CN202310201765A CN116310396A CN 116310396 A CN116310396 A CN 116310396A CN 202310201765 A CN202310201765 A CN 202310201765A CN 116310396 A CN116310396 A CN 116310396A
Authority
CN
China
Prior art keywords
depth
rgb
modal
dataset
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310201765.0A
Other languages
Chinese (zh)
Inventor
夏晨星
杨凤
梁兴柱
崔建华
王列伟
段松松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202310201765.0A priority Critical patent/CN116310396A/en
Publication of CN116310396A publication Critical patent/CN116310396A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and provides a depth quality weighting-based RGB-D significance target detection method, which comprises the following steps: 1) Acquiring RGB-D data sets for training and testing the task and defining the algorithmic objects of the present inventionMarking; 2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder; 3) Constructing a cross-modal weighted fusion module, and guiding the extracted RGB image features and Depth image features to carry out weighted fusion through a weighted guided Depth quality assessment mechanism; 4) Constructing a bidirectional scale-dependent convolution mechanism for multi-scale feature extraction fusion to enhance the advanced semantic information of the multi-modal features; 5) Building a decoder to generate saliency map P est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P est Significant object segmentation map P with manual annotation GT Calculating loss; 7) Testing the test data set to generate a saliency map P est And performance evaluation is performed using the evaluation index. The method can effectively integrate complementary information from images of different modes, and improves accuracy of salient target prediction in complex scenes.

Description

RGB-D significance target detection method based on depth quality weighting
Technical field:
the invention relates to the field of computer vision and image processing, in particular to a depth quality weighting-based RGB-D saliency target detection method.
The background technology is as follows:
in the field of computer vision and image processing, saliency Object Detection (SOD) aims to identify and segment the most attractive objects or regions in given data (e.g. RGB pictures, RGB-D pictures, video, etc.) by simulating human visual attention mechanisms, and has been widely used in various computer vision tasks such as semantic segmentation, image compression, object tracking, etc.
Because of the challenging factors of complex background and illumination conditions, etc. faced by single-modality RGB salient target detection algorithms, it is difficult to locate salient targets from a cluttered background. One way to overcome these challenges is to use a depth map to compensate for the missing spatial information in the RGB image. RGB images contain detail information (e.g., rich texture, color, and visual cues), while depth maps provide spatial information, expressing geometry and distance information. Therefore, combining RGB images with depth maps for SOD tasks (called RGB-D SOD) is a reasonable choice that can handle more complex scenes, meeting the requirements of advanced detection.
Although significant progress has been made in the existing RGB-D SOD approach, most ignore the problem that low quality depth maps can adversely affect the RGB-D SOD task. The high-quality depth map has clear boundaries and accurate target positioning, and is beneficial to SOD. However, low quality depth maps not only blur the edges, but also target localization is inaccurate, which may introduce some noise in the cross-modal feature fusion, thus degrading SOD performance. Therefore, in the task of RGB-D SOD, it is necessary to consider the quality of the depth map.
Considering that the low-quality depth map inevitably affects the saliency target detection, the invention tries to explore a high-efficiency cross-modal feature fusion method, and effectively reduces the influence of the low-quality depth map on the saliency detection. In addition, in order to further explore complementary information among the multi-scale features, the advantages of correlation between the high-level features and the remote information are fully utilized, and the significance detection model is helped to more accurately predict a significance target by mining the effect of multi-scale feature fusion.
The invention comprises the following steps:
aiming at the problems, the invention provides a depth quality weighting-based RGB-D significance target detection method, which specifically adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
The NJUD dataset, the NLPR dataset, and the DUT-RGBD dataset are used as training sets, and the remaining portion of the NJUD dataset, the remaining NLPR dataset, the SIP dataset, the LFSD dataset, and the RGBD135 dataset are used as test sets.
2. And constructing a saliency target detection model network for extracting RGB image features and Depth image features by using the convolutional neural network.
2.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectively
Figure BDA0004109274380000021
And->
Figure BDA0004109274380000022
Where i represents the number of layers, corresponding to each layer output of VGG 16.
2.2 VGG16 parameters weights pre-trained with ImageNet dataset to initialize VGG16 weights for constructing the backbone network of the present invention.
3. Based on the multi-scale RGB image features extracted in step 2
Figure BDA0004109274380000023
And corresponding Depth image features +.>
Figure BDA0004109274380000024
And performing multi-scale cross-modal feature weighted fusion, and constructing a cross-modal feature fusion network by using the weighted fusion to generate multi-modal features.
3.1 Cross-modal feature fusion network, wherein 5 layers of RGB image features are respectively extracted by 5 layers of cross-modal weighted fusion (CMWF) modules
Figure BDA0004109274380000031
And corresponding Depth image features +.>
Figure BDA0004109274380000032
Composing, and generating 5-level multi-modal features
Figure BDA0004109274380000033
3.2 Input of the CMWF module of the i-th hierarchy is composed of data
Figure BDA0004109274380000034
And->
Figure BDA0004109274380000035
The multi-modal feature of the ith hierarchy is generated by a weighted guided depth quality assessment mechanism>
Figure BDA0004109274380000036
3.3 The CMWF module generates the multi-modal feature by a weighted guided deep quality assessment mechanism as follows:
3.3.1 First, the present invention constructs a generalThe channel-space attention feature enhancement module is used for filtering and enhancing the features to enhance the saliency expression capability of the features. By the channel-spatial attention feature enhancement module, unnecessary noise can be further removed, common salient objects are emphasized, and enhanced multi-modal features are obtained
Figure BDA0004109274380000037
Figure BDA0004109274380000038
Figure BDA0004109274380000039
Figure BDA00041092743800000310
Wherein c.epsilon. { r, d },
Figure BDA00041092743800000311
and->
Figure BDA00041092743800000312
Representing channel attention and spatial attention at level i, GAP representing global average pooling, GMP representing global maximum pooling, cat representing feature join operation, conv k Representing a convolution operation with a convolution kernel size of k x k, sigmoid representing a sigmoid activation function, and multi representing a matrix multiplication operation of element perception.
3.3.2 Embodying the difference of two modalities at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention pattern
Figure BDA00041092743800000313
The resulting difference is then divided by the absolute value of the enhanced RGB feature pixel values by a weighting coefficient lambda i
Figure BDA0004109274380000041
Figure BDA0004109274380000042
Wherein, subtra represents the matrix subtraction operation of element perception,
Figure BDA0004109274380000043
Figure BDA0004109274380000044
the || represents the average absolute operation, H and W being the height and width of feature f.
3.3.3 Further employing a cross enhancement strategy to characterize the original RGB
Figure BDA0004109274380000045
And depth profile->
Figure BDA0004109274380000046
RGB image features after enhancement with channel-spatial attention features, respectively->
Figure BDA0004109274380000047
And corresponding Depth image features +.>
Figure BDA0004109274380000048
Adopting a cross enhancement strategy to obtain cross enhancement characteristics +.>
Figure BDA0004109274380000049
And->
Figure BDA00041092743800000410
Figure BDA00041092743800000411
Figure BDA00041092743800000412
3.3.4 After obtaining the weighting coefficient and the cross enhancement feature, fusing the cross-modal feature and the RGB image feature by a weighted fusion method
Figure BDA00041092743800000413
And corresponding Depth image features +.>
Figure BDA00041092743800000414
Obtaining fusion characteristics->
Figure BDA00041092743800000415
Figure BDA00041092743800000416
Wherein i epsilon {1,2,3,4,5} represents the hierarchy of the model in which the feature is located, add represents the matrix addition operation of element perception, and Cat represents the feature connection operation.
4) Through the operation, the multi-mode characteristics of 5 layers are extracted
Figure BDA00041092743800000417
And the features of the 4 th and 5 th layers are input to a two-way scale correlation convolution module, and the receptive field information and the advanced semantic information of the multi-modal features are enhanced through depth separable convolution operation.
4.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Figure BDA0004109274380000051
Wherein R represents an input feature, DConv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
4.2 Connecting all the multi-scale features to add a residual connection to obtain advanced features
Figure BDA0004109274380000052
Figure BDA0004109274380000053
Where c ε {4,5}, A represents global average pooling.
4.3 Low-level features generated by the steps
Figure BDA0004109274380000054
And->
Figure BDA0004109274380000055
And high-level features->
Figure BDA0004109274380000056
And->
Figure BDA0004109274380000057
Inputting the obtained result into a decoder network to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P est
Figure BDA0004109274380000058
5) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT Calculating a loss function, and gradually updating parameters of the model proposed by the invention through Adam and a back propagation algorithmAnd the weights are used for finally determining the structure and parameter weights of the RGB-D significance target detection algorithm.
6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The multi-modal salient object detection method based on the deep convolutional neural network utilizes abundant space structure information in the Depth image, and performs cross-modal feature fusion in a weighted guided Depth quality assessment mechanism mode with the Depth features extracted from the RGB image, so that the method can meet the requirement of salient object detection in different scenes, and particularly has certain robustness in challenging scenes (complex background, low contrast, transparent objects and the like). Compared with the previous RGB-D significance target detection method, the method has the beneficial effects that:
firstly, a relation between an RGB-D image pair and an image salient target is constructed through an encoder and decoder structure by utilizing a deep learning technology, and salient prediction is obtained through extraction and fusion of cross-modal characteristics. Secondly, the complementary information of the Depth image features on the RGB image features is effectively modulated in a weighted fusion mode, the Depth distribution information of the complementary information is utilized to guide cross-mode feature fusion, interference of background information in the RGB image is eliminated, and a foundation is laid for prediction of a remarkable target in the next stage. And finally, carrying out multi-scale multi-mode feature fusion through the constructed decoder, and predicting a final saliency map.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a two-way scale dependent convolution module
FIG. 4 is a schematic diagram of a Decoder (Decoder)
FIG. 5 is a schematic diagram of model training and testing
FIG. 6 is a graph comparing results of the present invention with other RGB-D significance target detection methods
Detailed Description
The following description of the embodiments of the present invention will be made more clearly and fully with reference to the accompanying drawings, in which some, but not all examples of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.
Referring to fig. 1, a depth quality weighting-based RGB-D saliency target detection method mainly includes the following steps:
1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NJUD dataset, the NLPR dataset and the DUT-RGBD dataset are used as training sets, and the rest dataset is used as a test set, wherein the rest NJUD dataset, the rest NLPR dataset, the SIP dataset, the LFSD dataset and the RGBD135 dataset are included.
2. Constructing a saliency target detection model network for extracting RGB image features and Depth image features by using a convolutional neural network, wherein the saliency target detection model network comprises an RGB encoder for extracting RGB image features and a Depth encoder for extracting Depth image features:
2.1. the RGB image with three channels is input into an RGB encoder to generate 5 layers of RGB image features, namely
Figure BDA0004109274380000071
2.2. Inputting the three-channel Depth image into a Depth encoder to generate 5 layers of Depth image features, which are respectively
Figure BDA0004109274380000072
3. Referring to fig. 2, the 5-level RGB image generated in step 2 is characterized by a cross-modal fusion module
Figure BDA0004109274380000073
And Depth image feature->
Figure BDA0004109274380000074
Weighted fusion is carried out to obtain 5 layers of multi-mode characteristics ++>
Figure BDA0004109274380000075
The main steps are as follows:
3.1. the cross-modal feature fusion network consists of 5 layers of cross-modal weighted fusion CMWF modules, and extracts the RGB image features of 5 layers
Figure BDA0004109274380000081
And corresponding Depth image features +.>
Figure BDA0004109274380000082
Composing and generating 5 layers of multi-modal features +.>
Figure BDA0004109274380000083
3.2. The input data of the CMWF module of the ith hierarchy is
Figure BDA0004109274380000084
And->
Figure BDA0004109274380000085
The multi-modal feature of the ith hierarchy is output by a weighted guided depth quality assessment mechanism>
Figure BDA0004109274380000086
The specific process of generating the multi-modal feature by the cmwf module through the weighted guided depth quality assessment mechanism is as follows:
3.3.1. first, the invention constructs a channel-space attention feature enhancement module for filtering and enhancing features to enhance the significance expression capability of the features. By means of the channel-spatial attention feature enhancement module, further removal is possibleRemoving unnecessary noise and emphasizing common salient objects, resulting in enhanced multi-modal features
Figure BDA0004109274380000087
Figure BDA0004109274380000088
Figure BDA0004109274380000089
Figure BDA00041092743800000810
Wherein c.epsilon. { r, d },
Figure BDA00041092743800000811
and->
Figure BDA00041092743800000812
Representing channel attention and spatial attention at level i, GAP representing global average pooling, GMP representing global maximum pooling, cat representing feature join operation, conv k Representing a convolution operation with a convolution kernel size of k x k, sigmoid representing a sigmoid activation function, and multi representing a matrix multiplication operation of element perception.
3.3.2. Representing the difference of two modes at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention pattern
Figure BDA00041092743800000813
The resulting difference is then divided by the absolute value of the enhanced RGB feature pixel values by a weighting coefficient lambda i
Figure BDA00041092743800000814
Figure BDA00041092743800000815
Wherein, subtra represents the matrix subtraction operation of element perception,
Figure BDA0004109274380000091
Figure BDA0004109274380000092
the || represents the average absolute operation, H and W being the height and width of feature f.
3.3.3. Further adopting the cross enhancement strategy, we characterize the original RGB
Figure BDA0004109274380000093
And depth profile->
Figure BDA0004109274380000094
RGB image features after enhancement with channel-spatial attention features, respectively->
Figure BDA0004109274380000095
And corresponding Depth image features +.>
Figure BDA0004109274380000096
Adopting a cross enhancement strategy to obtain cross enhancement characteristics +.>
Figure BDA0004109274380000097
And->
Figure BDA0004109274380000098
Figure BDA0004109274380000099
Figure BDA00041092743800000910
3.3.4. After the weighting coefficients and the cross enhancement features are obtained, the cross-modal features and the RGB image features are fused through a weighted fusion method
Figure BDA00041092743800000912
And corresponding Depth image features +.>
Figure BDA00041092743800000913
Obtaining fusion characteristics->
Figure BDA00041092743800000914
Figure BDA00041092743800000911
Wherein i epsilon {1,2,3,4,5} represents the hierarchy of the model in which the feature is located, add represents the matrix addition operation of element perception, and Cat represents the feature connection operation.
4. Referring to fig. 3, the receptive field information and the advanced semantic information of the multi-modal features are enhanced by a bi-directional scale dependent convolution module:
4.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Figure BDA0004109274380000101
Wherein R represents an input feature, DCconv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
4.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain an advanced levelFeatures (e.g. a character)
Figure BDA0004109274380000102
Figure BDA0004109274380000103
Where c ε {4,5}, A represents global average pooling.
5. Referring to fig. 4, the acquired low-level features are shown
Figure BDA0004109274380000104
And->
Figure BDA0004109274380000105
And high-level features->
Figure BDA0004109274380000106
And->
Figure BDA0004109274380000107
Inputting into decoder network, activating by sigmoid function to obtain predicted saliency map P est
Figure BDA0004109274380000108
6) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
It will be appreciated by persons skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, and that although the invention has been described in detail with reference to the foregoing embodiment, it will be apparent to those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for part of the technical features thereof. Any modifications, equivalent substitutions, etc. within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (5)

1. The RGB-D significance target detection method based on depth quality weighting is characterized by comprising the following steps:
1) Acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing an algorithm;
2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder;
3) Establishing a cross-modal feature fusion network, and guiding RGB image features and Depth image features to carry out cross-weighted fusion through a weighted guided Depth quality assessment mechanism;
4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a bidirectional scale correlation convolution fusion mechanism is constructed to enhance the high-level semantic information of the multi-modal characteristics;
5) Establishing a decoder, and obtaining a final predicted saliency map through an activation function;
6) Predicted saliency map P est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And performance evaluation is performed using the evaluation index.
2. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 1) is as follows:
the NJUD dataset, the NLPR dataset, and the DUT-RGBD dataset are used as training sets, and the remaining portion of the NJUD dataset, the remaining NLPR dataset, the SIP dataset, the LFSD dataset, and the RGBD135 dataset are used as test sets.
3. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 2) is as follows:
3.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectively
Figure FDA0004109274360000025
And->
Figure FDA0004109274360000021
Where i represents the number of layers, corresponding to each layer output of VGG 16.
3.2 VGG16 weights for constructing the backbone network of the present invention are initialized with VGG16 parameter weights pre-trained in the ImageNet dataset.
4. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 3) is as follows:
4.1 The cross-modal weighted fusion network is composed of 5 layers of cross-modal weighted fusion (CMWF) modules and generates 5 layers of multi-modal characteristics
Figure FDA0004109274360000022
4.2 I-th level of CMWF module input data is
Figure FDA0004109274360000026
And f i d Is composed of and isGenerating multi-modal features of the ith hierarchy by a weighted guided depth quality assessment mechanism>
Figure FDA0004109274360000023
5. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 4) is as follows:
5.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Figure FDA0004109274360000024
Wherein R represents an input feature, DConv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
5.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain advanced features
Figure FDA0004109274360000031
Figure FDA0004109274360000032
6) Characterizing the front 3 layers of low-level multi-mode obtained in the step 4
Figure FDA0004109274360000033
And->
Figure FDA0004109274360000034
And 2 layers of high-level multiscale complementary features obtained in the step 5 +.>
Figure FDA0004109274360000035
And->
Figure FDA0004109274360000036
Inputting the result into a decoder to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P est
Figure FDA0004109274360000037
7) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
CN202310201765.0A 2023-02-28 2023-02-28 RGB-D significance target detection method based on depth quality weighting Pending CN116310396A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310201765.0A CN116310396A (en) 2023-02-28 2023-02-28 RGB-D significance target detection method based on depth quality weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310201765.0A CN116310396A (en) 2023-02-28 2023-02-28 RGB-D significance target detection method based on depth quality weighting

Publications (1)

Publication Number Publication Date
CN116310396A true CN116310396A (en) 2023-06-23

Family

ID=86831836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310201765.0A Pending CN116310396A (en) 2023-02-28 2023-02-28 RGB-D significance target detection method based on depth quality weighting

Country Status (1)

Country Link
CN (1) CN116310396A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117036891B (en) * 2023-08-22 2024-03-29 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system

Similar Documents

Publication Publication Date Title
CN111476292B (en) Small sample element learning training method for medical image classification processing artificial intelligence
Huang et al. Indoor depth completion with boundary consistency and self-attention
CN113240691B (en) Medical image segmentation method based on U-shaped network
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN111325750B (en) Medical image segmentation method based on multi-scale fusion U-shaped chain neural network
CN113554125A (en) Object detection apparatus, method and storage medium combining global and local features
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN113379707A (en) RGB-D significance detection method based on dynamic filtering decoupling convolution network
CN113850900A (en) Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction
CN114663371A (en) Image salient target detection method based on modal unique and common feature extraction
CN116310396A (en) RGB-D significance target detection method based on depth quality weighting
CN114283315A (en) RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
Wang et al. INSPIRATION: A reinforcement learning-based human visual perception-driven image enhancement paradigm for underwater scenes
Liu et al. Video decolorization based on the CNN and LSTM neural network
CN116433904A (en) Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
CN115830420A (en) RGB-D significance target detection method based on boundary deformable convolution guidance
CN114972937A (en) Feature point detection and descriptor generation method based on deep learning
CN114693953A (en) RGB-D significance target detection method based on cross-modal bidirectional complementary network
CN114463346A (en) Complex environment rapid tongue segmentation device based on mobile terminal
Zhuge et al. Automatic colorization using fully convolutional networks
CN113096176A (en) Semantic segmentation assisted binocular vision unsupervised depth estimation method
CN116503618B (en) Method and device for detecting remarkable target based on multi-mode and multi-stage feature aggregation
CN112597847B (en) Face pose estimation method and device, electronic equipment and storage medium
Moorthy et al. SEM and TEM images’ dehazing using multiscale progressive feature fusion techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination