CN116310396A - RGB-D significance target detection method based on depth quality weighting - Google Patents
RGB-D significance target detection method based on depth quality weighting Download PDFInfo
- Publication number
- CN116310396A CN116310396A CN202310201765.0A CN202310201765A CN116310396A CN 116310396 A CN116310396 A CN 116310396A CN 202310201765 A CN202310201765 A CN 202310201765A CN 116310396 A CN116310396 A CN 116310396A
- Authority
- CN
- China
- Prior art keywords
- depth
- rgb
- modal
- dataset
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 230000004927 fusion Effects 0.000 claims abstract description 31
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000000034 method Methods 0.000 claims abstract description 9
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 238000001303 quality assessment method Methods 0.000 claims abstract description 8
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 230000000295 complement effect Effects 0.000 claims abstract description 5
- 230000002457 bidirectional effect Effects 0.000 claims abstract 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 230000008447 perception Effects 0.000 description 6
- 238000011176 pooling Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/52—Scale-space analysis, e.g. wavelet analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision, and provides a depth quality weighting-based RGB-D significance target detection method, which comprises the following steps: 1) Acquiring RGB-D data sets for training and testing the task and defining the algorithmic objects of the present inventionMarking; 2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder; 3) Constructing a cross-modal weighted fusion module, and guiding the extracted RGB image features and Depth image features to carry out weighted fusion through a weighted guided Depth quality assessment mechanism; 4) Constructing a bidirectional scale-dependent convolution mechanism for multi-scale feature extraction fusion to enhance the advanced semantic information of the multi-modal features; 5) Building a decoder to generate saliency map P est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P est Significant object segmentation map P with manual annotation GT Calculating loss; 7) Testing the test data set to generate a saliency map P est And performance evaluation is performed using the evaluation index. The method can effectively integrate complementary information from images of different modes, and improves accuracy of salient target prediction in complex scenes.
Description
Technical field:
the invention relates to the field of computer vision and image processing, in particular to a depth quality weighting-based RGB-D saliency target detection method.
The background technology is as follows:
in the field of computer vision and image processing, saliency Object Detection (SOD) aims to identify and segment the most attractive objects or regions in given data (e.g. RGB pictures, RGB-D pictures, video, etc.) by simulating human visual attention mechanisms, and has been widely used in various computer vision tasks such as semantic segmentation, image compression, object tracking, etc.
Because of the challenging factors of complex background and illumination conditions, etc. faced by single-modality RGB salient target detection algorithms, it is difficult to locate salient targets from a cluttered background. One way to overcome these challenges is to use a depth map to compensate for the missing spatial information in the RGB image. RGB images contain detail information (e.g., rich texture, color, and visual cues), while depth maps provide spatial information, expressing geometry and distance information. Therefore, combining RGB images with depth maps for SOD tasks (called RGB-D SOD) is a reasonable choice that can handle more complex scenes, meeting the requirements of advanced detection.
Although significant progress has been made in the existing RGB-D SOD approach, most ignore the problem that low quality depth maps can adversely affect the RGB-D SOD task. The high-quality depth map has clear boundaries and accurate target positioning, and is beneficial to SOD. However, low quality depth maps not only blur the edges, but also target localization is inaccurate, which may introduce some noise in the cross-modal feature fusion, thus degrading SOD performance. Therefore, in the task of RGB-D SOD, it is necessary to consider the quality of the depth map.
Considering that the low-quality depth map inevitably affects the saliency target detection, the invention tries to explore a high-efficiency cross-modal feature fusion method, and effectively reduces the influence of the low-quality depth map on the saliency detection. In addition, in order to further explore complementary information among the multi-scale features, the advantages of correlation between the high-level features and the remote information are fully utilized, and the significance detection model is helped to more accurately predict a significance target by mining the effect of multi-scale feature fusion.
The invention comprises the following steps:
aiming at the problems, the invention provides a depth quality weighting-based RGB-D significance target detection method, which specifically adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
The NJUD dataset, the NLPR dataset, and the DUT-RGBD dataset are used as training sets, and the remaining portion of the NJUD dataset, the remaining NLPR dataset, the SIP dataset, the LFSD dataset, and the RGBD135 dataset are used as test sets.
2. And constructing a saliency target detection model network for extracting RGB image features and Depth image features by using the convolutional neural network.
2.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectivelyAnd->Where i represents the number of layers, corresponding to each layer output of VGG 16.
2.2 VGG16 parameters weights pre-trained with ImageNet dataset to initialize VGG16 weights for constructing the backbone network of the present invention.
3. Based on the multi-scale RGB image features extracted in step 2And corresponding Depth image features +.>And performing multi-scale cross-modal feature weighted fusion, and constructing a cross-modal feature fusion network by using the weighted fusion to generate multi-modal features.
3.1 Cross-modal feature fusion network, wherein 5 layers of RGB image features are respectively extracted by 5 layers of cross-modal weighted fusion (CMWF) modulesAnd corresponding Depth image features +.>Composing, and generating 5-level multi-modal features
3.2 Input of the CMWF module of the i-th hierarchy is composed of dataAnd->The multi-modal feature of the ith hierarchy is generated by a weighted guided depth quality assessment mechanism>
3.3 The CMWF module generates the multi-modal feature by a weighted guided deep quality assessment mechanism as follows:
3.3.1 First, the present invention constructs a generalThe channel-space attention feature enhancement module is used for filtering and enhancing the features to enhance the saliency expression capability of the features. By the channel-spatial attention feature enhancement module, unnecessary noise can be further removed, common salient objects are emphasized, and enhanced multi-modal features are obtained
Wherein c.epsilon. { r, d },and->Representing channel attention and spatial attention at level i, GAP representing global average pooling, GMP representing global maximum pooling, cat representing feature join operation, conv k Representing a convolution operation with a convolution kernel size of k x k, sigmoid representing a sigmoid activation function, and multi representing a matrix multiplication operation of element perception.
3.3.2 Embodying the difference of two modalities at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention patternThe resulting difference is then divided by the absolute value of the enhanced RGB feature pixel values by a weighting coefficient lambda i :
Wherein, subtra represents the matrix subtraction operation of element perception, the || represents the average absolute operation, H and W being the height and width of feature f.
3.3.3 Further employing a cross enhancement strategy to characterize the original RGBAnd depth profile->RGB image features after enhancement with channel-spatial attention features, respectively->And corresponding Depth image features +.>Adopting a cross enhancement strategy to obtain cross enhancement characteristics +.>And->
3.3.4 After obtaining the weighting coefficient and the cross enhancement feature, fusing the cross-modal feature and the RGB image feature by a weighted fusion methodAnd corresponding Depth image features +.>Obtaining fusion characteristics->
Wherein i epsilon {1,2,3,4,5} represents the hierarchy of the model in which the feature is located, add represents the matrix addition operation of element perception, and Cat represents the feature connection operation.
4) Through the operation, the multi-mode characteristics of 5 layers are extractedAnd the features of the 4 th and 5 th layers are input to a two-way scale correlation convolution module, and the receptive field information and the advanced semantic information of the multi-modal features are enhanced through depth separable convolution operation.
4.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Wherein R represents an input feature, DConv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
4.2 Connecting all the multi-scale features to add a residual connection to obtain advanced features
Where c ε {4,5}, A represents global average pooling.
4.3 Low-level features generated by the stepsAnd->And high-level features->And->Inputting the obtained result into a decoder network to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P est :
5) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT Calculating a loss function, and gradually updating parameters of the model proposed by the invention through Adam and a back propagation algorithmAnd the weights are used for finally determining the structure and parameter weights of the RGB-D significance target detection algorithm.
6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The multi-modal salient object detection method based on the deep convolutional neural network utilizes abundant space structure information in the Depth image, and performs cross-modal feature fusion in a weighted guided Depth quality assessment mechanism mode with the Depth features extracted from the RGB image, so that the method can meet the requirement of salient object detection in different scenes, and particularly has certain robustness in challenging scenes (complex background, low contrast, transparent objects and the like). Compared with the previous RGB-D significance target detection method, the method has the beneficial effects that:
firstly, a relation between an RGB-D image pair and an image salient target is constructed through an encoder and decoder structure by utilizing a deep learning technology, and salient prediction is obtained through extraction and fusion of cross-modal characteristics. Secondly, the complementary information of the Depth image features on the RGB image features is effectively modulated in a weighted fusion mode, the Depth distribution information of the complementary information is utilized to guide cross-mode feature fusion, interference of background information in the RGB image is eliminated, and a foundation is laid for prediction of a remarkable target in the next stage. And finally, carrying out multi-scale multi-mode feature fusion through the constructed decoder, and predicting a final saliency map.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a two-way scale dependent convolution module
FIG. 4 is a schematic diagram of a Decoder (Decoder)
FIG. 5 is a schematic diagram of model training and testing
FIG. 6 is a graph comparing results of the present invention with other RGB-D significance target detection methods
Detailed Description
The following description of the embodiments of the present invention will be made more clearly and fully with reference to the accompanying drawings, in which some, but not all examples of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.
Referring to fig. 1, a depth quality weighting-based RGB-D saliency target detection method mainly includes the following steps:
1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NJUD dataset, the NLPR dataset and the DUT-RGBD dataset are used as training sets, and the rest dataset is used as a test set, wherein the rest NJUD dataset, the rest NLPR dataset, the SIP dataset, the LFSD dataset and the RGBD135 dataset are included.
2. Constructing a saliency target detection model network for extracting RGB image features and Depth image features by using a convolutional neural network, wherein the saliency target detection model network comprises an RGB encoder for extracting RGB image features and a Depth encoder for extracting Depth image features:
2.1. the RGB image with three channels is input into an RGB encoder to generate 5 layers of RGB image features, namely
2.2. Inputting the three-channel Depth image into a Depth encoder to generate 5 layers of Depth image features, which are respectively
3. Referring to fig. 2, the 5-level RGB image generated in step 2 is characterized by a cross-modal fusion moduleAnd Depth image feature->Weighted fusion is carried out to obtain 5 layers of multi-mode characteristics ++>The main steps are as follows:
3.1. the cross-modal feature fusion network consists of 5 layers of cross-modal weighted fusion CMWF modules, and extracts the RGB image features of 5 layersAnd corresponding Depth image features +.>Composing and generating 5 layers of multi-modal features +.>
3.2. The input data of the CMWF module of the ith hierarchy isAnd->The multi-modal feature of the ith hierarchy is output by a weighted guided depth quality assessment mechanism>
The specific process of generating the multi-modal feature by the cmwf module through the weighted guided depth quality assessment mechanism is as follows:
3.3.1. first, the invention constructs a channel-space attention feature enhancement module for filtering and enhancing features to enhance the significance expression capability of the features. By means of the channel-spatial attention feature enhancement module, further removal is possibleRemoving unnecessary noise and emphasizing common salient objects, resulting in enhanced multi-modal features
Wherein c.epsilon. { r, d },and->Representing channel attention and spatial attention at level i, GAP representing global average pooling, GMP representing global maximum pooling, cat representing feature join operation, conv k Representing a convolution operation with a convolution kernel size of k x k, sigmoid representing a sigmoid activation function, and multi representing a matrix multiplication operation of element perception.
3.3.2. Representing the difference of two modes at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention patternThe resulting difference is then divided by the absolute value of the enhanced RGB feature pixel values by a weighting coefficient lambda i :
Wherein, subtra represents the matrix subtraction operation of element perception, the || represents the average absolute operation, H and W being the height and width of feature f.
3.3.3. Further adopting the cross enhancement strategy, we characterize the original RGBAnd depth profile->RGB image features after enhancement with channel-spatial attention features, respectively->And corresponding Depth image features +.>Adopting a cross enhancement strategy to obtain cross enhancement characteristics +.>And->
3.3.4. After the weighting coefficients and the cross enhancement features are obtained, the cross-modal features and the RGB image features are fused through a weighted fusion methodAnd corresponding Depth image features +.>Obtaining fusion characteristics->
Wherein i epsilon {1,2,3,4,5} represents the hierarchy of the model in which the feature is located, add represents the matrix addition operation of element perception, and Cat represents the feature connection operation.
4. Referring to fig. 3, the receptive field information and the advanced semantic information of the multi-modal features are enhanced by a bi-directional scale dependent convolution module:
4.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Wherein R represents an input feature, DCconv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
4.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain an advanced levelFeatures (e.g. a character)
Where c ε {4,5}, A represents global average pooling.
5. Referring to fig. 4, the acquired low-level features are shownAnd->And high-level features->And->Inputting into decoder network, activating by sigmoid function to obtain predicted saliency map P est :
6) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
It will be appreciated by persons skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, and that although the invention has been described in detail with reference to the foregoing embodiment, it will be apparent to those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for part of the technical features thereof. Any modifications, equivalent substitutions, etc. within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (5)
1. The RGB-D significance target detection method based on depth quality weighting is characterized by comprising the following steps:
1) Acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing an algorithm;
2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder;
3) Establishing a cross-modal feature fusion network, and guiding RGB image features and Depth image features to carry out cross-weighted fusion through a weighted guided Depth quality assessment mechanism;
4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a bidirectional scale correlation convolution fusion mechanism is constructed to enhance the high-level semantic information of the multi-modal characteristics;
5) Establishing a decoder, and obtaining a final predicted saliency map through an activation function;
6) Predicted saliency map P est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And performance evaluation is performed using the evaluation index.
2. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 1) is as follows:
the NJUD dataset, the NLPR dataset, and the DUT-RGBD dataset are used as training sets, and the remaining portion of the NJUD dataset, the remaining NLPR dataset, the SIP dataset, the LFSD dataset, and the RGBD135 dataset are used as test sets.
3. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 2) is as follows:
3.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectivelyAnd->Where i represents the number of layers, corresponding to each layer output of VGG 16.
3.2 VGG16 weights for constructing the backbone network of the present invention are initialized with VGG16 parameter weights pre-trained in the ImageNet dataset.
4. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 3) is as follows:
4.1 The cross-modal weighted fusion network is composed of 5 layers of cross-modal weighted fusion (CMWF) modules and generates 5 layers of multi-modal characteristics
5. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 4) is as follows:
5.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:
R 1 =DConv 3 (R) +R formula (9)
R i =DConv 2×i+1 (R i-1 ) +R, i.e. (2, 3, 4) formula (10)
Wherein R represents an input feature, DConv 3 Representing a 3 x 3 depth separable convolution, DConv 2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.
5.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain advanced features
6) Characterizing the front 3 layers of low-level multi-mode obtained in the step 4And->And 2 layers of high-level multiscale complementary features obtained in the step 5 +.>And->Inputting the result into a decoder to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P est :
7) Saliency map P predicted by the present invention est Significant object segmentation map P with manual annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310201765.0A CN116310396A (en) | 2023-02-28 | 2023-02-28 | RGB-D significance target detection method based on depth quality weighting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310201765.0A CN116310396A (en) | 2023-02-28 | 2023-02-28 | RGB-D significance target detection method based on depth quality weighting |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116310396A true CN116310396A (en) | 2023-06-23 |
Family
ID=86831836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310201765.0A Pending CN116310396A (en) | 2023-02-28 | 2023-02-28 | RGB-D significance target detection method based on depth quality weighting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116310396A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036891A (en) * | 2023-08-22 | 2023-11-10 | 睿尔曼智能科技(北京)有限公司 | Cross-modal feature fusion-based image recognition method and system |
-
2023
- 2023-02-28 CN CN202310201765.0A patent/CN116310396A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036891A (en) * | 2023-08-22 | 2023-11-10 | 睿尔曼智能科技(北京)有限公司 | Cross-modal feature fusion-based image recognition method and system |
CN117036891B (en) * | 2023-08-22 | 2024-03-29 | 睿尔曼智能科技(北京)有限公司 | Cross-modal feature fusion-based image recognition method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111476292B (en) | Small sample element learning training method for medical image classification processing artificial intelligence | |
Huang et al. | Indoor depth completion with boundary consistency and self-attention | |
CN113240691B (en) | Medical image segmentation method based on U-shaped network | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
CN111325750B (en) | Medical image segmentation method based on multi-scale fusion U-shaped chain neural network | |
CN113554125A (en) | Object detection apparatus, method and storage medium combining global and local features | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN113379707A (en) | RGB-D significance detection method based on dynamic filtering decoupling convolution network | |
CN113850900A (en) | Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction | |
CN114663371A (en) | Image salient target detection method based on modal unique and common feature extraction | |
CN116310396A (en) | RGB-D significance target detection method based on depth quality weighting | |
CN114283315A (en) | RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
Wang et al. | INSPIRATION: A reinforcement learning-based human visual perception-driven image enhancement paradigm for underwater scenes | |
Liu et al. | Video decolorization based on the CNN and LSTM neural network | |
CN116433904A (en) | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution | |
CN115830420A (en) | RGB-D significance target detection method based on boundary deformable convolution guidance | |
CN114972937A (en) | Feature point detection and descriptor generation method based on deep learning | |
CN114693953A (en) | RGB-D significance target detection method based on cross-modal bidirectional complementary network | |
CN114463346A (en) | Complex environment rapid tongue segmentation device based on mobile terminal | |
Zhuge et al. | Automatic colorization using fully convolutional networks | |
CN113096176A (en) | Semantic segmentation assisted binocular vision unsupervised depth estimation method | |
CN116503618B (en) | Method and device for detecting remarkable target based on multi-mode and multi-stage feature aggregation | |
CN112597847B (en) | Face pose estimation method and device, electronic equipment and storage medium | |
Moorthy et al. | SEM and TEM images’ dehazing using multiscale progressive feature fusion techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |