CN116486112A - RGB-D significance target detection method based on lightweight cross-modal fusion network - Google Patents
RGB-D significance target detection method based on lightweight cross-modal fusion network Download PDFInfo
- Publication number
- CN116486112A CN116486112A CN202310410912.5A CN202310410912A CN116486112A CN 116486112 A CN116486112 A CN 116486112A CN 202310410912 A CN202310410912 A CN 202310410912A CN 116486112 A CN116486112 A CN 116486112A
- Authority
- CN
- China
- Prior art keywords
- rgb
- features
- modal
- saliency
- depth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 37
- 238000001514 detection method Methods 0.000 title claims abstract description 20
- 238000012360 testing method Methods 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 10
- 230000002776 aggregation Effects 0.000 claims abstract description 9
- 238000004220 aggregation Methods 0.000 claims abstract description 9
- 230000010354 integration Effects 0.000 claims abstract description 9
- 230000000750 progressive effect Effects 0.000 claims abstract description 3
- 230000003213 activating effect Effects 0.000 claims abstract 2
- 238000000034 method Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims 1
- 230000002708 enhancing effect Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 6
- 241000282326 Felis catus Species 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000001737 promoting effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 239000011800 void material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/768—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision, and provides a RGB-D saliency target detection method based on a lightweight cross-modal fusion network, which comprises the following steps: 1) Acquiring an RGB-D dataset for training and testing the task and defining an algorithm object 2) of the invention, constructing an encoder for extracting RGB image features and an encoder for depth image features; 3) Establishing a cross-modal characteristic fusion network, and enhancing the expression of the characteristic features of the RGB image and the depth image through a progressive guiding attention mechanism; 4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a lightweight global context integration module is constructed to extract the multi-scale context characteristics of the fused mode; 5) A simple and efficient multi-path aggregation module is constructed to integrate the fusion features, the original RGB and depth map features, and a final predicted saliency map is obtained by activating the function.
Description
Technical field:
the invention relates to the field of computer vision and image processing, in particular to a high-efficiency light RGB-D (red, green and blue-digital) saliency target detection method.
The background technology is as follows:
the task of Salient Object Detection (SOD) is to find the most attractive objects in a scene by simulating the visual attention mechanism of humans. It has instructive implications in many computer vision processing tasks, including weakly supervised semantic segmentation, vision tracking, object recognition, and video analysis. The existing SOD method mainly focuses on the processing of RGB images and obtains good performance. However, they can only use visual cues in RGB images, which in some complex scenes encounter serious obstacles such as cluttered backgrounds, similar foreground and background. The main reason for this is that RGB images provide enough visual cues but lack defined spatial structure information. Meanwhile, along with the popularization of the depth sensor, the depth map can be conveniently acquired. The embedded depth information in the complex scene processing process is used as the supplement of the spatial structure information, so that the RGB can be helped to complete the robust significance detection. Due to the introduction of depth maps, RGB-D SODs have made tremendous progress in recent years.
Many RGB-D SOD processes benefit greatly from very deep and extensive models and achieve significant results. However, the success comes at the cost of a heavy computational burden and slow running speed. These models increase the depth and width of the network by adjusting the number of layers and channels, which brings about huge parameters and calculations. Considering calculation of a model and memory consumption, the invention designs a high-efficiency lightweight cross-modal fusion network for RGB-D SOD to realize light-weight and high-efficiency RGB-D significance target detection segmentation. Specifically, a feature interaction module (CMI) for fusing RGB and depth maps is first proposed. Context information is extracted from a single modality by depth separable convolution, RGB and depth map features are enhanced by a progressive guided attention mechanism (PAG) respectively, and features of all modalities are integrated by a multi-source feature integration unit (MAU). Finally, considering the saliency information retained by the original RGB and depth maps, the invention designs a multi-path aggregation Module (MPA) in the decoder to integrate the fusion features from different layers in a coarse-to-fine fusion manner.
The invention comprises the following steps:
aiming at the problems, the invention provides a RGB-D significance target detection method based on a lightweight cross-modal fusion network, which adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
1.1 The NJU2K data set, the NLPR data set, and the NLPR data set, the remaining NJU2K data set, the NLPR data set, the SIP data set, the STERE data set, and the DES data set are taken as test sets.
1.2 RGB-D image dataset comprising color image I RGB Corresponding depth image I Depth And a corresponding artificially annotated salient object segmentation image P.
2. Constructing a saliency target detection model network for extracting RGB image features and depth image features by using a convolutional neural network;
2.1 Using MobileNet-v3 as the backbone network of the model of the present invention for extracting RGB image features and depth image features of the causal pairs, respectively And->
3. Based on the multi-scale RGB image features extracted in step 2And corresponding depth image features->And utilizing the extracted features of each layer to perform cross-modal feature fusion. Since the lowest level features contain too much noise we do not use here, only 2,3,4 and 5 level features are used for fusion. Similarly, depth encoders also use only 2,3,4, and 5 layer features.
3.1 Cross-modal feature interaction network consisting of 4 levels of CMI modules and 4 levels of RGB image featuresAnd corresponding depth image features-> Composing and generating 4 layers of multi-modal features +.>And->
3.2 I) the input data of the CMI module of the i-th hierarchy isAnd->The composition and the output of the multi-modal feature of the ith hierarchy by the multi-source integration unit>Where i ε {2,3,4,5}.
3.3 The CMI module generates multi-modal features through a progressively guided attention mechanism, the specific process is as follows:
3.3.1 Firstly, the invention adopts a depth separable convolution module to extract the features of a single mode, enhances the expression capability of the significance of the features, and can further enhance the expression of RGB and depth features through the convolution module.
3.3.2 Then further feature extraction and enhancement processing is performed on the RGB and depth features, respectively, using two parallel progressively guided attention mechanisms. In order to obtain the global information of a single mode, we use two parallel channel attentions to extract features of the RGB and depth maps respectively:
where DSConv () represents a depth separable convolution module, AVG () represents a global average pooling operation at a channel level, sigmoid () represents a Sigmoid activation function.
3.3.3 At the time of obtaining global information X r And global information X d After that, we apply to X r And X d And then, learning local details so as to prevent the loss of local details of a plurality of remarkable targets, and generating a space feature map with a plurality of receptive fields by utilizing a multi-scale space attention mechanism:
wherein max () represents a max pooling operation, cat () represents a stitching operation, C 1 ()、C 3 () And C 5 () Representing voids with void fractions 1, 3 and 5, respectivelyHole convolution operation.
3.3.4 For all features Z by means of a multisource integration unit r 、Z d 、And->Integrating and fusing RGB features Z r And depth image Z d Finally, the fusion characteristic F is obtained i fusion :
Where i ε {2,3,4,5} represents the hierarchy of the model where the feature is located, conv1 () represents the convolution operation with a convolution kernel size of 1×1, DSConv () represents the depth separable convolution operation, cat () represents the feature stitching operation, and add represents the addition operation.
4) Through the operation, the multi-mode characteristics of 4 layers are extracted Andand inputting the 4 layers into a context information extraction module, and enhancing the receptive field information of the multi-modal features and promoting the expression of the significance targets through convolution operations of multiple layers and different sizes.
4.1 Respectively extracting the context information of the fusion features from the multi-modal features through the context operation:
where i ε {2,3,4,5} represents the hierarchy where the fused feature is located and GCM () represents the contextual feature extraction module.
4.2 Inputting the context information modal characteristics generated by the steps into a decoder, integrating the fusion characteristics through a multi-path aggregation module, and integrating the RGB characteristics of each layerAnd depth feature per layer->
Wherein MPA () represents a multipath aggregation module, deconv () represents a deconvolution operation, S out Representing the predicted saliency map, i epsilon {2,3,4,5} represents the hierarchy of the model in which the fused feature is located, and finally we can obtain the final saliency map S out 。
5) Saliency map S predicted by the present invention out And calculating a loss function with the artificially marked salient object segmentation graph G, gradually updating the parameter weight of the model provided by the invention through SGD and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D salient detection algorithm.
6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map S test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The invention uses a lightweight MobileNet-v3 as backbone network, thereby avoiding heavy computation. Unlike previous fusion methods, here we do not do excessive modal interactions in order to avoid creating invalid fusion features. But uses a simple and efficient attention-directing mechanism to enhance the characterization capability of the features, and finally uses a multi-source integration unit to perform the final fusion operation. In the decoder, to be able to obtain more efficient saliency information, we use a simple multipath integration module to obtain the final saliency map. To make the whole network more lightweight we use separable convolution to learn the features. The invention can have certain robustness.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a global context module
FIG. 4 is a schematic diagram of a multi-path aggregation module
FIG. 5 is a schematic diagram of model training and testing
Detailed Description
The following description of the embodiments of the present invention will be made more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, and in which embodiments of the invention are shown, by way of illustration only, and not all embodiments in which the invention may be practiced. All other embodiments, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the embodiments of the present invention.
Referring to fig. 1, an RGB-D saliency target detection method based on a lightweight cross-modal fusion network mainly includes the following steps:
1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NJU2K data set and the NLPR data set are used as training sets, and the rest data sets are the most tested sets and comprise the rest NJU2K data set, the rest NLPR data set, the SIP data set, the STERE data set and the DES data set.
2. Constructing a saliency target detection model backbone network for extracting RGB image features and depth image features by using a MobileNet-v3 network, wherein the backbone network comprises an encoder for extracting RGB image features and an encoder for extracting depth image features:
2.1. the RGB image with three channels is input to an RGB encoder to generate 4 levels of RGB features, respectivelyAnd->Since the lowest level features contain too much noise we do not use here, only 2,3,4 and 5 level features are used for fusion. Similarly, depth encoders also use only 2,3,4, and 5 layer features.
2.2. Inputting the three-channel depth image into a depth encoder to generate 4 layers of depth image features, namelyAnd->
3. Referring to fig. 2, the 4-level RGB image generated in step 2 is characterized by a cross-modal fusion moduleAnd->And depth image feature->And->Cross-modal fusion is carried out to obtain multi-modal characteristics of 4 layers +.> And->The main steps are as followsThe illustration is:
3.1. the cross-modal feature fusion network consists of 4 layers of CMI modules, and the cross-modal feature fusion network is characterized by 4 layers of RGB image featuresAnd->And corresponding depth image features-> And->Composing and generating 4 layers of multi-modal features +.> And->
33.2 I) the input data of the CMI module of the i-th hierarchy isAnd->The composition and the output of the multi-modal feature of the ith hierarchy by the multi-source integration unit>Where i ε {2,3,4,5}.
3.3 The CMI module generates multi-modal features through a progressively guided attention mechanism, the specific process is as follows:
3.3.1 Firstly, the invention adopts a depth separable convolution module to extract the features of a single mode, enhances the expression capability of the significance of the features, and can further enhance the expression of RGB and depth features through the convolution module.
3.3.2 Then further feature extraction and enhancement processing is performed on the RGB and depth features, respectively, using two parallel progressively guided attention mechanisms. In order to obtain the global information of a single mode, we use two parallel channel attentions to extract features of the RGB and depth maps respectively:
where DSConv () represents a depth separable convolution module, AVG () represents a global average pooling operation at a channel level, sigmoid () represents a Sigmoid activation function.
3.3.3 At the time of obtaining global information X r And global information X d After that, we apply to X r And X d And then, learning local details so as to prevent the loss of local details of a plurality of remarkable targets, and generating a space feature map with a plurality of receptive fields by utilizing a multi-scale space attention mechanism:
wherein Conv3 () represents a convolution module with a convolution kernel of 3×3, max () represents a max pooling operation, cat () represents a splicing operation, C 1 ()、C 3 () And C 5 () Representing hollow rolls with hollow rates of 1, 3 and 5, respectivelyAnd (5) performing product operation.
3.3.4 For all features Z by means of a multisource feature integration unit r 、Z d 、And->Integrating and fusing RGB features Z r And depth image Z d Finally, fusion characteristics are obtained>
Where i ε {2,3,4,5} represents the hierarchy of the model where the feature is located, conv1 () represents the convolution operation with a convolution kernel size of 1×1, DSConv () represents the depth separable convolution operation, cat () represents the feature stitching operation, and add represents the addition operation.
4. Referring to fig. 3, multi-modal features extracted into 4 levels And->And inputting the 4 layers into a context information extraction module, and enhancing the receptive field information of the multi-modal features and promoting the expression of the significance targets through convolution operations of multiple layers and different sizes.
4.1 Respectively extracting the context information of the fusion features from the multi-modal features through the context operation:
where i ε {2,3,4,5} represents the hierarchy where the fused feature is located and GCM () represents the contextual feature extraction module.
4.2 Referring to fig. 4, the context information modal characteristics generated in the above steps are input into a decoder, and the fusion characteristics are integrated by a multi-path aggregation module, and the RGB characteristics of each layerAnd depth feature per layer->
Where MPA () represents a multipath aggregation module, deconv () represents a deconvolution operation,representing the predicted saliency map, i epsilon {2,3,4,5} represents the hierarchy of the model where the fusion feature is located, and finally we can obtain the final saliency map
5) Saliency maps predicted by the present inventionAnd calculating a loss function with the artificially marked salient object segmentation graph G, gradually updating the parameter weight of the model provided by the invention through SGD and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D salient detection algorithm.
6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map S test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The foregoing is a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.
Claims (5)
1. The RGB-D significance target detection method based on the lightweight cross-modal fusion network is characterized by comprising the following steps:
1) Acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing an algorithm;
2) Constructing an encoder for extracting RGB image features and an encoder for extracting depth image features;
3) Establishing a lightweight network for fusing RGB features and depth map features, and guiding the fusion of the RGB features and the depth image features through a depth separable rolling and attention mechanism;
4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a multi-scale context capturing mechanism is constructed to extract multi-modal characteristic context information;
5) Establishing a simple and efficient multipath aggregation decoder for fusing RGB, depth features and fusion features, and obtaining a final predicted saliency map through an activation function;
6) The predicted saliency map P and the artificially marked saliency target segmentation map G are subjected to loss function calculation, the parameter weights of the model provided by the invention are gradually updated through SGD and a back propagation algorithm, and finally the structure and the parameter weights of the RGB-D saliency detection algorithm are determined.
7) And (3) testing RGB-D image pairs on a test set on the basis of determining the structure and the parameter weight of the model in the step (6), generating a saliency map S, and performing performance evaluation by using an evaluation index.
2. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 2) is as follows:
2.1 With the NJUK data set and the NLPR data set as training sets and the remaining NLPR data set, the NJU2K data set, the SIP data set, the fire data set, and the DES data set as test sets.
2.2 RGB-D image dataset comprising a single color image I RGB Corresponding depth image I Depth And a corresponding artificially annotated salient object segmentation image G.
3. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 3) is as follows:
3.1 Using MobileNet-v3 as the backbone network of the model of the present invention for extracting RGB image features and corresponding depth image features, respectivelyE、/>And->
3.2 Initializing the MobileNet-v3 weights of the present invention for constructing a backbone network with pre-trained MobileNet-v3 parameter weights on an ImageNet dataset.
4. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 4) is as follows:
4.1 Cross-modal feature fusion network is composed of 4 layers of CMI modules and generates 4 layers of multi-modal featuresAnd->
4.2 I) the input data of the CMI module of the i-th hierarchy isAnd->Constituted and outputting the multi-modal feature of the ith hierarchy by the progressive guided attention mechanism +.>Where i ε {2,3,4,5}.
5. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 5) is as follows:
5.1 A multi-scale depth separable convolution operation, respectively, using different kernel sizes to obtain multiple acceptance domains, which can capture rich context information:
where i ε {2,3,4,5} represents the hierarchy where the fused feature is located and GCM () represents the contextual feature extraction operation.
6) Inputting the 4-level multi-mode features with a plurality of receiving domains, which are obtained in the step 5, into a decoder formed by a multi-path integration network to obtain a final fusion feature, and activating through a sigmoid function to obtain a predicted saliency map S:
where MPA () represents a multipath aggregation module.
7) The loss function is calculated by the predicted saliency map S and the artificially marked saliency target segmentation map G, the parameter weight of the model provided by the invention is gradually updated by SGD and a back propagation algorithm, and the structure and the parameter weight of the RGB-D saliency detection algorithm are finally determined.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310410912.5A CN116486112A (en) | 2023-04-18 | 2023-04-18 | RGB-D significance target detection method based on lightweight cross-modal fusion network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310410912.5A CN116486112A (en) | 2023-04-18 | 2023-04-18 | RGB-D significance target detection method based on lightweight cross-modal fusion network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116486112A true CN116486112A (en) | 2023-07-25 |
Family
ID=87222602
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310410912.5A Pending CN116486112A (en) | 2023-04-18 | 2023-04-18 | RGB-D significance target detection method based on lightweight cross-modal fusion network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116486112A (en) |
-
2023
- 2023-04-18 CN CN202310410912.5A patent/CN116486112A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
Hu et al. | Learning supervised scoring ensemble for emotion recognition in the wild | |
WO2019228358A1 (en) | Deep neural network training method and apparatus | |
CN109815826B (en) | Method and device for generating face attribute model | |
CN113240691B (en) | Medical image segmentation method based on U-shaped network | |
CN113628294B (en) | Cross-mode communication system-oriented image reconstruction method and device | |
CN110414432A (en) | Training method, object identifying method and the corresponding device of Object identifying model | |
CN112784764A (en) | Expression recognition method and system based on local and global attention mechanism | |
CN108427740B (en) | Image emotion classification and retrieval algorithm based on depth metric learning | |
CN113435520A (en) | Neural network training method, device, equipment and computer readable storage medium | |
CN114282059A (en) | Video retrieval method, device, equipment and storage medium | |
CN112861659A (en) | Image model training method and device, electronic equipment and storage medium | |
CN114283315A (en) | RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion | |
CN116524593A (en) | Dynamic gesture recognition method, system, equipment and medium | |
CN112241959A (en) | Attention mechanism generation semantic segmentation method based on superpixels | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN110704665A (en) | Image feature expression method and system based on visual attention mechanism | |
CN114764870A (en) | Object positioning model processing method, object positioning device and computer equipment | |
Abuowaida et al. | Improved deep learning architecture for depth estimation from single image | |
CN114066844A (en) | Pneumonia X-ray image analysis model and method based on attention superposition and feature fusion | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN116958324A (en) | Training method, device, equipment and storage medium of image generation model | |
CN116310396A (en) | RGB-D significance target detection method based on depth quality weighting | |
CN109583406B (en) | Facial expression recognition method based on feature attention mechanism | |
CN116486112A (en) | RGB-D significance target detection method based on lightweight cross-modal fusion network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |