CN116452805A - Transformer-based RGB-D semantic segmentation method of cross-modal fusion network - Google Patents

Transformer-based RGB-D semantic segmentation method of cross-modal fusion network Download PDF

Info

Publication number
CN116452805A
CN116452805A CN202310401129.2A CN202310401129A CN116452805A CN 116452805 A CN116452805 A CN 116452805A CN 202310401129 A CN202310401129 A CN 202310401129A CN 116452805 A CN116452805 A CN 116452805A
Authority
CN
China
Prior art keywords
rgb
depth
features
semantic segmentation
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310401129.2A
Other languages
Chinese (zh)
Inventor
葛斌
朱序
夏晨星
张梦格
卢洋
陆一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202310401129.2A priority Critical patent/CN116452805A/en
Publication of CN116452805A publication Critical patent/CN116452805A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention name is as follows: transformer-based cross-modal fusion network RGB-D semantic segmentation method abstract: the invention provides a trans-former-based cross-modal fusion RGB-D semantic segmentation method, which utilizes multi-modal data of RGB images and Depth images to extract cross-modal characteristics for semantic segmentation tasks in computer vision. The contribution of the invention mainly aims at realizing unreliable information of deep learning obtained by a Depth sensor (for example, deep objects or reflective surfaces read by some Depth sensors often have inaccurate readings or holes) by considering the Depth characteristics, and enhancing the effect of the Depth characteristics by utilizing bilateral filtering, and effectively fusing RGB characteristics and the Depth characteristics by a cross-mode residual fusion module. The challenges encountered by the semantic segmentation of RGB images (it is difficult to distinguish instances with similar colors and textures) can be effectively handled by the proposed method, and the Depth image can be effectively utilized.

Description

Transformer-based RGB-D semantic segmentation method of cross-modal fusion network
Technical Field
The invention relates to the field of image processing, in particular to a semantic segmentation method based on feature extraction and fusion of different modes.
Background
The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.
Semantic segmentation (Semantic Segmentation) is one of the most challenging problems in the computer vision field, whose purpose is to transform image input into its potential semantic meaning area, and to achieve pixel-level dense scene understanding for many real-world applications. With the rise of the most popular topics in the field of computer vision such as scene understanding, reconstruction and image processing, the image semantic segmentation is taken as the basis of the popular topics, and is also paid more and more attention to scientific researchers in the field. Semantic segmentation is a fundamental and persistent problem in computer vision, and is used as a multi-label classification problem, wherein a class label is allocated to each pixel, so that the method is suitable for various applications (such as automatic driving, object classification, image retrieval, detection of medical instruments in man-machine interaction surgery and the like). Although there are some excellent results in semantic segmentation, most studies focus on RGB images only. Since RGB learning gives models with distinct colors and textures, without geometric information, it is difficult to distinguish instances with similar colors and textures. To solve the above problems, researchers began to utilize depth information to assist RGB semantic segmentation. The combination of RGB and depth information, known as RGB-D, is a quite important approach, and depth images can provide the required geometrical information, thus potentially enriching the representation of the RGB image and better distinguishing between various objects.
The existing RGB-D semantic segmentation method has two main challenges: how to effectively extract features from the additional Depth; and how to effectively merge different features of two modes. The current approach primarily treats the depth map as a single channel image and uses convolutional neural networks (Convolutional Neural Network, CNN) to extract features similar to RGB maps from the depth map, however this approach ignores that the depth obtained by the depth sensor is not reliable for every depth value. Since RGB images and depth images belong to two different modalities, how to effectively fuse features of the two different modalities is also an important challenge for RGB-D semantic segmentation.
Based on the defects of the convolutional neural network-based method, the invention tries to design a framework capable of efficiently extracting RGB and depth features, and in the process of feature extraction, the reliability of input depth values is explicitly considered, noise processing is carried out on the depth image, and the features of the depth image can be effectively utilized. In order to solve the problem of fusion of RGB features and depth features, the invention designs a cross-mode residual error fusion module.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a trans-former-based RGB-D semantic segmentation method of a cross-modal fusion network, which adopts the following technical scheme:
1. RGB-D data sets for training and testing are acquired and collated.
1.1 The acquired dataset (NYU Depth V2 dataset, SUN RGB-D dataset) is sorted and generalized into the following categories: RGB image P RGB Depth image P Depth And a truth image P marked by manual GT
1.2 A) the received data set is divided into a training set and a test set. The NYU Depth V2 is totally 1449 pictures to the right, 795 pictures are selected as a training set, and the rest 654 pictures are selected as a testing set. SUN RGB-D consisted of 10335 indoor RGB-D pictures, which were divided into a training set of 5285 samples and a test set of 5050 samples.
2. The network framework of the invention consists of two parallel encoders (RGB Encoder and Depth Encoder), extracts specific mode features from RGB image and Depth mode respectively, and then generates final semantic segmentation result by one semantic decoder.
2.1 Two parallel independent trunks extract features from RGB and Depth modality inputs, respectively, and the semantic decoder takes the fusion features of each fusion module as input to generate a final segmentation result.
2.2 RGB and Depth are respectively obtained into 4 layers of RGB features and Depth features through two parallel Encoder trunks by four sequential transform blocks, which are respectively named as And
2.3 Generally, existing depth sensors have difficulty measuring the depth of highly reflective or light absorbing surfaces because the measurement of the depth sensor may be affected by the physical environment. Conventional depth sensors, such as Kinect, return only a null value when depth cannot be measured accurately. In these cases, we represent its uncertainty map as a binary map U.epsilon. {0,1} H×W Where 0 indicates that there is no sensor reading at that location and 1 indicates a valid sensor reading. For the Depth image measured by the sensor, the problem of uncertain Depth is solved by bilateral filtering, a neighborhood to be used for filtering is divided or classified according to pixel values, then relatively high weight is given to the category to which the point belongs, and then neighborhood weighted summation is carried out, so that a final result is obtained.
Generating a space domain kernel by using a two-dimensional Gaussian function, and generating a color domain kernel by using a one-dimensional Gaussian function:
wherein, (k, l) is the core center coordinate and (i, j) is the intra-core neighborhood coordinate.σ d Is the standard deviation of the gaussian function.
Where f (i, j) represents the gray value of the image at (i, j), and the other labels are consistent with the spatial domain.
2.4 The present invention uses the pyrerch framework to implement and train the network of the present invention. The encoder of the present invention uses the default configuration of Swin-S.
3. Based on the RGB features extracted in step 2And depth profile-> The invention fuses the characteristics between the RGB encoder and the depth encoder by using the cross-mode residual fusion module provided by the invention and combines the characteristics of the two modes into a single fusion characteristic. The fusion module takes input from the RGB branches and the depth branches and returns updated features to the corresponding encoder of the next block to enhance the complementarity of features between the two different modalities.
3.1 First, the present invention contemplates a Cross-modality survivor fusion module (Cross-Model Residual Feature Fusion Module, CRFFM) that first selects features from one modality that are complementary to another modality, and then performs feature fusion between the modalities and levels.
3.1.1 First, in the first stage of the fusion module, the invention inputs RGB image features and depth image features, respectively, to an improved coordinate attention module (Coordinate attention, CAM) for enhanced feature representation capabilities. And then, the RGB features and the depth features pass through a symmetrical feature selection stage, complementary information of different modes is selected for residual linking, and the features after residual linking are used as the output of a decoder of the next stage and the input of a fusion stage.
3.1.2 The RGB features and depth features after the residual connection of the results are respectively passed through Conv 3×3 The convolution is performed by cross element-by-element multiplication and maximization operation and the features generated by the cross element-by-element multiplication and maximization operation are connected, and then a Conv is passed 3×3 Convolution performs the output of the fusion feature.
4. Through the steps, the cross-modal fusion characteristic F can be obtained i . The semantic decoder takes the fusion characteristics of each fusion module as input to generate a final segmentation result. The invention uses the UuperNet decoder as our semantic decoder, which has higher efficiency.
5. Semantic segmentation map P predicted by the invention pre Semantic segmentation truth diagram P with artificial annotation GT And comparing to calculate a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and weight parameters of the RGB-D semantic segmentation algorithm. The loss function of the present invention uses a cross entropy loss function:
6. and 5, testing the RGB-D image on the test set on the basis of determining the structure and the weight parameters of the model, generating a semantic segmentation map, and evaluating by using the PixelAcc.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a flow chart of a bilateral filtering module
FIG. 3 is a schematic diagram of a cross-modal residual fusion module
FIG. 4 is a schematic diagram of an improved coordinate attention module
Detailed Description
The following description of the embodiments of the present invention will be made more fully and clearly with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.
Referring to fig. 1, an RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network mainly includes the following steps:
1. RGB-D data sets for training and testing are acquired and collated.
1.1 The acquired dataset (NYU Depth V2 dataset, SUN RGB-D dataset) is sorted and generalized into the following categories: RGB image P RGB Depth image P Depth And a truth image P marked by manual GT
1.2 A) the received data set is divided into a training set and a test set. The NYU Depth V2 is totally 1449 pictures to the right, 795 pictures are selected as a training set, and the rest 654 pictures are selected as a testing set. SUN RGB-D consisted of 10335 indoor RGB-D pictures, which were divided into a training set of 5285 samples and a test set of 5050 samples.
2. The network framework of the invention consists of two parallel encoders (RGB Encoder and Depth Encoder), extracts specific mode features from RGB image and Depth mode respectively, and then generates final semantic segmentation result by one semantic decoder.
2.1 Two parallel independent trunks extract features from RGB and Depth modality inputs, respectively, and the semantic decoder takes the fusion features of each fusion module as input to generate a final segmentation result.
2.2 RGB and Depth are respectively obtained into 4 layers of RGB features and Depth features through two parallel Encoder trunks by four sequential transform blocks, which are respectively named as And
2.3 Generally, existing depth sensors have difficulty measuring the depth of highly reflective or light absorbing surfaces because the measurement of the depth sensor may be affected by the physical environment. Conventional depth sensors, such as Kinect, return only a null value when depth cannot be measured accurately. In these cases, we represent its uncertainty map as a binary map U.epsilon. {0,1} H×W Where 0 indicates that there is no sensor reading at that location and 1 indicates a valid sensor reading. For the Depth image measured by the sensor, the problem of uncertain Depth is solved by bilateral filtering, a neighborhood to be used for filtering is divided or classified according to pixel values, then relatively high weight is given to the category to which the point belongs, and then neighborhood weighted summation is carried out, so that a final result is obtained.
Generating a space domain kernel by using a two-dimensional Gaussian function, and generating a color domain kernel by using a one-dimensional Gaussian function:
wherein, (k, l) is the core center coordinate and (i, j) is the intra-core neighborhood coordinate. Sigma (sigma) d Is the standard deviation of the gaussian function.
Where f (i, j) represents the gray value of the image at (i, j), and the other labels are consistent with the spatial domain.
2.4 The present invention uses the pyrerch framework to implement and train the network of the present invention. The encoder of the present invention uses the default configuration of Swin-S.
3. Based on the RGB features extracted in step 2And depth profile-> The invention fuses the characteristics between the RGB encoder and the depth encoder by using the cross-mode residual fusion module provided by the invention and combines the characteristics of the two modes into a single fusion characteristic. The fusion module takes input from the RGB branches and the depth branches and returns updated features to the corresponding encoder of the next block to enhance the complementarity of features between the two different modalities.
3.1 First, the present invention contemplates a Cross-modality survivor fusion module (Cross-Model Residual Feature Fusion Module, CRFFM) that first selects features from one modality that are complementary to another modality, and then performs feature fusion between the modalities and levels.
3.1.1 First, in the first stage of the fusion module, the invention inputs RGB image features and depth image features, respectively, to an improved coordinate attention module (Coordinate attention, CAM) for enhanced feature representation capabilities. And then, the RGB features and the depth features pass through a symmetrical feature selection stage, complementary information of different modes is selected for residual linking, and the features after residual linking are used as the output of a decoder of the next stage and the input of a fusion stage.
3.1.2 The RGB features and depth features after the residual connection of the results are respectively passed through Conv 3×3 The convolution is performed by cross element-by-element multiplication and maximization operation and the features generated by the cross element-by-element multiplication and maximization operation are connected, and then a Conv is passed 3×3 Convolution performs the output of the fusion feature.
4. Through the steps, the cross-modal fusion characteristic F can be obtained i . The semantic decoder takes the fusion characteristics of each fusion module as input to generate a final segmentation result. The invention uses the UuperNet decoder as our semantic decoder, which has higher efficiency.
5. Semantic segmentation map P predicted by the invention pre Semantic segmentation truth diagram P with artificial annotation GT And comparing to calculate a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and weight parameters of the RGB-D semantic segmentation algorithm. The loss function of the present invention uses a cross entropy loss function:
6. and 5, testing the RGB-D image on the test set on the basis of determining the structure and the weight parameters of the model, generating a semantic segmentation map, and evaluating by using the Pixel Acc.
The foregoing is a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (7)

1. The RGB-D semantic segmentation method based on the trans-former cross-mode fusion network is characterized by comprising a bilateral filtering module for acquiring and arranging image samples for training and testing, constructing a double-stream encoder, extracting and fusing cross-mode characteristics and processing depth images.
2. The method for RGB-D semantic segmentation of a trans-former-based cross-modal fusion network according to claim 1, wherein the data used comprises NYU V2 data set, SUN RGB-D data set, and the single sample is divided into RGB images P RGB Depth image P Depth And manually annotated semantically segmented images P GT The method comprises the steps of carrying out a first treatment on the surface of the The training set consisted of 795 samples in the NYU V2 dataset and 5285 samples in the SUN RGB-D dataset, with the remaining samples as the test set.
3. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network according to claim 1, wherein the network framework consists of two parallel encoders (RGB Encoder and Depth Encoder), extracting specific mode features from RGB images and Depth modes, respectively, and then generating final semantic segmentation results by a semantic decoder.
3.1 Two parallel independent trunks extract features from RGB and Depth modality inputs, respectively, and the semantic decoder takes the fusion features of each fusion module as input to generate a final segmentation result.
3.2 RGB and Depth are respectively obtained into 4 layers of RGB features and Depth features through two parallel Encoder trunks by four sequential transform blocks, which are respectively named as And
3.3 Generally, existing depth sensors have difficulty measuring the depth of highly reflective or light absorbing surfaces because the measurement of the depth sensor may be affected by the physical environment. Conventional depth sensors, such as Kinect, return only a null value when depth cannot be measured accurately. In these cases, we represent its uncertainty map as a binary map U.epsilon. {0,1} H×W Where 0 indicates that there is no sensor reading at that location and 1 indicates a valid sensor reading. For the Depth image measured by the sensor, the Depth uncertainty problem is solved by utilizing bilateral filtering.
Generating a space domain kernel by using a two-dimensional Gaussian function, and generating a color domain kernel by using a one-dimensional Gaussian function:
wherein, (k, l) is the core center coordinate and (i, j) is the intra-core neighborhood coordinate. Sigma (sigma) d Is the standard deviation of the gaussian function.
Where f (i, j) represents the gray value of the image at (i, j), and the other labels are consistent with the spatial domain.
4. The method for RGB-D semantic segmentation based on a trans-former cross-mode fusion network according to claim 3, wherein the method is characterized in that the output of each encoder block is fused by using the cross-mode residual fusion module provided by the invention, and the features of the RGB encoder and the depth encoder are combined into a single fusion feature. The fusion module takes input from the RGB branches and the depth branches and returns updated features to the corresponding encoder of the next block to enhance the complementarity of features between the two different modalities.
4.1 First, the present invention contemplates a Cross-modality survivor fusion module (Cross-Model Residual Feature Fusion Module, CRFFM) that first selects features from one modality that are complementary to another modality, and then performs feature fusion between the modalities and levels.
4.1.1 First, in the first stage of the fusion module, the invention inputs RGB image features and depth image features, respectively, to an improved coordinate attention module (Coordinate attention, CAM) for enhanced feature representation capabilities. And then, the RGB features and the depth features pass through a symmetrical feature selection stage, complementary information of different modes is selected for residual linking, and the features after residual linking are used as the output of a decoder of the next stage and the input of a fusion stage.
4.1.2 The RGB features and depth features after the residual connection of the results are respectively passed through Conv 3×3 Convolution is performing cross element-wise multiplication and summationMaximizing operation and connecting the features generated by the maximizing operation and the maximizing operation, and then passing through a Conv 3×3 Convolution performs the output of the fusion feature.
5. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network of claim 4, wherein the semantic decoder takes fusion features of each fusion module as input to generate a final segmentation result.
6. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network according to claim 5, wherein the semantic segmentation map P predicted by the method is characterized in that pre Semantic segmentation truth diagram P with artificial annotation GT And comparing to calculate a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and weight parameters of the RGB-D semantic segmentation algorithm.
7. The RGB-D semantic segmentation method of the trans-former-based cross-modal fusion network of claim 6, wherein the RGB-D images on the test set are tested to generate a semantic segmentation map, and the semantic segmentation map is evaluated by using pixelacc.
CN202310401129.2A 2023-04-15 2023-04-15 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network Pending CN116452805A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310401129.2A CN116452805A (en) 2023-04-15 2023-04-15 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310401129.2A CN116452805A (en) 2023-04-15 2023-04-15 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Publications (1)

Publication Number Publication Date
CN116452805A true CN116452805A (en) 2023-07-18

Family

ID=87129776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310401129.2A Pending CN116452805A (en) 2023-04-15 2023-04-15 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Country Status (1)

Country Link
CN (1) CN116452805A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117115061A (en) * 2023-09-11 2023-11-24 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117036891B (en) * 2023-08-22 2024-03-29 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117115061A (en) * 2023-09-11 2023-11-24 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium
CN117115061B (en) * 2023-09-11 2024-04-09 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110008338B (en) E-commerce evaluation emotion analysis method integrating GAN and transfer learning
CN114969405B (en) Cross-modal image-text mutual detection method
CN116452805A (en) Transformer-based RGB-D semantic segmentation method of cross-modal fusion network
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
Zhang et al. Fast and accurate land-cover classification on medium-resolution remote-sensing images using segmentation models
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN114419323B (en) Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method
Delibasoglu et al. Improved U-Nets with inception blocks for building detection
Feng et al. Multi‐scale classification network for road crack detection
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN108615401B (en) Deep learning-based indoor non-uniform light parking space condition identification method
CN113064995A (en) Text multi-label classification method and system based on deep learning of images
CN116189139A (en) Traffic sign detection method based on Transformer
Luo et al. New deep learning method for efficient extraction of small water from remote sensing images
Dong et al. Combination of modified U‐Net and domain adaptation for road detection
Jia et al. Multi‐stream densely connected network for semantic segmentation
Padhy et al. Image classification in artificial neural network using fractal dimension
Fu et al. A pixel pair–based encoding pattern for stereo matching via an adaptively weighted cost
Jia et al. Sample generation of semi‐automatic pavement crack labelling and robustness in detection of pavement diseases
CN116595534A (en) Defect detection method of intelligent contract
CN116129251A (en) Intelligent manufacturing method and system for office desk and chair
CN117011219A (en) Method, apparatus, device, storage medium and program product for detecting quality of article
Ahmed et al. DeepRoadNet: A deep residual based segmentation network for road map detection from remote aerial image
Zhang et al. Research on lane identification based on deep learning
Chen et al. Single-photon 3D imaging with a multi-stage network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination