CN112991350B - RGB-T image semantic segmentation method based on modal difference reduction - Google Patents
RGB-T image semantic segmentation method based on modal difference reduction Download PDFInfo
- Publication number
- CN112991350B CN112991350B CN202110187778.8A CN202110187778A CN112991350B CN 112991350 B CN112991350 B CN 112991350B CN 202110187778 A CN202110187778 A CN 202110187778A CN 112991350 B CN112991350 B CN 112991350B
- Authority
- CN
- China
- Prior art keywords
- features
- channel
- rgb
- correlation matrix
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a RGB-T image semantic segmentation method based on modal difference reduction, which comprises the steps of (1) constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power from input RGB and thermal infrared registered image pairs, and constructing a supervised learning model at the same time: (2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features through the weighted fusion module to obtain multi-level fusion features; (3) Acquiring multi-level fusion characteristics, and then obtaining a space correlation matrix and a channel correlation matrix through calculation; (4) Restoring the space correlation matrix and the channel correlation matrix to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and a softmax function; and (5) training the algorithm network to obtain model parameters.
Description
Technical Field
The invention belongs to the field of image processing, and relates to an RGB-T image semantic segmentation method based on modal difference reduction, which can be used for the preprocessing process of images in computer vision tasks.
Background
Semantic segmentation aims at assigning a class label to each pixel in a natural image using a model or algorithm. As one of key technologies for scene perception, semantic segmentation plays a vital role in computer vision tasks such as autopilot, pedestrian detection, and medical image analysis.
Existing semantic segmentation methods can be divided into two main categories: one is a traditional semantic segmentation method, and the other is a semantic segmentation method based on deep learning. The traditional semantic segmentation method mainly completes image semantic segmentation by combining low-level manual features with a plane classifier. Such methods are poorly robust and difficult to obtain satisfactory results in complex scenarios. With the wide application of the deep learning technology, the semantic segmentation method based on the deep learning has breakthrough progress, and compared with the traditional method, the semantic segmentation method based on the deep learning has better segmentation effect and stronger robustness.
So far, the RGB image semantic segmentation method based on the deep learning technology achieves a remarkable effect. However, in poor lighting conditions, the performance of these algorithms may be significantly degraded. The thermal infrared image can provide contour information and semantic information of the target, and can effectively supplement the RGB image.
The existing RGB-T semantic segmentation method usually adopts a simple strategy to capture complementary information in RGB images and thermal infrared images, such as 'Yuxiang Sun, weiixunonzuo, and Ming Liu.Rtfnet: rgb-thermal fusion network for semantic segmentation of urban scenes.RAL,4 (3): 2576-2583, 2019'. The method directly fuses each level of characteristics of two modal images in an encoder only by using an element-by-element addition mode; "Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada. Mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. InIROS, pages 5108-5115, 2017" each level feature of two modality images is fused directly in the decoder using only a cascade approach. These methods do not take into account the problem of modality differences in RGB images and thermal infrared images due to differences in imaging mechanisms. This problem further results in an inability to fully exploit cross-modality complementary information by simple fusion strategies, thereby reducing the accuracy of the RGB-T image semantic segmentation method.
In addition, the diversity of objects in the image to be detected, such as the class, size and shape of the objects, is also one of the key issues in the semantic segmentation task. In the single-mode RGB image semantic segmentation algorithm, multi-scale context information and its remote dependencies have proven to be effective methods to solve this problem. However, in RGB-T semantic segmentation tasks, multi-scale context information and remote dependency thereof are not well mined and utilized, and only 'Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada.mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral samples, inIROS, pages 5108-5115, 2017' the convolution parallel structure of two different receptive fields is used for acquiring a small amount of context information, which has very limited effect on RGB-T semantic segmentation tasks in complex scenes, and the problem of diversity of targets still cannot be effectively solved.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention aims to provide an RGB-T image semantic segmentation method based on modal difference reduction, which mainly solves the problems that the prior art does not consider the modal difference of a visible light image and a thermal infrared image and the utilization of context information is insufficient, and the semantic segmentation precision is low.
The key point of realizing the invention is that the modal difference of the RGB features and the thermal infrared features is reduced and fused in the network coding stage, so that the fused features have more discrimination, and the multiscale context information and the remote dependency relationship of the fused features are fully mined.
The technical scheme is as follows: an RGB-T image semantic segmentation method based on modal difference reduction comprises the following steps:
(1) Constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power for input RGB and thermal infrared registered image pairs, and simultaneously constructing a supervised learning model:
the bidirectional modal difference reduction sub-network reduces modal differences in a bidirectional manner, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding true image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, takes each level feature of the corresponding RGB true image and each level feature of the thermal infrared true image as supervision thereof, and builds a supervision learning model;
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features;
(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;
(4) Restoring the spatial correlation matrix and the channel correlation matrix obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
(5) Training the algorithm network to obtain model parameters:
and (3) on a training data set, adopting a supervised learning model for the predicted semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and completing algorithm network training end to end through the weighted cross entropy loss function and the weighted average absolute error loss function to obtain network model parameters.
Further, the bi-directional modal difference reduction sub-network in step (1) comprises two parts from RGB mode to thermal infrared mode and from thermal infrared mode to RGB mode, wherein the two parts adopt an encoder-decoder-encoder network with the same structure, wherein the encoder uses a res net-50 network and a res net-18 network, and the decoder uses an image generation network to generate a pseudo image through an upsampling strategy of bilinear interpolation.
Further, in step (1), five different resolution hierarchical features of the pseudo thermal infrared image generated by the ResNet-18 network extraction are simultaneously reduced And extracting, by the ResNet-18 network, five different resolution hierarchical features of their respective eukaryotic infrared imagesDifferences between
Five different resolution hierarchical features of generated pseudo-RGB images extracted by ResNet-18 networkAnd extracting hierarchical features of five different resolutions of their respective true RGB images from ResNet-18 network +.> Differences between;
to obtain more discriminative RGB multi-level features extracted by ResNet-50 networksAnd their corresponding thermal infrared multi-level features extracted by ResNet-50 network ∈>
Further, the adaptive channel weighted fusion module in the step (2) is configured to combine the first four layers of features of the RGB image obtained in the step (1)And its corresponding thermal infrared image front four layer featuresAs input, an RGB weight vector W of a corresponding hierarchy is adaptively generated 1 、W 2 、W 3 、W 4 Thermal infrared weight vector 1-W of corresponding level 1 、1-W 2 、1-W 3 、1-W 4 Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained>
Further, in step (3), the inputs of the multi-scale space and channel context modules are respectivelyAnd->To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
Still further, step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being inputIs 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(312) The multi-scale feature obtained in step (311) is processedPerforming size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW ;
(313) To input original characteristicsObtaining the cross-space correlation matrix M in the same manner as in step (312) cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic containing multi-scale context information and space long-term dependence thereof>
Still further, step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being inputIs 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(322) The multi-scale features obtained in step (321) are processedPerforming size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024 ;
(323) To input original characteristicsObtaining the cross-channel correlation matrix M in the same manner as in step (322) cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
Further, in step (4), a deconvolution operation is used to perform feature map up-sampling to recover resolution, and then a convolution kernel is used to obtain a 1×1 step size 1, and the parameter is θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
The beneficial effects are that: compared with the prior art, the RGB-T image semantic segmentation method based on modal difference reduction has the following beneficial effects:
1) The invention can realize end-to-end pixel level semantic segmentation prediction of RGB-T image pairs without manual design and feature extraction, and simulation results show that the invention remarkably improves semantic segmentation precision, and simultaneously has better segmentation effect on small targets and under complex scenes;
2) The invention designs a strategy of 'first reducing and then fusing', firstly reduces the modal difference between the multi-modal data caused by different imaging mechanisms by constructing a method based on bidirectional image conversion, and then adaptively selects multi-modal features with strong discrimination to improve the RGB-T semantic segmentation effect. Compared with the existing method, the multi-modal features extracted by the method have stronger discriminant, and are beneficial to improving the target category prediction precision;
3) The invention fully excavates rich context information by establishing multi-scale context information of cross-modal characteristics and interaction of long-term dependence on space and channel dimension thereof, and is beneficial to solving the problem of target diversity. Compared with the existing method, the method can better segment targets with different scales, and meanwhile, the segmentation integrity inside the targets is improved.
Drawings
FIG. 1 is a flow chart of a RGB-T image semantic segmentation method based on modal difference reduction;
FIG. 2 is a schematic diagram of an algorithm network of an RGB-T image semantic segmentation method based on modal difference reduction, wherein a dotted line frame represents a bidirectional modal difference reduction sub-network, CWF represents an adaptive channel weighting fusion module, MSC represents a multi-scale space context module, and MCC represents a multi-scale channel context module;
fig. 3 is a schematic diagram of a framework of an adaptive channel weighted fusion module (CWF) according to the present invention;
FIG. 4 is a diagram of a multi-scale space context Module (MSC) framework in accordance with the present invention;
fig. 5 is a multi-scale channel context Module (MCC) framework diagram according to the present invention.
The specific embodiment is as follows:
the following detailed description of specific embodiments of the invention.
Referring to fig. 1, a method for semantic segmentation of RGB-T images based on modal disparity reduction includes the steps of:
(1) Constructing a bi-directional modal difference reduction sub-network, extracting more discriminative RGB features and thermal infrared features from an input RGB and thermal infrared registered image pair and simultaneously constructing a supervised learning model, wherein:
as shown in fig. 2, the bidirectional modal difference reducing sub-network reduces modal differences in both directions, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding real image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, and builds a supervised learning model by taking each level feature of the corresponding real image of RGB and each level feature of the thermal infrared real image as supervision thereof;
step 1) when the characteristic difference reduction from RGB mode to thermal infrared mode is carried out, the ResNet-50 is used for extracting the multi-level characteristics of the RGB imageThe resolution is 1/2, 1/4, 1/8, 1/16, 1/32 of the resolution of the input image, and the channel numbers are 64, 256, 512, 1024, 2048, respectively. Then using four convolution kernels of 3×3, step length of 1, and parameters of +.>Is a convolution operation of (1)/> Will->The dimension reduction is a single-channel feature diagram, and then pseudo-thermal infrared image I is generated by bilinear interpolation calculation and summation pse-T . Extraction of pseudo thermal infrared image I using ResNet-18 pse-T Multi-level features of-> Simultaneously extracting five different resolution multi-level features of a corresponding eukaryotic infrared image by using another ResNet-18And calculating the difference between the true and false features of the corresponding layers.
Similarly, when performing feature difference reduction from thermal infrared mode to RGB mode, the ResNet-50 is first used to extract multi-level features of the thermal infrared imageGenerating a three-channel pseudo RGB image I in the same manner pse-RGB Subsequently pseudo RGB image I is extracted using ResNet-18 pse-RGB Multi-level features of (2) Simultaneously extracting five different resolution multi-level features of a corresponding true RGB image using another ResNet-18And calculating the difference between the true and false features of the corresponding layers.
In the bidirectional modal difference reduction sub-network, the total modal difference L MD The sum of the difference between the true and false thermal infrared multi-level features and the difference between the true and false RGB multi-level features can be expressed as:
wherein:
L 1 mean absolute error is expressed.
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features, so that a feature channel with strong resolution capability is better selected from the multi-mode features;
four layers of RGB features with different resolutions and corresponding thermal infrared features are obtained in the step (1), and the module is used for fusing each layer of RGB features and T features, so that four layers of fusion features are obtained in total. Meanwhile, as for the fused features, resNet-50 is also used for feature extraction. Specifically, the fusion module is utilized to obtain the RGB features and the thermal infrared features of the first layer, then the fusion features are subjected to ResNet-50 residual block downsampling, and finally the fusion features of the second layer (namely, the fusion features obtained by the fusion module are utilized to carry out addition operation on the RGB features and the thermal infrared features of the second layer). As are subsequent layers.
As shown in FIG. 3, the channel-by-channel weighted fusion module inputs the RGB features obtained in step 1 after the reduced modal differencesAnd the corresponding thermal infrared features-> Features of the last layer->Andare discarded to save network computation. The multi-modal characteristics of the corresponding hierarchy are cascaded, and corresponding weight vectors are respectively predicted through four convolution block operations, wherein each convolution block operation comprises a convolution kernel of 3 multiplied by 3, a step length of 1 and parameters of +.>Convolution operation of->And a convolution kernel of 1×1, step size of 1, parameter +.>Convolution operation of->Computing the relative importance of paired features from different modalities but in the same channel, i.e. the weight vector W of the RGB modality 1 ,W 2 ,W 3 ,W 4 Weight vectors 1-W of corresponding multi-level thermal infrared states 1 ,1-W 2 ,1-W 3 ,1-W 4 This can be expressed as:
wherein:
GAP (x) represents global average pooling operations;
cat (x) denotes cascade operation;
sigma represents a sigmoid activation function;
finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtainedCan be expressed as:
wherein:
represents a channel-by-channel multiplication operation;
1 represents W n All 1 vectors of the same size;
w obtained in the formula (2) n The larger the value, the more important the RGB modality corresponding channel is than the thermal infrared modality corresponding channel, and vice versa. When W is n And 1-W n When the values in the two weight vectors are 0.5, the special case of equal-ratio weight fusion can be considered; when W is n When the values of (a) are 0 or 1, the special case of using only thermal infrared or RGB single-mode information can be considered.
(3) Constructing a multi-scale space and channel context module, and mining multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information:
as shown in fig. 4 and fig. 5, the multi-level fusion feature obtained in the step (2) is firstly obtained, then a space correlation matrix and a channel correlation matrix are obtained through calculation, the space correlation matrix and the channel correlation matrix are acted on the multi-scale feature, and the relation between multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information is established;
as shown in fig. 4 and 5, the existing method cannot fully utilize the context information, so that the problem of target diversity in the semantic segmentation task is difficult to deal with, so that the method utilizes a hole convolution pyramid structure to extract multi-scale context information, and establishes long-term dependence on multi-scale features in space and channel dimensions so as to mine more abundant context information. In addition, in order to alleviate the problem of information loss in the process, the invention also establishes long-term dependence on space and channel dimension on the original input characteristics, and fuses the long-term dependence as supplementary information into multi-scale characteristics so as to ensure the integrity of the context information.
Specifically, the multi-scale space context module constructed by the invention is shown in fig. 4, and the input is the fusion characteristic obtained in the step 2The module includes a hole convolution pyramid structure, an autocorrelation matrix, and a cross-spatial correlation matrix.
The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>Hole convolution operation of (1)A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>Hole convolution operation +.>A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>Hole convolution operation +.>The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The channel numbers are all input +.>256. After cascading the four features (1024 channels), the four features are passed through a convolution kernel of 1×1, step size of 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same (512 channels), can be expressed as:
the obtained multi-scale characteristicsSize conversion to +.>And performs matrix multiplication operation with its own transpose matrix to obtain a characteristic multi-scale feature +.>An autocorrelation matrix of correlation between any two points in space, and an autocorrelation matrix M obtained ss ∈R HW×HW The method can be expressed as follows:
wherein:
(*) T representing a matrix transpose operation;
reshape represents the matrix dimension from R H×W×C Becomes R HW×C Is a size conversion operation of (a).
To input original characteristicsObtaining cross-space correlation matrix M in the same manner cs ∈R HW×HW Calculating the correlation between any two points in the original input feature space as information supplement to ensure the integrity of the context information and crossing the space correlation matrix M cs The method can be expressed as follows:
will auto-space correlation matrix M ss And cross-space correlation matrix M cs After element-by-element summation, normalization operation is carried out to obtain a total spatial correlation matrix M s ∈R HW×HW As in equation (7). Thereafter and multiscale featuresPerforming element-by-element multiplication operation, adding a jump connection path, and finally obtaining the characteristics containing abundant multi-scale context information and long-term dependence of the spaceCan be expressed as formula (8).
M s =Normalization(M ss +M cs ) (7)
Wherein:
normalization (x) represents Min-Max Normalization;
reshape' (. Times.) denotes the reverse operation of Reshape.
The multi-scale channel context module constructed by the invention is shown in FIG. 5, and the input is the fusion characteristic obtained in the step 2Comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>Hole convolution operation of (1)A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>Hole convolution operation +.>A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>Hole convolution operation +.>The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The channel numbers are all input +.>512. After cascading the four features (2048 channels) the four features are input into a convolution kernel of 1 x 1, step size of 1, and parameter θ 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same (1024 channels), can be expressed as:
the obtained multi-scale characteristicsSize conversion to +.>And performs matrix multiplication operation with its own transpose matrix to obtain a characteristic multi-scale feature +.>Self-channel correlation matrix of correlation between any two channels, and obtained self-channel correlation matrix M sc ∈R 1024×1024 The following can be expressed:
to input original characteristicsObtaining cross-channel correlation matrix M in the same manner cc ∈R 1024×1024 The correlation between any two channels of the original input features is calculated and used as information supplement, so that the integrity of the context information is further improved, and the method can be expressed as follows:
will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 As in equation (12). Thereafter and multiscale featuresPerforming element-by-element multiplication operation, also addingJump connection path to obtain the characteristics including multi-scale context information and long-term dependence of its channelCan be expressed as formula (13).
M c =Normalization(M sc +M cc )(12)
(4) Upsampling recovery resolution, predicting semantic segmentation mask maps for RGB and thermal infrared image pairs:
recovering the feature map obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
the multi-scale characteristics obtained in the step 3 are processedThrough a convolution kernel of 2×2, step size of 16, parameter +.>Is->Restoring the resolution by 16 times, then using a convolution kernel of 1X 1, step size of 1, parameter +.>Convolution operation of->The number of channels of the feature map is converted into the number of categories of the data set, and the semantic segmentation mask map S is calculated by using a softmax function and can be expressed as follows:
(5) Training algorithm network to obtain model parameters
On a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through the weighted cross entropy loss function and the weighted average absolute error loss function, so that network model parameters are obtained:
on a training data set, a supervised learning mechanism is adopted to calculate a cross entropy loss function L of a semantic segmentation prediction result and a true value in a network model s :
Where m and n represent the width and height of the input image, (i, j) represent the coordinates of the pixel point, p (x) ij ) True value label, q (x ij ) Represents the prediction result of the pixel point, w (x ij ) And the category weight coefficient of the pixel point is represented. The class weight coefficient w is used for relieving the problem of unbalanced class distribution in the data set, and the weight coefficient w of the ith class i Can be expressed as:
wherein c is a constant set to 1.1, P i The proportion of the pixel points labeled as the i-th class to the total pixel points is represented.
The calculated cross entropy loss function and the bi-modal difference loss L in formula (1) MD Together constitute a total loss function L total This can be expressed as:
L total =λ 1 L s (S,G)+λ 2 L MD (17)
wherein lambda is 1 And lambda (lambda) 2 For the hyper-parameters of the balance loss, S represents the model prediction result and G represents the true value.
Further, the bi-directional modal difference reduction sub-network in step (1) comprises two parts from RGB mode to thermal infrared mode and from thermal infrared mode to RGB mode, wherein the two parts adopt an encoder-decoder-encoder network with the same structure, wherein the encoder uses a res net-50 network and a res net-18 network, and the decoder uses an image generation network to generate a pseudo image through an upsampling strategy of bilinear interpolation.
Further, in step (3), the inputs of the multi-scale space and channel context modules are respectivelyAnd->To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
Still further, step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being inputIs 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(312) The multi-scale feature obtained in step (311) is processedPerforming size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW ;
(313) To input original characteristicsObtaining the cross-space correlation matrix M in the same manner as in step (312) cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain characteristics containing multi-scale context information and space long-term dependence thereof
Still further, step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
One convolution kernel is 3 x 3,step length of 1, hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>/>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being inputIs 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(322) The multi-scale features obtained in step (321) are processedPerforming size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024 ;
(323) To input original characteristicsObtaining the cross-channel correlation matrix M in the same manner as in step (322) cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
Further, in step (4), a deconvolution operation is used to perform feature map up-sampling to recover resolution, and then a convolution kernel is used to obtain a 1×1 step size 1, and the parameter is θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
The method trains the algorithm end to end, and obtains model parameters after training the whole RGB-T semantic segmentation network; when training RGB-T semantic segmentation network parameters, the data set (MFNet data set) for training the RGB-T semantic segmentation network is insufficient in data quantity, so that smooth training of the network is ensured, the phenomenon of over-fitting of the training data set is avoided, and data augmentation operation of random overturning, random cutting and noise injection is performed on RGB-T image pairs in the data set;
the technical effects of the invention are further described by combining simulation experiments:
1. simulation conditions: all simulation experiments are realized by adopting a PyTorch deep learning framework, wherein the operating system is Ubuntu 16.04.5, the hardware environment is GPU Nvidia GeForce GTX 1080 Ti;
2. simulation content and result analysis:
The semantic segmentation experiment is carried out on the public RGB-T image semantic segmentation data set MFNet by the invention with the existing RGB image-based semantic segmentation method, RGB-D-based semantic segmentation method and RGB-T-based semantic segmentation method, and part of experimental results are intuitively compared. In order to ensure fairness of experiments, a semantic segmentation method based on RGB images is expanded into two parts, namely an RGB branch and a thermal infrared branch, and prediction results of the two branches are added to be used as a final semantic segmentation mask diagram; for the RGB-D based semantic segmentation method, we directly replace the input depth image with a thermal infrared image.
Compared with the prior art, the method has better effect on the problem of difficult segmentation of the RGB-T image semantics. The multi-mode complementary information can be better utilized in the environment with poor illumination condition due to the mode difference reduction and fusion strategy in the invention, so that the semantic segmentation result of the target is closer to a manually calibrated truth-value diagram.
Emulation 2
The invention is compared with the existing semantic segmentation method based on RGB image, semantic segmentation method based on RGB-D and semantic segmentation method based on RGB-T to carry out the result obtained by semantic segmentation experiment on the semantic segmentation dataset of the public RGB-T image, and the objective evaluation is carried out by adopting the accepted evaluation index, wherein the evaluation simulation result is shown in the table 1, and the invention has the advantages that:
acc represents each type of accuracy;
mAcc represents class average accuracy;
IoU the cross-over ratio of each class;
mIoU represents class average cross-over ratio.
The higher and better the indexes are, the more accurate semantic segmentation capability of the method for RGB-T images can be seen from the table 1, and the effectiveness and superiority of the method are fully shown.
The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various modifications may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (3)
1. The RGB-T image semantic segmentation method based on modal difference reduction is characterized by comprising the following steps:
(1) Constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power for input RGB and thermal infrared registered image pairs, and simultaneously constructing a supervised learning model:
the bidirectional modal difference reduction sub-network reduces modal differences in a bidirectional manner, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding true image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, takes each level feature of the corresponding RGB true image and each level feature of the thermal infrared true image as supervision thereof, and builds a supervision learning model;
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features;
(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, constructing a multi-scale space context module and a multi-scale channel context module, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;
(4) Restoring the spatial correlation matrix and the channel correlation matrix obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
(5) Training the algorithm network to obtain model parameters:
on a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph of the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through a weighted cross entropy loss function and a weighted average absolute error loss function, so that network model parameters are obtained, wherein:
in step (1), five different resolution hierarchical features of pseudo-thermal infrared images generated by simultaneous downscaling of ResNet-18 network extraction And extracting five different resolution hierarchical features of their corresponding eukaryotic infrared images from a ResNet-18 network>Differences between
Five different resolution hierarchical features of generated pseudo-RGB images extracted by ResNet-18 networkAnd extracting hierarchical features of five different resolutions of their respective true RGB images from ResNet-18 network +.> Differences between;
to obtain more discriminative RGB multi-level features extracted by ResNet-50 networksAnd their corresponding thermal infrared multi-level features extracted by ResNet-50 network ∈>
the self-adaptive channel weighted fusion module in the step (2) is used for carrying out the first four layers of characteristics on the RGB image obtained in the step (1)And the first four layers of features of their corresponding thermal infrared images +.>As input, adaptively generate pairsRGB weight vector W for the hierarchy 1 、W 2 、W 3 、W 4 Thermal infrared weight vector 1-W of corresponding level 1 、1-W 2 、1-W 3 、1-W 4 Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained
The input of the multi-scale space context module and the multi-scale channel context module in the step (3) are respectively thatAnd->To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix;
step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) And (b)
A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being inputIs 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(312) The multi-scale feature obtained in step (311) is processedPerforming size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW ;
(313) To input original characteristicsPerforming size transformation and performing matrix multiplication operation with the transposed matrix of the cross-space correlation matrix M cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic containing multi-scale context information and space long-term dependence thereof>
The step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) And (c) a step of:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters ofHole convolution operation +.>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters ofHole convolution operation +.>
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being inputIs 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>The number of channels and the input +.>The same;
(322) The multi-scale features obtained in step (321) are processedPerforming size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024 ;
(323) To input original characteristicsPerforming size transformation and performing matrix multiplication operation with the transposed matrix of the cross-channel correlation matrix M cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale featuresPerforming element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
2. The RGB-T image semantic segmentation method based on modal disparity reduction as set forth in claim 1, wherein the bi-modal disparity reduction sub-network in step (1) includes two parts from RGB modality to thermal infrared modality and from thermal infrared modality to RGB modality, both of which employ a structurally identical "encoder-decoder-encoder" network, wherein the encoder uses a res net-50 network and a res net-18 network, the decoder uses an image generation network, and the pseudo-image is generated by an upsampling strategy of bilinear interpolation.
3. The method of claim 1, wherein in step (4), a deconvolution operation is used to perform feature map up-sampling to restore resolution, and a convolution kernel is used to obtain 1×1, step size is used to obtain 1, and parameters are θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110187778.8A CN112991350B (en) | 2021-02-18 | 2021-02-18 | RGB-T image semantic segmentation method based on modal difference reduction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110187778.8A CN112991350B (en) | 2021-02-18 | 2021-02-18 | RGB-T image semantic segmentation method based on modal difference reduction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112991350A CN112991350A (en) | 2021-06-18 |
CN112991350B true CN112991350B (en) | 2023-06-27 |
Family
ID=76393651
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110187778.8A Active CN112991350B (en) | 2021-02-18 | 2021-02-18 | RGB-T image semantic segmentation method based on modal difference reduction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112991350B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113362349A (en) * | 2021-07-21 | 2021-09-07 | 浙江科技学院 | Road scene image semantic segmentation method based on multi-supervision network |
CN113591685B (en) * | 2021-07-29 | 2023-10-27 | 武汉理工大学 | Geographic object spatial relationship identification method and system based on multi-scale pooling |
CN114330279B (en) * | 2021-12-29 | 2023-04-18 | 电子科技大学 | Cross-modal semantic consistency recovery method |
CN114708568B (en) * | 2022-06-07 | 2022-10-04 | 东北大学 | Pure vision automatic driving control system, method and medium based on improved RTFNet |
CN115115919B (en) * | 2022-06-24 | 2023-05-05 | 国网智能电网研究院有限公司 | Power grid equipment thermal defect identification method and device |
CN115240042B (en) * | 2022-07-05 | 2023-05-16 | 抖音视界有限公司 | Multi-mode image recognition method and device, readable medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969634A (en) * | 2019-11-29 | 2020-04-07 | 国网湖北省电力有限公司检修公司 | Infrared image power equipment segmentation method based on generation countermeasure network |
CN111462128A (en) * | 2020-05-28 | 2020-07-28 | 南京大学 | Pixel-level image segmentation system and method based on multi-modal spectral image |
WO2020151536A1 (en) * | 2019-01-25 | 2020-07-30 | 腾讯科技(深圳)有限公司 | Brain image segmentation method, apparatus, network device and storage medium |
CN111666977A (en) * | 2020-05-09 | 2020-09-15 | 西安电子科技大学 | Shadow detection method of monochrome image |
CN112101410A (en) * | 2020-08-05 | 2020-12-18 | 中国科学院空天信息创新研究院 | Image pixel semantic segmentation method and system based on multi-modal feature fusion |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784654B (en) * | 2016-08-26 | 2020-09-25 | 杭州海康威视数字技术股份有限公司 | Image segmentation method and device and full convolution network system |
US10956787B2 (en) * | 2018-05-14 | 2021-03-23 | Quantum-Si Incorporated | Systems and methods for unifying statistical models for different data modalities |
-
2021
- 2021-02-18 CN CN202110187778.8A patent/CN112991350B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020151536A1 (en) * | 2019-01-25 | 2020-07-30 | 腾讯科技(深圳)有限公司 | Brain image segmentation method, apparatus, network device and storage medium |
CN110969634A (en) * | 2019-11-29 | 2020-04-07 | 国网湖北省电力有限公司检修公司 | Infrared image power equipment segmentation method based on generation countermeasure network |
CN111666977A (en) * | 2020-05-09 | 2020-09-15 | 西安电子科技大学 | Shadow detection method of monochrome image |
CN111462128A (en) * | 2020-05-28 | 2020-07-28 | 南京大学 | Pixel-level image segmentation system and method based on multi-modal spectral image |
CN112101410A (en) * | 2020-08-05 | 2020-12-18 | 中国科学院空天信息创新研究院 | Image pixel semantic segmentation method and system based on multi-modal feature fusion |
Non-Patent Citations (2)
Title |
---|
Revisiting Feature Fusion for RGB-T Salient Object Detection;Qiang Zhang 等,;《IEEE Transactions on Circuits and Systems for Video Technology》;20200806;第2020年卷;全文 * |
跨模态行人重识别研究与展望;陈丹 等,;《计算机系统应用》;20201031;第29卷(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112991350A (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112991350B (en) | RGB-T image semantic segmentation method based on modal difference reduction | |
Huang et al. | DSNet: Joint semantic learning for object detection in inclement weather conditions | |
Guo et al. | Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN111259906B (en) | Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention | |
Zhang et al. | Deep hierarchical guidance and regularization learning for end-to-end depth estimation | |
CN111612008B (en) | Image segmentation method based on convolution network | |
CN107239730B (en) | Quaternion deep neural network model method for intelligent automobile traffic sign recognition | |
Huang et al. | Multi-level cross-modal interaction network for RGB-D salient object detection | |
CN113344806A (en) | Image defogging method and system based on global feature fusion attention network | |
CN114445430B (en) | Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion | |
Gong et al. | Global contextually guided lightweight network for RGB-thermal urban scene understanding | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN112785526A (en) | Three-dimensional point cloud repairing method for graphic processing | |
CN113554032A (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
Zeng et al. | Dual Swin-transformer based mutual interactive network for RGB-D salient object detection | |
Wu et al. | Vehicle detection based on adaptive multi-modal feature fusion and cross-modal vehicle index using RGB-T images | |
Shen et al. | ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection | |
Ren et al. | A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
Ogura et al. | Improving the visibility of nighttime images for pedestrian recognition using in‐vehicle camera | |
CN113780305B (en) | Significance target detection method based on interaction of two clues | |
CN117036658A (en) | Image processing method and related equipment | |
CN114494699A (en) | Image semantic segmentation method and system based on semantic propagation and foreground and background perception | |
Zhou et al. | GAF-Net: Geometric Contextual Feature Aggregation and Adaptive Fusion for Large-Scale Point Cloud Semantic Segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |