CN112991350B - RGB-T image semantic segmentation method based on modal difference reduction - Google Patents

RGB-T image semantic segmentation method based on modal difference reduction Download PDF

Info

Publication number
CN112991350B
CN112991350B CN202110187778.8A CN202110187778A CN112991350B CN 112991350 B CN112991350 B CN 112991350B CN 202110187778 A CN202110187778 A CN 202110187778A CN 112991350 B CN112991350 B CN 112991350B
Authority
CN
China
Prior art keywords
features
channel
rgb
correlation matrix
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110187778.8A
Other languages
Chinese (zh)
Other versions
CN112991350A (en
Inventor
张强
赵什陆
黄年昌
张鼎文
韩军功
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110187778.8A priority Critical patent/CN112991350B/en
Publication of CN112991350A publication Critical patent/CN112991350A/en
Application granted granted Critical
Publication of CN112991350B publication Critical patent/CN112991350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a RGB-T image semantic segmentation method based on modal difference reduction, which comprises the steps of (1) constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power from input RGB and thermal infrared registered image pairs, and constructing a supervised learning model at the same time: (2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features through the weighted fusion module to obtain multi-level fusion features; (3) Acquiring multi-level fusion characteristics, and then obtaining a space correlation matrix and a channel correlation matrix through calculation; (4) Restoring the space correlation matrix and the channel correlation matrix to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and a softmax function; and (5) training the algorithm network to obtain model parameters.

Description

RGB-T image semantic segmentation method based on modal difference reduction
Technical Field
The invention belongs to the field of image processing, and relates to an RGB-T image semantic segmentation method based on modal difference reduction, which can be used for the preprocessing process of images in computer vision tasks.
Background
Semantic segmentation aims at assigning a class label to each pixel in a natural image using a model or algorithm. As one of key technologies for scene perception, semantic segmentation plays a vital role in computer vision tasks such as autopilot, pedestrian detection, and medical image analysis.
Existing semantic segmentation methods can be divided into two main categories: one is a traditional semantic segmentation method, and the other is a semantic segmentation method based on deep learning. The traditional semantic segmentation method mainly completes image semantic segmentation by combining low-level manual features with a plane classifier. Such methods are poorly robust and difficult to obtain satisfactory results in complex scenarios. With the wide application of the deep learning technology, the semantic segmentation method based on the deep learning has breakthrough progress, and compared with the traditional method, the semantic segmentation method based on the deep learning has better segmentation effect and stronger robustness.
So far, the RGB image semantic segmentation method based on the deep learning technology achieves a remarkable effect. However, in poor lighting conditions, the performance of these algorithms may be significantly degraded. The thermal infrared image can provide contour information and semantic information of the target, and can effectively supplement the RGB image.
The existing RGB-T semantic segmentation method usually adopts a simple strategy to capture complementary information in RGB images and thermal infrared images, such as 'Yuxiang Sun, weiixunonzuo, and Ming Liu.Rtfnet: rgb-thermal fusion network for semantic segmentation of urban scenes.RAL,4 (3): 2576-2583, 2019'. The method directly fuses each level of characteristics of two modal images in an encoder only by using an element-by-element addition mode; "Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada. Mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. InIROS, pages 5108-5115, 2017" each level feature of two modality images is fused directly in the decoder using only a cascade approach. These methods do not take into account the problem of modality differences in RGB images and thermal infrared images due to differences in imaging mechanisms. This problem further results in an inability to fully exploit cross-modality complementary information by simple fusion strategies, thereby reducing the accuracy of the RGB-T image semantic segmentation method.
In addition, the diversity of objects in the image to be detected, such as the class, size and shape of the objects, is also one of the key issues in the semantic segmentation task. In the single-mode RGB image semantic segmentation algorithm, multi-scale context information and its remote dependencies have proven to be effective methods to solve this problem. However, in RGB-T semantic segmentation tasks, multi-scale context information and remote dependency thereof are not well mined and utilized, and only 'Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada.mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral samples, inIROS, pages 5108-5115, 2017' the convolution parallel structure of two different receptive fields is used for acquiring a small amount of context information, which has very limited effect on RGB-T semantic segmentation tasks in complex scenes, and the problem of diversity of targets still cannot be effectively solved.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention aims to provide an RGB-T image semantic segmentation method based on modal difference reduction, which mainly solves the problems that the prior art does not consider the modal difference of a visible light image and a thermal infrared image and the utilization of context information is insufficient, and the semantic segmentation precision is low.
The key point of realizing the invention is that the modal difference of the RGB features and the thermal infrared features is reduced and fused in the network coding stage, so that the fused features have more discrimination, and the multiscale context information and the remote dependency relationship of the fused features are fully mined.
The technical scheme is as follows: an RGB-T image semantic segmentation method based on modal difference reduction comprises the following steps:
(1) Constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power for input RGB and thermal infrared registered image pairs, and simultaneously constructing a supervised learning model:
the bidirectional modal difference reduction sub-network reduces modal differences in a bidirectional manner, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding true image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, takes each level feature of the corresponding RGB true image and each level feature of the thermal infrared true image as supervision thereof, and builds a supervision learning model;
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features;
(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;
(4) Restoring the spatial correlation matrix and the channel correlation matrix obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
(5) Training the algorithm network to obtain model parameters:
and (3) on a training data set, adopting a supervised learning model for the predicted semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and completing algorithm network training end to end through the weighted cross entropy loss function and the weighted average absolute error loss function to obtain network model parameters.
Further, the bi-directional modal difference reduction sub-network in step (1) comprises two parts from RGB mode to thermal infrared mode and from thermal infrared mode to RGB mode, wherein the two parts adopt an encoder-decoder-encoder network with the same structure, wherein the encoder uses a res net-50 network and a res net-18 network, and the decoder uses an image generation network to generate a pseudo image through an upsampling strategy of bilinear interpolation.
Further, in step (1), five different resolution hierarchical features of the pseudo thermal infrared image generated by the ResNet-18 network extraction are simultaneously reduced
Figure GDA0004243116770000041
Figure GDA0004243116770000042
And extracting, by the ResNet-18 network, five different resolution hierarchical features of their respective eukaryotic infrared images
Figure GDA0004243116770000043
Differences between
Five different resolution hierarchical features of generated pseudo-RGB images extracted by ResNet-18 network
Figure GDA0004243116770000051
And extracting hierarchical features of five different resolutions of their respective true RGB images from ResNet-18 network +.>
Figure GDA0004243116770000052
Figure GDA0004243116770000053
Differences between;
to obtain more discriminative RGB multi-level features extracted by ResNet-50 networks
Figure GDA0004243116770000054
And their corresponding thermal infrared multi-level features extracted by ResNet-50 network ∈>
Figure GDA0004243116770000055
By means of
Figure GDA0004243116770000056
For->
Figure GDA0004243116770000057
Figure GDA0004243116770000058
Supervision, utilization->
Figure GDA0004243116770000059
Figure GDA00042431167700000510
For a pair of
Figure GDA00042431167700000511
And (5) performing supervision.
Further, the adaptive channel weighted fusion module in the step (2) is configured to combine the first four layers of features of the RGB image obtained in the step (1)
Figure GDA00042431167700000512
And its corresponding thermal infrared image front four layer features
Figure GDA00042431167700000513
As input, an RGB weight vector W of a corresponding hierarchy is adaptively generated 1 、W 2 、W 3 、W 4 Thermal infrared weight vector 1-W of corresponding level 1 、1-W 2 、1-W 3 、1-W 4 Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained>
Figure GDA00042431167700000514
Further, in step (3), the inputs of the multi-scale space and channel context modules are respectively
Figure GDA00042431167700000515
And->
Figure GDA00042431167700000516
To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
Still further, step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000061
Hole convolution operation +.>
Figure GDA0004243116770000062
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of
Figure GDA0004243116770000063
Hole convolution operation +.>
Figure GDA0004243116770000064
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure GDA0004243116770000065
Hole convolution operation +.>
Figure GDA0004243116770000066
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure GDA0004243116770000067
Hole convolution operation +.>
Figure GDA0004243116770000068
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being input
Figure GDA0004243116770000069
Is 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA00042431167700000610
The number of channels and the input +.>
Figure GDA00042431167700000611
The same;
(312) The multi-scale feature obtained in step (311) is processed
Figure GDA00042431167700000612
Performing size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW
(313) To input original characteristics
Figure GDA00042431167700000613
Obtaining the cross-space correlation matrix M in the same manner as in step (312) cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale features
Figure GDA0004243116770000071
Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic containing multi-scale context information and space long-term dependence thereof>
Figure GDA0004243116770000072
Still further, step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000073
Hole convolution operation +.>
Figure GDA0004243116770000074
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of
Figure GDA0004243116770000075
Hole convolution operation +.>
Figure GDA0004243116770000076
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure GDA0004243116770000077
Hole convolution operation +.>
Figure GDA0004243116770000078
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure GDA0004243116770000079
Hole convolution operation +.>
Figure GDA00042431167700000710
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being input
Figure GDA00042431167700000711
Is 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA00042431167700000712
The number of channels and the input +.>
Figure GDA00042431167700000713
The same;
(322) The multi-scale features obtained in step (321) are processed
Figure GDA00042431167700000714
Performing size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024
(323) To input original characteristics
Figure GDA0004243116770000081
Obtaining the cross-channel correlation matrix M in the same manner as in step (322) cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale features
Figure GDA0004243116770000082
Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
Figure GDA0004243116770000083
Further, in step (4), a deconvolution operation is used to perform feature map up-sampling to recover resolution, and then a convolution kernel is used to obtain a 1×1 step size 1, and the parameter is θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
The beneficial effects are that: compared with the prior art, the RGB-T image semantic segmentation method based on modal difference reduction has the following beneficial effects:
1) The invention can realize end-to-end pixel level semantic segmentation prediction of RGB-T image pairs without manual design and feature extraction, and simulation results show that the invention remarkably improves semantic segmentation precision, and simultaneously has better segmentation effect on small targets and under complex scenes;
2) The invention designs a strategy of 'first reducing and then fusing', firstly reduces the modal difference between the multi-modal data caused by different imaging mechanisms by constructing a method based on bidirectional image conversion, and then adaptively selects multi-modal features with strong discrimination to improve the RGB-T semantic segmentation effect. Compared with the existing method, the multi-modal features extracted by the method have stronger discriminant, and are beneficial to improving the target category prediction precision;
3) The invention fully excavates rich context information by establishing multi-scale context information of cross-modal characteristics and interaction of long-term dependence on space and channel dimension thereof, and is beneficial to solving the problem of target diversity. Compared with the existing method, the method can better segment targets with different scales, and meanwhile, the segmentation integrity inside the targets is improved.
Drawings
FIG. 1 is a flow chart of a RGB-T image semantic segmentation method based on modal difference reduction;
FIG. 2 is a schematic diagram of an algorithm network of an RGB-T image semantic segmentation method based on modal difference reduction, wherein a dotted line frame represents a bidirectional modal difference reduction sub-network, CWF represents an adaptive channel weighting fusion module, MSC represents a multi-scale space context module, and MCC represents a multi-scale channel context module;
fig. 3 is a schematic diagram of a framework of an adaptive channel weighted fusion module (CWF) according to the present invention;
FIG. 4 is a diagram of a multi-scale space context Module (MSC) framework in accordance with the present invention;
fig. 5 is a multi-scale channel context Module (MCC) framework diagram according to the present invention.
The specific embodiment is as follows:
the following detailed description of specific embodiments of the invention.
Referring to fig. 1, a method for semantic segmentation of RGB-T images based on modal disparity reduction includes the steps of:
(1) Constructing a bi-directional modal difference reduction sub-network, extracting more discriminative RGB features and thermal infrared features from an input RGB and thermal infrared registered image pair and simultaneously constructing a supervised learning model, wherein:
as shown in fig. 2, the bidirectional modal difference reducing sub-network reduces modal differences in both directions, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding real image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, and builds a supervised learning model by taking each level feature of the corresponding real image of RGB and each level feature of the thermal infrared real image as supervision thereof;
step 1) when the characteristic difference reduction from RGB mode to thermal infrared mode is carried out, the ResNet-50 is used for extracting the multi-level characteristics of the RGB image
Figure GDA0004243116770000101
The resolution is 1/2, 1/4, 1/8, 1/16, 1/32 of the resolution of the input image, and the channel numbers are 64, 256, 512, 1024, 2048, respectively. Then using four convolution kernels of 3×3, step length of 1, and parameters of +.>
Figure GDA0004243116770000102
Is a convolution operation of (1)/>
Figure GDA0004243116770000103
Figure GDA0004243116770000104
Will->
Figure GDA0004243116770000105
The dimension reduction is a single-channel feature diagram, and then pseudo-thermal infrared image I is generated by bilinear interpolation calculation and summation pse-T . Extraction of pseudo thermal infrared image I using ResNet-18 pse-T Multi-level features of->
Figure GDA0004243116770000106
Figure GDA0004243116770000107
Simultaneously extracting five different resolution multi-level features of a corresponding eukaryotic infrared image by using another ResNet-18
Figure GDA0004243116770000108
And calculating the difference between the true and false features of the corresponding layers.
Similarly, when performing feature difference reduction from thermal infrared mode to RGB mode, the ResNet-50 is first used to extract multi-level features of the thermal infrared image
Figure GDA0004243116770000109
Generating a three-channel pseudo RGB image I in the same manner pse-RGB Subsequently pseudo RGB image I is extracted using ResNet-18 pse-RGB Multi-level features of (2)
Figure GDA00042431167700001010
Figure GDA0004243116770000111
Simultaneously extracting five different resolution multi-level features of a corresponding true RGB image using another ResNet-18
Figure GDA0004243116770000112
And calculating the difference between the true and false features of the corresponding layers.
By means of
Figure GDA0004243116770000113
For->
Figure GDA0004243116770000114
Figure GDA0004243116770000115
Supervision, utilization->
Figure GDA0004243116770000116
Figure GDA0004243116770000117
For a pair of
Figure GDA0004243116770000118
And (5) performing supervision.
In the bidirectional modal difference reduction sub-network, the total modal difference L MD The sum of the difference between the true and false thermal infrared multi-level features and the difference between the true and false RGB multi-level features can be expressed as:
Figure GDA0004243116770000119
wherein:
L 1 mean absolute error is expressed.
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features, so that a feature channel with strong resolution capability is better selected from the multi-mode features;
four layers of RGB features with different resolutions and corresponding thermal infrared features are obtained in the step (1), and the module is used for fusing each layer of RGB features and T features, so that four layers of fusion features are obtained in total. Meanwhile, as for the fused features, resNet-50 is also used for feature extraction. Specifically, the fusion module is utilized to obtain the RGB features and the thermal infrared features of the first layer, then the fusion features are subjected to ResNet-50 residual block downsampling, and finally the fusion features of the second layer (namely, the fusion features obtained by the fusion module are utilized to carry out addition operation on the RGB features and the thermal infrared features of the second layer). As are subsequent layers.
As shown in FIG. 3, the channel-by-channel weighted fusion module inputs the RGB features obtained in step 1 after the reduced modal differences
Figure GDA0004243116770000121
And the corresponding thermal infrared features->
Figure GDA0004243116770000122
Figure GDA0004243116770000123
Features of the last layer->
Figure GDA0004243116770000124
And
Figure GDA0004243116770000125
are discarded to save network computation. The multi-modal characteristics of the corresponding hierarchy are cascaded, and corresponding weight vectors are respectively predicted through four convolution block operations, wherein each convolution block operation comprises a convolution kernel of 3 multiplied by 3, a step length of 1 and parameters of +.>
Figure GDA0004243116770000126
Convolution operation of->
Figure GDA0004243116770000127
And a convolution kernel of 1×1, step size of 1, parameter +.>
Figure GDA0004243116770000128
Convolution operation of->
Figure GDA0004243116770000129
Computing the relative importance of paired features from different modalities but in the same channel, i.e. the weight vector W of the RGB modality 1 ,W 2 ,W 3 ,W 4 Weight vectors 1-W of corresponding multi-level thermal infrared states 1 ,1-W 2 ,1-W 3 ,1-W 4 This can be expressed as:
Figure GDA00042431167700001210
wherein:
GAP (x) represents global average pooling operations;
cat (x) denotes cascade operation;
sigma represents a sigmoid activation function;
finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained
Figure GDA00042431167700001211
Can be expressed as:
Figure GDA00042431167700001212
wherein:
represents a channel-by-channel multiplication operation;
1 represents W n All 1 vectors of the same size;
w obtained in the formula (2) n The larger the value, the more important the RGB modality corresponding channel is than the thermal infrared modality corresponding channel, and vice versa. When W is n And 1-W n When the values in the two weight vectors are 0.5, the special case of equal-ratio weight fusion can be considered; when W is n When the values of (a) are 0 or 1, the special case of using only thermal infrared or RGB single-mode information can be considered.
(3) Constructing a multi-scale space and channel context module, and mining multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information:
as shown in fig. 4 and fig. 5, the multi-level fusion feature obtained in the step (2) is firstly obtained, then a space correlation matrix and a channel correlation matrix are obtained through calculation, the space correlation matrix and the channel correlation matrix are acted on the multi-scale feature, and the relation between multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information is established;
as shown in fig. 4 and 5, the existing method cannot fully utilize the context information, so that the problem of target diversity in the semantic segmentation task is difficult to deal with, so that the method utilizes a hole convolution pyramid structure to extract multi-scale context information, and establishes long-term dependence on multi-scale features in space and channel dimensions so as to mine more abundant context information. In addition, in order to alleviate the problem of information loss in the process, the invention also establishes long-term dependence on space and channel dimension on the original input characteristics, and fuses the long-term dependence as supplementary information into multi-scale characteristics so as to ensure the integrity of the context information.
Specifically, the multi-scale space context module constructed by the invention is shown in fig. 4, and the input is the fusion characteristic obtained in the step 2
Figure GDA0004243116770000131
The module includes a hole convolution pyramid structure, an autocorrelation matrix, and a cross-spatial correlation matrix.
The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000141
Hole convolution operation +.>
Figure GDA0004243116770000142
A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>
Figure GDA0004243116770000143
Hole convolution operation of (1)
Figure GDA0004243116770000144
A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>
Figure GDA0004243116770000145
Hole convolution operation +.>
Figure GDA0004243116770000146
A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>
Figure GDA0004243116770000147
Hole convolution operation +.>
Figure GDA0004243116770000148
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The channel numbers are all input +.>
Figure GDA0004243116770000149
256. After cascading the four features (1024 channels), the four features are passed through a convolution kernel of 1×1, step size of 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA00042431167700001410
The number of channels and the input +.>
Figure GDA00042431167700001411
The same (512 channels), can be expressed as:
Figure GDA00042431167700001412
the obtained multi-scale characteristics
Figure GDA00042431167700001413
Size conversion to +.>
Figure GDA00042431167700001414
And performs matrix multiplication operation with its own transpose matrix to obtain a characteristic multi-scale feature +.>
Figure GDA00042431167700001415
An autocorrelation matrix of correlation between any two points in space, and an autocorrelation matrix M obtained ss ∈R HW×HW The method can be expressed as follows:
Figure GDA00042431167700001416
wherein:
Figure GDA00042431167700001417
representing a matrix multiplication operation;
(*) T representing a matrix transpose operation;
reshape represents the matrix dimension from R H×W×C Becomes R HW×C Is a size conversion operation of (a).
To input original characteristics
Figure GDA00042431167700001418
Obtaining cross-space correlation matrix M in the same manner cs ∈R HW×HW Calculating the correlation between any two points in the original input feature space as information supplement to ensure the integrity of the context information and crossing the space correlation matrix M cs The method can be expressed as follows:
Figure GDA0004243116770000151
will auto-space correlation matrix M ss And cross-space correlation matrix M cs After element-by-element summation, normalization operation is carried out to obtain a total spatial correlation matrix M s ∈R HW×HW As in equation (7). Thereafter and multiscale features
Figure GDA0004243116770000152
Performing element-by-element multiplication operation, adding a jump connection path, and finally obtaining the characteristics containing abundant multi-scale context information and long-term dependence of the space
Figure GDA0004243116770000153
Can be expressed as formula (8).
M s =Normalization(M ss +M cs ) (7)
Figure GDA0004243116770000154
Wherein:
normalization (x) represents Min-Max Normalization;
reshape' (. Times.) denotes the reverse operation of Reshape.
The multi-scale channel context module constructed by the invention is shown in FIG. 5, and the input is the fusion characteristic obtained in the step 2
Figure GDA0004243116770000155
Comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000156
Hole convolution operation +.>
Figure GDA0004243116770000157
A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>
Figure GDA0004243116770000158
Hole convolution operation of (1)
Figure GDA0004243116770000159
A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>
Figure GDA00042431167700001510
Hole convolution operation +.>
Figure GDA00042431167700001511
A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>
Figure GDA00042431167700001512
Hole convolution operation +.>
Figure GDA00042431167700001513
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The channel numbers are all input +.>
Figure GDA00042431167700001514
512. After cascading the four features (2048 channels) the four features are input into a convolution kernel of 1 x 1, step size of 1, and parameter θ 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA0004243116770000161
The number of channels and the input +.>
Figure GDA0004243116770000162
The same (1024 channels), can be expressed as:
Figure GDA0004243116770000163
the obtained multi-scale characteristics
Figure GDA0004243116770000164
Size conversion to +.>
Figure GDA0004243116770000165
And performs matrix multiplication operation with its own transpose matrix to obtain a characteristic multi-scale feature +.>
Figure GDA0004243116770000166
Self-channel correlation matrix of correlation between any two channels, and obtained self-channel correlation matrix M sc ∈R 1024×1024 The following can be expressed:
Figure GDA0004243116770000167
to input original characteristics
Figure GDA0004243116770000168
Obtaining cross-channel correlation matrix M in the same manner cc ∈R 1024×1024 The correlation between any two channels of the original input features is calculated and used as information supplement, so that the integrity of the context information is further improved, and the method can be expressed as follows:
Figure GDA0004243116770000169
will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 As in equation (12). Thereafter and multiscale features
Figure GDA00042431167700001610
Performing element-by-element multiplication operation, also addingJump connection path to obtain the characteristics including multi-scale context information and long-term dependence of its channel
Figure GDA00042431167700001611
Can be expressed as formula (13).
M c =Normalization(M sc +M cc )(12)
Figure GDA00042431167700001612
(4) Upsampling recovery resolution, predicting semantic segmentation mask maps for RGB and thermal infrared image pairs:
recovering the feature map obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
the multi-scale characteristics obtained in the step 3 are processed
Figure GDA0004243116770000171
Through a convolution kernel of 2×2, step size of 16, parameter +.>
Figure GDA0004243116770000172
Is->
Figure GDA0004243116770000173
Restoring the resolution by 16 times, then using a convolution kernel of 1X 1, step size of 1, parameter +.>
Figure GDA0004243116770000174
Convolution operation of->
Figure GDA0004243116770000175
The number of channels of the feature map is converted into the number of categories of the data set, and the semantic segmentation mask map S is calculated by using a softmax function and can be expressed as follows:
Figure GDA0004243116770000176
(5) Training algorithm network to obtain model parameters
On a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through the weighted cross entropy loss function and the weighted average absolute error loss function, so that network model parameters are obtained:
on a training data set, a supervised learning mechanism is adopted to calculate a cross entropy loss function L of a semantic segmentation prediction result and a true value in a network model s
Figure GDA0004243116770000177
Where m and n represent the width and height of the input image, (i, j) represent the coordinates of the pixel point, p (x) ij ) True value label, q (x ij ) Represents the prediction result of the pixel point, w (x ij ) And the category weight coefficient of the pixel point is represented. The class weight coefficient w is used for relieving the problem of unbalanced class distribution in the data set, and the weight coefficient w of the ith class i Can be expressed as:
Figure GDA0004243116770000181
wherein c is a constant set to 1.1, P i The proportion of the pixel points labeled as the i-th class to the total pixel points is represented.
The calculated cross entropy loss function and the bi-modal difference loss L in formula (1) MD Together constitute a total loss function L total This can be expressed as:
L total =λ 1 L s (S,G)+λ 2 L MD (17)
wherein lambda is 1 And lambda (lambda) 2 For the hyper-parameters of the balance loss, S represents the model prediction result and G represents the true value.
Further, the bi-directional modal difference reduction sub-network in step (1) comprises two parts from RGB mode to thermal infrared mode and from thermal infrared mode to RGB mode, wherein the two parts adopt an encoder-decoder-encoder network with the same structure, wherein the encoder uses a res net-50 network and a res net-18 network, and the decoder uses an image generation network to generate a pseudo image through an upsampling strategy of bilinear interpolation.
Further, in step (3), the inputs of the multi-scale space and channel context modules are respectively
Figure GDA0004243116770000182
And->
Figure GDA0004243116770000183
To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.
Still further, step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000191
Hole convolution operation +.>
Figure GDA0004243116770000192
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of
Figure GDA0004243116770000193
Hole convolution operation +.>
Figure GDA0004243116770000194
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure GDA0004243116770000195
Hole convolution operation +.>
Figure GDA0004243116770000196
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure GDA0004243116770000197
Hole convolution operation +.>
Figure GDA0004243116770000198
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being input
Figure GDA0004243116770000199
Is 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA00042431167700001910
The number of channels and the input +.>
Figure GDA00042431167700001911
The same;
(312) The multi-scale feature obtained in step (311) is processed
Figure GDA00042431167700001912
Performing size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW
(313) To input original characteristics
Figure GDA00042431167700001913
Obtaining the cross-space correlation matrix M in the same manner as in step (312) cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale features
Figure GDA0004243116770000201
Performing element-by-element multiplication operation, and adding a jump connection path to obtain characteristics containing multi-scale context information and space long-term dependence thereof
Figure GDA0004243116770000202
Still further, step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) Wherein:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure GDA0004243116770000203
Hole convolution operation +.>
Figure GDA0004243116770000204
One convolution kernel is 3 x 3,step length of 1, hole rate of 6, and parameters of
Figure GDA0004243116770000205
Hole convolution operation +.>
Figure GDA0004243116770000206
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure GDA0004243116770000207
Hole convolution operation +.>
Figure GDA0004243116770000208
/>
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure GDA0004243116770000209
Hole convolution operation +.>
Figure GDA00042431167700002010
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being input
Figure GDA00042431167700002011
Is 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure GDA00042431167700002012
The number of channels and the input +.>
Figure GDA00042431167700002013
The same;
(322) The multi-scale features obtained in step (321) are processed
Figure GDA00042431167700002014
Performing size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024
(323) To input original characteristics
Figure GDA00042431167700002015
Obtaining the cross-channel correlation matrix M in the same manner as in step (322) cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale features
Figure GDA0004243116770000211
Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
Figure GDA0004243116770000212
Further, in step (4), a deconvolution operation is used to perform feature map up-sampling to recover resolution, and then a convolution kernel is used to obtain a 1×1 step size 1, and the parameter is θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
The method trains the algorithm end to end, and obtains model parameters after training the whole RGB-T semantic segmentation network; when training RGB-T semantic segmentation network parameters, the data set (MFNet data set) for training the RGB-T semantic segmentation network is insufficient in data quantity, so that smooth training of the network is ensured, the phenomenon of over-fitting of the training data set is avoided, and data augmentation operation of random overturning, random cutting and noise injection is performed on RGB-T image pairs in the data set;
the technical effects of the invention are further described by combining simulation experiments:
1. simulation conditions: all simulation experiments are realized by adopting a PyTorch deep learning framework, wherein the operating system is Ubuntu 16.04.5, the hardware environment is GPU Nvidia GeForce GTX 1080 Ti;
2. simulation content and result analysis:
simulation 1
The semantic segmentation experiment is carried out on the public RGB-T image semantic segmentation data set MFNet by the invention with the existing RGB image-based semantic segmentation method, RGB-D-based semantic segmentation method and RGB-T-based semantic segmentation method, and part of experimental results are intuitively compared. In order to ensure fairness of experiments, a semantic segmentation method based on RGB images is expanded into two parts, namely an RGB branch and a thermal infrared branch, and prediction results of the two branches are added to be used as a final semantic segmentation mask diagram; for the RGB-D based semantic segmentation method, we directly replace the input depth image with a thermal infrared image.
Compared with the prior art, the method has better effect on the problem of difficult segmentation of the RGB-T image semantics. The multi-mode complementary information can be better utilized in the environment with poor illumination condition due to the mode difference reduction and fusion strategy in the invention, so that the semantic segmentation result of the target is closer to a manually calibrated truth-value diagram.
Emulation 2
The invention is compared with the existing semantic segmentation method based on RGB image, semantic segmentation method based on RGB-D and semantic segmentation method based on RGB-T to carry out the result obtained by semantic segmentation experiment on the semantic segmentation dataset of the public RGB-T image, and the objective evaluation is carried out by adopting the accepted evaluation index, wherein the evaluation simulation result is shown in the table 1, and the invention has the advantages that:
Figure GDA0004243116770000221
acc represents each type of accuracy;
mAcc represents class average accuracy;
IoU the cross-over ratio of each class;
mIoU represents class average cross-over ratio.
The higher and better the indexes are, the more accurate semantic segmentation capability of the method for RGB-T images can be seen from the table 1, and the effectiveness and superiority of the method are fully shown.
The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various modifications may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (3)

1. The RGB-T image semantic segmentation method based on modal difference reduction is characterized by comprising the following steps:
(1) Constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power for input RGB and thermal infrared registered image pairs, and simultaneously constructing a supervised learning model:
the bidirectional modal difference reduction sub-network reduces modal differences in a bidirectional manner, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding true image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, takes each level feature of the corresponding RGB true image and each level feature of the thermal infrared true image as supervision thereof, and builds a supervision learning model;
(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features;
(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, constructing a multi-scale space context module and a multi-scale channel context module, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;
(4) Restoring the spatial correlation matrix and the channel correlation matrix obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;
(5) Training the algorithm network to obtain model parameters:
on a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph of the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through a weighted cross entropy loss function and a weighted average absolute error loss function, so that network model parameters are obtained, wherein:
in step (1), five different resolution hierarchical features of pseudo-thermal infrared images generated by simultaneous downscaling of ResNet-18 network extraction
Figure FDA0004107462510000021
Figure FDA0004107462510000022
And extracting five different resolution hierarchical features of their corresponding eukaryotic infrared images from a ResNet-18 network>
Figure FDA0004107462510000023
Differences between
Five different resolution hierarchical features of generated pseudo-RGB images extracted by ResNet-18 network
Figure FDA0004107462510000024
And extracting hierarchical features of five different resolutions of their respective true RGB images from ResNet-18 network +.>
Figure FDA0004107462510000025
Figure FDA0004107462510000026
Differences between;
to obtain more discriminative RGB multi-level features extracted by ResNet-50 networks
Figure FDA0004107462510000027
And their corresponding thermal infrared multi-level features extracted by ResNet-50 network ∈>
Figure FDA0004107462510000028
By means of
Figure FDA0004107462510000029
For->
Figure FDA00041074625100000210
Figure FDA00041074625100000211
Supervision, use->
Figure FDA00041074625100000212
For a pair of
Figure FDA00041074625100000213
Monitoring;
the self-adaptive channel weighted fusion module in the step (2) is used for carrying out the first four layers of characteristics on the RGB image obtained in the step (1)
Figure FDA00041074625100000214
And the first four layers of features of their corresponding thermal infrared images +.>
Figure FDA00041074625100000215
As input, adaptively generate pairsRGB weight vector W for the hierarchy 1 、W 2 、W 3 、W 4 Thermal infrared weight vector 1-W of corresponding level 1 、1-W 2 、1-W 3 、1-W 4 Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained
Figure FDA0004107462510000031
The input of the multi-scale space context module and the multi-scale channel context module in the step (3) are respectively that
Figure FDA0004107462510000032
And->
Figure FDA0004107462510000033
To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:
(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;
(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix;
step (31) includes:
(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 1 ,θ 2 ,θ 3 ,θ 4 Convolution operation C (x; θ) 1 ),C(*;θ 2 ),C(*;θ 3 ),C(*;θ 4 ) And (b)
A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure FDA0004107462510000034
Hole convolution operation +.>
Figure FDA0004107462510000035
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of
Figure FDA0004107462510000036
Hole convolution operation +.>
Figure FDA0004107462510000037
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure FDA0004107462510000038
Hole convolution operation +.>
Figure FDA0004107462510000039
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure FDA00041074625100000310
Hole convolution operation +.>
Figure FDA00041074625100000311
The four paths respectively obtain the features d with different scales 1 、d 2 、d 3 、d 4 The number of channels being input
Figure FDA00041074625100000312
Is 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ 5 Convolution operation C (x; θ) 5 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure FDA0004107462510000041
The number of channels and the input +.>
Figure FDA0004107462510000042
The same;
(312) The multi-scale feature obtained in step (311) is processed
Figure FDA0004107462510000043
Performing size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M ss ∈R HW×HW
(313) To input original characteristics
Figure FDA0004107462510000044
Performing size transformation and performing matrix multiplication operation with the transposed matrix of the cross-space correlation matrix M cs ∈R HW×HW As an information supplementing section;
(314) Will auto-space correlation matrix M ss And cross-space correlation matrix M cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M s ∈R HW×HW Thereafter and with multiscale features
Figure FDA0004107462510000045
Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic containing multi-scale context information and space long-term dependence thereof>
Figure FDA0004107462510000046
The step (32) includes:
(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively 6 ,θ 7 ,θ 8 ,θ 9 Convolution operation C (x; θ) 6 ),C(*;θ 7 ),C(*;θ 8 ),C(*;θ 9 ) And (c) a step of:
a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of
Figure FDA0004107462510000047
Hole convolution operation +.>
Figure FDA0004107462510000048
A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of
Figure FDA0004107462510000049
Hole convolution operation +.>
Figure FDA00041074625100000410
A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of
Figure FDA00041074625100000411
Hole convolution operation +.>
Figure FDA00041074625100000412
A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of
Figure FDA0004107462510000051
Hole convolution operation +.>
Figure FDA0004107462510000052
The four paths respectively obtain the features d with different scales 5 、d 6 、d 7 、d 8 The number of channels being input
Figure FDA0004107462510000053
Is 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta 10 Convolution operation C (x; θ) 10 ) Obtaining a feature comprising rich multi-scale context information +.>
Figure FDA0004107462510000054
The number of channels and the input +.>
Figure FDA0004107462510000055
The same;
(322) The multi-scale features obtained in step (321) are processed
Figure FDA0004107462510000056
Performing size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel sc ∈R 1024×1024
(323) To input original characteristics
Figure FDA0004107462510000057
Performing size transformation and performing matrix multiplication operation with the transposed matrix of the cross-channel correlation matrix M cc ∈R 1024×1024 As an information supplementing section;
(324) Will auto-channel correlation matrix M sc And cross-channel correlation matrix M cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M c ∈R 1024×1024 Thereafter and with multiscale features
Figure FDA0004107462510000058
Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>
Figure FDA0004107462510000059
2. The RGB-T image semantic segmentation method based on modal disparity reduction as set forth in claim 1, wherein the bi-modal disparity reduction sub-network in step (1) includes two parts from RGB modality to thermal infrared modality and from thermal infrared modality to RGB modality, both of which employ a structurally identical "encoder-decoder-encoder" network, wherein the encoder uses a res net-50 network and a res net-18 network, the decoder uses an image generation network, and the pseudo-image is generated by an upsampling strategy of bilinear interpolation.
3. The method of claim 1, wherein in step (4), a deconvolution operation is used to perform feature map up-sampling to restore resolution, and a convolution kernel is used to obtain 1×1, step size is used to obtain 1, and parameters are θ 11 Convolution operation C (x; θ) 11 ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.
CN202110187778.8A 2021-02-18 2021-02-18 RGB-T image semantic segmentation method based on modal difference reduction Active CN112991350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110187778.8A CN112991350B (en) 2021-02-18 2021-02-18 RGB-T image semantic segmentation method based on modal difference reduction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110187778.8A CN112991350B (en) 2021-02-18 2021-02-18 RGB-T image semantic segmentation method based on modal difference reduction

Publications (2)

Publication Number Publication Date
CN112991350A CN112991350A (en) 2021-06-18
CN112991350B true CN112991350B (en) 2023-06-27

Family

ID=76393651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110187778.8A Active CN112991350B (en) 2021-02-18 2021-02-18 RGB-T image semantic segmentation method based on modal difference reduction

Country Status (1)

Country Link
CN (1) CN112991350B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362349A (en) * 2021-07-21 2021-09-07 浙江科技学院 Road scene image semantic segmentation method based on multi-supervision network
CN113591685B (en) * 2021-07-29 2023-10-27 武汉理工大学 Geographic object spatial relationship identification method and system based on multi-scale pooling
CN114330279B (en) * 2021-12-29 2023-04-18 电子科技大学 Cross-modal semantic consistency recovery method
CN114708568B (en) * 2022-06-07 2022-10-04 东北大学 Pure vision automatic driving control system, method and medium based on improved RTFNet
CN115115919B (en) * 2022-06-24 2023-05-05 国网智能电网研究院有限公司 Power grid equipment thermal defect identification method and device
CN115240042B (en) * 2022-07-05 2023-05-16 抖音视界有限公司 Multi-mode image recognition method and device, readable medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969634A (en) * 2019-11-29 2020-04-07 国网湖北省电力有限公司检修公司 Infrared image power equipment segmentation method based on generation countermeasure network
CN111462128A (en) * 2020-05-28 2020-07-28 南京大学 Pixel-level image segmentation system and method based on multi-modal spectral image
WO2020151536A1 (en) * 2019-01-25 2020-07-30 腾讯科技(深圳)有限公司 Brain image segmentation method, apparatus, network device and storage medium
CN111666977A (en) * 2020-05-09 2020-09-15 西安电子科技大学 Shadow detection method of monochrome image
CN112101410A (en) * 2020-08-05 2020-12-18 中国科学院空天信息创新研究院 Image pixel semantic segmentation method and system based on multi-modal feature fusion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784654B (en) * 2016-08-26 2020-09-25 杭州海康威视数字技术股份有限公司 Image segmentation method and device and full convolution network system
US10956787B2 (en) * 2018-05-14 2021-03-23 Quantum-Si Incorporated Systems and methods for unifying statistical models for different data modalities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020151536A1 (en) * 2019-01-25 2020-07-30 腾讯科技(深圳)有限公司 Brain image segmentation method, apparatus, network device and storage medium
CN110969634A (en) * 2019-11-29 2020-04-07 国网湖北省电力有限公司检修公司 Infrared image power equipment segmentation method based on generation countermeasure network
CN111666977A (en) * 2020-05-09 2020-09-15 西安电子科技大学 Shadow detection method of monochrome image
CN111462128A (en) * 2020-05-28 2020-07-28 南京大学 Pixel-level image segmentation system and method based on multi-modal spectral image
CN112101410A (en) * 2020-08-05 2020-12-18 中国科学院空天信息创新研究院 Image pixel semantic segmentation method and system based on multi-modal feature fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Revisiting Feature Fusion for RGB-T Salient Object Detection;Qiang Zhang 等,;《IEEE Transactions on Circuits and Systems for Video Technology》;20200806;第2020年卷;全文 *
跨模态行人重识别研究与展望;陈丹 等,;《计算机系统应用》;20201031;第29卷(第10期);全文 *

Also Published As

Publication number Publication date
CN112991350A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN112991350B (en) RGB-T image semantic segmentation method based on modal difference reduction
Huang et al. DSNet: Joint semantic learning for object detection in inclement weather conditions
Guo et al. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN111259906B (en) Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN111612008B (en) Image segmentation method based on convolution network
CN107239730B (en) Quaternion deep neural network model method for intelligent automobile traffic sign recognition
Huang et al. Multi-level cross-modal interaction network for RGB-D salient object detection
CN113344806A (en) Image defogging method and system based on global feature fusion attention network
CN114445430B (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
Gong et al. Global contextually guided lightweight network for RGB-thermal urban scene understanding
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112785526A (en) Three-dimensional point cloud repairing method for graphic processing
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
Zeng et al. Dual Swin-transformer based mutual interactive network for RGB-D salient object detection
Wu et al. Vehicle detection based on adaptive multi-modal feature fusion and cross-modal vehicle index using RGB-T images
Shen et al. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Ogura et al. Improving the visibility of nighttime images for pedestrian recognition using in‐vehicle camera
CN113780305B (en) Significance target detection method based on interaction of two clues
CN117036658A (en) Image processing method and related equipment
CN114494699A (en) Image semantic segmentation method and system based on semantic propagation and foreground and background perception
Zhou et al. GAF-Net: Geometric Contextual Feature Aggregation and Adaptive Fusion for Large-Scale Point Cloud Semantic Segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant