CN116091765A - RGB-T image semantic segmentation method and device - Google Patents

RGB-T image semantic segmentation method and device Download PDF

Info

Publication number
CN116091765A
CN116091765A CN202211715697.1A CN202211715697A CN116091765A CN 116091765 A CN116091765 A CN 116091765A CN 202211715697 A CN202211715697 A CN 202211715697A CN 116091765 A CN116091765 A CN 116091765A
Authority
CN
China
Prior art keywords
rgb
image
semantic segmentation
fusion
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211715697.1A
Other languages
Chinese (zh)
Inventor
范嗣祺
王岩
刘菁菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211715697.1A priority Critical patent/CN116091765A/en
Publication of CN116091765A publication Critical patent/CN116091765A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Abstract

The invention provides a RGB-T image semantic segmentation method and a device, comprising the following steps: the space cross-modal information fusion, multi-scale feature iterative fusion and RGB image random mask data enhancement methods are utilized in advance, an RGB-T image semantic segmentation model is trained on the basis of a semi-labeled RGB-T image pair data set, so that the mining capacity of cross-modal space complementary information of the RGB-T image semantic segmentation model and semantic segmentation performance under poor illumination conditions are improved, and the labeling cost is reduced. In the application stage, the semantic segmentation image of the target RGB-T image pair is generated by utilizing the RGB-T image semantic segmentation model, and the accuracy of the semantic segmentation result is improved.

Description

RGB-T image semantic segmentation method and device
Technical Field
The invention relates to the technical field of image processing, in particular to an RGB-T image semantic segmentation method and device.
Background
Semantic segmentation aims at assigning a class label to each pixel in an RGB image, which is one of the key technologies for scene perception and plays a vital role in computer vision tasks such as automatic driving, pedestrian detection, remote sensing image analysis and the like.
Because partial region texture information of the RGB image may be missing under the condition of poor illumination (too low brightness or overexposure), the direct semantic segmentation of the RGB image with the missing texture information may lead to unreliable semantic segmentation results. Thus, RGB-T semantic segmentation using thermal infrared images to provide texture information supplementation to RGB images has resulted. In the existing RGB-T semantic segmentation, the mode of mode feature self-enhancement post-additive fusion or channel dimension fusion after mode feature alignment is mostly adopted to realize the fusion of RGB features and thermal infrared features, and the fusion features are utilized to complete the image semantic segmentation.
However, the spatial complementarity between the modal features is not fully utilized by the modal feature self-enhancement post-additive fusion and the modal feature alignment post-channel dimension fusion, so that the existing RGB-T semantic segmentation performance is poor.
Disclosure of Invention
The invention provides a RGB-T image semantic segmentation method and a device, which are used for solving the problem of poor semantic segmentation performance caused by insufficient utilization of space complementarity between RGB features and thermal infrared features in the prior art, and training an RGB-T image semantic segmentation model on the basis of a semi-labeled RGB-T image pair data set by utilizing a space cross-modal information fusion, multi-scale feature iteration fusion and RGB image random mask data enhancement method so as to enhance the mining capability of cross-modal space complementary information of the RGB-T image semantic segmentation model, so that the RGB-T image semantic segmentation model has the characteristics of low labeling cost and high semantic segmentation performance under the condition of poor illumination; and furthermore, accurate semantic segmentation can be realized by utilizing an RGB-T image semantic segmentation model.
In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, the method comprising:
calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus, the apparatus comprising:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
the selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The invention provides a RGB-T image semantic segmentation method and a device, which are characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an RGB-T image semantic segmentation method provided by the invention;
FIG. 2 is a schematic diagram of a dual-branch RGB-T semantic segmentation network provided by the present invention;
FIG. 3 is a diagram of an example fully supervised learning provided by the present invention;
FIG. 4 is a diagram of cross-modal mutual learning examples provided by the present invention;
FIG. 5 is a RGB-T feature flow provided by the present invention;
FIG. 6 is a schematic diagram of a spatial cross-modal information fusion module provided by the invention;
FIG. 7 is a schematic structural diagram of a multi-scale feature iterative fusion module provided by the present invention;
FIG. 8 is a schematic flow chart of an RGB-T image semantic segmentation device provided by the invention;
fig. 9 is a schematic structural diagram of an electronic device provided by the present invention;
reference numerals:
910: a processor; 920: a communication interface; 930: a memory; 940: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First, abbreviations and key term definitions in the art will be explained:
mIoU: mean intersection of union average cross-over ratio
SCF: spatial-wise cross-model fusion, spatial cross-modal information fusion
RMM: multi-scale measurement of repetitive multi-scale feature fusion
M-CutOut: mono-model CutOut, single mode random mask data enhancement
Conv: convolition, convolution operation
CD: channel-wise denoise, channel adaptive noise reduction
ADM: attentive demand map spatially adaptive demand pattern evaluation
CF: cross-model fusion
ASPP: atrous spatial pyramid pooling spatial pyramid pooling
SF: spatial-wise fusion, spatial dimension fusion
CA: channel-wise attention, channel attention mechanism
SA: spatial-wise attention, spatial attention mechanism
BN: batch normalization batch normalization operation
MLP: multilayer Perceptron, multilayer perceptron
The following describes a semantic segmentation method and device for RGB-T images according to the present invention with reference to FIGS. 1 to 9.
In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, as shown in fig. 1, including:
s11, calling an RGB-T image semantic segmentation model;
specifically, the RGB-T image semantic segmentation model is pre-constructed, and the construction process comprises the following steps:
constructing an RGB-T image semantic segmentation data set; wherein, the RGB-T image semantic segmentation dataset can be acquired as follows:
step A: collecting RGB-T data pairs to form a data set T;
alternatively, the data set T may employ open source data, such as a road scene MFNet data set and a subsurface scene PST900 data set.
And (B) step (B): performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set T to obtain a first data set;
the invention can be directly applied to semi-supervision tasks (namely, part of RGB-T image pairs are used as labeled image pairs and the other part of RGB-T image pairs are used as unlabeled image pairs) through a cross-modal mutual learning method because the pixel-level semantic segmentation labeling cost of the RGB-T image pairs is very high, so that the effect of reducing the labeling cost of an RGB-T image semantic segmentation model is achieved.
Step C: performing data enhancement on the first data set by adopting an RGB image random mask mode to obtain an RGB-T image semantic segmentation data set;
in order to fully utilize the labeled image pair, the invention provides a single-mode random mask data enhancement method M-CutOut, and the random mask operation is carried out on the RGB image in the RGB-T image pair in the first data set so as to manually introduce a new space complementary information region, thereby promoting the RGB-T image semantic segmentation model to better utilize the cross-mode space complementary texture information in the training process.
Further, performing a random masking operation on the RGB image in the RGB-T image pair, including:
firstly initializing a mask M of all 1, then determining a rectangular area proportional to the image scale at a random position, setting the area of M to be 0, and finally multiplying the mask with the original RGB image pixel by pixel to form a new visible light texture information missing area.
Constructing a double-branch RGB-T semantic segmentation network; wherein, RGB branch and hot infrared branch in the two branch RGB-T semantic segmentation network all adopt encoding-decoding structure, wherein:
the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel level classifier and a main prediction output layer;
The thermal infrared branch comprises a second input layer, a second feature encoder, a second feature decoder, a second pixel level classifier and an auxiliary prediction output layer;
the first feature encoder includes K feature extraction layers;
the second feature encoder includes K feature extraction layers;
the first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules;
the first feature decoder comprises a first multi-scale feature iterative fusion module;
the second feature decoder includes a second multi-scale feature iterative fusion module. FIG. 2 provides a schematic diagram of a dual-branch RGB-T semantic segmentation network (the drawings of the present invention are each illustrated with K=5, without loss of generality), in which SCF (Spatial-temporal Cross-mode Fusion) represents a Spatial Cross-modality information Fusion module, RMM (repetitive Multi-scale measurement) represents a Multi-scale feature iterative Fusion module, L i The feature extraction layer with index i is indicated.
From fig. 2, it can be seen that the RGB branches and the thermal infrared branches in the dual-branch RGB-T semantic segmentation network progressively perform feature extraction by using K-layer feature extraction layers in the coding structure, and gradually perform spatial cross-modal information fusion by using the SCF module embedded after each feature extraction layer; and in the decoding structure, the RMM module is used for carrying out iterative feature fusion of spatial dimensions on multi-scale fusion features so as to make up for information loss caused by spatial downsampling in feature encoding, and the scale transformation is realized by an upsampling method (such as bilinear upsampling) in feature decoding. The spatial cross-modal information fusion and the multi-scale feature iterative fusion jointly enable the dual-branch RGB-T semantic segmentation network to have the capacity of deeply mining cross-modal spatial complementary texture features of RGB-T image pairs.
Constructing a loss function; wherein, the invention directly applies cross-mode mutual learning to semi-supervised semantic segmentation task when training RGB-T image semantic segmentation model
Figure BDA0004027721530000081
RGB-T image pair with pixel level semantic segmentation labels, denoted G +.>
Figure BDA0004027721530000082
Is marked with +.>
Figure BDA0004027721530000083
RGB-T image pairs representing non-pixel level semantic segmentation labels and for +.>
Figure BDA0004027721530000084
In a fully supervised learning mode as shown in FIG. 3, for
Figure BDA0004027721530000085
The cross-modal mutual learning mode shown in fig. 4 is adopted.
Thus, the loss function of the RGB-T image semantic segmentation model
Figure BDA0004027721530000086
The expression of (2) may be:
Figure BDA0004027721530000087
Figure BDA0004027721530000088
Figure BDA0004027721530000089
Figure BDA00040277215300000810
Figure BDA00040277215300000811
wherein ,
Figure BDA00040277215300000812
total training loss for RGB-T image pairs with pixel-level semantic segmentation labels, +.>
Figure BDA00040277215300000813
Total training loss, h, for RGB-T image pairs labeled for non-pixel level semantic segmentation rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h th e (·) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (·) represents the cross entropy loss function, Y rgb Is that
Figure BDA0004027721530000096
Corresponding pseudo tag, ytheA is +.>
Figure BDA0004027721530000095
Corresponding pseudo tag, M is->
Figure BDA0004027721530000092
Random mask pattern of->
Figure BDA0004027721530000093
Can be regarded as +.>
Figure BDA0004027721530000094
And (3) carrying out semantic segmentation prediction corresponding to a single pixel point on the result after the M-CutOut data is enhanced. Pseudo tag Y rgb and Ythe Is generated by using the semantic segmentation prediction result of the conventional weak data enhancement (such as flipping and the like) data.
Training an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set and a loss function.
The RGB-T image semantic segmentation model trained by the invention has low labeling cost and high semantic segmentation performance under the condition of poor illumination.
3 existing schemes, scheme 1 (patent CN112991350 a), scheme 2 (patent CNll3362349 a) and scheme 3 (patent CN 113781504A), are proposed, all of which adopt a fully supervised training mode to build a model, while the present invention directly applies to a semi-supervised task (only half of MFNet dataset data is used as marked data and the other half is used as unmarked data) training model through a cross-modal mutual learning method.
Table 1 is a model test effect comparison table on the road scene RGB-T image dataset MFNet;
table 2 is a semi-supervised semantic segmentation effect contrast table on the road scene RGB-T image dataset MFNet.
Tables 1 and 2 show that the semantic segmentation effect of the method is better under the same experimental setting, and the effect that all tagged data are used in the prior art scheme can be achieved by only using half of tagged data.
TABLE 1
Figure BDA0004027721530000101
TABLE 2
Figure BDA0004027721530000102
S12, inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
specifically, the K feature extraction layers included in the first feature encoder are respectively denoted as L rgb,0 To L rgb,K-1
The K feature extraction layers included in the second feature encoder are respectively denoted as L the,0 To L the,K-1
K spatial cross-modal information fusion modules jointly included by the first feature encoder and the second feature encoder are respectively recorded as SCF 0 To SCF K - 1
The S12 includes:
transmitting an RGB image of the target RGB-T image pair to the first feature encoder through the first input layer, and transmitting a thermal infrared image of the target RGB-T image pair to the second feature encoder through the second input layer;
utilizing the L in a combined structure of the first feature encoder and the second feature encoder rg b ,i Feature extraction is carried out on the input RGB information to obtain frgb, i, and the L is utilized the,i Feature extraction is carried out on the input thermal infrared information to obtain f the,i And utilize the SCF i For fr gb,i and fthe I, performing spatial cross-modal information fusion to obtain f rgb,i and fthe I; wherein i is E [0, (K-1)]And when i=0, L rgb,i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L rgb,i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1)]When L rgb,i The RGB information input in the SCF is utilized i - 1 The f' r obtained gb,i - 1 ,L rgb The thermal infrared information input in i is the SCF i - 1 Obtained f the,i - 1
Iterative fusion of module pairs fr using first multi-scale features gb,K - 1 、f m,k - 2 、f m,k - 3 and fm ,k- 4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure BDA0004027721530000111
Iterative fusion of module pairs f using second multi-scale features the ,K- 1 f m,k - 2 、f m,k - 3 and fm,k - 4 Performing space dimension iterative fusion to obtain decoding characteristics +.>
Figure BDA0004027721530000112
Wherein j is 2-K, f m,k-j Is fr gb ,k -j And f the ,k- j The characteristics obtained after the additive fusion are carried out;
will be
Figure BDA0004027721530000121
and />
Figure BDA0004027721530000122
Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image yr gb And outputting the first semantically divided image yr through the main prediction output layer gb The method comprises the steps of carrying out a first treatment on the surface of the Processing +.>
Figure BDA0004027721530000123
Obtaining the second semantic segmentation image ythe and outputting the second semantic segmentation image y through the auxiliary prediction output layer the
Here, the pixel-level classifier mainly performs convolution operation, that is:
Figure BDA0004027721530000124
Figure BDA0004027721530000125
further, fig. 5 shows the RGB-T feature flow of the feature encoding and decoding stage, where CD (channel-wise denoise) is a channel adaptive noise reducer, ADM (Attentive demand map) is a Spatial adaptive demand graph estimator, CF (Cross-model fusion) is a Cross-modal fusion, ASPP (Atrous Spatial pyramidpooling) is a Spatial pyramid pooler, and SF (Spatial-wise fusion) is a Spatial dimension fusion. Wherein, CD, ADM and CF together form the SCF module, ASPP and SF form the RMM module.
Specifically, fig. 6 provides a schematic structural diagram of a spatial cross-mode information fusion module, where in a real scene, both an RGB image and an infrared thermal image are inevitably interfered by noise in a complex environment, such as strong light irradiation, temperature fluctuation caused by an abnormal heat source, and the like. In view of the above noise, the proposed SCF module applies two attention mechanisms, the Channel adaptive noise reducer CD determines which Channel is more reliable through the Channel attention mechanism CA (Channel-wise) and the Spatial adaptive demand pattern evaluator ADM determines which region has more Spatial complementary information fusion requirements through the Spatial attention mechanism SA (Spatial-wise). There are a number of engineering implementations of the two attention mechanisms described above, such as those in CBAM (s.woo, j.park, J-yle, and i.s.kweon.cbam: convolitionallblockattententment module.InECCV, pages3-19, 2018.), which will be readily appreciated, extended and implemented by the skilled artisan.
Thus, the SCF is utilized i For fr gb,i and fthe,i Performing spatial cross-modal information fusion to obtain fr g b ,i and fthe I, comprising:
in the channel adaptive noise reducer, for fr gb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) rg b, i) and a first mean pooling feature MeanPool (f) rgb,i ) Based on the MaxPool (f rgb,i ) And the MeanPool (f) rgb I) generating a first channel attention map using a channel attention mechanism
Figure BDA0004027721530000131
And add said->
Figure BDA0004027721530000132
And fr is equal to gb,i Is the product of fr gb,i Noise reduction features of->
Figure BDA0004027721530000133
At the same time, for f the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) the,i ) And a second averaged pooling feature MeanPool (f) the,i ) Based on the MaxPool (f the,i ) And the MeanPool (f) the,i ) The channel attention mechanism generates a second channel attention map
Figure BDA0004027721530000134
And add said->
Figure BDA0004027721530000135
And f is equal to the The product of i is taken as f the,i Noise reduction features of->
Figure BDA0004027721530000136
Here, the described
Figure BDA0004027721530000137
The formula of (2) is as follows:
Figure BDA0004027721530000138
here, the described
Figure BDA0004027721530000139
The formula of (2) is as follows:
Figure BDA0004027721530000141
wherein, MLP is the multilayer perceptron, sigmoid (·) is Sigmoid function.
In a spatially adaptive demand graph evaluator, based on
Figure BDA0004027721530000142
Know->
Figure BDA0004027721530000143
Generating a first spatially adaptive demand graph using a spatial attention mechanism >
Figure BDA0004027721530000144
At the same time based on->
Figure BDA0004027721530000145
It is known that
Figure BDA0004027721530000146
Generating a second spatially adaptive demand graph using a spatial attention mechanism>
Figure BDA0004027721530000147
Here, the described
Figure BDA0004027721530000148
The formula of (2) is as follows:
Figure BDA0004027721530000149
the said
Figure BDA00040277215300001410
The formula of (2) is as follows:
Figure BDA00040277215300001411
conv (·) is a convolution function, 7*7 convolution kernel is adopted, sigmoid (·) is a Sigmoid function, and the function of the space self-adaptive demand graph is to represent the space complementary information fusion demand of the region.
In a cross-modal fusion device
Figure BDA0004027721530000151
and />
Figure BDA0004027721530000152
Dot product and->
Figure BDA0004027721530000153
Carrying out additive fusion to obtain the frgb, i, and carrying out +.>
Figure BDA0004027721530000154
and />
Figure BDA0004027721530000155
Dot product and->
Figure BDA0004027721530000156
Carrying out additive fusion to obtain the f the,i 。/>
Fig. 7 provides a schematic structural diagram of a multi-scale feature iterative fusion module, in which Conv units are composed of convolution operations, batch normalization operations (Batch normalization), and Relu activation. The spatial downsampling in the feature encoding process inevitably causes information loss, the semantic segmentation task at the pixel level depends on detail texture features, and the proposed RMM module realizes the compensation of the information loss by iteratively fusing multi-scale features.
Thus, the iterative fusion module pair fr using the first multi-scale features g b,K- 1 、f m,K - 2 、f m ,K- 3 and fm,K - 4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure BDA0004027721530000157
Comprising the following steps:
the feature encoding result fr g b,K- 1 First embedding more global features ASPP (f' rgb, K- 1 ) Wherein ASPP adopts expansion coefficient d=2, 4,8, the number of output characteristic channels is 256, and then ASPP (f' rgb, K- 1 ) Up-sampled feature frgb, uK-1 and fusion feature f containing rich texture information m,K - 2 、f m,K - 3 and fm,K - 4 Obtaining
Figure BDA0004027721530000158
In consideration of computational complexity, the above fusion feature utilizes one Conv unit to realize channel dimension reduction. Specifically, said->
Figure BDA0004027721530000159
The formula of (2) is as follows:
Figure BDA00040277215300001510
Figure BDA0004027721530000161
Figure BDA0004027721530000162
wherein z is E [1,3 ]],
Figure BDA0004027721530000163
For cascade operation, up (-) is Up-sampling, meanPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is point multiplication operation>
Figure BDA0004027721530000164
To pair(s)
Figure BDA0004027721530000165
Adaptive mask for spatial attention mechanism evaluation to indicate where and how much information compensation is needed,/for>
Figure BDA0004027721530000166
For f by Conv unit pair m,K _z- 1 And performing channel dimension reduction to obtain the characteristics.
Likewise, the iterative fusion module pair f utilizing the second multi-scale features the ,K- 1 、f m ,K- 2 、f m,K - 3 and fm,K - 4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure BDA0004027721530000167
Comprising the following steps:
feature encoding result f the ,K- 1 First more global features ASPP (f 'are embedded by ASPP' the,K-1 ) Then iteratively fusing ASPP (f the,K-1 ) Upsampling features of (a)
Figure BDA0004027721530000168
Fusion feature f containing rich texture information m,K - 2 、f m,K - 3 and fm,K - 4 Obtain->
Figure BDA0004027721530000169
Specifically, the following
Figure BDA00040277215300001610
The formula of (2) is as follows:
Figure BDA00040277215300001611
/>
Figure BDA00040277215300001612
Figure BDA0004027721530000171
Figure BDA0004027721530000172
for->
Figure BDA0004027721530000173
And performing the adaptive mask obtained by the spatial attention mechanism evaluation.
S13, selecting one of the first semantic segmentation image and the second semantic segmentation image as a semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair.
Specifically, S13 includes:
under the condition that no texture information is missing in an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair;
and under the condition that texture information is missing in any one of an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair by a daytime scene, and taking the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair by a night scene.
The double-branch structure of the invention enables the RGB-T semantic segmentation model to better cope with the loss condition of the single-mode data signal. Table 3 is a comparison table of the results of the robustness experiments of signal loss on the road scene RGB-T image dataset MFNet, and experiments prove that the invention can obtain better semantic segmentation results under the condition of signal loss.
TABLE 3 Table 3
Figure BDA0004027721530000181
In a word, according to the RGB-T image semantic segmentation method provided by the invention, an RGB-T image semantic segmentation data set is used for training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches in advance to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
In addition, on the basis of the scheme, similar effects can be achieved by using different characteristics to extract backbone networks and modifying parameters (such as the number of convolutional layers, the number of channels, an activation function and the like) in each module, and similar semi-supervised semantic segmentation training can be achieved by using different strong and weak data enhancement combinations.
In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus for description, where the RGB-T image semantic segmentation apparatus described below and the RGB-T image semantic segmentation method described above may be referred to correspondingly to each other. Fig. 8 is a schematic flow chart of an RGB-T image semantic segmentation apparatus provided by the present invention, as shown in fig. 8, the apparatus includes:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
The selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The invention provides an RGB-T image semantic segmentation device, which is characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
In a third aspect, fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, where the electronic device may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a method of RGB-T image semantic segmentation, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of RGB-T image semantic segmentation provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of semantic segmentation of RGB-T images provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for semantic segmentation of RGB-T images, the method comprising:
calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
The RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
2. The RGB-T image semantic segmentation method of claim 1, wherein the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel-level classifier, and a main prediction output layer;
the thermal infrared branch comprises a second input layer, a second feature encoder, a second feature decoder, a second pixel level classifier and an auxiliary prediction output layer;
wherein the first feature encoder comprises K feature extraction layers, respectively denoted as L rgb,0 To L rgb,K-1
The second feature encoder includes K feature extraction layers, each denoted as L the,0 To L the,K-1
The first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules which are respectively recorded as SCF 0 To SCF K-1
The first feature decoder comprises a first multi-scale feature iterative fusion module, and the second feature decoder comprises a second multi-scale feature iterative fusion module;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch, wherein the method comprises the following steps:
transmitting an RGB image of the target RGB-T image pair to the first feature encoder through the first input layer, and transmitting a thermal infrared image of the target RGB-T image pair to the second feature encoder through the second input layer;
utilizing the L in a combined structure of the first feature encoder and the second feature encoder rgb,i Feature extraction is carried out on the input RGB information to obtain f rgb,i By using the L the,i Feature extraction is carried out on the input thermal infrared information to obtain f the,i And utilize the SCF i For f rgb,i and fthe,i Performing spatial cross-modal information fusion to obtain f' rgb,i and f′the,i The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is E [0, (K-1)]And when i=0, L rgb,i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L rgb,i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1) ]When L rgb,i The RGB information input in the SCF is utilized i-1 The obtained f' rgb,i-1 ,L rgb,i The input thermal infrared information is the SCF i-1 The obtained f' the,i-1
Iterative fusion of module pairs f 'using first multi-scale features' rgb,K-1 、f m,k-2 、f m,k-3 and fm,k-4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure FDA0004027721520000031
Iterative fusion of module pairs f 'using second multi-scale features' the,K-1 f m,k-2 、f m,k-3 and fm,k-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>
Figure FDA0004027721520000032
Wherein j is E [2,K ]],f m,k-j Is f rgb,k-j And f the,k-j The characteristics obtained after the additive fusion are carried out; />
Will be
Figure FDA0004027721520000033
and />
Figure FDA0004027721520000034
Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image y rgb And outputting the first semantically segmented image y through the main prediction output layer rgb
Processing with a second pixel level classifier
Figure FDA0004027721520000035
Obtaining the second semantically segmented image y the And outputting the second semantically segmented image y through the auxiliary prediction output layer the
3. The RGB-T image semantic segmentation method of claim 2, wherein the spatial cross-modality information fusion module comprises: a channel adaptive noise reducer, a space adaptive demand graph evaluator and a cross-modal fusion device;
the SCF is utilized i For f rgb,i and fthe,i Performing spatial cross-modal information fusion to obtain f' rgb,i and f’the,i Comprising:
in the channel adaptive noise reducer, for f rgb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) rgb,i ) And a first averaged pooling feature MeanPool (f) rgb,i ) Based on the MaxPool (f rgb,i ) And the MeanPool (f) rgb,i ) Generating a first channel attention map using a channel attention mechanism
Figure FDA0004027721520000036
And add said->
Figure FDA0004027721520000037
And f is equal to rgb,i Is the product of f rgb,i Noise reduction features of->
Figure FDA0004027721520000038
At the same time, for f the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) the,i ) And a second averaged pooling feature MeanPool (f) the,i ) Based on the MaxPool (f the,i ) And the MeanPool (f) the,i ) The channel attention mechanism generates a second channel attention map
Figure FDA0004027721520000041
And add said->
Figure FDA0004027721520000042
And f is equal to the,i Is the product of f the,i Noise reduction features of->
Figure FDA0004027721520000043
In a spatially adaptive demand graph evaluator, based on
Figure FDA0004027721520000044
and />
Figure FDA0004027721520000045
Generating a first spatially adaptive demand graph using a spatial attention mechanism>
Figure FDA0004027721520000046
At the same time based on->
Figure FDA0004027721520000047
and />
Figure FDA0004027721520000048
Generating a second spatially adaptive demand graph using a spatial attention mechanism>
Figure FDA0004027721520000049
In a cross-modal fusion device
Figure FDA00040277215200000410
and />
Figure FDA00040277215200000411
Dot product and->
Figure FDA00040277215200000412
Carrying out additive fusion to obtain the f' rgb,i And will->
Figure FDA00040277215200000413
and />
Figure FDA00040277215200000414
Dot product and- >
Figure FDA00040277215200000415
Carrying out additive fusion to obtain the f' the,i
4. A method of semantic segmentation of RGB-T images according to claim 3, characterized in that the
Figure FDA00040277215200000416
The formula of (2) is as follows:
Figure FDA00040277215200000417
the said
Figure FDA00040277215200000418
The formula of (2) is as follows: />
Figure FDA0004027721520000051
Wherein, MLP is the multilayer perceptron, sigmoid (·) is Sigmoid function.
5. A method of semantic segmentation of RGB-T images according to claim 3, characterized in that the
Figure FDA0004027721520000052
The formula of (2) is as follows:
Figure FDA0004027721520000053
the said
Figure FDA0004027721520000054
The formula of (2) is as follows:
Figure FDA0004027721520000055
wherein Conv (·) is a convolution function, sigmoid (·) is a Sigmoid function.
6. The RGB-T image semantic segmentation method of claim 2, wherein the first multi-scale feature iterative fusion module and the second multi-scale feature iterative fusion module each comprise: a spatial pyramid pooler and a spatial dimension fusion device;
the iterative fusion module pair f 'utilizing the first multi-scale features' rgb,K-1 、f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure FDA0004027721520000061
Comprising the following steps:
iterative fusion using the first multi-scale featureGenerating the f 'by a space pyramid pooler in a synthesis module' rgb,K-1 Corresponding global features;
in the spatial dimension fusion device in the first multi-scale feature iterative fusion module, for the f' rgb,K-1 Upsampling features of (a)
Figure FDA0004027721520000062
f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>
Figure FDA0004027721520000063
The iterative fusion module pair f 'utilizing the second multi-scale features' the,K-1 、f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristics
Figure FDA0004027721520000064
Comprising the following steps:
determining the f 'by a spatial pyramid pooler in the second multi-scale feature iterative fusion module' the,K-1 Corresponding global features;
in the spatial dimension fusion device in the second multi-scale feature iterative fusion module, for the f' the,K-1 Upsampling features of (a)
Figure FDA0004027721520000065
f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>
Figure FDA0004027721520000066
7. The RGB-T image semantic segmentation method of claim 6, wherein the
Figure FDA0004027721520000067
The formula of (2) is as follows:
Figure FDA0004027721520000068
Figure FDA0004027721520000069
Figure FDA0004027721520000071
the said
Figure FDA0004027721520000072
The formula of (2) is as follows:
Figure FDA0004027721520000073
Figure FDA0004027721520000074
Figure FDA0004027721520000075
wherein z is E [1,3 ]],
Figure FDA0004027721520000076
For cascade operation, up (-) is Up-sampling, maxPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is dot multiplication operation>
Figure FDA0004027721520000077
For->
Figure FDA0004027721520000078
Adaptive mask obtained by performing spatial attention mechanism evaluation, < >>
Figure FDA0004027721520000079
For->
Figure FDA00040277215200000710
Adaptive mask obtained by performing spatial attention mechanism evaluation, < >>
Figure FDA00040277215200000711
For f by Conv unit pair m,K-z-1 And performing channel dimension reduction to obtain characteristics, wherein the Conv unit comprises convolution operation, batch normalization operation and Relu activation.
8. The RGB-T image semantic segmentation method according to claim 1, characterized in that the RGB-T image semantic segmentation model has a loss function
Figure FDA00040277215200000712
The expression of (2) is as follows:
Figure FDA0004027721520000081
Figure FDA0004027721520000082
Figure FDA0004027721520000083
Figure FDA0004027721520000084
Figure FDA0004027721520000085
wherein ,
Figure FDA0004027721520000086
total training loss for RGB-T image pairs with pixel-level semantic segmentation labels, +.>
Figure FDA0004027721520000087
The total training loss of the RGB-T image pair without pixel-level semantic segmentation labels is G is the RGB-T image pair with pixel-level semantic segmentation labels +.>
Figure FDA0004027721520000088
Pixel-level semantic segmentation labeling of (h) rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h the (. Cndot.) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (. Cndot.) represents the cross entropy loss function,>
Figure FDA0004027721520000089
RGB-T image pair labeled for non-pixel level semantic segmentation, Y rgb Is->
Figure FDA00040277215200000810
Corresponding pseudo tag, Y the Is->
Figure FDA00040277215200000811
Corresponding pseudo tag, M is->
Figure FDA00040277215200000812
Y is the semantic segmentation prediction corresponding to a single pixel point.
9. The RGB-T image semantic segmentation method according to any one of claims 1 to 8, wherein selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the lack of texture information of the target RGB-T image pair comprises:
Under the condition that no texture information is missing in an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair;
and under the condition that texture information is missing in any one of an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair by a daytime scene, and taking the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair by a night scene.
10. An RGB-T image semantic segmentation apparatus, the apparatus comprising:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
the selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
The RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
CN202211715697.1A 2022-12-29 2022-12-29 RGB-T image semantic segmentation method and device Pending CN116091765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211715697.1A CN116091765A (en) 2022-12-29 2022-12-29 RGB-T image semantic segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211715697.1A CN116091765A (en) 2022-12-29 2022-12-29 RGB-T image semantic segmentation method and device

Publications (1)

Publication Number Publication Date
CN116091765A true CN116091765A (en) 2023-05-09

Family

ID=86209687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211715697.1A Pending CN116091765A (en) 2022-12-29 2022-12-29 RGB-T image semantic segmentation method and device

Country Status (1)

Country Link
CN (1) CN116091765A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557795A (en) * 2024-01-10 2024-02-13 吉林大学 Underwater target semantic segmentation method and system based on multi-source data fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557795A (en) * 2024-01-10 2024-02-13 吉林大学 Underwater target semantic segmentation method and system based on multi-source data fusion
CN117557795B (en) * 2024-01-10 2024-03-29 吉林大学 Underwater target semantic segmentation method and system based on multi-source data fusion

Similar Documents

Publication Publication Date Title
CN112634296B (en) RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation
Zhuo et al. Self-adversarial training incorporating forgery attention for image forgery localization
US20230021661A1 (en) Forgery detection of face image
CN111932431B (en) Visible watermark removing method based on watermark decomposition model and electronic equipment
CN110163188B (en) Video processing and method, device and equipment for embedding target object in video
CN114936605A (en) Knowledge distillation-based neural network training method, device and storage medium
CN115272437A (en) Image depth estimation method and device based on global and local features
CN116091765A (en) RGB-T image semantic segmentation method and device
Sheng et al. A joint framework for underwater sequence images stitching based on deep neural network convolutional neural network
Zhou et al. Transformer-based multi-scale feature integration network for video saliency prediction
CN116757978A (en) Infrared and visible light image self-adaptive fusion method, system and electronic equipment
CN115631205B (en) Method, device and equipment for image segmentation and model training
CN112052863B (en) Image detection method and device, computer storage medium and electronic equipment
CN111325068B (en) Video description method and device based on convolutional neural network
Guo et al. A Markov random field model for the restoration of foggy images
Rahmon et al. Deepftsg: Multi-stream asymmetric use-net trellis encoders with shared decoder feature fusion architecture for video motion segmentation
CN116821699B (en) Perception model training method and device, electronic equipment and storage medium
Chen et al. Exploring efficient and effective generative adversarial network for thermal infrared image colorization
Kumar et al. Encoder–decoder-based CNN model for detection of object removal by image inpainting
CN117057969B (en) Cross-modal image-watermark joint generation and detection device and method
Tyagi et al. ForensicNet: Modern convolutional neural network‐based image forgery detection network
Dangle et al. Qualitative Colorization of Thermal Infrared Images using custom Convolutional Neural Networks
Sang et al. MPA‐Net: multi‐path attention stereo matching network
CN116246071A (en) Attention mechanism-based self-adaptive multi-mode scene segmentation method in dark and weak environment
Sun et al. Underwater visual feature matching based on attenuation invariance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination