CN116091765A

CN116091765A - RGB-T image semantic segmentation method and device

Info

Publication number: CN116091765A
Application number: CN202211715697.1A
Authority: CN
Inventors: 范嗣祺; 王岩; 刘菁菁
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-09

Abstract

The invention provides a RGB-T image semantic segmentation method and a device, comprising the following steps: the space cross-modal information fusion, multi-scale feature iterative fusion and RGB image random mask data enhancement methods are utilized in advance, an RGB-T image semantic segmentation model is trained on the basis of a semi-labeled RGB-T image pair data set, so that the mining capacity of cross-modal space complementary information of the RGB-T image semantic segmentation model and semantic segmentation performance under poor illumination conditions are improved, and the labeling cost is reduced. In the application stage, the semantic segmentation image of the target RGB-T image pair is generated by utilizing the RGB-T image semantic segmentation model, and the accuracy of the semantic segmentation result is improved.

Description

RGB-T image semantic segmentation method and device

Technical Field

The invention relates to the technical field of image processing, in particular to an RGB-T image semantic segmentation method and device.

Background

Semantic segmentation aims at assigning a class label to each pixel in an RGB image, which is one of the key technologies for scene perception and plays a vital role in computer vision tasks such as automatic driving, pedestrian detection, remote sensing image analysis and the like.

Because partial region texture information of the RGB image may be missing under the condition of poor illumination (too low brightness or overexposure), the direct semantic segmentation of the RGB image with the missing texture information may lead to unreliable semantic segmentation results. Thus, RGB-T semantic segmentation using thermal infrared images to provide texture information supplementation to RGB images has resulted. In the existing RGB-T semantic segmentation, the mode of mode feature self-enhancement post-additive fusion or channel dimension fusion after mode feature alignment is mostly adopted to realize the fusion of RGB features and thermal infrared features, and the fusion features are utilized to complete the image semantic segmentation.

However, the spatial complementarity between the modal features is not fully utilized by the modal feature self-enhancement post-additive fusion and the modal feature alignment post-channel dimension fusion, so that the existing RGB-T semantic segmentation performance is poor.

Disclosure of Invention

The invention provides a RGB-T image semantic segmentation method and a device, which are used for solving the problem of poor semantic segmentation performance caused by insufficient utilization of space complementarity between RGB features and thermal infrared features in the prior art, and training an RGB-T image semantic segmentation model on the basis of a semi-labeled RGB-T image pair data set by utilizing a space cross-modal information fusion, multi-scale feature iteration fusion and RGB image random mask data enhancement method so as to enhance the mining capability of cross-modal space complementary information of the RGB-T image semantic segmentation model, so that the RGB-T image semantic segmentation model has the characteristics of low labeling cost and high semantic segmentation performance under the condition of poor illumination; and furthermore, accurate semantic segmentation can be realized by utilizing an RGB-T image semantic segmentation model.

In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, the method comprising:

calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;

inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;

selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;

the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;

the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.

In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus, the apparatus comprising:

the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;

the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;

the selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;

The invention provides a RGB-T image semantic segmentation method and a device, which are characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an RGB-T image semantic segmentation method provided by the invention;

FIG. 2 is a schematic diagram of a dual-branch RGB-T semantic segmentation network provided by the present invention;

FIG. 3 is a diagram of an example fully supervised learning provided by the present invention;

FIG. 4 is a diagram of cross-modal mutual learning examples provided by the present invention;

FIG. 5 is a RGB-T feature flow provided by the present invention;

FIG. 6 is a schematic diagram of a spatial cross-modal information fusion module provided by the invention;

FIG. 7 is a schematic structural diagram of a multi-scale feature iterative fusion module provided by the present invention;

FIG. 8 is a schematic flow chart of an RGB-T image semantic segmentation device provided by the invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention;

reference numerals:

910: a processor; 920: a communication interface; 930: a memory; 940: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

First, abbreviations and key term definitions in the art will be explained:

mIoU: mean intersection of union average cross-over ratio

SCF: spatial-wise cross-model fusion, spatial cross-modal information fusion

RMM: multi-scale measurement of repetitive multi-scale feature fusion

M-CutOut: mono-model CutOut, single mode random mask data enhancement

Conv: convolition, convolution operation

CD: channel-wise denoise, channel adaptive noise reduction

ADM: attentive demand map spatially adaptive demand pattern evaluation

CF: cross-model fusion

ASPP: atrous spatial pyramid pooling spatial pyramid pooling

SF: spatial-wise fusion, spatial dimension fusion

CA: channel-wise attention, channel attention mechanism

SA: spatial-wise attention, spatial attention mechanism

BN: batch normalization batch normalization operation

MLP: multilayer Perceptron, multilayer perceptron

The following describes a semantic segmentation method and device for RGB-T images according to the present invention with reference to FIGS. 1 to 9.

In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, as shown in fig. 1, including:

s11, calling an RGB-T image semantic segmentation model;

specifically, the RGB-T image semantic segmentation model is pre-constructed, and the construction process comprises the following steps:

constructing an RGB-T image semantic segmentation data set; wherein, the RGB-T image semantic segmentation dataset can be acquired as follows:

step A: collecting RGB-T data pairs to form a data set T;

alternatively, the data set T may employ open source data, such as a road scene MFNet data set and a subsurface scene PST900 data set.

And (B) step (B): performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set T to obtain a first data set;

the invention can be directly applied to semi-supervision tasks (namely, part of RGB-T image pairs are used as labeled image pairs and the other part of RGB-T image pairs are used as unlabeled image pairs) through a cross-modal mutual learning method because the pixel-level semantic segmentation labeling cost of the RGB-T image pairs is very high, so that the effect of reducing the labeling cost of an RGB-T image semantic segmentation model is achieved.

Step C: performing data enhancement on the first data set by adopting an RGB image random mask mode to obtain an RGB-T image semantic segmentation data set;

in order to fully utilize the labeled image pair, the invention provides a single-mode random mask data enhancement method M-CutOut, and the random mask operation is carried out on the RGB image in the RGB-T image pair in the first data set so as to manually introduce a new space complementary information region, thereby promoting the RGB-T image semantic segmentation model to better utilize the cross-mode space complementary texture information in the training process.

Further, performing a random masking operation on the RGB image in the RGB-T image pair, including:

firstly initializing a mask M of all 1, then determining a rectangular area proportional to the image scale at a random position, setting the area of M to be 0, and finally multiplying the mask with the original RGB image pixel by pixel to form a new visible light texture information missing area.

Constructing a double-branch RGB-T semantic segmentation network; wherein, RGB branch and hot infrared branch in the two branch RGB-T semantic segmentation network all adopt encoding-decoding structure, wherein:

the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel level classifier and a main prediction output layer;

The thermal infrared branch comprises a second input layer, a second feature encoder, a second feature decoder, a second pixel level classifier and an auxiliary prediction output layer;

the first feature encoder includes K feature extraction layers;

the second feature encoder includes K feature extraction layers;

the first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules;

the first feature decoder comprises a first multi-scale feature iterative fusion module;

the second feature decoder includes a second multi-scale feature iterative fusion module. FIG. 2 provides a schematic diagram of a dual-branch RGB-T semantic segmentation network (the drawings of the present invention are each illustrated with K=5, without loss of generality), in which SCF (Spatial-temporal Cross-mode Fusion) represents a Spatial Cross-modality information Fusion module, RMM (repetitive Multi-scale measurement) represents a Multi-scale feature iterative Fusion module, L _i The feature extraction layer with index i is indicated.

From fig. 2, it can be seen that the RGB branches and the thermal infrared branches in the dual-branch RGB-T semantic segmentation network progressively perform feature extraction by using K-layer feature extraction layers in the coding structure, and gradually perform spatial cross-modal information fusion by using the SCF module embedded after each feature extraction layer; and in the decoding structure, the RMM module is used for carrying out iterative feature fusion of spatial dimensions on multi-scale fusion features so as to make up for information loss caused by spatial downsampling in feature encoding, and the scale transformation is realized by an upsampling method (such as bilinear upsampling) in feature decoding. The spatial cross-modal information fusion and the multi-scale feature iterative fusion jointly enable the dual-branch RGB-T semantic segmentation network to have the capacity of deeply mining cross-modal spatial complementary texture features of RGB-T image pairs.

Constructing a loss function; wherein, the invention directly applies cross-mode mutual learning to semi-supervised semantic segmentation task when training RGB-T image semantic segmentation model

RGB-T image pair with pixel level semantic segmentation labels, denoted G +.>

Is marked with +.>

RGB-T image pairs representing non-pixel level semantic segmentation labels and for +.>

In a fully supervised learning mode as shown in FIG. 3, for

The cross-modal mutual learning mode shown in fig. 4 is adopted.

Thus, the loss function of the RGB-T image semantic segmentation model

The expression of (2) may be:

wherein ,

total training loss for RGB-T image pairs with pixel-level semantic segmentation labels, +.>

Total training loss, h, for RGB-T image pairs labeled for non-pixel level semantic segmentation _rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h _th e (·) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (·) represents the cross entropy loss function, Y _rgb Is that

Corresponding pseudo tag, ytheA is +.>

Corresponding pseudo tag, M is->

Random mask pattern of->

Can be regarded as +.>

And (3) carrying out semantic segmentation prediction corresponding to a single pixel point on the result after the M-CutOut data is enhanced. Pseudo tag Y _rgb and Y_the Is generated by using the semantic segmentation prediction result of the conventional weak data enhancement (such as flipping and the like) data.

Training an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set and a loss function.

The RGB-T image semantic segmentation model trained by the invention has low labeling cost and high semantic segmentation performance under the condition of poor illumination.

3 existing schemes, scheme 1 (patent CN112991350 a), scheme 2 (patent CNll3362349 a) and scheme 3 (patent CN 113781504A), are proposed, all of which adopt a fully supervised training mode to build a model, while the present invention directly applies to a semi-supervised task (only half of MFNet dataset data is used as marked data and the other half is used as unmarked data) training model through a cross-modal mutual learning method.

Table 1 is a model test effect comparison table on the road scene RGB-T image dataset MFNet;

table 2 is a semi-supervised semantic segmentation effect contrast table on the road scene RGB-T image dataset MFNet.

Tables 1 and 2 show that the semantic segmentation effect of the method is better under the same experimental setting, and the effect that all tagged data are used in the prior art scheme can be achieved by only using half of tagged data.

TABLE 1

TABLE 2

S12, inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;

specifically, the K feature extraction layers included in the first feature encoder are respectively denoted as L _rgb,0 To L _rgb,K-1 ；

The K feature extraction layers included in the second feature encoder are respectively denoted as L _the，0 To L _the，K-1 ；

K spatial cross-modal information fusion modules jointly included by the first feature encoder and the second feature encoder are respectively recorded as SCF ₀ To SCF _K - ₁ ；

The S12 includes:

transmitting an RGB image of the target RGB-T image pair to the first feature encoder through the first input layer, and transmitting a thermal infrared image of the target RGB-T image pair to the second feature encoder through the second input layer;

utilizing the L in a combined structure of the first feature encoder and the second feature encoder _rg b _,i Feature extraction is carried out on the input RGB information to obtain frgb, i, and the L is utilized _the,i Feature extraction is carried out on the input thermal infrared information to obtain f _the,i And utilize the SCF _i For fr _gb,i and f_the I, performing spatial cross-modal information fusion to obtain f _rgb,i and f_the I; wherein i is E [0, (K-1)]And when i=0, L _rgb，i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L _rgb，i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1)]When L _rgb，i The RGB information input in the SCF is utilized _i - ₁ The f' r obtained _gb,i - ₁ ，L _rgb The thermal infrared information input in i is the SCF _i - ₁ Obtained f _the，i - ₁ ；

Iterative fusion of module pairs fr using first multi-scale features _gb，K - ₁ 、f _m,k - ₂ 、f _m,k - ₃ and f_m ，k- ₄ Performing space dimension iterative fusion to obtain decoding characteristics

Iterative fusion of module pairs f using second multi-scale features _the ,K- ₁ f _m,k - ₂ 、f _m,k - ₃ and f_m,k - ₄ Performing space dimension iterative fusion to obtain decoding characteristics +.>

Wherein j is 2-K, f _m，k-j Is fr _gb ,k _-j And f _the ,k- _j The characteristics obtained after the additive fusion are carried out;

will be

and />

Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image yr _gb And outputting the first semantically divided image yr through the main prediction output layer _gb The method comprises the steps of carrying out a first treatment on the surface of the Processing +.>

Obtaining the second semantic segmentation image ythe and outputting the second semantic segmentation image y through the auxiliary prediction output layer _the ；

Here, the pixel-level classifier mainly performs convolution operation, that is:

further, fig. 5 shows the RGB-T feature flow of the feature encoding and decoding stage, where CD (channel-wise denoise) is a channel adaptive noise reducer, ADM (Attentive demand map) is a Spatial adaptive demand graph estimator, CF (Cross-model fusion) is a Cross-modal fusion, ASPP (Atrous Spatial pyramidpooling) is a Spatial pyramid pooler, and SF (Spatial-wise fusion) is a Spatial dimension fusion. Wherein, CD, ADM and CF together form the SCF module, ASPP and SF form the RMM module.

Specifically, fig. 6 provides a schematic structural diagram of a spatial cross-mode information fusion module, where in a real scene, both an RGB image and an infrared thermal image are inevitably interfered by noise in a complex environment, such as strong light irradiation, temperature fluctuation caused by an abnormal heat source, and the like. In view of the above noise, the proposed SCF module applies two attention mechanisms, the Channel adaptive noise reducer CD determines which Channel is more reliable through the Channel attention mechanism CA (Channel-wise) and the Spatial adaptive demand pattern evaluator ADM determines which region has more Spatial complementary information fusion requirements through the Spatial attention mechanism SA (Spatial-wise). There are a number of engineering implementations of the two attention mechanisms described above, such as those in CBAM (s.woo, j.park, J-yle, and i.s.kweon.cbam: convolitionallblockattententment module.InECCV, pages3-19, 2018.), which will be readily appreciated, extended and implemented by the skilled artisan.

Thus, the SCF is utilized _i For fr _gb,i and f_the,i Performing spatial cross-modal information fusion to obtain fr _g b _,i and f_the I, comprising:

in the channel adaptive noise reducer, for fr _gb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) _rg b, i) and a first mean pooling feature MeanPool (f) _rgb,i ) Based on the MaxPool (f _rgb,i ) And the MeanPool (f) _rgb I) generating a first channel attention map using a channel attention mechanism

And add said->

And fr is equal to _gb,i Is the product of fr _gb,i Noise reduction features of->

At the same time, for f _the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) _the,i ) And a second averaged pooling feature MeanPool (f) _the,i ) Based on the MaxPool (f _the，i ) And the MeanPool (f) _the，i ) The channel attention mechanism generates a second channel attention map

And add said->

And f is equal to _the The product of i is taken as f _the，i Noise reduction features of->

Here, the described

The formula of (2) is as follows:

here, the described

The formula of (2) is as follows:

wherein, MLP is the multilayer perceptron, sigmoid (·) is Sigmoid function.

In a spatially adaptive demand graph evaluator, based on

Know->

Generating a first spatially adaptive demand graph using a spatial attention mechanism >

At the same time based on->

It is known that

Generating a second spatially adaptive demand graph using a spatial attention mechanism>

Here, the described

The formula of (2) is as follows:

the said

The formula of (2) is as follows:

conv (·) is a convolution function, 7*7 convolution kernel is adopted, sigmoid (·) is a Sigmoid function, and the function of the space self-adaptive demand graph is to represent the space complementary information fusion demand of the region.

In a cross-modal fusion device

and />

Dot product and->

Carrying out additive fusion to obtain the frgb, i, and carrying out +.>

and />

Dot product and->

Carrying out additive fusion to obtain the f _the,i 。/>

Fig. 7 provides a schematic structural diagram of a multi-scale feature iterative fusion module, in which Conv units are composed of convolution operations, batch normalization operations (Batch normalization), and Relu activation. The spatial downsampling in the feature encoding process inevitably causes information loss, the semantic segmentation task at the pixel level depends on detail texture features, and the proposed RMM module realizes the compensation of the information loss by iteratively fusing multi-scale features.

Thus, the iterative fusion module pair fr using the first multi-scale features _g b，K- ₁ 、f _m，K - ₂ 、f _m ，K- ₃ and f_m，K - ₄ Performing space dimension iterative fusion to obtain decoding characteristics

Comprising the following steps:

the feature encoding result fr _g b，K- ₁ First embedding more global features ASPP (f' rgb, K- ₁ ) Wherein ASPP adopts expansion coefficient d=2, 4,8, the number of output characteristic channels is 256, and then ASPP (f' rgb, K- ₁ ) Up-sampled feature frgb, uK-1 and fusion feature f containing rich texture information _m,K - ₂ 、f _m,K - ₃ and f_m,K - ₄ Obtaining

In consideration of computational complexity, the above fusion feature utilizes one Conv unit to realize channel dimension reduction. Specifically, said->

The formula of (2) is as follows:

wherein z is E [1,3 ]]，

For cascade operation, up (-) is Up-sampling, meanPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is point multiplication operation>

To pair(s)

Adaptive mask for spatial attention mechanism evaluation to indicate where and how much information compensation is needed,/for>

For f by Conv unit pair _m,K _z- ₁ And performing channel dimension reduction to obtain the characteristics.

Likewise, the iterative fusion module pair f utilizing the second multi-scale features _the ,K- ₁ 、f _m ,K- ₂ 、f _m，K - ₃ and f_m,K - ₄ Performing space dimension iterative fusion to obtain decoding characteristics

Comprising the following steps:

feature encoding result f _the ,K- ₁ First more global features ASPP (f 'are embedded by ASPP' _the，K-1 ) Then iteratively fusing ASPP (f _the，K-1 ) Upsampling features of (a)

Fusion feature f containing rich texture information _m，K - ₂ 、f _m,K - ₃ and f_m,K - ₄ Obtain->

Specifically, the following

The formula of (2) is as follows:

/>

for->

And performing the adaptive mask obtained by the spatial attention mechanism evaluation.

S13, selecting one of the first semantic segmentation image and the second semantic segmentation image as a semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair.

Specifically, S13 includes:

under the condition that no texture information is missing in an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair;

and under the condition that texture information is missing in any one of an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair by a daytime scene, and taking the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair by a night scene.

The double-branch structure of the invention enables the RGB-T semantic segmentation model to better cope with the loss condition of the single-mode data signal. Table 3 is a comparison table of the results of the robustness experiments of signal loss on the road scene RGB-T image dataset MFNet, and experiments prove that the invention can obtain better semantic segmentation results under the condition of signal loss.

TABLE 3 Table 3

In a word, according to the RGB-T image semantic segmentation method provided by the invention, an RGB-T image semantic segmentation data set is used for training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches in advance to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.

In addition, on the basis of the scheme, similar effects can be achieved by using different characteristics to extract backbone networks and modifying parameters (such as the number of convolutional layers, the number of channels, an activation function and the like) in each module, and similar semi-supervised semantic segmentation training can be achieved by using different strong and weak data enhancement combinations.

In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus for description, where the RGB-T image semantic segmentation apparatus described below and the RGB-T image semantic segmentation method described above may be referred to correspondingly to each other. Fig. 8 is a schematic flow chart of an RGB-T image semantic segmentation apparatus provided by the present invention, as shown in fig. 8, the apparatus includes:

The invention provides an RGB-T image semantic segmentation device, which is characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.

In a third aspect, fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, where the electronic device may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a method of RGB-T image semantic segmentation, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.

Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of RGB-T image semantic segmentation provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of semantic segmentation of RGB-T images provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for semantic segmentation of RGB-T images, the method comprising:

2. The RGB-T image semantic segmentation method of claim 1, wherein the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel-level classifier, and a main prediction output layer;

wherein the first feature encoder comprises K feature extraction layers, respectively denoted as L _rgb,0 To L _rgb,K-1 ；

The second feature encoder includes K feature extraction layers, each denoted as L _the,0 To L _the,K-1 ；

The first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules which are respectively recorded as SCF ₀ To SCF _K-1 ；

The first feature decoder comprises a first multi-scale feature iterative fusion module, and the second feature decoder comprises a second multi-scale feature iterative fusion module;

inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch, wherein the method comprises the following steps:

utilizing the L in a combined structure of the first feature encoder and the second feature encoder _rgb,i Feature extraction is carried out on the input RGB information to obtain f _rgb,i By using the L _the,i Feature extraction is carried out on the input thermal infrared information to obtain f _the,i And utilize the SCF _i For f _rgb,i and f_the,i Performing spatial cross-modal information fusion to obtain f' _rgb,i and f′_the,i The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is E [0, (K-1)]And when i=0, L _rgb,i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L _rgb,i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1) ]When L _rgb,i The RGB information input in the SCF is utilized _i-1 The obtained f' _rgb,i-1 ，L _rgb,i The input thermal infrared information is the SCF _i-1 The obtained f' _the,i-1 ；

Iterative fusion of module pairs f 'using first multi-scale features' _rgb,K-1 、f _m,k-2 、f _m,k-3 and f_m,k-4 Performing space dimension iterative fusion to obtain decoding characteristics

Iterative fusion of module pairs f 'using second multi-scale features' _the,K-1 f _m,k-2 、f _m,k-3 and f_m,k-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>

Wherein j is E [2,K ]]，f _m,k-j Is f _rgb,k-j And f _the,k-j The characteristics obtained after the additive fusion are carried out; />

Will be

and />

Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image y _rgb And outputting the first semantically segmented image y through the main prediction output layer _rgb ；

Processing with a second pixel level classifier

Obtaining the second semantically segmented image y _the And outputting the second semantically segmented image y through the auxiliary prediction output layer _the 。

3. The RGB-T image semantic segmentation method of claim 2, wherein the spatial cross-modality information fusion module comprises: a channel adaptive noise reducer, a space adaptive demand graph evaluator and a cross-modal fusion device;

the SCF is utilized _i For f _rgb,i and f_the,i Performing spatial cross-modal information fusion to obtain f' _rgb,i and f’_the,i Comprising:

in the channel adaptive noise reducer, for f _rgb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) _rgb,i ) And a first averaged pooling feature MeanPool (f) _rgb,i ) Based on the MaxPool (f _rgb,i ) And the MeanPool (f) _rgb,i ) Generating a first channel attention map using a channel attention mechanism

And add said->

And f is equal to _rgb,i Is the product of f _rgb,i Noise reduction features of->

At the same time, for f _the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) _the,i ) And a second averaged pooling feature MeanPool (f) _the,i ) Based on the MaxPool (f _the,i ) And the MeanPool (f) _the,i ) The channel attention mechanism generates a second channel attention map

And add said->

And f is equal to _the,i Is the product of f _the,i Noise reduction features of->

In a spatially adaptive demand graph evaluator, based on

and />

Generating a first spatially adaptive demand graph using a spatial attention mechanism>

At the same time based on->

and />

In a cross-modal fusion device

and />

Dot product and->

Carrying out additive fusion to obtain the f' _rgb,i And will->

and />

Dot product and- >

Carrying out additive fusion to obtain the f' _the,i 。

4. A method of semantic segmentation of RGB-T images according to claim 3, characterized in that the

The formula of (2) is as follows:

the said

The formula of (2) is as follows: />

Wherein, MLP is the multilayer perceptron, sigmoid (·) is Sigmoid function.

5. A method of semantic segmentation of RGB-T images according to claim 3, characterized in that the

The formula of (2) is as follows:

the said

The formula of (2) is as follows:

wherein Conv (·) is a convolution function, sigmoid (·) is a Sigmoid function.

6. The RGB-T image semantic segmentation method of claim 2, wherein the first multi-scale feature iterative fusion module and the second multi-scale feature iterative fusion module each comprise: a spatial pyramid pooler and a spatial dimension fusion device;

the iterative fusion module pair f 'utilizing the first multi-scale features' _rgb,K-1 、f _m,K-2 、f _m,K-3 and f_m,K-4 Performing space dimension iterative fusion to obtain decoding characteristics

Comprising the following steps:

iterative fusion using the first multi-scale featureGenerating the f 'by a space pyramid pooler in a synthesis module' _rgb,K-1 Corresponding global features;

in the spatial dimension fusion device in the first multi-scale feature iterative fusion module, for the f' _rgb,K-1 Upsampling features of (a)

f _m,K-2 、f _m,K-3 and f_m,K-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>

The iterative fusion module pair f 'utilizing the second multi-scale features' _the,K-1 、f _m,K-2 、f _m,K-3 and f_m,K-4 Performing space dimension iterative fusion to obtain decoding characteristics

Comprising the following steps:

determining the f 'by a spatial pyramid pooler in the second multi-scale feature iterative fusion module' _the,K-1 Corresponding global features;

in the spatial dimension fusion device in the second multi-scale feature iterative fusion module, for the f' _the,K-1 Upsampling features of (a)

7. The RGB-T image semantic segmentation method of claim 6, wherein the

The formula of (2) is as follows:

the said

The formula of (2) is as follows:

wherein z is E [1,3 ]]，

For cascade operation, up (-) is Up-sampling, maxPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is dot multiplication operation>

For->

Adaptive mask obtained by performing spatial attention mechanism evaluation, < >>

For->

For f by Conv unit pair _m,K-z-1 And performing channel dimension reduction to obtain characteristics, wherein the Conv unit comprises convolution operation, batch normalization operation and Relu activation.

8. The RGB-T image semantic segmentation method according to claim 1, characterized in that the RGB-T image semantic segmentation model has a loss function

The expression of (2) is as follows:

wherein ,

The total training loss of the RGB-T image pair without pixel-level semantic segmentation labels is G is the RGB-T image pair with pixel-level semantic segmentation labels +.>

Pixel-level semantic segmentation labeling of (h) _rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h _the (. Cndot.) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (. Cndot.) represents the cross entropy loss function,>

RGB-T image pair labeled for non-pixel level semantic segmentation, Y _rgb Is->

Corresponding pseudo tag, Y _the Is->

Corresponding pseudo tag, M is->

Y is the semantic segmentation prediction corresponding to a single pixel point.

9. The RGB-T image semantic segmentation method according to any one of claims 1 to 8, wherein selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the lack of texture information of the target RGB-T image pair comprises:

10. An RGB-T image semantic segmentation apparatus, the apparatus comprising: