CN112991350B

CN112991350B - RGB-T image semantic segmentation method based on modal difference reduction

Info

Publication number: CN112991350B
Application number: CN202110187778.8A
Authority: CN
Inventors: 张强; 赵什陆; 黄年昌; 张鼎文; 韩军功
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2023-06-27
Anticipated expiration: 2041-02-18
Also published as: CN112991350A

Abstract

The invention discloses a RGB-T image semantic segmentation method based on modal difference reduction, which comprises the steps of (1) constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power from input RGB and thermal infrared registered image pairs, and constructing a supervised learning model at the same time: (2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features through the weighted fusion module to obtain multi-level fusion features; (3) Acquiring multi-level fusion characteristics, and then obtaining a space correlation matrix and a channel correlation matrix through calculation; (4) Restoring the space correlation matrix and the channel correlation matrix to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and a softmax function; and (5) training the algorithm network to obtain model parameters.

Description

RGB-T image semantic segmentation method based on modal difference reduction

Technical Field

The invention belongs to the field of image processing, and relates to an RGB-T image semantic segmentation method based on modal difference reduction, which can be used for the preprocessing process of images in computer vision tasks.

Background

Semantic segmentation aims at assigning a class label to each pixel in a natural image using a model or algorithm. As one of key technologies for scene perception, semantic segmentation plays a vital role in computer vision tasks such as autopilot, pedestrian detection, and medical image analysis.

Existing semantic segmentation methods can be divided into two main categories: one is a traditional semantic segmentation method, and the other is a semantic segmentation method based on deep learning. The traditional semantic segmentation method mainly completes image semantic segmentation by combining low-level manual features with a plane classifier. Such methods are poorly robust and difficult to obtain satisfactory results in complex scenarios. With the wide application of the deep learning technology, the semantic segmentation method based on the deep learning has breakthrough progress, and compared with the traditional method, the semantic segmentation method based on the deep learning has better segmentation effect and stronger robustness.

So far, the RGB image semantic segmentation method based on the deep learning technology achieves a remarkable effect. However, in poor lighting conditions, the performance of these algorithms may be significantly degraded. The thermal infrared image can provide contour information and semantic information of the target, and can effectively supplement the RGB image.

The existing RGB-T semantic segmentation method usually adopts a simple strategy to capture complementary information in RGB images and thermal infrared images, such as 'Yuxiang Sun, weiixunonzuo, and Ming Liu.Rtfnet: rgb-thermal fusion network for semantic segmentation of urban scenes.RAL,4 (3): 2576-2583, 2019'. The method directly fuses each level of characteristics of two modal images in an encoder only by using an element-by-element addition mode; "Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada. Mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. InIROS, pages 5108-5115, 2017" each level feature of two modality images is fused directly in the decoder using only a cascade approach. These methods do not take into account the problem of modality differences in RGB images and thermal infrared images due to differences in imaging mechanisms. This problem further results in an inability to fully exploit cross-modality complementary information by simple fusion strategies, thereby reducing the accuracy of the RGB-T image semantic segmentation method.

In addition, the diversity of objects in the image to be detected, such as the class, size and shape of the objects, is also one of the key issues in the semantic segmentation task. In the single-mode RGB image semantic segmentation algorithm, multi-scale context information and its remote dependencies have proven to be effective methods to solve this problem. However, in RGB-T semantic segmentation tasks, multi-scale context information and remote dependency thereof are not well mined and utilized, and only 'Qiasen Ha, kohei Watanabe, takumi Karasawa, yoshitaka Ushiku, and Tatsuya Harada.mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral samples, inIROS, pages 5108-5115, 2017' the convolution parallel structure of two different receptive fields is used for acquiring a small amount of context information, which has very limited effect on RGB-T semantic segmentation tasks in complex scenes, and the problem of diversity of targets still cannot be effectively solved.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the invention aims to provide an RGB-T image semantic segmentation method based on modal difference reduction, which mainly solves the problems that the prior art does not consider the modal difference of a visible light image and a thermal infrared image and the utilization of context information is insufficient, and the semantic segmentation precision is low.

The key point of realizing the invention is that the modal difference of the RGB features and the thermal infrared features is reduced and fused in the network coding stage, so that the fused features have more discrimination, and the multiscale context information and the remote dependency relationship of the fused features are fully mined.

The technical scheme is as follows: an RGB-T image semantic segmentation method based on modal difference reduction comprises the following steps:

(1) Constructing a bidirectional modal difference reduction sub-network, extracting RGB features and thermal infrared features with more discriminative power for input RGB and thermal infrared registered image pairs, and simultaneously constructing a supervised learning model:

the bidirectional modal difference reduction sub-network reduces modal differences in a bidirectional manner, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding true image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, takes each level feature of the corresponding RGB true image and each level feature of the thermal infrared true image as supervision thereof, and builds a supervision learning model;

(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features;

(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;

(4) Restoring the spatial correlation matrix and the channel correlation matrix obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;

(5) Training the algorithm network to obtain model parameters:

and (3) on a training data set, adopting a supervised learning model for the predicted semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and completing algorithm network training end to end through the weighted cross entropy loss function and the weighted average absolute error loss function to obtain network model parameters.

Further, the bi-directional modal difference reduction sub-network in step (1) comprises two parts from RGB mode to thermal infrared mode and from thermal infrared mode to RGB mode, wherein the two parts adopt an encoder-decoder-encoder network with the same structure, wherein the encoder uses a res net-50 network and a res net-18 network, and the decoder uses an image generation network to generate a pseudo image through an upsampling strategy of bilinear interpolation.

Further, in step (1), five different resolution hierarchical features of the pseudo thermal infrared image generated by the ResNet-18 network extraction are simultaneously reduced

And extracting, by the ResNet-18 network, five different resolution hierarchical features of their respective eukaryotic infrared images

Differences between

Five different resolution hierarchical features of generated pseudo-RGB images extracted by ResNet-18 network

And extracting hierarchical features of five different resolutions of their respective true RGB images from ResNet-18 network +.>

Differences between;

to obtain more discriminative RGB multi-level features extracted by ResNet-50 networks

And their corresponding thermal infrared multi-level features extracted by ResNet-50 network ∈>

By means of

For->

Supervision, utilization->

For a pair of

And (5) performing supervision.

Further, the adaptive channel weighted fusion module in the step (2) is configured to combine the first four layers of features of the RGB image obtained in the step (1)

And its corresponding thermal infrared image front four layer features

As input, an RGB weight vector W of a corresponding hierarchy is adaptively generated ₁ 、W ₂ 、W ₃ 、W ₄ Thermal infrared weight vector 1-W of corresponding level ₁ 、1-W ₂ 、1-W ₃ 、1-W ₄ Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained>

Further, in step (3), the inputs of the multi-scale space and channel context modules are respectively

And->

To establish multi-scale context information and its long-term dependent interactions in the spatial, channel dimensions, wherein:

(31) The multi-scale space context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-spatial correlation matrix;

(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.

Still further, step (31) includes:

(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ¹ ，θ ² ，θ ³ ，θ ⁴ Convolution operation C (x; θ) ¹ )，C(*；θ ² )，C(*；θ ³ )，C(*；θ ⁴ ) Wherein:

a convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of

Hole convolution operation +.>

A convolution kernel of 3×3, a step length of 1, a hole rate of 6, and parameters of

Hole convolution operation +.>

A convolution kernel of 3×3, a step length of 1, a hole rate of 12, and parameters of

Hole convolution operation +.>

A convolution kernel of 3×3, a step length of 1, a hole rate of 18, and parameters of

Hole convolution operation +.>

The four paths respectively obtain the features d with different scales ₁ 、d ₂ 、d ₃ 、d ₄ The number of channels being input

Is 256, and then passes through a convolution kernel of 1 x 1, step size 1, and parameter θ ⁵ Convolution operation C (x; θ) ⁵ ) Obtaining a feature comprising rich multi-scale context information +.>

The number of channels and the input +.>

The same;

(312) The multi-scale feature obtained in step (311) is processed

Performing size transformation and matrix multiplication with its own transpose matrix to obtain an autocorrelation matrix M _ss ∈R ^HW×HW ；

(313) To input original characteristics

Obtaining the cross-space correlation matrix M in the same manner as in step (312) _cs ∈R ^HW×HW As an information supplementing section;

(314) Will auto-space correlation matrix M _ss And cross-space correlation matrix M _cs Adding elements by elements, and performing normalization operation to obtain a total spatial correlation matrix M _s ∈R ^HW×HW Thereafter and with multiscale features

Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic containing multi-scale context information and space long-term dependence thereof>

Still further, step (32) includes:

(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ⁶ ，θ ⁷ ，θ ⁸ ，θ ⁹ Convolution operation C (x; θ) ⁶ )，C(*；θ ⁷ )，C(*；θ ⁸ )，C(*；θ ⁹ ) Wherein:

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

The four paths respectively obtain the features d with different scales ₅ 、d ₆ 、d ₇ 、d ₈ The number of channels being input

Is 512, and is input into a convolution kernel of 1 x 1, step length of 1 and parameter theta ¹⁰ Convolution operation C (x; θ) ¹⁰ ) Obtaining a feature comprising rich multi-scale context information +.>

The number of channels and the input +.>

The same;

(322) The multi-scale features obtained in step (321) are processed

Performing size transformation and matrix multiplication operation with its own transposed matrix to obtain an autocorrelation matrix M of the channel _sc ∈R ^1024×1024 ；

(323) To input original characteristics

Obtaining the cross-channel correlation matrix M in the same manner as in step (322) _cc ∈R ^1024×1024 As an information supplementing section;

(324) Will auto-channel correlation matrix M _sc And cross-channel correlation matrix M _cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M _c ∈R ^1024×1024 Thereafter and with multiscale features

Performing element-by-element multiplication operation, and adding a jump connection path to obtain a characteristic including multi-scale context information and long-term dependence of the channel thereof>

Further, in step (4), a deconvolution operation is used to perform feature map up-sampling to recover resolution, and then a convolution kernel is used to obtain a 1×1 step size 1, and the parameter is θ ¹¹ Convolution operation C (x; θ) ¹¹ ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.

The beneficial effects are that: compared with the prior art, the RGB-T image semantic segmentation method based on modal difference reduction has the following beneficial effects:

1) The invention can realize end-to-end pixel level semantic segmentation prediction of RGB-T image pairs without manual design and feature extraction, and simulation results show that the invention remarkably improves semantic segmentation precision, and simultaneously has better segmentation effect on small targets and under complex scenes;

2) The invention designs a strategy of 'first reducing and then fusing', firstly reduces the modal difference between the multi-modal data caused by different imaging mechanisms by constructing a method based on bidirectional image conversion, and then adaptively selects multi-modal features with strong discrimination to improve the RGB-T semantic segmentation effect. Compared with the existing method, the multi-modal features extracted by the method have stronger discriminant, and are beneficial to improving the target category prediction precision;

3) The invention fully excavates rich context information by establishing multi-scale context information of cross-modal characteristics and interaction of long-term dependence on space and channel dimension thereof, and is beneficial to solving the problem of target diversity. Compared with the existing method, the method can better segment targets with different scales, and meanwhile, the segmentation integrity inside the targets is improved.

Drawings

FIG. 1 is a flow chart of a RGB-T image semantic segmentation method based on modal difference reduction;

FIG. 2 is a schematic diagram of an algorithm network of an RGB-T image semantic segmentation method based on modal difference reduction, wherein a dotted line frame represents a bidirectional modal difference reduction sub-network, CWF represents an adaptive channel weighting fusion module, MSC represents a multi-scale space context module, and MCC represents a multi-scale channel context module;

fig. 3 is a schematic diagram of a framework of an adaptive channel weighted fusion module (CWF) according to the present invention;

FIG. 4 is a diagram of a multi-scale space context Module (MSC) framework in accordance with the present invention;

fig. 5 is a multi-scale channel context Module (MCC) framework diagram according to the present invention.

The specific embodiment is as follows:

the following detailed description of specific embodiments of the invention.

Referring to fig. 1, a method for semantic segmentation of RGB-T images based on modal disparity reduction includes the steps of:

(1) Constructing a bi-directional modal difference reduction sub-network, extracting more discriminative RGB features and thermal infrared features from an input RGB and thermal infrared registered image pair and simultaneously constructing a supervised learning model, wherein:

as shown in fig. 2, the bidirectional modal difference reducing sub-network reduces modal differences in both directions, extracts RGB features and thermal infrared features with discrimination by reducing modal differences between each level feature of a pseudo image generated by an image conversion method and each level feature of a corresponding real image, then extracts each level feature of the RGB pseudo image and the thermal infrared pseudo image respectively, and builds a supervised learning model by taking each level feature of the corresponding real image of RGB and each level feature of the thermal infrared real image as supervision thereof;

step 1) when the characteristic difference reduction from RGB mode to thermal infrared mode is carried out, the ResNet-50 is used for extracting the multi-level characteristics of the RGB image

The resolution is 1/2, 1/4, 1/8, 1/16, 1/32 of the resolution of the input image, and the channel numbers are 64, 256, 512, 1024, 2048, respectively. Then using four convolution kernels of 3×3, step length of 1, and parameters of +.>

Is a convolution operation of (1)/>

Will->

The dimension reduction is a single-channel feature diagram, and then pseudo-thermal infrared image I is generated by bilinear interpolation calculation and summation ^pse-T . Extraction of pseudo thermal infrared image I using ResNet-18 ^pse-T Multi-level features of->

Simultaneously extracting five different resolution multi-level features of a corresponding eukaryotic infrared image by using another ResNet-18

And calculating the difference between the true and false features of the corresponding layers.

Similarly, when performing feature difference reduction from thermal infrared mode to RGB mode, the ResNet-50 is first used to extract multi-level features of the thermal infrared image

Generating a three-channel pseudo RGB image I in the same manner ^pse-RGB Subsequently pseudo RGB image I is extracted using ResNet-18 ^pse-RGB Multi-level features of (2)

Simultaneously extracting five different resolution multi-level features of a corresponding true RGB image using another ResNet-18

By means of

For->

Supervision, utilization->

For a pair of

And (5) performing supervision.

In the bidirectional modal difference reduction sub-network, the total modal difference L _MD The sum of the difference between the true and false thermal infrared multi-level features and the difference between the true and false RGB multi-level features can be expressed as:

wherein:

L ₁ mean absolute error is expressed.

(2) Constructing a self-adaptive channel weighted fusion module, and carrying out channel-by-channel weighted fusion on the multi-level RGB features and the thermal infrared features obtained in the step (1) through the weighted fusion module to obtain multi-level fusion features, so that a feature channel with strong resolution capability is better selected from the multi-mode features;

four layers of RGB features with different resolutions and corresponding thermal infrared features are obtained in the step (1), and the module is used for fusing each layer of RGB features and T features, so that four layers of fusion features are obtained in total. Meanwhile, as for the fused features, resNet-50 is also used for feature extraction. Specifically, the fusion module is utilized to obtain the RGB features and the thermal infrared features of the first layer, then the fusion features are subjected to ResNet-50 residual block downsampling, and finally the fusion features of the second layer (namely, the fusion features obtained by the fusion module are utilized to carry out addition operation on the RGB features and the thermal infrared features of the second layer). As are subsequent layers.

As shown in FIG. 3, the channel-by-channel weighted fusion module inputs the RGB features obtained in step 1 after the reduced modal differences

And the corresponding thermal infrared features->

Features of the last layer->

And

are discarded to save network computation. The multi-modal characteristics of the corresponding hierarchy are cascaded, and corresponding weight vectors are respectively predicted through four convolution block operations, wherein each convolution block operation comprises a convolution kernel of 3 multiplied by 3, a step length of 1 and parameters of +.>

Convolution operation of->

And a convolution kernel of 1×1, step size of 1, parameter +.>

Convolution operation of->

Computing the relative importance of paired features from different modalities but in the same channel, i.e. the weight vector W of the RGB modality ₁ ，W ₂ ，W ₃ ，W ₄ Weight vectors 1-W of corresponding multi-level thermal infrared states ₁ ，1-W ₂ ，1-W ₃ ，1-W ₄ This can be expressed as:

wherein:

GAP (x) represents global average pooling operations;

cat (x) denotes cascade operation;

sigma represents a sigmoid activation function;

finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained

Can be expressed as:

wherein:

represents a channel-by-channel multiplication operation;

1 represents W _n All 1 vectors of the same size;

w obtained in the formula (2) _n The larger the value, the more important the RGB modality corresponding channel is than the thermal infrared modality corresponding channel, and vice versa. When W is _n And 1-W _n When the values in the two weight vectors are 0.5, the special case of equal-ratio weight fusion can be considered; when W is _n When the values of (a) are 0 or 1, the special case of using only thermal infrared or RGB single-mode information can be considered.

(3) Constructing a multi-scale space and channel context module, and mining multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information:

as shown in fig. 4 and fig. 5, the multi-level fusion feature obtained in the step (2) is firstly obtained, then a space correlation matrix and a channel correlation matrix are obtained through calculation, the space correlation matrix and the channel correlation matrix are acted on the multi-scale feature, and the relation between multi-scale context information and long-term dependence on space and channel dimensions of the multi-scale context information is established;

as shown in fig. 4 and 5, the existing method cannot fully utilize the context information, so that the problem of target diversity in the semantic segmentation task is difficult to deal with, so that the method utilizes a hole convolution pyramid structure to extract multi-scale context information, and establishes long-term dependence on multi-scale features in space and channel dimensions so as to mine more abundant context information. In addition, in order to alleviate the problem of information loss in the process, the invention also establishes long-term dependence on space and channel dimension on the original input characteristics, and fuses the long-term dependence as supplementary information into multi-scale characteristics so as to ensure the integrity of the context information.

Specifically, the multi-scale space context module constructed by the invention is shown in fig. 4, and the input is the fusion characteristic obtained in the step 2

The module includes a hole convolution pyramid structure, an autocorrelation matrix, and a cross-spatial correlation matrix.

The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ¹ ，θ ² ，θ ³ ，θ ⁴ Convolution operation C (x; θ) ¹ )，C(*；θ ² )，C(*；θ ³ )，C(*；θ ⁴ ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of

Hole convolution operation +.>

A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>

Hole convolution operation of (1)

A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>

Hole convolution operation +.>

A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>

Hole convolution operation +.>

The four paths respectively obtain the features d with different scales ₁ 、d ₂ 、d ₃ 、d ₄ The channel numbers are all input +.>

256. After cascading the four features (1024 channels), the four features are passed through a convolution kernel of 1×1, step size of 1, and parameter θ ⁵ Convolution operation C (x; θ) ⁵ ) Obtaining a feature comprising rich multi-scale context information +.>

The number of channels and the input +.>

The same (512 channels), can be expressed as:

the obtained multi-scale characteristics

Size conversion to +.>

And performs matrix multiplication operation with its own transpose matrix to obtain a characteristic multi-scale feature +.>

An autocorrelation matrix of correlation between any two points in space, and an autocorrelation matrix M obtained _ss ∈R ^HW×HW The method can be expressed as follows:

wherein:

representing a matrix multiplication operation;

(*) ^T representing a matrix transpose operation;

reshape represents the matrix dimension from R ^H×W×C Becomes R ^HW×C Is a size conversion operation of (a).

To input original characteristics

Obtaining cross-space correlation matrix M in the same manner _cs ∈R ^HW×HW Calculating the correlation between any two points in the original input feature space as information supplement to ensure the integrity of the context information and crossing the space correlation matrix M _cs The method can be expressed as follows:

will auto-space correlation matrix M _ss And cross-space correlation matrix M _cs After element-by-element summation, normalization operation is carried out to obtain a total spatial correlation matrix M _s ∈R ^HW×HW As in equation (7). Thereafter and multiscale features

Performing element-by-element multiplication operation, adding a jump connection path, and finally obtaining the characteristics containing abundant multi-scale context information and long-term dependence of the space

Can be expressed as formula (8).

M _s ＝Normalization(M _ss +M _cs ) (7)

Wherein:

normalization (x) represents Min-Max Normalization;

reshape' (. Times.) denotes the reverse operation of Reshape.

The multi-scale channel context module constructed by the invention is shown in FIG. 5, and the input is the fusion characteristic obtained in the step 2

Comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix.

The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ⁶ ，θ ⁷ ，θ ⁸ ，θ ⁹ Convolution operation C (x; θ) ⁶ )，C(*；θ ⁷ )，C(*；θ ⁸ )，C(*；θ ⁹ ) The method comprises the steps of carrying out a first treatment on the surface of the A convolution kernel of 3×3, a step length of 1, a hole rate of 1, and parameters of

Hole convolution operation +.>

A convolution kernel of 3×3, step size of 1, hole rate of 6, parameter +.>

Hole convolution operation of (1)

A convolution kernel of 3×3, step size of 1, hole rate of 12, parameter +.>

Hole convolution operation +.>

A convolution kernel of 3×3, step size of 1, hole rate of 18, parameter +.>

Hole convolution operation +.>

The four paths respectively obtain the features d with different scales ₅ 、d ₆ 、d ₇ 、d ₈ The channel numbers are all input +.>

512. After cascading the four features (2048 channels) the four features are input into a convolution kernel of 1 x 1, step size of 1, and parameter θ ¹⁰ Convolution operation C (x; θ) ¹⁰ ) Obtaining a feature comprising rich multi-scale context information +.>

The number of channels and the input +.>

The same (1024 channels), can be expressed as:

the obtained multi-scale characteristics

Size conversion to +.>

Self-channel correlation matrix of correlation between any two channels, and obtained self-channel correlation matrix M _sc ∈R ^1024×1024 The following can be expressed:

to input original characteristics

Obtaining cross-channel correlation matrix M in the same manner _cc ∈R ^1024×1024 The correlation between any two channels of the original input features is calculated and used as information supplement, so that the integrity of the context information is further improved, and the method can be expressed as follows:

will auto-channel correlation matrix M _sc And cross-channel correlation matrix M _cc Adding elements by elements, and performing normalization operation to obtain a total channel correlation matrix M _c ∈R ^1024×1024 As in equation (12). Thereafter and multiscale features

Performing element-by-element multiplication operation, also addingJump connection path to obtain the characteristics including multi-scale context information and long-term dependence of its channel

Can be expressed as formula (13).

M _c ＝Normalization(M _sc +M _cc )(12)

(4) Upsampling recovery resolution, predicting semantic segmentation mask maps for RGB and thermal infrared image pairs:

recovering the feature map obtained in the step (3) to full resolution through deconvolution operation, and predicting a semantic segmentation mask map after pixel-by-pixel classification calculation through channel transformation operation and softmax function;

the multi-scale characteristics obtained in the step 3 are processed

Through a convolution kernel of 2×2, step size of 16, parameter +.>

Is->

Restoring the resolution by 16 times, then using a convolution kernel of 1X 1, step size of 1, parameter +.>

Convolution operation of->

The number of channels of the feature map is converted into the number of categories of the data set, and the semantic segmentation mask map S is calculated by using a softmax function and can be expressed as follows:

(5) Training algorithm network to obtain model parameters

On a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph in the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through the weighted cross entropy loss function and the weighted average absolute error loss function, so that network model parameters are obtained:

on a training data set, a supervised learning mechanism is adopted to calculate a cross entropy loss function L of a semantic segmentation prediction result and a true value in a network model _s ：

Where m and n represent the width and height of the input image, (i, j) represent the coordinates of the pixel point, p (x) _ij ) True value label, q (x _ij ) Represents the prediction result of the pixel point, w (x _ij ) And the category weight coefficient of the pixel point is represented. The class weight coefficient w is used for relieving the problem of unbalanced class distribution in the data set, and the weight coefficient w of the ith class _i Can be expressed as:

wherein c is a constant set to 1.1, P _i The proportion of the pixel points labeled as the i-th class to the total pixel points is represented.

The calculated cross entropy loss function and the bi-modal difference loss L in formula (1) _MD Together constitute a total loss function L _total This can be expressed as:

L _total ＝λ ₁ L _s (S,G)+λ ₂ L _MD (17)

wherein lambda is ₁ And lambda (lambda) ₂ For the hyper-parameters of the balance loss, S represents the model prediction result and G represents the true value.

And->

Still further, step (31) includes:

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

The number of channels and the input +.>

The same;

(312) The multi-scale feature obtained in step (311) is processed

(313) To input original characteristics

Performing element-by-element multiplication operation, and adding a jump connection path to obtain characteristics containing multi-scale context information and space long-term dependence thereof

Still further, step (32) includes:

Hole convolution operation +.>

One convolution kernel is 3 x 3,step length of 1, hole rate of 6, and parameters of

Hole convolution operation +.>

Hole convolution operation +.>

/>

Hole convolution operation +.>

The number of channels and the input +.>

The same;

(322) The multi-scale features obtained in step (321) are processed

(323) To input original characteristics

The method trains the algorithm end to end, and obtains model parameters after training the whole RGB-T semantic segmentation network; when training RGB-T semantic segmentation network parameters, the data set (MFNet data set) for training the RGB-T semantic segmentation network is insufficient in data quantity, so that smooth training of the network is ensured, the phenomenon of over-fitting of the training data set is avoided, and data augmentation operation of random overturning, random cutting and noise injection is performed on RGB-T image pairs in the data set;

the technical effects of the invention are further described by combining simulation experiments:

1. simulation conditions: all simulation experiments are realized by adopting a PyTorch deep learning framework, wherein the operating system is Ubuntu 16.04.5, the hardware environment is GPU Nvidia GeForce GTX 1080 Ti;

2. simulation content and result analysis:

simulation 1

The semantic segmentation experiment is carried out on the public RGB-T image semantic segmentation data set MFNet by the invention with the existing RGB image-based semantic segmentation method, RGB-D-based semantic segmentation method and RGB-T-based semantic segmentation method, and part of experimental results are intuitively compared. In order to ensure fairness of experiments, a semantic segmentation method based on RGB images is expanded into two parts, namely an RGB branch and a thermal infrared branch, and prediction results of the two branches are added to be used as a final semantic segmentation mask diagram; for the RGB-D based semantic segmentation method, we directly replace the input depth image with a thermal infrared image.

Compared with the prior art, the method has better effect on the problem of difficult segmentation of the RGB-T image semantics. The multi-mode complementary information can be better utilized in the environment with poor illumination condition due to the mode difference reduction and fusion strategy in the invention, so that the semantic segmentation result of the target is closer to a manually calibrated truth-value diagram.

Emulation 2

The invention is compared with the existing semantic segmentation method based on RGB image, semantic segmentation method based on RGB-D and semantic segmentation method based on RGB-T to carry out the result obtained by semantic segmentation experiment on the semantic segmentation dataset of the public RGB-T image, and the objective evaluation is carried out by adopting the accepted evaluation index, wherein the evaluation simulation result is shown in the table 1, and the invention has the advantages that:

acc represents each type of accuracy;

mAcc represents class average accuracy;

IoU the cross-over ratio of each class;

mIoU represents class average cross-over ratio.

The higher and better the indexes are, the more accurate semantic segmentation capability of the method for RGB-T images can be seen from the table 1, and the effectiveness and superiority of the method are fully shown.

The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various modifications may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. The RGB-T image semantic segmentation method based on modal difference reduction is characterized by comprising the following steps:

(3) Acquiring the multi-level fusion characteristics obtained in the step (2), obtaining a space correlation matrix and a channel correlation matrix through calculation, acting the space correlation matrix and the channel correlation matrix on the multi-scale characteristics, constructing a multi-scale space context module and a multi-scale channel context module, and establishing a relation between multi-scale context information and long-term dependence on space and channel dimensions thereof;

(5) Training the algorithm network to obtain model parameters:

on a training data set, a supervised learning model is adopted for the prediction semantic segmentation mask graph of the step (4) and the pseudo image features generated in the step (1), and algorithm network training is finished end to end through a weighted cross entropy loss function and a weighted average absolute error loss function, so that network model parameters are obtained, wherein:

in step (1), five different resolution hierarchical features of pseudo-thermal infrared images generated by simultaneous downscaling of ResNet-18 network extraction

And extracting five different resolution hierarchical features of their corresponding eukaryotic infrared images from a ResNet-18 network>

Differences between

Differences between;

By means of

For->

Supervision, use->

For a pair of

Monitoring;

the self-adaptive channel weighted fusion module in the step (2) is used for carrying out the first four layers of characteristics on the RGB image obtained in the step (1)

And the first four layers of features of their corresponding thermal infrared images +.>

As input, adaptively generate pairsRGB weight vector W for the hierarchy ₁ 、W ₂ 、W ₃ 、W ₄ Thermal infrared weight vector 1-W of corresponding level ₁ 、1-W ₂ 、1-W ₃ 、1-W ₄ Finally, cross-modal information fusion is realized in a weighted summation mode, and multi-level fusion characteristics are obtained

The input of the multi-scale space context module and the multi-scale channel context module in the step (3) are respectively that

And->

(32) The multi-scale channel context module comprises a hole convolution pyramid structure, an autocorrelation matrix and a cross-channel correlation matrix;

step (31) includes:

(311) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ¹ ，θ ² ，θ ³ ，θ ⁴ Convolution operation C (x; θ) ¹ )，C(*；θ ² )，C(*；θ ³ )，C(*；θ ⁴ ) And (b)

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

The number of channels and the input +.>

The same;

(312) The multi-scale feature obtained in step (311) is processed

(313) To input original characteristics

Performing size transformation and performing matrix multiplication operation with the transposed matrix of the cross-space correlation matrix M _cs ∈R ^HW×HW As an information supplementing section;

The step (32) includes:

(321) The hole convolution pyramid structure comprises four convolution kernels of 1 multiplied by 1, step length of 1 and parameters of theta respectively ⁶ ，θ ⁷ ，θ ⁸ ，θ ⁹ Convolution operation C (x; θ) ⁶ )，C(*；θ ⁷ )，C(*；θ ⁸ )，C(*；θ ⁹ ) And (c) a step of:

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

Hole convolution operation +.>

The number of channels and the input +.>

The same;

(322) The multi-scale features obtained in step (321) are processed

(323) To input original characteristics

Performing size transformation and performing matrix multiplication operation with the transposed matrix of the cross-channel correlation matrix M _cc ∈R ^1024×1024 As an information supplementing section;

2. The RGB-T image semantic segmentation method based on modal disparity reduction as set forth in claim 1, wherein the bi-modal disparity reduction sub-network in step (1) includes two parts from RGB modality to thermal infrared modality and from thermal infrared modality to RGB modality, both of which employ a structurally identical "encoder-decoder-encoder" network, wherein the encoder uses a res net-50 network and a res net-18 network, the decoder uses an image generation network, and the pseudo-image is generated by an upsampling strategy of bilinear interpolation.

3. The method of claim 1, wherein in step (4), a deconvolution operation is used to perform feature map up-sampling to restore resolution, and a convolution kernel is used to obtain 1×1, step size is used to obtain 1, and parameters are θ ¹¹ Convolution operation C (x; θ) ¹¹ ) And changing the channel number into the class number of the data set, and finally predicting the class to which each pixel point belongs by using a softmax function to obtain a semantic segmentation mask map.