CN114898106A

CN114898106A - RGB-T multi-source image data-based saliency target detection method

Info

Publication number: CN114898106A
Application number: CN202210581903.8A
Authority: CN
Inventors: 吴慧欣; 安丽鑫; 姜维; 王喆; 陈继坤; 刘孟轩; 李琳; 张慢丽; 李文静
Original assignee: North China University of Water Resources and Electric Power
Current assignee: North China University of Water Resources and Electric Power
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-12

Abstract

The invention discloses a saliency target detection method based on RGB-T multi-source image data, which comprises the steps of adopting feature fusion based on attention information, weighting important features of primary features extracted by VGG-DCNet by an attention mechanism to obtain attention feature maps of visible light images and infrared images, further combining and transmitting multi-layer features of each stage backwards, adopting a multi-scale pooling method to obtain global prior information, using the global prior information for an up-sampling process, and performing pooling operation of different sampling rates on information of different scale spaces of each stage in a forward transmission process to extract local features to obtain rich local information, and transmitting forwards under the guidance of the global prior information to obtain a final saliency prediction map. The method has good significance target detection capability, and particularly has obvious detection effect advantages under complex scenes such as insufficient illumination, crossed image boundaries, center offset and the like.

Description

RGB-T multi-source image data-based saliency target detection method

Technical Field

The invention relates to the technical field of image processing, in particular to a saliency target detection method based on RGB-T multi-source image data.

Background

The saliency target detection aims at finding out the most interesting region in an image, and requires that the object position can be accurately given and can be separated from the background, and the saliency target detection comprises the tasks of object positioning and object separation, and the two tasks are fused in one process to carry out end-to-end detection. Early saliency target detection detects salient objects based on heuristic local or global clues, and manual characteristics limit the detection capability of various methods in complex scenes; with the development of deep learning, the model based on the convolutional neural network can quickly and efficiently obtain local global information, and gradually refine the characteristics, so that the precision of the detection of the significant target is increased.

In reality, because an object has the characteristics of large intra-class difference and small inter-class difference, different semantics can be generated by one object under different situations, so that the visible light image and the infrared image cannot independently and accurately identify the object. The general visible light image can store rich detail and texture information of the image, but the target and the background cannot be effectively distinguished due to light, camouflage, smoke and the like, the infrared image can be free from the influence of the factors due to a special imaging mechanism, and the target can be displayed as long as the temperature difference exists between the target and the surrounding environment. At present, a model for saliency object detection based on RGB-T multisource images is proposed successively, such as ADFNet [ Tu Z, Ma Y, Li Z, et al.RGBT content object detection, Alarge-scale data set and benchmark [ J ]. arXiv prediction arXiv:2007.03262,2020 ] network proposed by Li et al, a multitask-based ranking algorithm [ Wang G, Li C, Ma Y, et al.RGB-Twai significance detection result: Dataset, bases, analysis and a novel processing [ C ]// Chinese Conference reference Image and Graphics technologies.Springer, Singapap, 2018: RGB-scale protein expression ], a model for saliency object detection based on RGB-T multisource images, T-T content detection by using the method of Tu-T content object detection [ C ]/[ RGB J ]. IEEE sample detection technology ] proposed by Wang et al, 2019,22(1): 160-.

The method has a good effect, but when the characteristics are extracted, the characteristics are usually extracted by utilizing the fixed kernel size of the traditional convolution, the actual objects are changed curves and are not fixed rectangles, and the problems that partial global information is lost due to the fact that the actual receptive field of deep layers in the network is small and local information is ignored along with the deepening of the network exist.

Disclosure of Invention

The invention provides a saliency target detection method based on RGB-T multi-source image data, aiming at the problems that the conventional convolution fixed kernel size is often used for extracting features when the features are extracted in the conventional saliency target detection method, actual objects are changed curves and are not fixed rectangles, partial global information is lost due to small actual receptive field of deep layers in a network, and local information is ignored along with the deepening of the network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a salient object detection method based on RGB-T multi-source image data comprises the following steps:

step 1: on the basis of a traditional dual-channel VGG-16 network architecture, a part of convolution layers in the VGG-16 are replaced by deformable convolution, a final full-connection layer is removed, a VGG-DCNet network based on the deformable convolution is formed, a visible light image and a thermal infrared image are used as dual-channel input of the VGG-DCNet network, and primary features of the visible light image and the thermal infrared image are extracted by the VGG-DCNet network;

step 2: inputting the extracted primary features of the visible light image and the thermal infrared image into an attention feature fusion module, respectively obtaining attention feature graphs corresponding to the visible light image and the thermal infrared image after a standardized attention mechanism, and fusing every two attention feature graphs of each layer of the visible light image and the infrared image to obtain a fused attention feature graph;

and step 3: and fusing global semantic information acquired after the deepest layer attention characteristic is subjected to multilayer pyramid pooling operation into the process of extracting the local characteristics of the visible light image and the infrared image, fusing the global multi-scale characteristics and the local multi-layer characteristics of the visible light image and the infrared image in a fusion global-local characteristic module, and outputting a final significance prediction image.

Further, the step 1 comprises:

the three-layer convolution of the last stage in VGG-16 is replaced with a deformable convolution.

Further, the attention feature fusion module is used for obtaining the enhanced beneficial features and suppressing irrelevant features under the action of a normalized attention mechanism NAM, acquiring an attention feature map, and performing feature level fusion on feature maps containing attention information acquired at a network intermediate level.

Further, in the attention feature fusion module, fusion of the attention feature maps of each layer of the visible light image and the infrared image is performed as follows:

wherein N is ^R _i Attention feature, N, representing the i-th stage visible light image ^T _i Attention feature representing the i-th stage infrared image, A _i Showing the attention characteristics after the fusion of the ith stage.

Further, the global semantic information is acquired as follows:

by using the pyramid pooling method, the pooling operation of four subbranches is adopted to obtain feature maps with different scales, which comprises the following steps:

1) pooling the input characteristic diagram under four scales to obtain four-scale output P _i I is 1,2,3,4, wherein the first layer is global average pooling, the other three layers are average pooling operations, and each output is different in size but same in channel dimension;

2) reducing the channel dimension of the pooled features, and reducing the number of channels to 1/N of the original features by using 1-by-1 convolution operation, wherein N is the number of layers of the pooled operation;

3) and (3) performing up-sampling by using a bilinear interpolation method, so that the size of the four-layer characteristic is changed to be consistent with that of the original characteristic graph, and finally splicing the four-layer characteristic graph in the channel dimension.

Further, local features of the visible light image and the infrared image are extracted as follows:

the attention features fused from top to bottom of the VGG-DCNet network in each stage are averaged and pooled at different down-sampling rates to obtain features of different scale spaces, the features are subjected to convolution operation, then up-sampling is carried out to restore to the original scale, fusion is carried out, and finally a3 x 3 convolution is carried out to obtain local feature maps containing information of different scales.

Compared with the prior art, the invention has the following beneficial effects:

(1) in order to achieve the effect of adaptively extracting irregular target features, the invention takes VGG-16 as a basis, introduces a deformable convolution adjusting network structure, replaces part of convolution layers in the network to obtain a VGG-16 network added with deformable convolution, namely VGG-DCNet, and obtains irregular convolution operation by adding offset to the traditional convolution to obtain complete target features.

(2) In order to fully play the complementary role of the multi-source image characteristics, the invention constructs an attention module, analyzes and applies a standardized attention mechanism to extract two modal characteristics, and performs characteristic level fusion on the attention characteristics.

(3) In order to fully utilize global semantic information, a fusion global-local feature module is constructed, global features such as textures and structures and local features with small correlation degree between the features are fused, shallow position information and high-level semantic information are fully utilized, and therefore redundant information is restrained to obtain beneficial features.

(4) The experimental result shows that the method has good significance target detection capability, and particularly has obvious detection effect advantages under complex scenes such as insufficient illumination, crossed image boundaries, central deviation and the like.

Drawings

FIG. 1 is a basic flowchart of a method for detecting a salient object based on RGB-T multi-source image data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network architecture constructed in an embodiment of the present invention;

FIG. 3 is a schematic diagram of feature level fusion based on a standardized attention mechanism according to an embodiment of the present invention;

FIG. 4 is a block diagram of an exemplary embodiment of an attention feature fusion module;

FIG. 5 is a schematic diagram of pyramid pooling to obtain global information according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a multi-scale feature aggregation module for obtaining local features according to an embodiment of the present invention;

FIG. 7 illustrates global-local information flow directions according to an embodiment of the present invention;

FIG. 8 is a representation of various algorithms on a VT821, VT1000, VT5000 data set according to an embodiment of the present invention;

FIG. 9 is a qualitative analysis of an ablation experiment according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

in order to overcome the limitation of environmental factors such as illumination, weather and the like on a single visible light image and enhance the effect of detecting a significant target, the invention introduces a thermal infrared image as an auxiliary, utilizes the all-weather imaging of the thermal infrared image, is not easily influenced by the environment, and extracts the characteristic that the visible light image cannot be obtained according to the characteristic of the self thermal radiation imaging of an object. In this embodiment, on the basis of a conventional two-channel network architecture, a proper amount of Deformable convolution (Deformable Conv) is introduced to replace a conventional convolutional layer optimization backbone network, and the characteristics of a deformed object are extracted in an offset adaptive manner. In order to further improve the detection performance, the embodiment utilizes an attention mechanism to acquire attention features of different modalities and perform feature level fusion on the attention features. The larger the receptive field is, the higher-level global features can be abstracted, along with the deepening of the network, the actual receptive field is smaller than the theoretical receptive field, and in order to obtain more comprehensive features, the invention utilizes global semantic information to guide the extraction and fusion of global and local multi-scale features. Finally, a comparative test is carried out on the public data set to verify the effectiveness of the algorithm. As shown in fig. 1, a method for detecting a salient object based on RGB-T multi-source image data (abbreviated as DAGLNet) includes:

step 1: on the basis of a traditional dual-channel VGG-16 network architecture, a part of convolution layers in the VGG-16 are replaced by Deformable convolution, a final full-connection layer is removed, a VGG-DCNet (VGG-16+ Deformable Conv) network based on the Deformable convolution is formed, a visible light image and a thermal infrared image are used as dual-channel input of the VGG-DCNet network, and primary features of the visible light image and the thermal infrared image are extracted by the VGG-DCNet network;

step 2: inputting the extracted primary features of the visible light image and the thermal infrared image into an Attention feature fusion module (Attention Block, AB), respectively obtaining Attention feature maps corresponding to the visible light image and the thermal infrared image after a standardized Attention Mechanism (NAM), and then fusing every two Attention feature maps of each layer of the visible light image and the infrared image to obtain a fused Attention feature map;

and step 3: the Global semantic information acquired after the deepest attention feature is subjected to multilayer pyramid pooling operation is fused into the process of extracting the Local features of the visible light image and the infrared image, so that the Global multi-scale features and the Local multi-level features of the visible light image and the infrared image are fused in a Fusion Global-Local Block (FGLB), and a final significance prediction map is output.

Further, the step 1 comprises:

Further, the global semantic information is acquired as follows:

On the basis of the embodiment, the invention also discloses another saliency target detection method based on RGB-T multi-source image data, which specifically comprises the following steps:

and (3) introducing deformable convolution to extract the characteristics of the deformed object, and replacing part of convolution layers on the basis of the backbone network VGG-16 to obtain VGG-DCNet. In order to effectively fuse the features of two modes, the invention constructs an attention feature fusion module (AB) based on a multi-source image, extracts an attention feature map based on an attention mechanism, and then performs feature-level fusion on the attention feature map. In order to overcome the problems that partial global information is lost due to small actual receptive field of a deep layer in a network and local information is ignored along with the deepening of the network, the invention constructs a fusion global-local feature module (FGLB), utilizes global guide information from the global consideration and local features from the local consideration and fuses the multi-level and multi-scale features of the RGB-T multi-source image. The network model finally constructed by the method can output the significance target detection effect diagram from end to end. The network architecture employed by the present invention is shown in fig. 2.

The specific working process is as follows:

(1) visible light and thermal infrared images are respectively used as double-channel input, primary features are extracted from the VGG-DCNet network added with the deformable convolution, and five characteristic visible light sources F are respectively obtained at five stages of the VGG-DCNet network ^R {1,2,3,4,5} and an infrared source F ^T {1,2,3,4,5}。

(2) The related information of the primary features of the visible light and the infrared images flows into an attention feature fusion module (AB) and an attention feature graph N is obtained after the information is subjected to an attention mechanism ^R {1,2,3,4,5} and N ^T {1,2,3,4,5}, and then fusing the attention maps of each layer of the visible light image and the infrared image in pairs to obtain a fused attention feature map A {1,2,3,4,5 }.

(3) With the deepening of the network, the semantic information carried by the high layer summarizes the whole situation, so that the characteristics of the deepest layer carry the global semantic guidance information. And fusing semantic information acquired after the deep Feature A5 is subjected to multilayer pooling operation into a local Feature extraction process, so that a global multi-scale Feature and a local multi-level Feature are fused in a fusion global-local Feature module (FGLB), and a final Feature map Feature is output.

1. VGG-DCNet based on deformable convolution

In this section, a specific implementation based on a Deformable convolution VGG-DCNet will be described, and based on the superiority of the Deformable convolution in processing an unknown geometric deformation image, the invention introduces a Deformable convolution (DConv) into the backbone network VGG-16 to form a VGG-DCNet network based on the Deformable convolution, and uses the network to extract primary features of two modes. The deeper and more locations the deformable convolution layer and deformable region of interest pooling are applied in the backbone network, the better the effect, but with the consequent generation of more parameters, the larger the model occupancy.

The invention chooses to replace the last three convolutional layers with deformable convolutions, and experiments are carried out to prove the rationality of the replacement, and the table 1 shows the performance comparison of the modified backbone network. Adding the deformable convolution on the VGG backbone can increase a large number of network parameters, the requirement on a machine is higher, the video memory occupied by 6 layers of DConv is larger than 11GB, the highest precision is 0.770, and the precision of 3 layers of DConv can reach 0.770 and occupy less than 11GB, so that the last 3 layers of convolution is finally selected and replaced.

Table 1 comparison of the effects of the number of layers of the deformable convolution

The VGG-DCNet network added with the deformable convolution replaces the three layers of convolution of the last stage of the original VGG-16 network with the deformable convolution, removes the last full connection layer, and obtains a full convolution network capable of outputting end to end, and the table 2 is a VGG-DCNet network structure parameter table.

And inputting the image with the input size of 480 × 640 into the VGG-DCNet network, wherein the feature map output from the previous stage is input to the next stage after each stage, and the deformable convolution kernel can learn the offset of the next position with the help of the offset of the deformable convolution learning, so that the image with unknown deformation can be predicted more accurately. P _ conv in the deformable convolution represents a convolution process for calculating the offset Δ p, and m _ conv represents a convolution process for calculating the weight Δ m.

TABLE 2VGG-DCNet network architecture parameter Table

2. Attention feature fusion module (AB) based on multi-source image

The section will specifically introduce an attention feature fusion module (AB), in which the primary features of each stage extracted from the VGG-DCNet are subjected to feature level fusion to obtain an attention feature map by obtaining an enhanced beneficial feature and suppressing an irrelevant feature under the action of a Normalized Attention Mechanism (NAM), and feature maps containing attention information obtained in the middle level of the network are subjected to feature level fusion.

The fusion of multi-source images can be obtained by simple pixel addition, but the simple addition easily introduces excessive noise and seriously influences the expression of important features. Like visible light and infrared images under a weak illumination scene, the visible light images may not be capable of distinguishing the target and the background, which undoubtedly has an influence on the detection of the salient target, while the infrared images can clearly separate the target and the background, and if the two images are simply subjected to pixel-level addition, noise information in the visible light images is introduced in a full disc manner, so that the detection of the target is inhibited. The features of different modes are extracted through the convolutional neural network, and then the features are fused, so that the key area can be highlighted by using the features extracted through the convolutional neural network, and the target and the background can be effectively separated. The attention mechanism can assign a larger weight to the important area, so as to assist the network to pay attention to the expression of the important feature in the propagation process, and more target information can be reserved through the attention feature map extracted by the attention mechanism.

The features extracted from the visible light image and the infrared image at each stage in the backbone network VGG-16 are fused and transmitted backwards to obtain gradually refined fusion features, and fig. 3 shows a feature level fusion process using NAM attention mechanism.

The normalized Attention-based mechanism (NAM) utilizes the contribution factor of the weight to improve the Attention mechanism, suppressing the less significant weight. The NAM uses batch normalized scale factors to express the variation intensity of channels and spaces, uses standard deviation to express the importance of weights, highlights prominent features and suppresses insignificant channels or pixels. The NAM contains two modules, channel attention and spatial attention, the channel attention compressing the spatial direction to 1 × 1 individual pixels, focusing on the channel, and focusing on the channel, which is the heavier channel. And the spatial attention is to compress the channels into one dimension, focus on the salient pixels and pay attention to the spatial pixels which occupy a heavier ratio. The combination of the two can fully focus on the channel and the space information, and capture the structure information such as texture, color and the like and the abstract information of the channel. As shown in fig. 4, which is a specific schematic diagram of an attention feature fusion module (AB) based on a multi-source image, the AB module takes an attention map of two modalities and a modality attention fusion map of a previous layer as input, and correspondingly generates a fusion feature map based on an attention feature of the multi-source image. Because the weight of each channel and space is measured in the attention mechanism, the feature maps with the added attention information can be adaptively fused to obtain the beneficial features of each of the two modes.

When the attention characteristics of the two modes of the ith stage are fused, the convolution and pooling operations of the ith stage respectively obtain the ith output F of the visible light image ^R _i (convolution characteristic of i-stage visible light image) and i-th output F of infrared image ^T _i (convolution characteristics of i-th stage thermal infrared image) as input of attention mechanism NAM, and obtaining characteristic map N fused with attention information ^R _i (attention feature of i-th stage visible light image) and N ^T _i (attention feature of i-th stage thermal infrared image), the fused attention feature map of the first stage is added in two modes, the fused attention feature maps of the following four stages are all fused with the fused attention feature of the previous stage, and the specific implementation process is shown in formula (1).

Wherein the content of the first and second substances,

representing an element level addition.

3. Fusion global-local feature module (FGLB)

In order to fully utilize all context information in the image, global semantic guidance information and local feature information are used in the subsection, and global and local features are fused, so that the local features are more correctly extracted under the assistance of the global features, and a more accurate saliency map is obtained in an assisting manner.

The multi-task training framework realizes the comprehensive extraction of the features by utilizing local and global semantic information, the receptive field corresponding to the shallow layer of the network is small, the geometric information of the object is relatively rich, more local features can be extracted, the segmentation of small targets is facilitated, and the segmentation precision is refined and improved. The corresponding receptive field of the deep layer of the network is larger, the spatial information of the object is richer, richer semantic information and more comprehensive global characteristics can be extracted from the deep layer of the network, and the method is beneficial to segmenting larger targets. Therefore, the features in the deep layer of the network are subjected to multi-scale pooling operation to obtain global features, and shallow features are merged into the features in the backward propagation process of the network to extract local features.

3.1 extraction of Global features

The global features refer to the overall attributes of the image, have the characteristics of good invariance, simple calculation, visual representation and the like, can reflect the content contained in the image, can accurately position the target, but contain less semantic information, and the visual features cannot reflect the semantic information expressed by the image. For example, a human face image is given, the nose, eyes and other parts can be distinguished according to appearance information such as the outline, the shape and the like, but the human face image cannot be understood, high-level semantic features express rich semantic information, and the image can be understood as the human face. Deeper features contain richer high-level semantic information, and have stronger comprehension and resolution on images, which is called semantic space.

In order to obtain multi-scale global information, the invention utilizes a pyramid pooling method and adopts the pooling operation of four subbranches to obtain feature maps with different scales. The pyramid pooling operation of the present invention employs pooling cores of different sizes to pool input features, with the size of the sizes having an increasing relationship. Wherein the size of the pooling core and the number of layers can be adjusted according to the actual scene.

The G module for acquiring the global information in the network architecture is constructed by a pyramid pooling method, as shown in FIG. 5, the global scene prior information acquired by pyramid pooling can be used for guidingAnd extracting local features. Fused attention feature A based on deepest layer of multi-source image ₅ And as a final feature map extracted by network forward propagation, carrying more high-level semantic information, performing pooling operations of different scales on the feature map, as shown in fig. 5. The pooling operation has four layers, the first layer adopts global average pooling GAP to obtain the output of a single pixel, the pooled sizes of the second layer and the third layer are 2 × 2 and 6 × 6 respectively, and the last layer is the identity mapping of the original input.

The specific steps of obtaining the global information by the multi-scale pooling operation are as follows:

(1) pooling the input characteristic diagram under four scales to obtain output P of four scales _i And i is 1,2,3,4, wherein the first layer is global average pooling, the other three layers are average pooling, and each output is different in size but same in channel dimension, as shown in formula (2).

(2) The feature maps of the four scales need to be cascaded to ensure the weight of the global feature of each scale, so the pooled features need to be reduced in Channel dimension, and the Channel is reduced to 1/N of the original features by using convolution operation of 1 × 1, as shown in formula (3), where N is the number of layers of the pooling operation, and N is 4 in the present invention.

P _i ＝conv(P _i ) (3)

Where conv () represents a convolution operation.

(3) At the moment, the sizes of the four layers of output are different, so that the method of bilinear interpolation is applied to carry out up-sampling, the sizes of the four layers of characteristics are changed to be consistent with the size of the original characteristic diagram, and finally the four layers of characteristic diagrams are spliced in channel dimensions, as shown in formula (4), the fourth layer is the identity mapping of the original characteristics, so that the fourth layer is not required to be combined with the original characteristic diagram through residual error connection.

P _i ＝contact(P ₁ ，P ₂ ，P ₃ ，P ₄ ) (4)

Where contact (,) represents the number of channels added.

The global context priors obtained by fusing the context information of different sub-regions can well help the network to detect. The feature map generated by four layers of pooling operation with different scales carries global information with different scales, and the global prior information obtained after fusion is used for guiding local feature extraction in the process of sampling the initial significant feature map to the size of the resolution of the original image.

3.2 local feature extraction

The global features have good invariance and intuition, the global prior information can effectively help the network to detect, but the global feature description is not suitable for the situations of image mixing and occlusion, so that the information of the global features is not enough to ensure the completeness and effectiveness of feature extraction. In order to solve the problem, a feature aggregation module L is introduced in the up-sampling process in this subsection, the fused feature map is first converted into a feature space of multiple scales to capture local context information of different scales, and then the information is combined to better weigh the composition of the fused input feature map.

The network of the invention takes the global feature as the guide in the top-down output, extracts the local feature, and fuses with the previous stage step by step with the up-sampling rate of 2 times in sequence. The L module in the network model extracts local features, as shown in fig. 6, the features at each stage from top to bottom are averaged and pooled at different down-sampling rates to obtain features of different scale spaces, the features are convolved and then up-sampled to the original scale and fused, finally, a3 × 3 convolution is performed to obtain a local feature map containing information of different scales, and the local feature map and the global features are subjected to element level fusion and are transmitted forward.

The specific steps for extracting the local features are as follows:

(1) the attention fusion features a {1,2,3,4,5} correspond to five stages of VGG-DCNet, and have different scale spaces, where a1 is 480 × 640 × 64, a2 is 240 × 320 × 128, A3 is 120 × 160 × 256, a4 is 60 × 80 × 512, and a5 is 30 × 40 × 512, and the different scale spaces contain different local features, and the features of the five scale spaces are respectively used as inputs to extract features of which each stage contains abundant local information. In the forward transfer process, the features of each scale space are averaged and pooled at 3 different sampling rates, as shown in equation (5):

Scale _k ＝up(conv(down(A _i ))) (5)

where k is the three

sample rates

2, 4, 8, up () represents upsampling, down () represents downsampling, and conv () represents a convolution operation.

(2) The characteristics of the three branches with different scales are sampled to the original scale size, and then are fused with the original input characteristics, multi-scale local characteristics are aggregated through the process, each space position is allowed to check local context on different scale spaces, and therefore the receptive field of the network is further expanded, and the specific process is shown as a formula (6),

Local _i ＝conv(sum(Scale _k ,A _i ))，k＝{2,4,8},i＝{1,2,3,4,5} (6)

the more times of feature aggregation, the better the feature extraction, so that in the forward transfer process, one-step upsampling is replaced by gradual upsampling, and in each stage, the operation of feature aggregation is performed to obtain local context information, so as to obtain rich local features.

Fig. 7 shows the information flow of global information during the extraction process of local features, and since the image size in the global information is 30 × 40, the image size needs to be converted at a different up-sampling rate to be fused with the local features.

3.3 fusion of Multi-layer Multi-Scale features for Multi-Source images

The fusion of the characteristics of the invention is based on the multi-level and multi-scale fusion of the multi-source images, firstly the multi-source image fusion refers to the fusion of the characteristics from the visible light image and the characteristics from the thermal infrared image, and secondly the multi-level fusion refers to the fusion of the characteristics from the two modes in each stage of VGG-DCNet and the backward transfer. In particular F ^R And F ^T Fusion, is performed _i ＝F _i ^R +F _i ^T And i is 1,2,3,4 and 5. It should be noted that, in the invention, the multi-level features are fused, attention information with relatively important channel and space dimensions is obtained through an attention mechanism, and then the attention features of each stage of the two modes are fused. Finally, in order to obtain rich global and local features, the invention adopts a multi-scale feature fusion method to extract the global and local features, and fuses the features of different scales through multiple scale transformations, thereby achieving the effect of acquiring global information and local context information from a multi-scale space.

In conclusion, the input of the method is a visible light image and a thermal infrared image, and the multi-source image features are extracted. The main framework of the network is adopted as the improved VGG-DCNet, and the network replaces the three-layer traditional standard convolution of the last stage of the VGG-16 with the deformable convolution capable of adapting to the object target deformation. The features extracted from the backbone network are coarse primary features, resulting in multi-layer features. In order to perform effective feature fusion, the invention adopts feature fusion based on attention information, weights important features of primary features extracted by VGG-DCNet by an attention mechanism to obtain attention feature maps of visible light images and infrared images, and further combines and transmits multilayer features of each stage backwards. As the deep features carry rich semantic information, global prior information is obtained by adopting a multi-scale pooling method and is used in an up-sampling process, in a forward transfer process, in order to extract local features, pooling operations of different sampling rates are carried out on information of different scale spaces at each stage to obtain rich local information, and the rich local information is transferred forward under the guidance of the global prior information to obtain a final significance prediction map.

To verify the effect of the invention, the following experimental setup was performed:

the present section discusses details of the method implementation, comparison of the method, and performance analysis of each module of the method. First, the data set and evaluation criteria used in the present invention are described. Then, the method is compared with the similar model, and qualitative and quantitative analysis is carried out, so that the feasibility and the superiority of the model are proved. And finally, carrying out an ablation experiment, and carrying out experimental verification on the necessity and effectiveness of each module.

The method is built by using a deep learning frame Pythrch and a programming language Python, the training and testing of the model are based on a Ubuntu 18.04.6 operating system, the memory is 128G, a high-performance video card NVIDIA RTX A6000 GPU (48G) is used, and the storage capacity of a hard disk is 4.4 TB. During the experiment, the input image sizes for training and testing were 480 × 640, the initial learning rate was le-4, and the batch _ size was set to 1. The whole network needs about 15 minutes for each training iteration (epoch), and each iteration is once, so that the whole network is stored as a model, and the performance reaches the optimum at the 25 th epoch. The testing process of the network (2500 pictures) takes approximately 6 minutes, and each picture takes 0.14 seconds to output a significant prediction map.

(a) Data set

The invention performs experimental training on a VT5000 data set, the data set is large in scale and more comprehensive, and comprises 5000 pairs of images for performing experiments, wherein 2500 pairs are training sets, and 2500 pairs are testing sets, each pair of visible light images and thermal infrared images are automatically aligned, and a preprocessing step of the images is not required in the experiments. To verify the validity of the DAGLNet proposed by the present invention, tests were performed on three data sets VT821, VT1000 and VT5000(test), all of which have pairs of visible light images and thermal infrared images, which can be input to two branches of a two-channel network.

(b) Evaluation index of algorithm

Evaluation Standard accuracy F of this experiment _β And the average absolute error MAE. Calculation of F by precision and recall _β As shown in the formula (7),

wherein, the invention balances the parameter beta ² Is set to 0.3, F _β The larger the value of (A), the better.

The average absolute error MAE is used to measure the average difference between the predicted value Y and the true value G, W and H represent the width and height of the input image respectively, the smaller the value of MAE, the better the performance of the algorithm, the calculation method of MAE is as shown in equation (8),

(c) model training process

The algorithm is improved on the basis of the VGG-16 network, the traditional standard convolution of the last stage is replaced by the deformable convolution to obtain the VGG-DCNet, the changed network is used as a backbone network of a model, and the initialization parameters are obtained by training from an ImageNet data set. The complete network comprises two stages, firstly, from convolution operation from bottom to top in the backbone network to the primary characteristics of each stage of the model, and then from top to bottom through back propagation and gradual up-sampling to obtain a final significance prediction output diagram.

The input of the two channels of the model is 480 multiplied by 640, in a backbone network, primary features obtained by the traditional convolution of the first four stages are transmitted from bottom to top, the output of the fourth stage is used as the input of the fifth stage, and a feature map capable of self-adapting feature deformation is obtained under the guidance of deformable convolution.

After the primary feature is acquired, attention guidance is conducted in each stage, and the feature map of each stage is input into an attention mechanism module to obtain an attention feature map. In order to effectively fuse the characteristics of the two modes, the algorithm carries out element-level addition on the attention characteristic graphs of each stage of the two modes to achieve the characteristic-level fusion of the images of the two modes, and the fused characteristics are also transmitted from bottom to top.

The feature map with the down-sampling rate of 16 is obtained in the deep layer of the network, the size of the image is 30 x 40, and in the process of gradually up-sampling to the size of the original image, the global prior information obtained in the deep layer of the network is used as global guide information to carry out backward propagation of the network. In the process of obtaining the global features and the local features, a multi-scale feature fusion mode is adopted, and the global features are a feature map A obtained by fusing deep attention ₅ Pooling to three scale spaces of 1 × 1,2 × 2, 6 × 6, and mixing with A ₅ Ligation (cat) is performed. The local feature is obtained by using the attention feature map of each stageAnd the same sampling rate {2, 4, 8} is firstly sampled down and then up to the original scale, and finally the characteristics of each layer are fused.

In the whole training process, along with the increase of iteration times, the training precision gradually tends to be gentle, and after the DAGLnet finishes the training process of the data set, the optimal model is stored for subsequent testing. And (4) outputting the prediction saliency map end to end in the network, and performing qualitative analysis and quantitative comparison with the truth map under the evaluation standard to calculate the precision.

(d) Algorithmic comparison

In order to verify the effectiveness of the DAGLNet provided by the invention, the method of the invention is compared with other nine mainstream significance target detection algorithms, and the nine algorithms are all based on deep learning and comprise the following steps: PoolNet, R3Net, BASNet, EGNet, S2MANet, BBSNet, PGARNet, MIED, ADFNet. Wherein PoolNet, R3Net, BASNet and EGNet are significance target detection algorithms based on single mode, S2MANet, BBSNet and PGARNet are significance target detection algorithms based on RGB-D, and MIED and ADFNet are significance target detection algorithms based on RGB-T.

Because the existing RGB-T-based saliency target detection methods are few, several RGB-T-based saliency target detection methods and several RGB-D-based saliency target detection methods are introduced into the comparison experiment, and in order to reflect the fairness of the comparison experiment, the optimal model provided by the original paper is applied in the comparison experiment for testing. For the single-mode saliency target detection method, the method disclosed by the invention keeps the original model, and the input is still kept as the input of a single-mode RGB image and is consistent with the RGB input of other multi-mode models. For the RGB-D-based saliency target detection algorithm, a network framework of an original algorithm is not changed, and only a depth image is replaced by a thermal infrared image to carry out a comparison experiment. Particularly, for the RGB-T based saliency target detection method ADFNet, because the same training set is used as the method, any parameter and network structure of the method are not changed, in the experimental environment of the invention, the training set in VT5000 is used for retraining the model, and the optimal model is reserved for carrying out subsequent tests.

In order to ensure the completeness of the comparison experiment, the three types of comparison experiments are based on three data sets of VT821, VT1000 and VT5000(test), and the comparison experiments of various models are based on the same experiment environment, so that the fairness of the comparison experiments is ensured.

(1) Algorithm quantitative comparison and analysis

Table 3 shows the comparison results of the experimental model and other nine significant target detection models on three RGB-T data sets. Because the MIED algorithm adopts the VT1000 data set in the training process, the training result of the MIED algorithm on the VT1000 data set is not taken as a reference in the comparison experiment. The performance of the various algorithms in different datasets is shown in the quantitative results of table 3.

TABLE 3 quantitative results of significance detection based on algorithms of different datasets

As can be seen from table 3, compared with the nine classical significant target detection networks, DAGLNet of the present invention has better detection effect in VT1000 dataset, and is first aligned with ADFNet, but the mean absolute error MAE value of DAGLNet is lower than ADFNet. DAGLNET performs best on VT5000(test) data set, with an improvement of 0.5 percentage points compared to the current excellent RGB-T based ADFNet model; compared with an RGB-T significance target detection model MIED based on an encoding-decoding network, the detection efficiency is improved by 5.4 percentage points, and the practical significance and feasibility of each module applied by the algorithm are shown. Compared with a single-mode algorithm BASNet with the third detection efficiency rank in VT5000(test), the efficiency is improved by 2.0 percent, which shows that the algorithm is based on RGB-T multi-source images and has practical research value, and the image of a thermal infrared mode has good complementary effect on the image based on visible light. Compared with the multi-mode-based RGB-D significance target detection algorithm, the PGARNet detection efficiency better represented in VT5000(test) is improved by 4.5 percentage points, the BBSNet detection efficiency better represented in VT1000 is improved by 2.0 percentage points, the difference between depth information and thermal infrared information is shown, and a feature extraction and fusion mechanism in the RGB-D-based model cannot be generalized for detecting RGB-T multi-source images.

(2) Algorithmic qualitative comparison and analysis

And obtaining test results of nine algorithms which are compared with DAGLnet on the test set of the VT5000 data set, and comparing on the premise of ensuring the same input. Fig. 8 illustrates the performance of ten algorithms in three datasets. The overall performance shows that the multi-mode detection algorithm is superior to the single-mode detection algorithm, the saliency map of the RGB-T based saliency target detection algorithm can be better close to a true value map, and the algorithm provided by the invention has good detection capability in response to various challenges.

In fig. 8, the first two rows show the performance of various algorithms in the VT821 dataset, and although the algorithm of the present invention performs generally in the VT821, the algorithm of the present invention can better locate the target position for the situations of over-noise and cluttered background. The VT1000 data set test result of the middle two behaviors can well display target details, for example, the protruding part of the upper edge column of the billboard in the fourth row can be detected more completely by the algorithm. The last two rows are VT5000 data set test results, the algorithm can better utilize complementary information of the thermal infrared image to detect the target in the dark environment, as shown in the fifth row; the sixth row shows that the algorithm can highlight the target more completely than other algorithms, and no noise is introduced.

Through visual comparison, the conclusion can be drawn that the algorithm has good detection capability and can deal with significance detection of large-scale data sets.

(3) Analysis of performance of each module of algorithm

The method mainly comprises three important modules, namely an improved VGG-16 network VGG-DCNet based on deformable convolution, an attention feature fusion module (AB) used for extracting and fusing attention information of each phase of two modes, and a fusion global-local feature module (FGLB) used for fusing global and local features. To verify the effectiveness of each module, this section performed a series of ablation experiments by individually adding each module based on a test set of VT5000 datasets, and performed quantitative and qualitative analyses.

Table 4 shows the results of the quantitative tests performed by the modules of the present invention, and FIG. 9 shows the results of the qualitative analyses performed by the modules of the method.

TABLE 4 method quantitative test results for each module

As can be seen from Table 4, with the addition of each module, F _β And the MAE is gradually reduced, which shows that each module has an effect of improving the detection capability of the network.

Comparing (1) and (2) in table 4, it can be seen that the salient object detection performance of the multi-source image is better, and compared with the single-source detection efficiency, the salient object detection efficiency is improved by 10.5 percentage points, and qualitative analysis in fig. 9 shows that the salient object is more easily located by the multi-source image detection, so that the salient object detection of the multi-source image is meaningful.

For the VGG-DCNet module, referring to (2) and (3) in table 4, the efficiency is improved by 2.9 percentage points compared with a simple VGG-16 network. From the prediction of fig. 9, the deformable convolution based reconstruction network is superior to the VGG-16 network in its ability to accommodate deformations.

For the fusion module (AB) of attention features, referring to (2) and (4) in table 4, quantitative analysis can obtain that the detection efficiency is improved by 3.6 percentage points after the AB module is added, qualitative analysis in fig. 9 can obtain that the addition of the feature map of the AB module can effectively suppress the parts where the visible light and thermal infrared images are not significant.

For the global-local feature fusion module (FGLB), referring to (2) and (5) in table 4, after the global-local feature fusion module (FGLB) is added, quantitative analysis can obtain that the detection efficiency is improved by 7.1 percentage points under the guidance of global and local features acquired in multi-scale, and the final significant prediction graph information obtained by qualitative analysis in fig. 9 is richer and more comprehensive, and does not greatly lack significant targets.

Comparing (2) and (6) in table 4, the attention mechanism and the global-local feature fusion module are combined and then act on the VGG-16 network, so that the detection efficiency is greatly improved, and compared with the VGG-16 network, the detection efficiency is improved by 9.6 percentage points. The qualitative analysis of fig. 9 found that the above approach can further enhance the prominent features while suppressing the prominent features.

And finally, the VGG-DCNet, the attention module AB and the fusion global-local feature module FGLB are jointly used, the detection efficiency is improved by 2.2 percentage points by referring to (6) and (7) in the table 4, and qualitative analysis of FIG. 9 shows that the result obtained by the DAGLNet model is closer to the true value GT, so that the noise is effectively inhibited, and a remarkable target is highlighted.

Single in fig. 9 represents a Single source test with RGB image as input; the input of the Multi-representation double-channel network is Multi-source test of RGB and T images respectively, and single-source and Multi-source tests both use VGG-16 as a backbone network; dconv represents training and testing by adding deformable convolution on the basis of VGG-16; AB. FGLB and A + GL respectively represent that in a multi-source framework taking VGG-16 as a backbone network, a standardized attention mechanism is added, a fusion global-local feature module is added, and the attention mechanism and the fusion global-local feature module are simultaneously used; OURS represents the test result of the model, and combines the optimized backbone network VGG-DCNet, the attention feature fusion module AB and the fusion global-local feature module FGLB.

As can be seen from fig. 9, with the introduction of each module, the detection capability of the network for a significant target is stably raised, noise is gradually reduced, the target positioning is more accurate, the boundary between the target and the background is clearer, and the target contour can be more completely displayed.

Through a series of ablation experiments, the performances of each module of the method are quantitatively compared and qualitatively analyzed, and the fact that each module provided by the algorithm has a necessary supporting effect on improvement of network performance is proved.

In conclusion, in order to achieve the effect of adaptively extracting irregular target features, the invention takes VGG-16 as a basis, introduces a deformable convolution adjusting network structure to replace part of convolution layers in the network, obtains a VGG-16 network added with deformable convolution, namely VGG-DCNet, obtains irregular convolution operation by adding offset to the traditional convolution, and obtains complete target features. In order to fully play the complementary role of the multi-source image characteristics, the invention constructs an attention characteristic fusion module, analyzes and applies a standardized attention mechanism to extract two-modal characteristics, and performs characteristic level fusion on the attention characteristics. In order to fully utilize global semantic information, a fusion global-local feature module is constructed, global features such as textures and structures and local features with small correlation degree between the features are fused, shallow position information and high-level semantic information are fully utilized, and therefore redundant information is restrained to obtain beneficial features. Finally, the network architecture obtained by the method of the invention can output the significant target detection effect diagram end to end. The experimental result shows that the method has good significance target detection capability, and particularly has obvious detection effect advantages under complex scenes such as insufficient illumination, crossed image boundaries, central deviation and the like.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A salient object detection method based on RGB-T multi-source image data is characterized by comprising the following steps:

2. The RGB-T multi-source image data-based saliency target detection method of claim 1, wherein said step 1 comprises:

3. The method as claimed in claim 1, wherein the attention feature fusion module is configured to obtain enhanced beneficial features and suppress irrelevant features under the action of a normalized attention mechanism NAM, obtain an attention feature map, and perform feature level fusion on feature maps containing attention information obtained at intermediate levels of a network.

4. The RGB-T multisource image data-based saliency object detection method of claim 1, wherein in said attention feature fusion module, fusion of attention feature maps of each layer of visible light images and infrared images is performed as follows:

wherein N is ^R _i Graph representing the visible light of the i-th stageAttention feature of image, N ^T _i Attention feature representing the i-th stage infrared image, A _i Showing the attention characteristics after the fusion of the ith stage.

5. The RGB-T multi-source image data-based saliency target detection method of claim 1, wherein global semantic information is obtained as follows:

6. The RGB-T multi-source image data-based saliency target detection method of claim 1, wherein the visible light image and infrared image local features are extracted as follows: