CN110210539B

CN110210539B - RGB-T image saliency target detection method based on multi-level depth feature fusion

Info

Publication number: CN110210539B
Application number: CN201910431110.6A
Authority: CN
Inventors: 张强; 黄年昌; 姚琳; 刘健; 韩军功
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2022-12-30
Anticipated expiration: 2039-05-22
Also published as: CN110210539A

Abstract

The invention discloses a method for detecting a salient target of an RGB-T image based on multi-level depth feature fusion, which mainly solves the problem that the salient target cannot be completely and consistently detected in a complex and changeable scene in the prior art. The implementation scheme is as follows: 1. extracting rough multilevel characteristics from an input image; 2. constructing an adjacent depth feature fusion module to improve the single-mode features; 3. constructing a multi-branch group fusion module, and fusing multi-mode features; 4. obtaining a fusion output characteristic diagram; 5. training an algorithm network; 6. a pixel level saliency map of an RGB-T image is predicted. The method can effectively fuse the supplementary information from the images in different modes, can completely and consistently detect the image salient targets in a complex and changeable scene, and can be used for the image preprocessing process in computer vision.

Description

RGB-T image saliency target detection method based on multi-level depth feature fusion

Technical Field

The invention belongs to the field of image processing, relates to a RGB-T image salient target detection method, and particularly relates to a RGB-T image salient target detection method based on multi-level depth feature fusion, which can be used for a preprocessing process of an image in computer vision.

Background

Salient object detection aims to detect and segment salient object regions in images by using a model or an algorithm. As a preprocessing step of an image, salient object detection plays an important role in visual tasks such as visual tracking, image recognition, image compression, image fusion and the like.

Existing target detection methods can be divided into two main categories: one is a salient object detection method based on the traditional technology, and the other is a salient object detection method based on deep learning. The saliency prediction is completed through the characteristics of colors, textures, directions and the like extracted manually based on the traditional saliency target detection algorithm, the method excessively depends on the characteristics selected manually, the adaptability to scenes is not strong, and the performance on complex data sets is poor. With the wide application of deep learning, the significance target detection research based on the deep learning achieves breakthrough progress, and compared with the traditional significance algorithm, the detection performance is remarkably improved.

Most of the methods for detecting salient objects, such as "q.hou, m.m.cheng, x.hu, et al.deep super detected object detection with short connections, ieee Transactions on Pattern Analysis and Machine Analysis, 2019,41 (4): 815-828." the salient values are calculated only from RGB images of a single modality, the obtained scene information is limited, and under challenging scenes such as low light, low contrast, complex background, etc., the salient objects are difficult to be detected completely and consistently.

In order to solve the above problems, some RGB-T image-based salient object Detection methods are proposed, such as "Li C, wang G, ma Y, et al.a united RGB-T salience Detection Benchmark: dataset, baselines, analysis and a Novel approach.arxiv preprintiv: 1701.02829, 2017", which disclose a popular ranking model-based RGB-T image salient object Detection method, which uses the complementary information of RGB and thermal infrared images to construct a cross-modal consistent flow pattern ranking model, and combines a two-stage graph method to calculate the salient value of each node. Under the conditions of low illumination and low contrast, compared with a saliency target detection method taking RGB as input, the saliency target can be detected more accurately.

However, this method detects the target area block as a basic unit, and a significant blocking effect appears in the saliency map, the boundary between the target and the background is not accurate, and the target is not uniform inside. In addition, the method is established based on manual feature extraction, the selected features cannot completely express the internal characteristics of different images, the utilization of the supplementary information among the images in different modes is not sufficient, and the improvement of the detection effect is limited.

Disclosure of Invention

The purpose of the invention is as follows: in view of the defects of the prior art, the invention aims to provide an RGB-T image saliency target detection method based on multi-level depth feature fusion to improve the complete consistency effect of detection of saliency targets in complex and changeable scene images. The method mainly solves the problem that the prior art cannot completely and consistently detect the obvious target in a complex and changeable scene.

The key point for realizing the invention is that the RGB-T image multi-level depth feature extraction and fusion: through the fusion of the multistage single-mode characteristics extracted from the RGB and thermal infrared images, the significance is predicted: extracting rough multilevel characteristics from different depths of the strut network for RGB or thermal infrared images; constructing an adjacent depth feature fusion module, and extracting improved multi-level single-mode features; constructing a multi-branch combination fusion module, and fusing different modal characteristics; obtaining a fusion output characteristic diagram; training a network to obtain model parameters; a pixel level saliency map of an RGB-T image is predicted.

The technical scheme is as follows: the method for detecting the RGB-T image significance target based on the multi-level depth feature fusion comprises the following steps:

(1) Extracting rough multilevel features from an input image:

extracting 5-level features at different depths in the basic network of the image to be used as rough single-mode features;

(2) Constructing an adjacent depth feature fusion module, and improving the single-mode features:

establishing a plurality of adjacent depth feature fusion modules, then processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;

(3) Constructing a multi-branch combination fusion module, fusing multi-modal characteristics:

constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode features located under the same feature level in the improved 3-level single-mode features obtained in the step (2) to obtain fused multi-mode features;

(4) Obtaining a fusion output characteristic diagram:

step-by-step reverse fusion is carried out on the different-level features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;

(5) Training the algorithm network:

on a training data set, completing algorithm network training on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4) by adopting a deep supervised learning mechanism and minimizing a cross entropy loss function to obtain network model parameters;

(6) Predicting a pixel-level saliency map of an RGB-T image:

and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output feature map and the fusion output feature map obtained in the step (4) by using the network model parameters obtained in the step (5).

Further, the graph in the step (1) is an RGB image or a thermal infrared image.

Further, the base network in step (1) is a VGG16 network.

Further, the constructing of the neighboring depth feature fusion module in step (2) includes the following steps:

(21) Respectively using symbols to the 5-grade rough monomodal features obtained in the step (1)

Wherein n =1 or 2, respectively, represents an RGB image or a thermal infrared image;

(22) Each neighboring depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain the d-th order single-modal feature, d =1,2,3.

Still further, step (22) comprises:

(221) One convolution kernel is 3 x 3, the step length is 2, and the parameter is

By convolution operation of

One convolution kernel is 1 × 1, the step size is 1, and the parameters are

By convolution operation of

And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of

Deconvolution operation of

Respectively act on

And

(222) The 3-level features are concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1

By convolution operation of

Obtaining the d-th-level single-mode characteristic of 128 channels

The proximity depth fusion module can be represented as follows:

wherein:

cat (·) denotes cross-channel cascading operation;

φ (-) is a ReLu activation function.

Further, the multi-branch combination fusion module in step (3) fuses for different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:

the multiple groups of fusion branches have 8 groups, and the single group of fusion branches only has one group;

each merging branch outputs 64-channel features, and the two merging branch output features are cascaded to obtain 128-channel multi-modal features.

Further, the constructing a multi-branch combination fusion module in step (3) fuses different single modes at the same feature level in multiple groups of fusion branches to obtain fused multi-mode features, which includes the following steps:

(31) Input single modal characteristics

And

respectively cut into M groups with the same channel number according to the channel number to obtain

And

two feature sets, wherein:

m is a positive integer, and the value range of M is more than or equal to 2 and less than or equal to 128;

(32) Combining corresponding RGB and thermal infrared features from an M-th subgroup in two feature sets of the same level through cascade operation, and realizing the fusion of cross-modal features in the subgroups through convolution operation of 1 × 1 with 64/M channels and convolution operation of 3 × 3 with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;

(33) M small groups of outputs are cascaded together to obtain output characteristics H of multiple groups of fusion branches _1,d The expression is as follows:

wherein:

representing the above-described stack convolution operation with ReLu activation function,

represents the fusion parameters of the mth subgroup.

Further, the constructing a multi-branch combination fusion module in step (3) and fusing different single modes at the same feature level in the single-group fusion branch to obtain a fused multi-mode feature includes the following steps:

(3a) A single group of fused branches can be regarded as a special case when M =1 in multiple groups of fused branches, and the expression is:

wherein:

H _2,d is a d-th level fused feature output of a single set of fused branches;

the convolution operation comprising two stacks, namely convolution of 1 × 1 with the number of channels being 64 and convolution of 3 × 3 with the number of channels being 64, and each convolution operation is followed by a ReLu activation function;

fusion parameters representing a single set of fusion branches;

(3b) Multi-branch group fusion feature H of d-th stage _d From H _1,d And H _2,d Simple cascading is carried out, and the expression is as follows:

H _d ＝Cat(H _1,d ,H _2,d )。

has the advantages that: compared with the prior art, the RGB-T image significance target detection method based on multi-level depth feature fusion has the following beneficial effects:

1) The method can realize end-to-end pixel level detection of the RGB-T image without manual design and characteristic extraction, and simulation results show that the method has complete consistency effect when detecting the obvious target of the image in a complex and changeable scene.

2) According to the method, 5-level rough single-mode features extracted from the strut network are improved by establishing a plurality of adjacent depth feature fusion modules to obtain 3-level single-mode features, low-level details and high-level semantic information of an input image can be effectively captured, the phenomenon that overall network parameters are increased sharply due to excessive feature levels is avoided, and the difficulty in network training is reduced.

3) The invention fuses different modal characteristics by constructing the multi-branch combination fusion module comprising two fusion branches, because the single-branch combination fusion structure captures cross-channel correlation among all characteristics of different modes from RGB images and thermal infrared images, and more remarkable characteristics are extracted from a plurality of groups of fusion branches, cross-modal information from RGB and thermal infrared images can be effectively captured, more complete and consistent targets can be detected, meanwhile, the number of training parameters required by the fusion module is less, and the detection speed of the algorithm can be improved.

Drawings

FIG. 1 is a flow chart of an implementation of a method for detecting a saliency target of an RGB-T image with multi-level depth feature fusion disclosed by the present invention;

FIG. 2 is a simulation comparison diagram of experimental results of RGB-thermal database according to the present invention and the prior art;

FIGS. 3a and 3b are simulation comparison diagrams of two evaluation indexes of a P-R curve and an F-measure curve under an RGB-thermal database according to the present invention and the prior art.

The specific implementation mode is as follows:

the following is a detailed description of specific embodiments of the invention.

Referring to fig. 1, the method for detecting the salient object of the RGB-T image with the multi-level depth feature fusion includes the following steps:

step 1) extracting rough multilevel features from an input image:

for the RGB image or the thermal infrared image, 5-level features at different depths in the VGG16 network are extracted as rough single-mode features, which are respectively as follows:

conv1-2 (by symbol)

Representation, containing 64 feature maps of size 256 × 256);

conv2-2 (by symbol)

Representation, containing 128 feature maps of size 128 × 128);

conv3-3 (by symbol)

Representation, containing 256 feature maps of size 64 × 64);

conv4-3 (by symbol)

Representing, containing 512 feature maps of size 32 × 32);

conv5-3 (by symbol)

Representation, containing 512 feature maps of size 16 × 16);

wherein: n =1 or 2 and n is a linear integer,

n =1 represents an RGB image;

n =2 represents a thermal infrared image;

step 2), constructing an adjacent depth feature fusion module, and improving the single-mode features:

the common multi-modal vision method directly takes five-level features as single-modal features, the method has huge network parameter quantity due to excessive number of characteristic levels and increased difficulty of network training, 5-level features with different depths are taken as rough single-modal features, and 3-level improved RGB image features or thermal infrared image features are obtained by establishing a plurality of adjacent depth feature fusion modules;

each neighboring depth fusion module comprises 3 convolution operations and 1 deconvolution operation, and in particular, to obtain the d-th-order monomodal feature, d =1,2,3, a convolution kernel of 3 × 3 is first set, step size of 2, and parameters of 2

By convolution operation of

One convolution kernel is 1 × 1, step size is 1, and parameter is

By convolution operation of

And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of

Deconvolution operation of

Respectively act on

And

to ensure that adjacent level 3 features from the strut network have the same spatial resolution and number of feature channels (128 channels for the present invention); then, the 3-level features are cascaded, and the 3-level features pass through a convolution kernel with the length of 1 multiplied by 1, the step size of 1 and the parameter of 1

Of (2) a convolutional layer

Obtaining the d-th-level single-mode characteristic of 128 channels

The proximity depth fusion module can be represented as follows:

wherein Cat (-) represents the cross-channel cascade operation, and phi (-) is a ReLu activation function;

as indicated above, the RGB or thermal infrared single mode feature of the d-th level

And also contains 3 levels of characteristic information from the strut network, i.e.

Feature of depth adjacent to it

And

this also shows that

Will contain richer detail and semantic information, which is helpful for accurately identifying the target, and in addition, the characteristics

With respect to simple merging

And

has more compact data, improves redundant information in the crude extracted features through adjacent depth feature fusion

Compression is obtained in the features;

step 3), constructing a multi-branch combination fusion module, and fusing multi-modal characteristics:

the multi-branch combination fusion module performs fusion aiming at different single modes under the same characteristic level and comprises two fusion branches;

the first merging branch (also called multi-group merging branch) has M (M =8 in this embodiment) groups, and mainly amplifies the effect of each channel to reduce network parameters;

the second fused branch (also called single-group fused branch) has only one group and mainly functions to fully capture cross-channel correlation among all input features of different modalities; the two branches output the same number of channels (64 channels in this embodiment), so the final number of output characteristic channels of the multi-branch combination fusion module is twice that of each fusion branch, and is equal to the number of RGB or thermal infrared image characteristic channels (128 channels in this embodiment) input to the multi-branch combination fusion module;

multiple groups of fusion branches are established according to the basic idea of 'splitting-converting-merging', in which input single-mode features

And

respectively cut into M groups (128/M) with the same channel number according to the channel number to obtain

And

two feature sets; then combining corresponding RGB and thermal infrared features from an M-th subgroup in two feature sets of the same level through cascade operation, and then realizing the fusion of cross-modal features in the subgroups through convolution of 1 × 1 with 64/M channels and convolution operation of 3 × 3 with 64/M channels, wherein the first 1 × 1 mainly plays a role in reducing the number of feature channels, the second convolution is mainly used for fusing the features, and each convolution operation is followed by a ReLu activation function; finally, M small groups of outputs are cascaded together to obtain the output characteristics H of multiple groups of fusion branches _1,d The expression is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the above-described stack convolution operation with the ReLu activation function,

fusion parameters representing the mth subgroup;

a single group of fused branches can be regarded as a special case when M =1 in multiple groups of fused branches, and the expression is:

wherein H _2,d Is the d-th order fused feature output of the single group of fused branches,

comprising two stacked convolution operations, respectively 1 × 1 convolution with channel number 64 and 3 × 3 convolution with channel number 64, by which correlation information between all the multimodal features of the input is fully captured, and each convolution operation is followed by a ReLu activation function,

fusion parameters representing a single set of fusion branches;

finally, through the multi-group fusion branch and the single-group fusion branch, the multi-branch group fusion characteristic H of the d-th level _d Can be prepared from H _1,d And H _2,d Simple cascading is carried out, and the expression is as follows:

H _d ＝Cat(H _1,d ,H _2,d )

as described, the multi-branch set fusion module can capture cross-channel correlations between all features from different modalities of RGB images and thermal infrared images through a single-branch set fusion structure, as well as extract more salient features from the multi-branch set fusion. Therefore, by the multi-branch combination fusion module, multi-modal-based multi-level fusion features are extracted, cross-modal information from RGB and thermal infrared images can be captured more effectively compared with a common fusion method, and more complete and consistent targets can be detected; due to the idea of grouping convolution, compared with a common fusion method of directly cascading and then passing through a series of convolution layers and activation layers, the multi-branch group fusion module needs less training parameters;

step 4), obtaining a fusion output characteristic diagram:

fusing different stage characteristics in a stepwise reverse manner to obtain a plurality of edge output characteristic graphs { P } _d L d =1,2,3}, the expression is:

wherein D (; gamma) _d ,(1/2) ^d ) Is a convolution kernel of 2 ^d ×2 ^d Step size of (1/2) ^d Parameter is gamma _d The deconvolution layer of (a), so that the fused features have the same spatial resolution,

and

two convolution kernels are 1 × 1, the step length is 1, and the parameters are respectively

And

are used to fuse different stage features and generate edge output feature maps for each stage, respectively. Through progressive information transfer, we obtain 3 side output feature maps (P) with the same size as the input single-mode image _d |d＝1,2,3}；

The multilevel features are combined by using cascade operation, and then a convolution kernel is 1 multiplied by 1, the step length is 1, and the parameter is theta ⁰ C (;. Theta.) of convolution operation ⁰ 1) fusion to generate a feature map P ₀ The expression is:

P ₀ ＝C(Cat(P ₁ ,P ₂ ,P ₃ )；θ ⁰ ,1)

step 5) training an algorithm network:

on the training data set, a deep supervised learning mechanism is adopted to output the edge feature map and the fusion output feature map { P } _t And L t =0,1,2,3}, and comparing with the truth map G to obtain a cross entropy loss function L of the network model:

where G (i, j) ∈ {0,1} is the value at the (i, j) position in the truth map G, P _t (i, j) is a feature map P _t Through sigma (P) _t ) And (c) obtaining a probability value at the position (i, j) in the probability graph obtained after the operation, wherein the sigma (-) is a sigmoid activating function. In different images, the size of the area occupied by the salient object is different from that of the background area, in order to balance the loss of the foreground and the background and increase the detection accuracy of the algorithm for the salient objects with different sizes, a class balance parameter β is used, and β is the ratio of the number of background pixels in the true value image to the number of pixels in the whole true value image, and can be expressed as:

wherein, N _b Representing the number of background pixels, N _f Representing the number of foreground pixel points;

the invention uses a 3-step training method to train the network: training a branch network of the RGB image by minimizing a cross entropy loss function, wherein in the constructed branch network, a multi-branch group fusion module is removed, and multi-level visible light image features output from a plurality of adjacent depth feature fusion modules are directly input to a reverse transfer process to predict significance; secondly, constructing and training a thermal infrared branch by using the same method as the RGB branch network in the first step; thirdly, training an integral network of RGB-T image detection to obtain network model parameters based on VGG16 strut network parameters and near depth feature fusion module parameters obtained by RGB and thermal infrared single branch networks in the previous two steps;

when the thermal infrared single-mode branch network parameters are trained, the data set for detecting the thermal infrared single-mode significance target is absent, in order to train smoothly, the R channel of the RGB image is used for replacing the thermal infrared single-mode data, because the R channel image is closest to the thermal infrared image in the three channels of the RGB image, the specific training data set is constructed as follows:

using RGB images (one for every two) in the RGB-thermal data set and MSRA-B training data sets (one for every 3) to form a data ratio of 1; correspondingly, the thermal infrared images (one for each two) in the RGB-thermal data set and the R channels (one for each 3) of the images in the MSRA-B training data set are used for forming a data ratio of 1; for the RGB-T image multi-modal network model, paired images (one pair for each pair) in an RGB-thermal data set are used for training;

in the training process, in order to avoid the overfitting phenomenon caused by too little training data, each image is rotated by 90 degrees, rotated by 180 degrees, rotated by 270 degrees, and turned over horizontally and vertically, and the original data aggregation amount is enlarged to be 8 times;

step 5) predicting a pixel level saliency map of an RGB-T image:

taking the other half of data in the RGB-thermal data set except for training as test data, utilizing the network model parameters obtained in the step (5), carrying out further classification calculation on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4), and using { S } to calculate _t I t =0,1,2,3 represents the output saliency map of the network, S _t Can be expressed as follows:

S _t ＝σ(P _t )

wherein σ (·) is a sigmoid activation function;

finally, the S is ₀ And (5) obtaining a final RGB-T prediction significance map.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions are as follows: all simulation experiments are realized by adopting a caffe deep learning framework under the Ubuntu 16.04.5 environment and taking Matlab R2014b software as an interface;

2. simulation content and result analysis:

emulation 1

Compared with the existing RGB image-based salient target detection method and RGB-T image-based salient target detection algorithm, the method carries out salient target detection experiments on a public image database RGB-thermal, and part of experiment results are visually compared, as shown in FIG. 2, wherein the RGB image represents an RGB image used for experimental input in a database, the T image represents a thermal infrared image paired with the RGB image and used for experimental input in the database, and GT represents a true value image calibrated manually;

as can be seen from fig. 2, compared with the prior art, the method has a better background suppression effect, has a better complete consistency effect in the detection of the significant target in the complex scene, and is closer to the truth diagram of the artificial calibration.

Simulation 2

Compared with the existing single-mode image-based significant target detection method and RGB-T image-based significant target detection algorithm, the method provided by the invention has the advantages that the result obtained by performing a significant target detection experiment on a public image database RGB-thermal is objectively evaluated by adopting an accepted evaluation index, and the evaluation simulation result is shown in figures 3a and 3b, wherein:

FIG. 3a is a graph of the results of an evaluation of the present invention and the prior art using an accuracy-recall (P-R) curve;

FIG. 3b is a graph showing the results of the evaluation using the F-Measure curve according to the present invention and the prior art;

as can be seen from FIGS. 3a and 3b, compared with the prior art, the method has higher PR curve and F-measure curve, thereby showing that the method has better consistency and integrity for the detection of the significant target, and fully showing the effectiveness and superiority of the method.

The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. The method for detecting the RGB-T image saliency target based on multi-level depth feature fusion is characterized by comprising the following steps:

(1) Extracting rough multilevel features from the input image:

establishing a plurality of adjacent depth feature fusion modules, processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;

constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode characteristics under the same characteristic level in the improved 3-level single-mode characteristics obtained in the step (2) to obtain fused multi-mode characteristics;

(4) Obtaining a fusion output characteristic diagram:

step-by-step reverse fusion is carried out on different-stage features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;

(5) Training the algorithm network:

(6) Predicting a pixel-level saliency map of an RGB-T image:

and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output characteristic map and the fusion output characteristic map obtained in the step (4) by using the network model parameters obtained in the step (5).

2. The method for detecting the salient object of the RGB-T image based on multi-level depth feature fusion as claimed in claim 1, wherein the image in step (1) is an RGB image or a thermal infrared image.

3. The RGB-T image saliency target detection method of multi-level depth feature fusion of claim 1 characterized in that the base network in step (1) is VGG16 network.

4. The method for detecting the RGB-T image saliency target of claim 2 characterized by the step of multi-level depth feature fusion, wherein the step of constructing the neighboring depth feature fusion module in step (2) comprises the steps of:

(21) Respectively using symbols to sign the 5-grade rough single-mode features obtained in the step (1)

5. The method for detecting the RGB-T image significance target of multi-level depth feature fusion as claimed in claim 4, wherein the step (22) comprises:

Convolution operation of

One convolution kernel is 1 × 1, the step size is 1, and the parameters are

By convolution operation of

And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of

Deconvolution operation of

Respectively act on

And

Convolution operation of

Obtaining the d-th-level single-mode characteristic of 128 channels

The proximity depth fusion module can be represented as follows:

wherein:

cat (·) represents a cross-channel cascade operation;

φ (-) is a ReLu activation function.

6. The method for detecting the RGB-T image saliency target of claim 1 characterized by that, the multi-branch combination fusion module in step (3) is used for fusion of different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:

the multiple groups of fusion branches have 8 groups, and the single group of fusion branches has only one group;

7. The method for detecting the RGB-T image saliency target by multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module, and fusing different single modes at the same feature level in multiple sets of fusion branches to obtain a fused multi-mode feature includes the following steps:

(31) Single modal characterization of input

And

respectively dividing the channel into M groups with the same channel number according to the channel number to obtain

And

two feature sets, wherein:

(32) Then combining corresponding RGB and thermal infrared features from the M-th subgroup in the two feature sets of the same level through cascade operation, and then realizing the fusion of the cross-modal features in the subgroups through 1 × 1 convolution with 64/M channels and 3 × 3 convolution operation with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;

(33) M small groups of outputs are cascaded together to obtain the output characteristics H of a plurality of groups of fusion branches _1,d The expression is as follows:

wherein:

representing the m-th subgroupAnd fusing the parameters.

8. The method for detecting the salient object of the RGB-T image with multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module to fuse different single modes at the same feature level in a single-group fusion branch to obtain a fused multi-mode feature comprises the following steps:

(3a) The single group of fused branches can be regarded as a special case when M =1 in the multiple groups of fused branches, and the expression is as follows:

wherein:

H _2,d is the d-th level fused feature output of the single group of fused branches;

fusion parameters representing a single set of fusion branches;

(3b) D-th stage multi-branch group fusion feature H _d From H _1,d And H _2,d Simple cascading is carried out, and the expression is as follows:

H _d ＝Cat(H _1,d ,H _2,d )。