CN110210539B - RGB-T image saliency target detection method based on multi-level depth feature fusion - Google Patents

RGB-T image saliency target detection method based on multi-level depth feature fusion Download PDF

Info

Publication number
CN110210539B
CN110210539B CN201910431110.6A CN201910431110A CN110210539B CN 110210539 B CN110210539 B CN 110210539B CN 201910431110 A CN201910431110 A CN 201910431110A CN 110210539 B CN110210539 B CN 110210539B
Authority
CN
China
Prior art keywords
fusion
image
level
rgb
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910431110.6A
Other languages
Chinese (zh)
Other versions
CN110210539A (en
Inventor
张强
黄年昌
姚琳
刘健
韩军功
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910431110.6A priority Critical patent/CN110210539B/en
Publication of CN110210539A publication Critical patent/CN110210539A/en
Application granted granted Critical
Publication of CN110210539B publication Critical patent/CN110210539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention discloses a method for detecting a salient target of an RGB-T image based on multi-level depth feature fusion, which mainly solves the problem that the salient target cannot be completely and consistently detected in a complex and changeable scene in the prior art. The implementation scheme is as follows: 1. extracting rough multilevel characteristics from an input image; 2. constructing an adjacent depth feature fusion module to improve the single-mode features; 3. constructing a multi-branch group fusion module, and fusing multi-mode features; 4. obtaining a fusion output characteristic diagram; 5. training an algorithm network; 6. a pixel level saliency map of an RGB-T image is predicted. The method can effectively fuse the supplementary information from the images in different modes, can completely and consistently detect the image salient targets in a complex and changeable scene, and can be used for the image preprocessing process in computer vision.

Description

RGB-T image saliency target detection method based on multi-level depth feature fusion
Technical Field
The invention belongs to the field of image processing, relates to a RGB-T image salient target detection method, and particularly relates to a RGB-T image salient target detection method based on multi-level depth feature fusion, which can be used for a preprocessing process of an image in computer vision.
Background
Salient object detection aims to detect and segment salient object regions in images by using a model or an algorithm. As a preprocessing step of an image, salient object detection plays an important role in visual tasks such as visual tracking, image recognition, image compression, image fusion and the like.
Existing target detection methods can be divided into two main categories: one is a salient object detection method based on the traditional technology, and the other is a salient object detection method based on deep learning. The saliency prediction is completed through the characteristics of colors, textures, directions and the like extracted manually based on the traditional saliency target detection algorithm, the method excessively depends on the characteristics selected manually, the adaptability to scenes is not strong, and the performance on complex data sets is poor. With the wide application of deep learning, the significance target detection research based on the deep learning achieves breakthrough progress, and compared with the traditional significance algorithm, the detection performance is remarkably improved.
Most of the methods for detecting salient objects, such as "q.hou, m.m.cheng, x.hu, et al.deep super detected object detection with short connections, ieee Transactions on Pattern Analysis and Machine Analysis, 2019,41 (4): 815-828." the salient values are calculated only from RGB images of a single modality, the obtained scene information is limited, and under challenging scenes such as low light, low contrast, complex background, etc., the salient objects are difficult to be detected completely and consistently.
In order to solve the above problems, some RGB-T image-based salient object Detection methods are proposed, such as "Li C, wang G, ma Y, et al.a united RGB-T salience Detection Benchmark: dataset, baselines, analysis and a Novel approach.arxiv preprintiv: 1701.02829, 2017", which disclose a popular ranking model-based RGB-T image salient object Detection method, which uses the complementary information of RGB and thermal infrared images to construct a cross-modal consistent flow pattern ranking model, and combines a two-stage graph method to calculate the salient value of each node. Under the conditions of low illumination and low contrast, compared with a saliency target detection method taking RGB as input, the saliency target can be detected more accurately.
However, this method detects the target area block as a basic unit, and a significant blocking effect appears in the saliency map, the boundary between the target and the background is not accurate, and the target is not uniform inside. In addition, the method is established based on manual feature extraction, the selected features cannot completely express the internal characteristics of different images, the utilization of the supplementary information among the images in different modes is not sufficient, and the improvement of the detection effect is limited.
Disclosure of Invention
The purpose of the invention is as follows: in view of the defects of the prior art, the invention aims to provide an RGB-T image saliency target detection method based on multi-level depth feature fusion to improve the complete consistency effect of detection of saliency targets in complex and changeable scene images. The method mainly solves the problem that the prior art cannot completely and consistently detect the obvious target in a complex and changeable scene.
The key point for realizing the invention is that the RGB-T image multi-level depth feature extraction and fusion: through the fusion of the multistage single-mode characteristics extracted from the RGB and thermal infrared images, the significance is predicted: extracting rough multilevel characteristics from different depths of the strut network for RGB or thermal infrared images; constructing an adjacent depth feature fusion module, and extracting improved multi-level single-mode features; constructing a multi-branch combination fusion module, and fusing different modal characteristics; obtaining a fusion output characteristic diagram; training a network to obtain model parameters; a pixel level saliency map of an RGB-T image is predicted.
The technical scheme is as follows: the method for detecting the RGB-T image significance target based on the multi-level depth feature fusion comprises the following steps:
(1) Extracting rough multilevel features from an input image:
extracting 5-level features at different depths in the basic network of the image to be used as rough single-mode features;
(2) Constructing an adjacent depth feature fusion module, and improving the single-mode features:
establishing a plurality of adjacent depth feature fusion modules, then processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;
(3) Constructing a multi-branch combination fusion module, fusing multi-modal characteristics:
constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode features located under the same feature level in the improved 3-level single-mode features obtained in the step (2) to obtain fused multi-mode features;
(4) Obtaining a fusion output characteristic diagram:
step-by-step reverse fusion is carried out on the different-level features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;
(5) Training the algorithm network:
on a training data set, completing algorithm network training on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4) by adopting a deep supervised learning mechanism and minimizing a cross entropy loss function to obtain network model parameters;
(6) Predicting a pixel-level saliency map of an RGB-T image:
and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output feature map and the fusion output feature map obtained in the step (4) by using the network model parameters obtained in the step (5).
Further, the graph in the step (1) is an RGB image or a thermal infrared image.
Further, the base network in step (1) is a VGG16 network.
Further, the constructing of the neighboring depth feature fusion module in step (2) includes the following steps:
(21) Respectively using symbols to the 5-grade rough monomodal features obtained in the step (1)
Figure GDA0002118865060000041
Figure GDA0002118865060000042
Wherein n =1 or 2, respectively, represents an RGB image or a thermal infrared image;
(22) Each neighboring depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain the d-th order single-modal feature, d =1,2,3.
Still further, step (22) comprises:
(221) One convolution kernel is 3 x 3, the step length is 2, and the parameter is
Figure GDA0002118865060000043
By convolution operation of
Figure GDA0002118865060000044
One convolution kernel is 1 × 1, the step size is 1, and the parameters are
Figure GDA0002118865060000045
By convolution operation of
Figure GDA0002118865060000046
And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of
Figure GDA0002118865060000047
Deconvolution operation of
Figure GDA0002118865060000051
Respectively act on
Figure GDA0002118865060000052
And
Figure GDA0002118865060000053
(222) The 3-level features are concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1
Figure GDA0002118865060000054
By convolution operation of
Figure GDA0002118865060000055
Obtaining the d-th-level single-mode characteristic of 128 channels
Figure GDA0002118865060000056
The proximity depth fusion module can be represented as follows:
Figure GDA0002118865060000057
wherein:
cat (·) denotes cross-channel cascading operation;
φ (-) is a ReLu activation function.
Further, the multi-branch combination fusion module in step (3) fuses for different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:
the multiple groups of fusion branches have 8 groups, and the single group of fusion branches only has one group;
each merging branch outputs 64-channel features, and the two merging branch output features are cascaded to obtain 128-channel multi-modal features.
Further, the constructing a multi-branch combination fusion module in step (3) fuses different single modes at the same feature level in multiple groups of fusion branches to obtain fused multi-mode features, which includes the following steps:
(31) Input single modal characteristics
Figure GDA0002118865060000058
And
Figure GDA0002118865060000059
respectively cut into M groups with the same channel number according to the channel number to obtain
Figure GDA00021188650600000510
And
Figure GDA00021188650600000511
two feature sets, wherein:
m is a positive integer, and the value range of M is more than or equal to 2 and less than or equal to 128;
(32) Combining corresponding RGB and thermal infrared features from an M-th subgroup in two feature sets of the same level through cascade operation, and realizing the fusion of cross-modal features in the subgroups through convolution operation of 1 × 1 with 64/M channels and convolution operation of 3 × 3 with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;
(33) M small groups of outputs are cascaded together to obtain output characteristics H of multiple groups of fusion branches 1,d The expression is as follows:
Figure GDA0002118865060000061
wherein:
Figure GDA0002118865060000062
representing the above-described stack convolution operation with ReLu activation function,
Figure GDA0002118865060000063
represents the fusion parameters of the mth subgroup.
Further, the constructing a multi-branch combination fusion module in step (3) and fusing different single modes at the same feature level in the single-group fusion branch to obtain a fused multi-mode feature includes the following steps:
(3a) A single group of fused branches can be regarded as a special case when M =1 in multiple groups of fused branches, and the expression is:
Figure GDA0002118865060000064
wherein:
H 2,d is a d-th level fused feature output of a single set of fused branches;
Figure GDA0002118865060000065
the convolution operation comprising two stacks, namely convolution of 1 × 1 with the number of channels being 64 and convolution of 3 × 3 with the number of channels being 64, and each convolution operation is followed by a ReLu activation function;
Figure GDA0002118865060000066
fusion parameters representing a single set of fusion branches;
(3b) Multi-branch group fusion feature H of d-th stage d From H 1,d And H 2,d Simple cascading is carried out, and the expression is as follows:
H d =Cat(H 1,d ,H 2,d )。
has the advantages that: compared with the prior art, the RGB-T image significance target detection method based on multi-level depth feature fusion has the following beneficial effects:
1) The method can realize end-to-end pixel level detection of the RGB-T image without manual design and characteristic extraction, and simulation results show that the method has complete consistency effect when detecting the obvious target of the image in a complex and changeable scene.
2) According to the method, 5-level rough single-mode features extracted from the strut network are improved by establishing a plurality of adjacent depth feature fusion modules to obtain 3-level single-mode features, low-level details and high-level semantic information of an input image can be effectively captured, the phenomenon that overall network parameters are increased sharply due to excessive feature levels is avoided, and the difficulty in network training is reduced.
3) The invention fuses different modal characteristics by constructing the multi-branch combination fusion module comprising two fusion branches, because the single-branch combination fusion structure captures cross-channel correlation among all characteristics of different modes from RGB images and thermal infrared images, and more remarkable characteristics are extracted from a plurality of groups of fusion branches, cross-modal information from RGB and thermal infrared images can be effectively captured, more complete and consistent targets can be detected, meanwhile, the number of training parameters required by the fusion module is less, and the detection speed of the algorithm can be improved.
Drawings
FIG. 1 is a flow chart of an implementation of a method for detecting a saliency target of an RGB-T image with multi-level depth feature fusion disclosed by the present invention;
FIG. 2 is a simulation comparison diagram of experimental results of RGB-thermal database according to the present invention and the prior art;
FIGS. 3a and 3b are simulation comparison diagrams of two evaluation indexes of a P-R curve and an F-measure curve under an RGB-thermal database according to the present invention and the prior art.
The specific implementation mode is as follows:
the following is a detailed description of specific embodiments of the invention.
Referring to fig. 1, the method for detecting the salient object of the RGB-T image with the multi-level depth feature fusion includes the following steps:
step 1) extracting rough multilevel features from an input image:
for the RGB image or the thermal infrared image, 5-level features at different depths in the VGG16 network are extracted as rough single-mode features, which are respectively as follows:
conv1-2 (by symbol)
Figure GDA0002118865060000081
Representation, containing 64 feature maps of size 256 × 256);
conv2-2 (by symbol)
Figure GDA0002118865060000082
Representation, containing 128 feature maps of size 128 × 128);
conv3-3 (by symbol)
Figure GDA0002118865060000083
Representation, containing 256 feature maps of size 64 × 64);
conv4-3 (by symbol)
Figure GDA0002118865060000084
Representing, containing 512 feature maps of size 32 × 32);
conv5-3 (by symbol)
Figure GDA0002118865060000085
Representation, containing 512 feature maps of size 16 × 16);
wherein: n =1 or 2 and n is a linear integer,
n =1 represents an RGB image;
n =2 represents a thermal infrared image;
step 2), constructing an adjacent depth feature fusion module, and improving the single-mode features:
the common multi-modal vision method directly takes five-level features as single-modal features, the method has huge network parameter quantity due to excessive number of characteristic levels and increased difficulty of network training, 5-level features with different depths are taken as rough single-modal features, and 3-level improved RGB image features or thermal infrared image features are obtained by establishing a plurality of adjacent depth feature fusion modules;
each neighboring depth fusion module comprises 3 convolution operations and 1 deconvolution operation, and in particular, to obtain the d-th-order monomodal feature, d =1,2,3, a convolution kernel of 3 × 3 is first set, step size of 2, and parameters of 2
Figure GDA0002118865060000091
By convolution operation of
Figure GDA0002118865060000092
One convolution kernel is 1 × 1, step size is 1, and parameter is
Figure GDA0002118865060000093
By convolution operation of
Figure GDA0002118865060000094
And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of
Figure GDA0002118865060000095
Deconvolution operation of
Figure GDA0002118865060000096
Respectively act on
Figure GDA0002118865060000097
Figure GDA0002118865060000098
And
Figure GDA0002118865060000099
to ensure that adjacent level 3 features from the strut network have the same spatial resolution and number of feature channels (128 channels for the present invention); then, the 3-level features are cascaded, and the 3-level features pass through a convolution kernel with the length of 1 multiplied by 1, the step size of 1 and the parameter of 1
Figure GDA00021188650600000910
Of (2) a convolutional layer
Figure GDA00021188650600000911
Obtaining the d-th-level single-mode characteristic of 128 channels
Figure GDA00021188650600000912
The proximity depth fusion module can be represented as follows:
Figure GDA00021188650600000913
wherein Cat (-) represents the cross-channel cascade operation, and phi (-) is a ReLu activation function;
as indicated above, the RGB or thermal infrared single mode feature of the d-th level
Figure GDA00021188650600000914
And also contains 3 levels of characteristic information from the strut network, i.e.
Figure GDA00021188650600000915
Feature of depth adjacent to it
Figure GDA00021188650600000916
And
Figure GDA00021188650600000917
this also shows that
Figure GDA00021188650600000918
Will contain richer detail and semantic information, which is helpful for accurately identifying the target, and in addition, the characteristics
Figure GDA00021188650600000919
With respect to simple merging
Figure GDA00021188650600000920
Figure GDA00021188650600000921
And
Figure GDA00021188650600000922
has more compact data, improves redundant information in the crude extracted features through adjacent depth feature fusion
Figure GDA00021188650600000923
Compression is obtained in the features;
step 3), constructing a multi-branch combination fusion module, and fusing multi-modal characteristics:
the multi-branch combination fusion module performs fusion aiming at different single modes under the same characteristic level and comprises two fusion branches;
the first merging branch (also called multi-group merging branch) has M (M =8 in this embodiment) groups, and mainly amplifies the effect of each channel to reduce network parameters;
the second fused branch (also called single-group fused branch) has only one group and mainly functions to fully capture cross-channel correlation among all input features of different modalities; the two branches output the same number of channels (64 channels in this embodiment), so the final number of output characteristic channels of the multi-branch combination fusion module is twice that of each fusion branch, and is equal to the number of RGB or thermal infrared image characteristic channels (128 channels in this embodiment) input to the multi-branch combination fusion module;
multiple groups of fusion branches are established according to the basic idea of 'splitting-converting-merging', in which input single-mode features
Figure GDA0002118865060000101
And
Figure GDA0002118865060000102
respectively cut into M groups (128/M) with the same channel number according to the channel number to obtain
Figure GDA0002118865060000103
And
Figure GDA0002118865060000104
two feature sets; then combining corresponding RGB and thermal infrared features from an M-th subgroup in two feature sets of the same level through cascade operation, and then realizing the fusion of cross-modal features in the subgroups through convolution of 1 × 1 with 64/M channels and convolution operation of 3 × 3 with 64/M channels, wherein the first 1 × 1 mainly plays a role in reducing the number of feature channels, the second convolution is mainly used for fusing the features, and each convolution operation is followed by a ReLu activation function; finally, M small groups of outputs are cascaded together to obtain the output characteristics H of multiple groups of fusion branches 1,d The expression is as follows:
Figure GDA0002118865060000105
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0002118865060000106
representing the above-described stack convolution operation with the ReLu activation function,
Figure GDA0002118865060000107
fusion parameters representing the mth subgroup;
a single group of fused branches can be regarded as a special case when M =1 in multiple groups of fused branches, and the expression is:
Figure GDA0002118865060000108
wherein H 2,d Is the d-th order fused feature output of the single group of fused branches,
Figure GDA0002118865060000111
comprising two stacked convolution operations, respectively 1 × 1 convolution with channel number 64 and 3 × 3 convolution with channel number 64, by which correlation information between all the multimodal features of the input is fully captured, and each convolution operation is followed by a ReLu activation function,
Figure GDA0002118865060000112
fusion parameters representing a single set of fusion branches;
finally, through the multi-group fusion branch and the single-group fusion branch, the multi-branch group fusion characteristic H of the d-th level d Can be prepared from H 1,d And H 2,d Simple cascading is carried out, and the expression is as follows:
H d =Cat(H 1,d ,H 2,d )
as described, the multi-branch set fusion module can capture cross-channel correlations between all features from different modalities of RGB images and thermal infrared images through a single-branch set fusion structure, as well as extract more salient features from the multi-branch set fusion. Therefore, by the multi-branch combination fusion module, multi-modal-based multi-level fusion features are extracted, cross-modal information from RGB and thermal infrared images can be captured more effectively compared with a common fusion method, and more complete and consistent targets can be detected; due to the idea of grouping convolution, compared with a common fusion method of directly cascading and then passing through a series of convolution layers and activation layers, the multi-branch group fusion module needs less training parameters;
step 4), obtaining a fusion output characteristic diagram:
fusing different stage characteristics in a stepwise reverse manner to obtain a plurality of edge output characteristic graphs { P } d L d =1,2,3}, the expression is:
Figure GDA0002118865060000113
wherein D (; gamma) d ,(1/2) d ) Is a convolution kernel of 2 d ×2 d Step size of (1/2) d Parameter is gamma d The deconvolution layer of (a), so that the fused features have the same spatial resolution,
Figure GDA0002118865060000121
and
Figure GDA0002118865060000122
two convolution kernels are 1 × 1, the step length is 1, and the parameters are respectively
Figure GDA0002118865060000123
And
Figure GDA0002118865060000124
are used to fuse different stage features and generate edge output feature maps for each stage, respectively. Through progressive information transfer, we obtain 3 side output feature maps (P) with the same size as the input single-mode image d |d=1,2,3};
The multilevel features are combined by using cascade operation, and then a convolution kernel is 1 multiplied by 1, the step length is 1, and the parameter is theta 0 C (;. Theta.) of convolution operation 0 1) fusion to generate a feature map P 0 The expression is:
P 0 =C(Cat(P 1 ,P 2 ,P 3 );θ 0 ,1)
step 5) training an algorithm network:
on the training data set, a deep supervised learning mechanism is adopted to output the edge feature map and the fusion output feature map { P } t And L t =0,1,2,3}, and comparing with the truth map G to obtain a cross entropy loss function L of the network model:
Figure GDA0002118865060000125
where G (i, j) ∈ {0,1} is the value at the (i, j) position in the truth map G, P t (i, j) is a feature map P t Through sigma (P) t ) And (c) obtaining a probability value at the position (i, j) in the probability graph obtained after the operation, wherein the sigma (-) is a sigmoid activating function. In different images, the size of the area occupied by the salient object is different from that of the background area, in order to balance the loss of the foreground and the background and increase the detection accuracy of the algorithm for the salient objects with different sizes, a class balance parameter β is used, and β is the ratio of the number of background pixels in the true value image to the number of pixels in the whole true value image, and can be expressed as:
Figure GDA0002118865060000126
wherein, N b Representing the number of background pixels, N f Representing the number of foreground pixel points;
the invention uses a 3-step training method to train the network: training a branch network of the RGB image by minimizing a cross entropy loss function, wherein in the constructed branch network, a multi-branch group fusion module is removed, and multi-level visible light image features output from a plurality of adjacent depth feature fusion modules are directly input to a reverse transfer process to predict significance; secondly, constructing and training a thermal infrared branch by using the same method as the RGB branch network in the first step; thirdly, training an integral network of RGB-T image detection to obtain network model parameters based on VGG16 strut network parameters and near depth feature fusion module parameters obtained by RGB and thermal infrared single branch networks in the previous two steps;
when the thermal infrared single-mode branch network parameters are trained, the data set for detecting the thermal infrared single-mode significance target is absent, in order to train smoothly, the R channel of the RGB image is used for replacing the thermal infrared single-mode data, because the R channel image is closest to the thermal infrared image in the three channels of the RGB image, the specific training data set is constructed as follows:
using RGB images (one for every two) in the RGB-thermal data set and MSRA-B training data sets (one for every 3) to form a data ratio of 1; correspondingly, the thermal infrared images (one for each two) in the RGB-thermal data set and the R channels (one for each 3) of the images in the MSRA-B training data set are used for forming a data ratio of 1; for the RGB-T image multi-modal network model, paired images (one pair for each pair) in an RGB-thermal data set are used for training;
in the training process, in order to avoid the overfitting phenomenon caused by too little training data, each image is rotated by 90 degrees, rotated by 180 degrees, rotated by 270 degrees, and turned over horizontally and vertically, and the original data aggregation amount is enlarged to be 8 times;
step 5) predicting a pixel level saliency map of an RGB-T image:
taking the other half of data in the RGB-thermal data set except for training as test data, utilizing the network model parameters obtained in the step (5), carrying out further classification calculation on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4), and using { S } to calculate t I t =0,1,2,3 represents the output saliency map of the network, S t Can be expressed as follows:
S t =σ(P t )
wherein σ (·) is a sigmoid activation function;
finally, the S is 0 And (5) obtaining a final RGB-T prediction significance map.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions are as follows: all simulation experiments are realized by adopting a caffe deep learning framework under the Ubuntu 16.04.5 environment and taking Matlab R2014b software as an interface;
2. simulation content and result analysis:
emulation 1
Compared with the existing RGB image-based salient target detection method and RGB-T image-based salient target detection algorithm, the method carries out salient target detection experiments on a public image database RGB-thermal, and part of experiment results are visually compared, as shown in FIG. 2, wherein the RGB image represents an RGB image used for experimental input in a database, the T image represents a thermal infrared image paired with the RGB image and used for experimental input in the database, and GT represents a true value image calibrated manually;
as can be seen from fig. 2, compared with the prior art, the method has a better background suppression effect, has a better complete consistency effect in the detection of the significant target in the complex scene, and is closer to the truth diagram of the artificial calibration.
Simulation 2
Compared with the existing single-mode image-based significant target detection method and RGB-T image-based significant target detection algorithm, the method provided by the invention has the advantages that the result obtained by performing a significant target detection experiment on a public image database RGB-thermal is objectively evaluated by adopting an accepted evaluation index, and the evaluation simulation result is shown in figures 3a and 3b, wherein:
FIG. 3a is a graph of the results of an evaluation of the present invention and the prior art using an accuracy-recall (P-R) curve;
FIG. 3b is a graph showing the results of the evaluation using the F-Measure curve according to the present invention and the prior art;
as can be seen from FIGS. 3a and 3b, compared with the prior art, the method has higher PR curve and F-measure curve, thereby showing that the method has better consistency and integrity for the detection of the significant target, and fully showing the effectiveness and superiority of the method.
The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (8)

1. The method for detecting the RGB-T image saliency target based on multi-level depth feature fusion is characterized by comprising the following steps:
(1) Extracting rough multilevel features from the input image:
extracting 5-level features at different depths in the basic network of the image to be used as rough single-mode features;
(2) Constructing an adjacent depth feature fusion module, and improving the single-mode features:
establishing a plurality of adjacent depth feature fusion modules, processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;
(3) Constructing a multi-branch combination fusion module, fusing multi-modal characteristics:
constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode characteristics under the same characteristic level in the improved 3-level single-mode characteristics obtained in the step (2) to obtain fused multi-mode characteristics;
(4) Obtaining a fusion output characteristic diagram:
step-by-step reverse fusion is carried out on different-stage features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;
(5) Training the algorithm network:
on a training data set, completing algorithm network training on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4) by adopting a deep supervised learning mechanism and minimizing a cross entropy loss function to obtain network model parameters;
(6) Predicting a pixel-level saliency map of an RGB-T image:
and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output characteristic map and the fusion output characteristic map obtained in the step (4) by using the network model parameters obtained in the step (5).
2. The method for detecting the salient object of the RGB-T image based on multi-level depth feature fusion as claimed in claim 1, wherein the image in step (1) is an RGB image or a thermal infrared image.
3. The RGB-T image saliency target detection method of multi-level depth feature fusion of claim 1 characterized in that the base network in step (1) is VGG16 network.
4. The method for detecting the RGB-T image saliency target of claim 2 characterized by the step of multi-level depth feature fusion, wherein the step of constructing the neighboring depth feature fusion module in step (2) comprises the steps of:
(21) Respectively using symbols to sign the 5-grade rough single-mode features obtained in the step (1)
Figure FDA0003878208030000021
Figure FDA0003878208030000022
Wherein n =1 or 2, respectively, represents an RGB image or a thermal infrared image;
(22) Each neighboring depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain the d-th order single-modal feature, d =1,2,3.
5. The method for detecting the RGB-T image significance target of multi-level depth feature fusion as claimed in claim 4, wherein the step (22) comprises:
(221) One convolution kernel is 3 x 3, the step length is 2, and the parameter is
Figure FDA0003878208030000023
Convolution operation of
Figure FDA0003878208030000024
One convolution kernel is 1 × 1, the step size is 1, and the parameters are
Figure FDA0003878208030000025
By convolution operation of
Figure FDA0003878208030000026
And a convolution kernel of 2 × 2 with a step size of 1/2 and a parameter of
Figure FDA0003878208030000027
Deconvolution operation of
Figure FDA0003878208030000028
Respectively act on
Figure FDA0003878208030000029
And
Figure FDA00038782080300000210
(222) The 3-level features are concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1
Figure FDA0003878208030000031
Convolution operation of
Figure FDA0003878208030000032
Obtaining the d-th-level single-mode characteristic of 128 channels
Figure FDA0003878208030000033
The proximity depth fusion module can be represented as follows:
Figure FDA0003878208030000034
wherein:
cat (·) represents a cross-channel cascade operation;
φ (-) is a ReLu activation function.
6. The method for detecting the RGB-T image saliency target of claim 1 characterized by that, the multi-branch combination fusion module in step (3) is used for fusion of different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:
the multiple groups of fusion branches have 8 groups, and the single group of fusion branches has only one group;
each merging branch outputs 64-channel features, and the two merging branch output features are cascaded to obtain 128-channel multi-modal features.
7. The method for detecting the RGB-T image saliency target by multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module, and fusing different single modes at the same feature level in multiple sets of fusion branches to obtain a fused multi-mode feature includes the following steps:
(31) Single modal characterization of input
Figure FDA0003878208030000035
And
Figure FDA0003878208030000036
respectively dividing the channel into M groups with the same channel number according to the channel number to obtain
Figure FDA0003878208030000037
And
Figure FDA0003878208030000038
two feature sets, wherein:
m is a positive integer, and the value range of M is more than or equal to 2 and less than or equal to 128;
(32) Then combining corresponding RGB and thermal infrared features from the M-th subgroup in the two feature sets of the same level through cascade operation, and then realizing the fusion of the cross-modal features in the subgroups through 1 × 1 convolution with 64/M channels and 3 × 3 convolution operation with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;
(33) M small groups of outputs are cascaded together to obtain the output characteristics H of a plurality of groups of fusion branches 1,d The expression is as follows:
Figure FDA0003878208030000041
wherein:
Figure FDA0003878208030000042
representing the above-described stack convolution operation with ReLu activation function,
Figure FDA0003878208030000043
representing the m-th subgroupAnd fusing the parameters.
8. The method for detecting the salient object of the RGB-T image with multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module to fuse different single modes at the same feature level in a single-group fusion branch to obtain a fused multi-mode feature comprises the following steps:
(3a) The single group of fused branches can be regarded as a special case when M =1 in the multiple groups of fused branches, and the expression is as follows:
Figure FDA0003878208030000044
wherein:
H 2,d is the d-th level fused feature output of the single group of fused branches;
Figure FDA0003878208030000045
the convolution operation comprising two stacks, namely convolution of 1 × 1 with the number of channels being 64 and convolution of 3 × 3 with the number of channels being 64, and each convolution operation is followed by a ReLu activation function;
Figure FDA0003878208030000051
fusion parameters representing a single set of fusion branches;
(3b) D-th stage multi-branch group fusion feature H d From H 1,d And H 2,d Simple cascading is carried out, and the expression is as follows:
H d =Cat(H 1,d ,H 2,d )。
CN201910431110.6A 2019-05-22 2019-05-22 RGB-T image saliency target detection method based on multi-level depth feature fusion Active CN110210539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910431110.6A CN110210539B (en) 2019-05-22 2019-05-22 RGB-T image saliency target detection method based on multi-level depth feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910431110.6A CN110210539B (en) 2019-05-22 2019-05-22 RGB-T image saliency target detection method based on multi-level depth feature fusion

Publications (2)

Publication Number Publication Date
CN110210539A CN110210539A (en) 2019-09-06
CN110210539B true CN110210539B (en) 2022-12-30

Family

ID=67788118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910431110.6A Active CN110210539B (en) 2019-05-22 2019-05-22 RGB-T image saliency target detection method based on multi-level depth feature fusion

Country Status (1)

Country Link
CN (1) CN110210539B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047571B (en) * 2019-12-10 2023-04-25 安徽大学 Image salient target detection method with self-adaptive selection training process
CN110889416B (en) * 2019-12-13 2023-04-18 南开大学 Salient object detection method based on cascade improved network
CN111242138B (en) * 2020-01-11 2022-04-01 杭州电子科技大学 RGBD significance detection method based on multi-scale feature fusion
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Convolutional neural network edge-assisted enhanced binocular saliency image detection method
CN111583173B (en) * 2020-03-20 2023-12-01 北京交通大学 RGB-D image saliency target detection method
CN111582316B (en) * 2020-04-10 2022-06-28 天津大学 RGB-D significance target detection method
CN111666977B (en) * 2020-05-09 2023-02-28 西安电子科技大学 Shadow detection method of monochrome image
CN111814895A (en) * 2020-07-17 2020-10-23 大连理工大学人工智能大连研究院 Significance target detection method based on absolute and relative depth induction network
CN111986240A (en) * 2020-09-01 2020-11-24 交通运输部水运科学研究所 Drowning person detection method and system based on visible light and thermal imaging data fusion
CN112348870B (en) * 2020-11-06 2022-09-30 大连理工大学 Significance target detection method based on residual error fusion
CN113205481A (en) * 2021-03-19 2021-08-03 浙江科技学院 Salient object detection method based on stepped progressive neural network
CN113221659B (en) * 2021-04-13 2022-12-23 天津大学 Double-light vehicle detection method and device based on uncertain sensing network
CN113159068B (en) * 2021-04-13 2022-08-30 天津大学 RGB-D significance target detection method based on deep learning
CN113486899B (en) * 2021-05-26 2023-01-24 南开大学 Saliency target detection method based on complementary branch network
CN113298094B (en) * 2021-06-10 2022-11-04 安徽大学 RGB-T significance target detection method based on modal association and double-perception decoder
CN114092774B (en) * 2021-11-22 2023-08-15 沈阳工业大学 RGB-T image significance detection system and detection method based on information flow fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109712105A (en) * 2018-12-24 2019-05-03 浙江大学 A kind of image well-marked target detection method of combination colour and depth information
CN109784183A (en) * 2018-12-17 2019-05-21 西北工业大学 Saliency object detection method based on concatenated convolutional network and light stream

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7742620B2 (en) * 2003-03-21 2010-06-22 Lockhead Martin Corporation Target detection improvements using temporal integrations and spatial fusion
CN106295678B (en) * 2016-07-27 2020-03-06 北京旷视科技有限公司 Neural network training and constructing method and device and target detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109784183A (en) * 2018-12-17 2019-05-21 西北工业大学 Saliency object detection method based on concatenated convolutional network and light stream
CN109712105A (en) * 2018-12-24 2019-05-03 浙江大学 A kind of image well-marked target detection method of combination colour and depth information

Also Published As

Publication number Publication date
CN110210539A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110210539B (en) RGB-T image saliency target detection method based on multi-level depth feature fusion
CN112884064B (en) Target detection and identification method based on neural network
CN108596330B (en) Parallel characteristic full-convolution neural network device and construction method thereof
CN111582316B (en) RGB-D significance target detection method
Zhou et al. Contextual ensemble network for semantic segmentation
CN111612008A (en) Image segmentation method based on convolution network
Hu et al. Learning hybrid convolutional features for edge detection
CN108596240B (en) Image semantic segmentation method based on discriminant feature network
CN111242181B (en) RGB-D saliency object detector based on image semantics and detail
CN113688894B (en) Fine granularity image classification method integrating multiple granularity features
CN111797841B (en) Visual saliency detection method based on depth residual error network
CN114549439A (en) RGB-D image semantic segmentation method based on multi-modal feature fusion
CN111401380A (en) RGB-D image semantic segmentation method based on depth feature enhancement and edge optimization
CN114219824A (en) Visible light-infrared target tracking method and system based on deep network
CN113628297A (en) COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning
CN112651459A (en) Defense method, device, equipment and storage medium for confrontation sample of deep learning image
Zhu et al. Supplement and suppression: Both boundary and nonboundary are helpful for salient object detection
Liu et al. Student behavior recognition from heterogeneous view perception in class based on 3-D multiscale residual dense network for the analysis of case teaching
Liu et al. Progressive context-dependent inference for object detection in remote sensing imagery
Chen et al. Intra-and inter-reasoning graph convolutional network for saliency prediction on 360° images
Zong et al. A cascaded refined rgb-d salient object detection network based on the attention mechanism
Hou et al. A face detection algorithm based on two information flow block and retinal receptive field block
CN111104924A (en) Processing algorithm for effectively identifying low-resolution commodity image
Zheng et al. Boundary-aware network with two-stage partial decoders for salient object detection in remote sensing images
CN115457385A (en) Building change detection method based on lightweight network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant