CN110210539A - The RGB-T saliency object detection method of multistage depth characteristic fusion - Google Patents
The RGB-T saliency object detection method of multistage depth characteristic fusion Download PDFInfo
- Publication number
- CN110210539A CN110210539A CN201910431110.6A CN201910431110A CN110210539A CN 110210539 A CN110210539 A CN 110210539A CN 201910431110 A CN201910431110 A CN 201910431110A CN 110210539 A CN110210539 A CN 110210539A
- Authority
- CN
- China
- Prior art keywords
- fusion
- level
- rgb
- feature
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 118
- 238000001514 detection method Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 14
- 238000010586 diagram Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 241000282326 Felis catus Species 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 4
- 230000010354 integration Effects 0.000 abstract 1
- 230000000153 supplemental effect Effects 0.000 abstract 1
- 238000004088 simulation Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
- G06V10/464—Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of RGB-T saliency object detection methods of multistage depth characteristic fusion, mainly solve the problems, such as that the prior art completely cannot consistently detect well-marked target in scene complicated and changeable.Its implementation are as follows: 1. pairs of input pictures extract coarse multi-stage characteristics;2. constructing neighbouring depth characteristic Fusion Module, improve single mode feature;3. constructing multiple-limb group Fusion Module, multi-modal feature is merged;4. obtaining fusion output characteristic pattern;5. training algorithm network;6. predicting the Pixel-level notable figure of RGB-T image.The present invention can supplemental information of the effective integration from different modalities image, can under scene complicated and changeable complete consistently detection image well-marked target, can be used for the pretreatment process of image in computer vision.
Description
Technical Field
The invention belongs to the field of image processing, relates to a RGB-T image salient target detection method, and particularly relates to a RGB-T image salient target detection method based on multi-level depth feature fusion, which can be used for a preprocessing process of an image in computer vision.
Background
Salient object detection aims to detect and segment salient object areas in images by using a model or an algorithm. As a preprocessing step of an image, salient object detection plays an important role in visual tasks such as visual tracking, image recognition, image compression, image fusion and the like.
Existing target detection methods can be divided into two main categories: one is based on the traditional salient object detection method, and the other is based on the salient object detection method of deep learning. The saliency prediction is completed through the characteristics of colors, textures, directions and the like extracted manually based on the traditional saliency target detection algorithm, the method excessively depends on the characteristics selected manually, the adaptability to scenes is not strong, and the performance on complex data sets is poor. With the wide application of deep learning, the significance target detection research based on the deep learning achieves breakthrough progress, and compared with the traditional significance algorithm, the detection performance is remarkably improved.
Most of the methods for detecting a salient object, such as "q.hou, m.m.cheng, x.hu, et al.superimposed detected object detection with short connections. ieee transaction on Pattern Analysis and Machine Analysis, 2019,41(4): 815-.
In order to solve the above problems, some significant target Detection methods based on RGB-T images are proposed, such as "Li C, WangG, Ma Y, et al. A Unified RGB-T clinical Detection Benchmark: Dataset, Baselines, Analysis and A Novel approach. arXiv prediction arXiv:1701.02829,2017", which discloses a method for detecting a significant target of RGB-T images based on a popular ranking model, wherein the method utilizes the supplementary information of RGB and thermal infrared images to construct a flow-type ranking model with cross-modal consistency, and combines a two-stage graph method to calculate the significant value of each node. Under the conditions of low illumination and low contrast, compared with a saliency target detection method taking RGB as input, the saliency target can be detected more accurately.
However, this method is used for detection in the basic unit of region block, and significant blocking effect appears in the saliency map, the boundary of the target and the background is not accurate, and the target is not uniform inside. In addition, the method is established based on manual feature extraction, the selected features cannot completely express the internal characteristics of different images, the utilization of supplementary information among different modal images is not sufficient, and the improvement of the detection effect is limited.
Disclosure of Invention
The purpose of the invention is as follows: in view of the above deficiencies of the prior art, the present invention provides a RGB-T image salient object detection method based on multi-level depth feature fusion, so as to improve the complete consistency effect of detecting salient objects in a complex and changeable scene image. The method mainly solves the problem that the prior art cannot completely and consistently detect the obvious target in a complex and changeable scene.
The key point for realizing the method is that the RGB-T image multi-level depth feature extraction and fusion: through the fusion of the multistage single-mode characteristics extracted from the RGB and thermal infrared images, the significance is predicted: extracting rough multilevel characteristics from different depths of the strut network for RGB or thermal infrared images; constructing an adjacent depth feature fusion module, and extracting improved multi-level single-mode features; constructing a multi-branch combination fusion module, and fusing different modal characteristics; obtaining a fusion output characteristic diagram; training a network to obtain model parameters; a pixel level saliency map of an RGB-T image is predicted.
The technical scheme is as follows: the method for detecting the RGB-T image saliency target by multi-level depth feature fusion comprises the following steps:
(1) extracting rough multilevel features from the input image:
extracting 5-level features at different depths in the basic network of the image to be used as rough single-mode features;
(2) constructing an adjacent depth feature fusion module, and improving the single-mode features:
establishing a plurality of adjacent depth feature fusion modules, then processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;
(3) constructing a multi-branch combination fusion module, fusing multi-modal characteristics:
constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode features located under the same feature level in the improved 3-level single-mode features obtained in the step (2) to obtain fused multi-mode features;
(4) obtaining a fusion output characteristic diagram:
step-by-step reverse fusion is carried out on the different-level features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;
(5) training the algorithm network:
on a training data set, completing algorithm network training on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4) by adopting a deep supervised learning mechanism and minimizing a cross entropy loss function to obtain network model parameters;
(6) predicting a pixel-level saliency map of an RGB-T image:
and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output feature map and the fusion output feature map obtained in the step (4) by using the network model parameters obtained in the step (5).
Further, the graph in the step (1) is an RGB image or a thermal infrared image.
Further, the base network in step (1) is a VGG16 network.
Further, the constructing of the neighboring depth feature fusion module in step (2) includes the following steps:
(21) respectively using symbols to sign the 5-grade rough single-mode features obtained in the step (1) Wherein n is 1 or 2, and represents an RGB image or a thermal infrared image, respectively;
(22) each neighboring depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain the d-th-order single-mode feature, d being 1,2, 3.
Still further, step (22) comprises:
(221) one convolution kernel is 3 x 3, the step length is 2, and the parameter isBy convolution operation ofOne convolution kernel is 1 × 1, step size is 1, and parameter isBy convolution operation ofAnd a convolution kernel of 2 x 2, step size 1/2, and parameterBy deconvolution ofRespectively act onAnd
(222) the 3-level features are concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1By convolution operation ofObtaining the d-th-level single-mode characteristic of 128 channelsThe proximity depth fusion module can be represented as follows:
wherein:
cat (·) represents a cross-channel cascade operation;
φ (-) is a ReLu activation function.
Further, the multi-branch combination fusion module in step (3) fuses for different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:
the multi-group fusion branch has 8 groups, and the single-group fusion branch has only one group;
each merging branch outputs 64-channel features, and the two merging branch output features are cascaded to obtain 128-channel multi-modal features.
Further, the constructing a multi-branch combination fusion module in step (3) and fusing different single modes at the same feature level in the multiple groups of fusion branches to obtain a fused multi-mode feature includes the following steps:
(31) single modal characterization of inputAndrespectively cut into M groups with the same channel number according to the channel number to obtainAndtwo sets of features are provided,wherein:
m is a positive integer, and the value range of M is more than or equal to 2 and less than or equal to 128;
(32) then combining corresponding RGB and thermal infrared features from the M-th subgroup in the two feature sets of the same level through cascade operation, and then realizing the fusion of the cross-modal features in the subgroups through 1 × 1 convolution with 64/M channels and 3 × 3 convolution operation with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;
(33) m small groups of outputs are cascaded together to obtain the output characteristics H of a plurality of groups of fusion branches1,dThe expression is as follows:
wherein:
representing the above-described stack convolution operation with ReLu activation function,
represents the fusion parameters of the mth subgroup.
Further, the constructing a multi-branch combination fusion module in step (3) and fusing different single modes at the same feature level in the single-group fusion branch to obtain a fused multi-mode feature includes the following steps:
(3a) the single group of fused branches can be regarded as a special case when M is 1 in the multiple groups of fused branches, and the expression is:
wherein:
H2,dis the d-th level fused feature output of the single group of fused branches;
the convolution operation comprising two stacks, namely convolution of 1 × 1 with the number of channels being 64 and convolution of 3 × 3 with the number of channels being 64, and each convolution operation is followed by a ReLu activation function;
fusion parameters representing a single set of fusion branches;
(3b) multi-branch group fusion feature H of d-th stagedFrom H1,dAnd H2,dSimple cascading is carried out, and the expression is as follows:
Hd=Cat(H1,d,H2,d)。
has the advantages that: compared with the prior art, the RGB-T image saliency target detection method based on multi-level depth feature fusion has the following beneficial effects:
1) the method can realize end-to-end pixel level detection of the RGB-T image without manual design and characteristic extraction, and simulation results show that the method has complete consistency effect when detecting the obvious target of the image in a complex and changeable scene.
2) According to the method, 5-level rough single-mode features extracted from the strut network are improved by establishing a plurality of adjacent depth feature fusion modules to obtain 3-level single-mode features, low-level details and high-level semantic information of an input image can be effectively captured, the phenomenon that overall network parameters are increased sharply due to excessive feature levels is avoided, and the difficulty in network training is reduced.
3) The invention fuses different modal characteristics by constructing the multi-branch combination fusion module comprising two fusion branches, because the single-branch combination fusion structure captures cross-channel correlation among all characteristics of different modes from RGB images and thermal infrared images, and more remarkable characteristics are extracted from a plurality of groups of fusion branches, cross-modal information from RGB and thermal infrared images can be effectively captured, more complete and consistent targets can be detected, meanwhile, the number of training parameters required by the fusion module is less, and the detection speed of the algorithm can be improved.
Drawings
FIG. 1 is a flow chart of an implementation of a method for detecting a saliency target of an RGB-T image with multi-level depth feature fusion disclosed by the present invention;
FIG. 2 is a simulation comparison diagram of experimental results of RGB-thermal database according to the present invention and the prior art;
FIGS. 3a and 3b are comparison graphs of two evaluation indexes of P-R curve and F-measure curve in RGB-thermal database according to the present invention and the prior art.
The specific implementation mode is as follows:
the following describes in detail specific embodiments of the present invention.
Referring to fig. 1, the method for detecting the salient object of the RGB-T image with the multi-level depth feature fusion includes the following steps:
step 1) extracting rough multilevel features from an input image:
for the RGB image or the thermal infrared image, 5-level features at different depths in the VGG16 network are extracted as rough single-mode features, which are respectively:
conv1-2 (by symbol)Representation, containing 64 feature maps of size 256 × 256);
conv2-2 (by symbol)Representation, containing 128 feature maps of size 128 × 128);
conv3-3 (by symbol)Representation, containing 256 feature maps of size 64 × 64);
conv4-3 (by symbol)Representing, containing 512 feature maps of size 32 × 32);
conv5-3 (by symbol)Representing, containing 512 feature maps of size 16 × 16);
wherein: n is 1 or 2, and n is a linear,
when n is 1, the image represents an RGB image;
the term n-2 represents a thermal infrared image;
step 2), constructing an adjacent depth feature fusion module, and improving the single-mode features:
the common multi-modal vision method directly takes five-level features as single-modal features, the method has huge network parameter quantity due to excessive number of characteristic levels and increased difficulty of network training, 5-level features with different depths are taken as rough single-modal features, and 3-level improved RGB image features or thermal infrared image features are obtained by establishing a plurality of adjacent depth feature fusion modules;
each adjacent depth fusion module comprises 3 convolution operations and 1 deconvolution operation, and particularly, to obtain the d-th-order monomodal feature, d is 1,2 and 3, a convolution kernel is firstly set to be 3 × 3, the step size is 2, and the parameter isBy convolution operation ofOne convolution kernel is 1 × 1, step size is 1, and parameter isBy convolution operation ofAnd a convolution kernel of 2 x 2, step size 1/2, and parameterBy deconvolution ofRespectively act on Andto ensure that adjacent level 3 features from the strut network have the same spatial resolution and number of feature channels (128 channels for the present invention); the 3-level features are then concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1OfObtaining the d-th-level single-mode characteristic of 128 channelsThe proximity depth fusion module can be represented as follows:
wherein Cat (-) represents the cross-channel cascade operation, and phi (-) is a ReLu activation function;
as indicated above, the RGB or thermal infrared single mode feature of the d-th levelAlso containing level 3 characteristic information from the strut network, i.e.Feature of depth adjacent to itAndthis also showsWill contain richer detail and semantic information, which is helpful for accurately identifying the target, and in addition, the characteristicsWith respect to simple merging Andhaving more compact data, improved redundancy information in crudely extracted features through adjacent depth feature fusionCompressing the features;
step 3), constructing a multi-branch combination fusion module, and fusing multi-modal characteristics:
the multi-branch combination fusion module performs fusion aiming at different single modes under the same characteristic level and comprises two fusion branches;
the first merging branch (also called a multi-group merging branch) has M (in this embodiment, M is 8) groups, and mainly amplifies the effect of each channel to reduce network parameters;
the second fused branch (also called single-group fused branch) has only one group and mainly functions to fully capture cross-channel correlation among all input features of different modalities; the two branches output the same number of channels (64 channels in this embodiment), so the final number of output characteristic channels of the multi-branch combination fusion module is twice that of each fusion branch, and is equal to the number of RGB or thermal infrared image characteristic channels (128 channels in this embodiment) input to the multi-branch combination fusion module;
multiple groups of fusion branches are established according to the basic idea of 'splitting-converting-merging', in which input single-mode featuresAndrespectively cut into M groups (128/M) with the same channel number according to the channel number to obtainAndtwo feature sets; then combining corresponding RGB and thermal infrared features from an M-th subgroup in two feature sets of the same level through cascade operation, and then realizing the fusion of cross-modal features in the subgroups through convolution of 1 × 1 with 64/M channels and convolution operation of 3 × 3 with 64/M channels, wherein the first 1 × 1 mainly plays a role in reducing the number of feature channels, the second convolution is mainly used for fusing the features, and each convolution operation is followed by a ReLu activation function; finally, M small groups of outputs are cascaded together to obtain the output characteristics H of a plurality of groups of fusion branches1,dThe expression is as follows:
wherein,representing the above-described stack convolution operation with ReLu activation function,fusion parameters representing the mth subgroup;
the single group of fused branches can be regarded as a special case when M is 1 in the multiple groups of fused branches, and the expression is:
wherein H2,dIs the d-th order fused feature output of the single group of fused branches,comprising two stacked convolution operations, respectively 1 × 1 convolution with channel number 64 and 3 × 3 convolution with channel number 64, by which correlation information between all the multimodal features of the input is fully captured, and each convolution operation is followed by a ReLu activation function,fusion parameters representing a single set of fusion branches;
finally, through the multi-group fusion branch and the single-group fusion branch, the multi-branch group fusion characteristic H of the d-th leveldCan be prepared from H1,dAnd H2,dSimple cascading is carried out, and the expression is as follows:
Hd=Cat(H1,d,H2,d)
as described, the multi-branch set fusion module can capture cross-channel correlations between all features from different modalities of RGB images and thermal infrared images through a single-branch set fusion structure, as well as extract more salient features from the multi-branch set fusion. Therefore, by the multi-branch combination fusion module, multi-modal-based multi-level fusion features are extracted, cross-modal information from RGB and thermal infrared images can be captured more effectively compared with a common fusion method, and more complete and consistent targets can be detected; due to the idea of grouping convolution, compared with a common fusion method of directly cascading and then passing through a series of convolution layers and activation layers, the multi-branch group fusion module needs less training parameters;
step 4), obtaining a fusion output characteristic diagram:
the different-level features are fused in a stepwise reverse way to obtain a plurality of edge output feature maps { P }d1,2,3}, the expression is:
wherein D (; gamma)d,(1/2)d) Is a convolution kernel of 2d×2dStep size is (1/2)dParameter is gammadThe deconvolution layer of (a), so that the fused features have the same spatial resolution,andtwo convolution kernels are 1 × 1, the step length is 1, and the parameters are respectivelyAndare used to fuse different stage features and generate edge output feature maps for each stage, respectively. Through progressive information transfer, we obtain 3 side output feature maps (P) with the same size as the input single-mode imaged|d=1,2,3};
The multilevel features are combined by using cascade operation, and then a convolution kernel is 1 multiplied by 1, the step length is 1, and the parameter is theta0C (; theta) convolution operation of01) fusion to generate a feature map P0The expression is:
P0=C(Cat(P1,P2,P3);θ0,1)
step 5) training an algorithm network:
on the training data set, a deep supervised learning mechanism is adopted to output the edge feature map and the fusion output feature map { P }tAnd comparing | t ═ 0,1,2 and 3}, and obtaining a cross entropy loss function L of the network model by comparing with the truth diagram G:
where G (i, j) ∈ {0,1} is the value at the (i, j) position in the truth map G, Pt(i, j) is a feature map PtThrough sigma (P)t) In order to balance the loss of foreground and background and increase the detection accuracy of the algorithm on the salient objects with different sizes, a class balance parameter β is the ratio of the number of background pixels in the true value image to the number of the whole true value image pixels, and can be expressed as:
wherein N isbRepresenting the number of background pixels, NfRepresenting the number of foreground pixel points;
the invention uses a 3-step training method to train the network: training a branch network of the RGB image by minimizing a cross entropy loss function, wherein in the constructed branch network, a multi-branch group fusion module is removed, and multi-level visible light image features output from a plurality of adjacent depth feature fusion modules are directly input to a reverse transfer process to predict significance; secondly, constructing and training a thermal infrared branch by using the same method as the RGB branch network in the first step; thirdly, training an integral network for RGB-T image detection to obtain network model parameters based on VGG16 support network parameters and near depth feature fusion module parameters obtained by RGB and thermal infrared single branch networks in the previous two steps;
when the thermal infrared single-mode branch network parameters are trained, the data set for detecting the thermal infrared single-mode significance target is absent, in order to train smoothly, the R channel of the RGB image is used for replacing the thermal infrared single-mode data, because the R channel image is closest to the thermal infrared image in the three channels of the RGB image, the specific training data set is constructed as follows:
using RGB images (one for each two) in the RGB-thermal data set and MSRA-B training data sets (one for each 3) to form a data ratio of 1:2 to train an RGB branch network model; correspondingly, a 1:2 data ratio training thermal infrared branch network model is formed by using thermal infrared images (one for each two) in an RGB-thermal data set and R channels (one for each 3) of images in an MSRA-B training data set; for the RGB-T image multi-modal network model, paired images (one pair for each pair) in an RGB-thermal data set are used for training;
in the training process, in order to avoid the overfitting phenomenon caused by too little training data, each image is rotated by 90 degrees, rotated by 180 degrees, rotated by 270 degrees, and turned over horizontally and vertically, and the original data aggregation amount is enlarged to be 8 times;
step 5) predicting a pixel level saliency map of an RGB-T image:
taking the other half of data except for the data used for training in the RGB-thermal data set as test data, performing further classification calculation on the edge output characteristic diagram and the fusion output characteristic diagram obtained in the step (4) by using the network model parameters obtained in the step (5), and using { S {tI t 0,1,2,3 represents the output saliency map of the network, StCan be expressed as follows:
St=σ(Pt)
wherein σ (·) is a sigmoid activation function;
finally, the S is0And (4) serving as a final RGB-T prediction saliency map.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions are as follows: all simulation experiments are realized by adopting a caffe deep learning framework under the Ubuntu 16.04.5 environment and taking Matlab R2014b software as an interface;
2. simulation content and result analysis:
simulation 1
Compared with the existing RGB image-based salient target detection method and RGB-T image-based salient target detection algorithm, the method carries out salient target detection experiments on a public image database RGB-thermal, and part of experiment results are visually compared, as shown in FIG. 2, wherein the RGB image represents an RGB image used for experimental input in a database, the T image represents a thermal infrared image paired with the RGB image and used for experimental input in the database, and GT represents a true value image calibrated manually;
as can be seen from fig. 2, compared with the prior art, the method has a better background suppression effect, has a better complete consistency effect in the detection of the significant target in the complex scene, and is closer to the truth diagram of the artificial calibration.
Simulation 2
Compared with the existing single-mode image-based significant target detection method and RGB-T image-based significant target detection algorithm, the method provided by the invention has the advantages that the result obtained by performing a significant target detection experiment on a public image database RGB-thermal is objectively evaluated by adopting a recognized evaluation index, and the evaluation simulation result is shown in fig. 3a and 3b, wherein:
FIG. 3a is a graph of the results of an evaluation of the present invention and the prior art using an accuracy-recall (P-R) curve;
FIG. 3b is a graph showing the results of the evaluation using the F-Measure curve according to the present invention and the prior art;
as can be seen from FIGS. 3a and 3b, compared with the prior art, the method has higher PR curve and F-measure curve, thereby showing that the method has better consistency and integrity for the detection of the significant target, and fully showing the effectiveness and superiority of the method.
The embodiments of the present invention have been described in detail. However, the present invention is not limited to the above-described embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (8)
1. The method for detecting the RGB-T image saliency target by multi-level depth feature fusion is characterized by comprising the following steps of:
(1) extracting rough multilevel features from the input image:
extracting 5-level features at different depths in the basic network of the image to be used as rough single-mode features;
(2) constructing an adjacent depth feature fusion module, and improving the single-mode features:
establishing a plurality of adjacent depth feature fusion modules, then processing the 5-level rough single-mode features obtained in the step (1) through the adjacent depth feature fusion modules, and fusing the 3-level features from the adjacent depths to obtain improved 3-level single-mode features;
(3) constructing a multi-branch combination fusion module, fusing multi-modal characteristics:
constructing a multi-branch combination fusion module comprising two fusion branches, and fusing different single-mode features located under the same feature level in the improved 3-level single-mode features obtained in the step (2) to obtain fused multi-mode features;
(4) obtaining a fusion output characteristic diagram:
step-by-step reverse fusion is carried out on the different-level features of the fused multi-modal features obtained in the step (3) to obtain a plurality of edge output feature maps, and all the edge output feature maps are fused to obtain a fused output feature map;
(5) training the algorithm network:
on a training data set, completing algorithm network training on the edge output characteristic graph and the fusion output characteristic graph obtained in the step (4) by adopting a deep supervised learning mechanism and minimizing a cross entropy loss function to obtain network model parameters;
(6) predicting a pixel-level saliency map of an RGB-T image:
and (4) on a test data set, predicting the pixel level saliency map of the RGB-T image by sigmoid classification calculation on the edge output feature map and the fusion output feature map obtained in the step (4) by using the network model parameters obtained in the step (5).
2. The method for detecting the RGB-T image saliency target of claim 1 characterized by the fact that the graphics in step (1) are RGB images or thermal infrared images.
3. The RGB-T image saliency target detection method of multi-level depth feature fusion of claim 1 characterized in that the base network in step (1) is VGG16 network.
4. The method for detecting the RGB-T image saliency target of claim 2 characterized by the step of multi-level depth feature fusion, wherein the step of constructing the neighboring depth feature fusion module in step (2) comprises the steps of:
(21) respectively using symbols to sign the 5-grade rough single-mode features obtained in the step (1) Wherein n is 1 or 2, and represents an RGB image or a thermal infrared image, respectively;
(22) each neighboring depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain the d-th-order single-mode feature, d being 1,2, 3.
5. The method for detecting the RGB-T image significance target of multi-level depth feature fusion as claimed in claim 4, wherein the step (22) comprises:
(221) one convolution kernel is 3 x 3, the step length is 2, and the parameter isBy convolution operation ofOne convolution kernel is 1 × 1, step size is 1, and parameter isBy convolution operation ofAnd a convolution kernel of 2 x 2, step size 1/2, and parameterBy deconvolution ofRespectively act onAnd
(222) the 3-level features are concatenated and passed through a convolution kernel of 1 × 1, step size 1, and parameter 1By convolution operation ofObtaining the d-th-level single-mode characteristic of 128 channelsThe proximity depth fusion module can be represented as follows:
wherein:
cat (·) represents a cross-channel cascade operation;
φ (-) is a ReLu activation function.
6. The method for detecting the RGB-T image saliency target of claim 1 characterized by that, the multi-branch combination fusion module in step (3) is used for fusion of different single modes at the same feature level, and includes two fusion branches: a plurality of sets of fused branches and a single set of fused branches, wherein:
the multi-group fusion branch has 8 groups, and the single-group fusion branch has only one group;
each merging branch outputs 64-channel features, and the two merging branch output features are cascaded to obtain 128-channel multi-modal features.
7. The method for detecting the RGB-T image saliency target by multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module, and fusing different single modes at the same feature level in multiple sets of fusion branches to obtain a fused multi-mode feature includes the following steps:
(31) single modal characterization of inputAndrespectively cut into M groups with the same channel number according to the channel number to obtainAndtwo feature sets, wherein:
m is a positive integer, and the value range of M is more than or equal to 2 and less than or equal to 128;
(32) then combining corresponding RGB and thermal infrared features from the M-th subgroup in the two feature sets of the same level through cascade operation, and then realizing the fusion of the cross-modal features in the subgroups through 1 × 1 convolution with 64/M channels and 3 × 3 convolution operation with 64/M channels, wherein each convolution operation is followed by a ReLu activation function;
(33) m small groups of outputs are cascaded together to obtain the output characteristics H of a plurality of groups of fusion branches1,dThe expression is as follows:
wherein:
representing the above-described stack convolution operation with ReLu activation function,
represents the fusion parameters of the mth subgroup.
8. The method for detecting the salient object of the RGB-T image with multi-level depth feature fusion as claimed in claim 1, wherein the step (3) of constructing a multi-branch combination fusion module to fuse different single modes at the same feature level in a single-group fusion branch to obtain a fused multi-mode feature comprises the following steps:
(3a) the single group of fused branches can be regarded as a special case when M is 1 in the multiple groups of fused branches, and the expression is:
wherein:
H2,dis the d-th level fused feature output of the single group of fused branches;
the convolution operation comprising two stacks, namely convolution of 1 × 1 with the number of channels being 64 and convolution of 3 × 3 with the number of channels being 64, and each convolution operation is followed by a ReLu activation function;
fusion parameters representing a single set of fusion branches;
(3b) multi-branch group fusion feature H of d-th stagedFrom H1,dAnd H2,dSimple cascading is carried out, and the expression is as follows:
Hd=Cat(H1,d,H2,d)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910431110.6A CN110210539B (en) | 2019-05-22 | 2019-05-22 | RGB-T image saliency target detection method based on multi-level depth feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910431110.6A CN110210539B (en) | 2019-05-22 | 2019-05-22 | RGB-T image saliency target detection method based on multi-level depth feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110210539A true CN110210539A (en) | 2019-09-06 |
CN110210539B CN110210539B (en) | 2022-12-30 |
Family
ID=67788118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910431110.6A Active CN110210539B (en) | 2019-05-22 | 2019-05-22 | RGB-T image saliency target detection method based on multi-level depth feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110210539B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110889416A (en) * | 2019-12-13 | 2020-03-17 | 南开大学 | Salient object detection method based on cascade improved network |
CN111047571A (en) * | 2019-12-10 | 2020-04-21 | 安徽大学 | Image salient target detection method with self-adaptive selection training process |
CN111242138A (en) * | 2020-01-11 | 2020-06-05 | 杭州电子科技大学 | RGBD significance detection method based on multi-scale feature fusion |
CN111428602A (en) * | 2020-03-18 | 2020-07-17 | 浙江科技学院 | Convolutional neural network edge-assisted enhanced binocular saliency image detection method |
CN111583173A (en) * | 2020-03-20 | 2020-08-25 | 北京交通大学 | RGB-D image saliency target detection method |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111666977A (en) * | 2020-05-09 | 2020-09-15 | 西安电子科技大学 | Shadow detection method of monochrome image |
CN111814895A (en) * | 2020-07-17 | 2020-10-23 | 大连理工大学人工智能大连研究院 | Significance target detection method based on absolute and relative depth induction network |
CN111986240A (en) * | 2020-09-01 | 2020-11-24 | 交通运输部水运科学研究所 | Drowning person detection method and system based on visible light and thermal imaging data fusion |
CN112348870A (en) * | 2020-11-06 | 2021-02-09 | 大连理工大学 | Significance target detection method based on residual error fusion |
CN113159068A (en) * | 2021-04-13 | 2021-07-23 | 天津大学 | RGB-D significance target detection method based on deep learning |
CN113205481A (en) * | 2021-03-19 | 2021-08-03 | 浙江科技学院 | Salient object detection method based on stepped progressive neural network |
CN113221659A (en) * | 2021-04-13 | 2021-08-06 | 天津大学 | Double-light vehicle detection method and device based on uncertain sensing network |
CN113298094A (en) * | 2021-06-10 | 2021-08-24 | 安徽大学 | RGB-T significance target detection method based on modal association and double-perception decoder |
CN113486899A (en) * | 2021-05-26 | 2021-10-08 | 南开大学 | Saliency target detection method based on complementary branch network |
CN113822855A (en) * | 2021-08-11 | 2021-12-21 | 安徽大学 | RGB-T image salient object detection method combining independent decoding and joint decoding |
CN114663371A (en) * | 2022-03-11 | 2022-06-24 | 安徽大学 | Image salient target detection method based on modal unique and common feature extraction |
CN114092774B (en) * | 2021-11-22 | 2023-08-15 | 沈阳工业大学 | RGB-T image significance detection system and detection method based on information flow fusion |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070076917A1 (en) * | 2003-03-21 | 2007-04-05 | Lockheed Martin Corporation | Target detection improvements using temporal integrations and spatial fusion |
US20180032840A1 (en) * | 2016-07-27 | 2018-02-01 | Beijing Kuangshi Technology Co., Ltd. | Method and apparatus for neural network training and construction and method and apparatus for object detection |
CN109598268A (en) * | 2018-11-23 | 2019-04-09 | 安徽大学 | A kind of RGB-D well-marked target detection method based on single flow depth degree network |
CN109712105A (en) * | 2018-12-24 | 2019-05-03 | 浙江大学 | A kind of image well-marked target detection method of combination colour and depth information |
CN109784183A (en) * | 2018-12-17 | 2019-05-21 | 西北工业大学 | Saliency object detection method based on concatenated convolutional network and light stream |
-
2019
- 2019-05-22 CN CN201910431110.6A patent/CN110210539B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070076917A1 (en) * | 2003-03-21 | 2007-04-05 | Lockheed Martin Corporation | Target detection improvements using temporal integrations and spatial fusion |
US20180032840A1 (en) * | 2016-07-27 | 2018-02-01 | Beijing Kuangshi Technology Co., Ltd. | Method and apparatus for neural network training and construction and method and apparatus for object detection |
CN109598268A (en) * | 2018-11-23 | 2019-04-09 | 安徽大学 | A kind of RGB-D well-marked target detection method based on single flow depth degree network |
CN109784183A (en) * | 2018-12-17 | 2019-05-21 | 西北工业大学 | Saliency object detection method based on concatenated convolutional network and light stream |
CN109712105A (en) * | 2018-12-24 | 2019-05-03 | 浙江大学 | A kind of image well-marked target detection method of combination colour and depth information |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111047571A (en) * | 2019-12-10 | 2020-04-21 | 安徽大学 | Image salient target detection method with self-adaptive selection training process |
CN111047571B (en) * | 2019-12-10 | 2023-04-25 | 安徽大学 | Image salient target detection method with self-adaptive selection training process |
CN110889416A (en) * | 2019-12-13 | 2020-03-17 | 南开大学 | Salient object detection method based on cascade improved network |
CN110889416B (en) * | 2019-12-13 | 2023-04-18 | 南开大学 | Salient object detection method based on cascade improved network |
CN111242138B (en) * | 2020-01-11 | 2022-04-01 | 杭州电子科技大学 | RGBD significance detection method based on multi-scale feature fusion |
CN111242138A (en) * | 2020-01-11 | 2020-06-05 | 杭州电子科技大学 | RGBD significance detection method based on multi-scale feature fusion |
CN111428602A (en) * | 2020-03-18 | 2020-07-17 | 浙江科技学院 | Convolutional neural network edge-assisted enhanced binocular saliency image detection method |
CN111583173A (en) * | 2020-03-20 | 2020-08-25 | 北京交通大学 | RGB-D image saliency target detection method |
CN111583173B (en) * | 2020-03-20 | 2023-12-01 | 北京交通大学 | RGB-D image saliency target detection method |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111582316B (en) * | 2020-04-10 | 2022-06-28 | 天津大学 | RGB-D significance target detection method |
CN111666977A (en) * | 2020-05-09 | 2020-09-15 | 西安电子科技大学 | Shadow detection method of monochrome image |
CN111666977B (en) * | 2020-05-09 | 2023-02-28 | 西安电子科技大学 | Shadow detection method of monochrome image |
CN111814895A (en) * | 2020-07-17 | 2020-10-23 | 大连理工大学人工智能大连研究院 | Significance target detection method based on absolute and relative depth induction network |
CN111986240A (en) * | 2020-09-01 | 2020-11-24 | 交通运输部水运科学研究所 | Drowning person detection method and system based on visible light and thermal imaging data fusion |
CN112348870A (en) * | 2020-11-06 | 2021-02-09 | 大连理工大学 | Significance target detection method based on residual error fusion |
CN113205481A (en) * | 2021-03-19 | 2021-08-03 | 浙江科技学院 | Salient object detection method based on stepped progressive neural network |
CN113221659B (en) * | 2021-04-13 | 2022-12-23 | 天津大学 | Double-light vehicle detection method and device based on uncertain sensing network |
CN113159068A (en) * | 2021-04-13 | 2021-07-23 | 天津大学 | RGB-D significance target detection method based on deep learning |
CN113159068B (en) * | 2021-04-13 | 2022-08-30 | 天津大学 | RGB-D significance target detection method based on deep learning |
CN113221659A (en) * | 2021-04-13 | 2021-08-06 | 天津大学 | Double-light vehicle detection method and device based on uncertain sensing network |
CN113486899A (en) * | 2021-05-26 | 2021-10-08 | 南开大学 | Saliency target detection method based on complementary branch network |
CN113298094B (en) * | 2021-06-10 | 2022-11-04 | 安徽大学 | RGB-T significance target detection method based on modal association and double-perception decoder |
CN113298094A (en) * | 2021-06-10 | 2021-08-24 | 安徽大学 | RGB-T significance target detection method based on modal association and double-perception decoder |
CN113822855A (en) * | 2021-08-11 | 2021-12-21 | 安徽大学 | RGB-T image salient object detection method combining independent decoding and joint decoding |
CN114092774B (en) * | 2021-11-22 | 2023-08-15 | 沈阳工业大学 | RGB-T image significance detection system and detection method based on information flow fusion |
CN114663371A (en) * | 2022-03-11 | 2022-06-24 | 安徽大学 | Image salient target detection method based on modal unique and common feature extraction |
Also Published As
Publication number | Publication date |
---|---|
CN110210539B (en) | 2022-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110210539B (en) | RGB-T image saliency target detection method based on multi-level depth feature fusion | |
CN112884064B (en) | Target detection and identification method based on neural network | |
CN111582316B (en) | RGB-D significance target detection method | |
CN113052210A (en) | Fast low-illumination target detection method based on convolutional neural network | |
CN110909594A (en) | Video significance detection method based on depth fusion | |
CN113688894B (en) | Fine granularity image classification method integrating multiple granularity features | |
Hu et al. | Learning hybrid convolutional features for edge detection | |
Huang et al. | DeepDiff: Learning deep difference features on human body parts for person re-identification | |
CN114743027B (en) | Weak supervision learning-guided cooperative significance detection method | |
CN111401380A (en) | RGB-D image semantic segmentation method based on depth feature enhancement and edge optimization | |
CN113628297A (en) | COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning | |
CN112651459A (en) | Defense method, device, equipment and storage medium for confrontation sample of deep learning image | |
CN114219824A (en) | Visible light-infrared target tracking method and system based on deep network | |
Tao et al. | An adaptive frame selection network with enhanced dilated convolution for video smoke recognition | |
CN110517270A (en) | A kind of indoor scene semantic segmentation method based on super-pixel depth network | |
CN114449362B (en) | Video cover selection method, device, equipment and storage medium | |
Wu et al. | Multiple attention encoded cascade R-CNN for scene text detection | |
Liu et al. | Student behavior recognition from heterogeneous view perception in class based on 3-D multiscale residual dense network for the analysis of case teaching | |
CN111507416A (en) | Smoking behavior real-time detection method based on deep learning | |
Chen et al. | Intra-and inter-reasoning graph convolutional network for saliency prediction on 360° images | |
Ren et al. | Co-saliency detection using collaborative feature extraction and high-to-low feature integration | |
Hou et al. | An object detection algorithm based on infrared-visible dual modal feature fusion | |
CN113536977A (en) | Saliency target detection method facing 360-degree panoramic image | |
Hesham et al. | Image colorization using Scaled-YOLOv4 detector | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |