CN115131640A - Target detection method and system utilizing illumination guide and attention mechanism - Google Patents
Target detection method and system utilizing illumination guide and attention mechanism Download PDFInfo
- Publication number
- CN115131640A CN115131640A CN202210734314.9A CN202210734314A CN115131640A CN 115131640 A CN115131640 A CN 115131640A CN 202210734314 A CN202210734314 A CN 202210734314A CN 115131640 A CN115131640 A CN 115131640A
- Authority
- CN
- China
- Prior art keywords
- visible light
- features
- modal
- image
- infrared
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 71
- 238000005286 illumination Methods 0.000 title claims abstract description 48
- 230000007246 mechanism Effects 0.000 title claims abstract description 14
- 230000003993 interaction Effects 0.000 claims abstract description 31
- 238000000034 method Methods 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 230000000295 complement effect Effects 0.000 claims abstract description 10
- 230000001537 neural effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 38
- 238000012549 training Methods 0.000 claims description 31
- 230000004927 fusion Effects 0.000 claims description 29
- 230000008447 perception Effects 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000002708 enhancing effect Effects 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 7
- 238000009499 grossing Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 102100034112 Alkyldihydroxyacetonephosphate synthase, peroxisomal Human genes 0.000 description 2
- 101000799143 Homo sapiens Alkyldihydroxyacetonephosphate synthase, peroxisomal Proteins 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000000848 angular dependent Auger electron spectroscopy Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 102100031315 AP-2 complex subunit mu Human genes 0.000 description 1
- 101000796047 Homo sapiens AP-2 complex subunit mu Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001681 protective effect Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/60—Extraction of image or video features relating to illumination properties, e.g. using a reflectance or lighting model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention provides a visible light infrared image target detection method and system utilizing illumination guidance and an attention mechanism, wherein the method and the system utilize a deep convolution neural backbone network to extract image features of visible light and infrared images, and introduce an inter-modal differential interaction attention module and an intra-modal attention module to respectively enhance the inter-modal features and intra-modal features, wherein the inter-modal differential interaction attention module enhances the extraction of the network on modal complementary features by amplifying inter-modal differences, and the intra-modal attention module predicts a target mask for each mode and takes the target mask as attention to enhance the intra-modal features. Meanwhile, an illumination sensing network module guided by illumination is introduced, weight values are adaptively distributed for different modes by utilizing illumination information, the mode weight is introduced into a mask prediction loss function, the contribution of the two modes to the loss function is adjusted, and the network focuses more on samples difficult to learn so as to achieve high-precision target detection.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a target detection method and a target detection system utilizing illumination guidance and attention mechanism.
Background
The target detection is the basis for automatic analysis and understanding of complex scenes, and plays an important role in the fields of intelligent security, human-computer interaction, smart cities and the like. However, in the environments of night, insufficient illumination, severe weather and the like, the imaging quality of the visible light image is greatly affected, and the requirement of high-precision target detection is difficult to meet. The infrared image is imaged by radiation of a target and a background, is not influenced by severe environments such as rain, snow, wind, frost and the like, has strong anti-interference capability, can be identified and disguised, and has good complementary characteristics with a visible light image.
Therefore, how to effectively utilize the characteristics of the visible light and the infrared image, develop complementary information and realize high-precision target detection has important theoretical research significance and practical application value.
However, due to the difficult predictability of the external environment, it is difficult for the target detection network to predict the contribution and utility of each modality data in advance. For example, the following may occur: the object of interest does not appear in one modality and appears characteristic in another modality; the characteristics of a certain degree appear in the two modes, and the information of the two modes needs to be complementarily utilized to obtain final judgment; and other more complex modalities of information presentation. In these complex cases, the network cannot be preset in advance, with what attention should be given to each modality, and which features are of particular concern.
Therefore, a highly efficient and adaptive modality information fusion framework is needed. However, most existing visible light and infrared image fusion algorithms do not clearly divide the features, and the selection of the features is completely handed to a detection network, so that the visible light and infrared image features are not fully utilized, and the detection performance is reduced.
In order to solve the problems, the invention provides a visible light infrared image target detection method utilizing illumination guidance and attention mechanism, which performs explicit modeling on different characteristics, fully utilizes information in visible light and infrared images and achieves high-precision visible light and infrared image fusion target detection.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a target detection method and a system utilizing an illumination guide and attention mechanism, the method and the system can fully utilize information in visible light and infrared images, divide image characteristics into intra-modal characteristics and inter-modal difference characteristics, perform explicit modeling on different characteristics, introduce an illumination sensing network module of illumination guide at the same time, and utilize illumination information to adaptively distribute weights for different modalities so as to improve the precision of target detection of fusion of the visible light and the infrared images.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a target detection method utilizing illumination guidance and attention mechanism, which comprises the following steps:
step 1: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
and 2, step: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
and step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
and 4, step 4: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
and 5: fusing the enhanced visible light image features and the infrared image features obtained in the step 2 and the step 3 by using the weight obtained in the step 4 to obtain fusion features, sending the fusion features into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of a target detection model;
step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function ofWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,for the calculation of the dice loss, s represents the smoothing factor.
Preferably, step 4 further includes training the illumination sensing network, so that the illumination sensing network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is as followsWherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
Preferably, step 5 comprises weighting the modal weights W obtained in step 4 R ,W T The enhanced visible light characteristics obtained in step 2 and step 3And infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively representing element-by-element summation, multiplication.
Preferably, step 2 specifically includes extracting the visible light image feature F R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. C.) is to characterize the visible light image by F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely, it is
Wherein,in order to be a differential feature,represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,respectively representing element-by-element subtraction, summation and multiplication.
The present invention also provides a target detection system using light guidance and attention mechanism, comprising:
an extraction module: respectively inputting the visible light image and the infrared image into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
inter-modality differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, the intra-modal attention module further comprises training of the intra-modal attention module to correctly predict the mask of the target, the loss function beingWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,for the calculation of the dice loss, s represents the smoothing factor.
Preferably, the illumination sensing module further comprises a training of the illumination sensing network, so that the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is thatWherein T is the total number of samples, y i Class label representing the ith input image, p i The probability of being predicted as daytime for the ith input image.
Preferably, the fusion module comprises a modal weight W to be obtained R ,W T Inter-modal differential interaction attention model and intra-modalEnhanced visible light features obtained by attention moduleAnd infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively, element-by-element summation and multiplication.
Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely that
Wherein,in order to be a differential feature,represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,respectively representing element-by-element subtraction, summation and multiplication.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a target detection method and a target detection system utilizing an illumination guide and attention mechanism.
Meanwhile, inter-modality interaction attention and intra-modality attention are introduced, and inter-modality features and intra-modality features are enhanced respectively, wherein the inter-modality interaction attention module is used for enhancing extraction of the complementary features of the network on the modalities by amplifying differences among the modalities, and the intra-modality attention module is used for predicting a target mask for each modality and taking the target mask as the attention to enhance the intra-modality features.
And introducing the modal weight into a mask prediction loss function, and adjusting the contribution of the two modes to the loss function to enable the network to pay more attention to the samples difficult to learn so as to achieve high-precision target detection.
Drawings
FIG. 1 is a schematic diagram of a visible light image and an infrared image;
FIG. 2 is a diagram of a deep convolutional neural network model used in the target detection method provided by the present invention;
FIG. 3 is a schematic diagram of an inter-modality interaction attention module in accordance with the present invention;
FIG. 4 is a schematic view of an intra-modal attention module of the present invention;
FIG. 5 is a schematic view of a target mask label in the present invention;
FIG. 6 is a schematic diagram of an illumination-aware network module for illumination guidance according to the present invention;
FIG. 7 is a visualization of a feature map for each stage of the network in the present invention;
FIG. 8 is a graph showing the results of the test in the present invention.
Detailed Description
The invention will be further described with reference to examples of embodiments shown in the drawings.
Example one
The invention provides a visible light infrared image target detection method utilizing illumination guidance and an attention mechanism. In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below with reference to the accompanying drawings.
Step 1: and respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters.
The visible light image and the infrared image are simultaneously shot in the same scene. Referring to fig. 1, fig. 1 shows a visible light image and an infrared image pair, where the first and third columns are visible light images, and the second and fourth columns are infrared images corresponding to the first and third visible light images. The visible light image may be a color image.
Referring to fig. 2, fig. 2 shows a deep convolutional neural network model used in the target detection method provided in the present invention.
Preferably, the visible light image I is processed by using fast R-CNN as a basic frame R Infrared image I T Respectively input into two paths of deep convolution neural backbone networks G with completely same structures R (·)、G T In the step (c), the image characteristics are extracted, and the two networks do not share parameters to obtain the characteristics F of the visible light image R =G R (I R ) Features F of an infrared image T =G T (I T )。
Where subscript R denotes for the visible light image modality and subscript T denotes for the infrared image modality.
Preferably, in order to enhance the generalization capability of the network, a data enhancement strategy of image scaling and random horizontal flipping can be adopted during training. For example, visible light image I R Infrared image I T Scaled to 640 x 512 pixels.
Preferably, the deep convolutional Neural backbone Network used as the feature extractor adopts ResNet-101(Residual Neural networks of 101layers) + FPN (feature pyramid Network), the initial values of the parameters use the fast R-CNN weights pre-trained on the COCO data set, and only the weights of the feature extraction part are used.
Step 2: and inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts.
Specifically, referring to fig. 3, the extracted visible light image feature F R And infrared image feature F T Input to the inter-modal differential interaction attention module M shown in FIG. 3 inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely, it is
Wherein,in order to be a differential feature,represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,respectively representing element-by-element subtraction, summation and multiplication.
And step 3: and (3) respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modality attention module, predicting a target mask, and enhancing intra-modality features by using the mask as attention to obtain the visible light image features and the infrared image features with enhanced intra-modality features.
Specifically, referring to fig. 4, the visible light image feature F extracted in step 1 is shown R And infrared image feature F T Input to the intra-modal attention module M as shown in FIG. 4 intra In (c), target masks for two modalities are predictedAndand obtaining two modal characteristics with enhanced features within the modalWherein the visible light image featuresInfrared image characteristicsWhere δ (-) represents the sigmoid activation function, F (-) represents the 1 × 1 convolution,respectively, element-by-element summation or multiplication.
Referring to FIG. 5, FIG. 5 shows a target mask tag schematic, the network using the target mask tag as the actual value of the target mask.
Preferably, step 3 further comprises training of the intra-modality attention module so that it can correctly predict the mask of the target.
In particular, the training of the intra-modal attention module adjusts parameters according to a loss function, which is a function of the lossWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,a calculation formula of the dice loss is shown, and s represents a smoothing coefficient; parameters are adjusted according to the loss function so that the intra-modal attention module can correctly predict the mask of the target.
Preferably, a random gradient descent (SGD) method is adopted to optimize the network weight, the SGD momentum is set to be 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to be 0.005.
And 4, step 4: and (4) downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics.
Referring to fig. 6, fig. 6 shows a lighting aware network and gate function diagram. In particular, for visible light image I R The down-sampling is performed in a manner,the input illumination aware network N (-) obtains the probability that the input image is day or night:
C d =δ(N(I R )),C n =1-C d ,
wherein, C d To input the probability that the image is daytime, C n δ (-) represents the softmax activation function for the probability that the input image is night.
The probability value is then adjusted via a gating function to obtain a reasonable modal weight, thereby adaptively assigning weights to the two modalities. Wherein the weight of the visible light characteristic isThe weight of the infrared features is W T =1-W R Where α represents a learnable parameter.
Preferably, α is initially set to 1, the downsampling factor is 1/8 for the original, the lighting aware network N (-) contains 2 convolutional layers and 3 fully-connected layers, each convolutional layer is followed by a ReLU activation function layer and a 2 × 2 max pooling layer to activate and compress features, and the softmax activation function is used after the last fully-connected layer.
Preferably, step 4 further comprises training of the illumination perception network, so that the probability of whether the scene in the image is day or night can be calculated according to the characteristics of the input visible light image.
Specifically, the loss function in the training of the illumination perception network isWherein T is the total number of samples, y i A classification label representing the ith input image, where 1 is day and 0 is night, p i Predicting the probability of the day for the ith input image; and adjusting parameters according to the loss function, so that the illumination perception network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image.
Preferably, the model is trained using a back propagation method when calculating the loss function. And optimizing the network weight by adopting a random gradient descent (SGD) method, setting the SGD momentum to be 0.9, setting the weight attenuation coefficient to be 0.0001, and setting the learning rate to be 0.005.
And 5: and (4) fusing the enhanced visible light and infrared features obtained in the steps (2) and (3) by using the weight obtained in the step (4) to obtain a fusion feature, sending the fusion feature into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of the target detection model.
Specifically, the modal weight W obtained in step 4 is used R ,W T The enhanced visible light characteristics obtained in step 2 and step 3And infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetection network D (-) after being fed in, class confidence p of ith anchor box i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively, element-by-element summation or multiplication.
Preferably, the step 5 further includes training the detection network, where the training of the detection network includes calculating a loss function of the detection network, and adjusting parameters according to the loss function to obtain a trained target detection model.
This is because the detection network will generate anchor frames, which are preliminary candidate frames, and a graph will generate 512 anchor frames in general, but the positions are not accurate. And further regressing the anchor frames through regression branches to obtain a prediction frame with a more accurate position.
In particular, the loss function isAdjusting parameters according to the loss function to obtain a trained target detection model, wherein N represents the total number of anchor frames, and p i Andthe prediction class score and the true class label, l, of the ith anchor box are respectively represented i Andrespectively representing the predicted regression value and the true value of the ith anchor frame, wherein lambda is a weight factor and is set to be 1.
In particular, the loss function is a loss function for detecting network D (-) including detecting network classification lossAnd regression branch loss
Wherein the classification loss function uses a softmax cross entropy loss function,wherein K is the total number of categories, y is the label, when predicting the category and the real category labelWhen consistentOtherwiseFor the softmax probability that anchor box i belongs to category n,
x, y, w, h represent coordinates of the center point, height and width of the frame, x p ,x a ,x * The values y, w, h corresponding to the predicted frame, the anchor frame and the true frame, respectively, from the network regression are the same.
Preferably, a back propagation method is used for training the model, and a random gradient descent (SGD) method is used for optimizing the network weight, so that the network can correctly predict the position and the category of the target of interest.
Here, the SGD momentum is set to 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to 0.005.
Step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Specifically, the target detection model is trained by using the steps, after the training is completed, the image to be detected is input into the trained target detection model, the forward propagation of the network is carried out to obtain the output of the structured network model, and the non-maximum suppression is carried out on the output result of the network model to obtain the final target detection result.
Preferably, the non-maximum suppression threshold is set to 0.5.
Example two
Aiming at the visible light infrared image target detection method utilizing the illumination guidance and attention mechanism provided by the first embodiment of the invention, the first embodiment of the invention provides a test result of the method to evaluate the performance of the method.
In the testing process, an FLIR-aligned data set is selected for precision testing, the FLIR-aligned data set is a registration subset of the FLIR ADAS data set, on the basis of the FLIR ADAS data set, unpaired images are removed, a part of image pairs are manually registered, and finally 4129 pairs of image pairs are kept as a training set and 1013 pairs of image pairs are used as a testing set. Only three types of objects, namely people, bicycles, and vehicles, are retained in the data set.
The visual result and the experimental detection result of the test process are shown in fig. 7 and 8, in fig. 7, the input image, the original feature map, the feature map passing through the differential interaction attention module, the feature map passing through the intra-modal attention module, and the fused feature map are sequentially from left to right, in fig. 8, the solid line box represents the correct detection result, the dotted line box represents the omission, and the dotted line box represents the false alarm.
The method adopts the following analysis indexes for measuring the detection precision: average Precision (AP).
The accuracy of 0.5 for each class of object intersection versus threshold (AP50), the accuracy of 0.5 for all classes of object intersection versus threshold (mAP50), and the average accuracy of 10 intersection versus threshold calculations (maps) of 0.5:0.05:0.95 were evaluated in the experiment and the results are shown in table 1.
Table 1 experiment results of visible light and infrared image fusion target detection algorithm on FLIR-aligned
As can be seen from the quantitative analysis and the qualitative analysis of the detection precision in Table 1, the detection precision of the method provided by the invention on the aligned-FLIR data set reaches the leading level.
EXAMPLE III
An embodiment of the present invention provides a target detection system using light guidance and attention mechanism, the system including:
an extraction module: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
inter-modality differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features enhanced visible light image features and infrared image features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, the intra-modal attention module further comprises training of the intra-modal attention module to correctly predict the mask of the target, the loss function beingWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,for the calculation of the dice loss, s represents the smoothing factor.
Preferably, the illumination sensing module further comprises a training of the illumination sensing network, so that the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is thatWherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
Preferably, the fusion module comprises a modal weight W to be obtained R ,W T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention moduleAnd infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively representing element-by-element summation, multiplication.
Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted R And infrared image feature F T Input-to-modality differential interaction attention module M inter (. C.) is to characterize the visible light image by F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely, it is
Wherein,in order to be a differential feature,represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,respectively representing element-by-element subtraction, summation and multiplication.
It should be understood that parts of the specification not set forth in detail are of the prior art.
The protective scope of the present invention is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present invention by those skilled in the art without departing from the scope and spirit of the present invention. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
Claims (10)
1. A method of target detection using a light guidance and attention mechanism, characterized by:
step 1: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
step 2: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
and 3, step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features of the visible light image and the infrared image with enhanced intra-modal features;
and 4, step 4: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
and 5: fusing the enhanced visible light image features and the infrared image features obtained in the step 2 and the step 3 by using the weight obtained in the step 4 to obtain fusion features, sending the fusion features into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of a target detection model;
step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
2. The method of claim 1, wherein: said step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function ofWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,for the calculation of the dice loss, s represents the smoothing factor.
3. The method of claim 1, wherein: step 4 further comprises training of the illumination perception network, the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network isWherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
4. The method of claim 1, wherein: the step 5 comprises the step of obtaining the modal weight W obtained in the step 4 R ,W T The enhanced visible light characteristics obtained in step 2 and step 3And infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively representing element-by-element summation, multiplication.
5. The method of claim 1, wherein: the step 2 comprises the step of extracting the visible light image characteristics F R And infrared image feature F T Input-to-modality differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely that
6. An object detection system utilizing a light-directing and attention-directing mechanism, characterized by: the system comprises:
an extraction module: respectively inputting the visible light image and the infrared image into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
an inter-modal differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
7. The system of claim 6, wherein: the intra-modal attention module further includes training of the intra-modal attention module to correctly predict a mask for the target with a loss function ofWherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,for the calculation of the dice loss, s represents the smoothing factor.
8. The system of claim 6, wherein: the illumination perception module also comprises the training of an illumination perception network, the probability of whether the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network isWherein T is the total number of samples, y i Class label representing the ith input image, p i The probability of daytime for the ith input image is predicted.
9. The system of claim 6, wherein: the fusion module comprises a modal weight W to be obtained R ,W T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention moduleAnd infrared characteristicsRecombined, weighted and cascaded to obtain the fusion characteristicsDetecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,respectively, element-by-element summation and multiplication.
10. The system of claim 6, wherein: the inter-modality differential interaction attention module comprises a visible light image feature F to be extracted R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference partAnd infrared image characteristicsNamely, it is
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210734314.9A CN115131640A (en) | 2022-06-27 | 2022-06-27 | Target detection method and system utilizing illumination guide and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210734314.9A CN115131640A (en) | 2022-06-27 | 2022-06-27 | Target detection method and system utilizing illumination guide and attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115131640A true CN115131640A (en) | 2022-09-30 |
Family
ID=83379399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210734314.9A Pending CN115131640A (en) | 2022-06-27 | 2022-06-27 | Target detection method and system utilizing illumination guide and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115131640A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115393684A (en) * | 2022-10-27 | 2022-11-25 | 松立控股集团股份有限公司 | Anti-interference target detection method based on automatic driving scene multi-mode fusion |
CN115631510A (en) * | 2022-10-24 | 2023-01-20 | 智慧眼科技股份有限公司 | Pedestrian re-identification method and device, computer equipment and storage medium |
CN116740410A (en) * | 2023-04-21 | 2023-09-12 | 中国地质大学(武汉) | Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment |
CN116778227A (en) * | 2023-05-12 | 2023-09-19 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
CN117078920A (en) * | 2023-10-16 | 2023-11-17 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
-
2022
- 2022-06-27 CN CN202210734314.9A patent/CN115131640A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631510A (en) * | 2022-10-24 | 2023-01-20 | 智慧眼科技股份有限公司 | Pedestrian re-identification method and device, computer equipment and storage medium |
CN115393684A (en) * | 2022-10-27 | 2022-11-25 | 松立控股集团股份有限公司 | Anti-interference target detection method based on automatic driving scene multi-mode fusion |
CN116740410A (en) * | 2023-04-21 | 2023-09-12 | 中国地质大学(武汉) | Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment |
CN116740410B (en) * | 2023-04-21 | 2024-01-30 | 中国地质大学(武汉) | Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment |
CN116778227A (en) * | 2023-05-12 | 2023-09-19 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
CN116778227B (en) * | 2023-05-12 | 2024-05-10 | 昆明理工大学 | Target detection method, system and equipment based on infrared image and visible light image |
CN117078920A (en) * | 2023-10-16 | 2023-11-17 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
CN117078920B (en) * | 2023-10-16 | 2024-01-23 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN115131640A (en) | Target detection method and system utilizing illumination guide and attention mechanism | |
CN108399362B (en) | Rapid pedestrian detection method and device | |
CN111126258B (en) | Image recognition method and related device | |
CN110458165B (en) | Natural scene text detection method introducing attention mechanism | |
Yang et al. | Single image haze removal via region detection network | |
CN108537824B (en) | Feature map enhanced network structure optimization method based on alternating deconvolution and convolution | |
CN113111814B (en) | Regularization constraint-based semi-supervised pedestrian re-identification method and device | |
CN113326735B (en) | YOLOv 5-based multi-mode small target detection method | |
CN114612937B (en) | Pedestrian detection method based on single-mode enhancement by combining infrared light and visible light | |
CN110222615A (en) | The target identification method that is blocked based on InceptionV3 network | |
CN113781519A (en) | Target tracking method and target tracking device | |
CN111539351A (en) | Multi-task cascaded face frame selection comparison method | |
CN117237740B (en) | SAR image classification method based on CNN and Transformer | |
CN114998801A (en) | Forest fire smoke video detection method based on contrast self-supervision learning network | |
CN111126155A (en) | Pedestrian re-identification method for generating confrontation network based on semantic constraint | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN117237411A (en) | Pedestrian multi-target tracking method based on deep learning | |
CN116152699B (en) | Real-time moving target detection method for hydropower plant video monitoring system | |
CN115063428B (en) | Spatial dim small target detection method based on deep reinforcement learning | |
CN116704309A (en) | Image defogging identification method and system based on improved generation of countermeasure network | |
CN115393901A (en) | Cross-modal pedestrian re-identification method and computer readable storage medium | |
CN116110074A (en) | Dynamic small-strand pedestrian recognition method based on graph neural network | |
CN114694042A (en) | Disguised person target detection method based on improved Scaled-YOLOv4 | |
CN114581353A (en) | Infrared image processing method and device, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |