CN115131640A - Target detection method and system utilizing illumination guide and attention mechanism - Google Patents

Target detection method and system utilizing illumination guide and attention mechanism Download PDF

Info

Publication number
CN115131640A
CN115131640A CN202210734314.9A CN202210734314A CN115131640A CN 115131640 A CN115131640 A CN 115131640A CN 202210734314 A CN202210734314 A CN 202210734314A CN 115131640 A CN115131640 A CN 115131640A
Authority
CN
China
Prior art keywords
visible light
features
modal
image
infrared
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210734314.9A
Other languages
Chinese (zh)
Inventor
杨文�
贺钰洁
张妍
余淮
余磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210734314.9A priority Critical patent/CN115131640A/en
Publication of CN115131640A publication Critical patent/CN115131640A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/60Extraction of image or video features relating to illumination properties, e.g. using a reflectance or lighting model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a visible light infrared image target detection method and system utilizing illumination guidance and an attention mechanism, wherein the method and the system utilize a deep convolution neural backbone network to extract image features of visible light and infrared images, and introduce an inter-modal differential interaction attention module and an intra-modal attention module to respectively enhance the inter-modal features and intra-modal features, wherein the inter-modal differential interaction attention module enhances the extraction of the network on modal complementary features by amplifying inter-modal differences, and the intra-modal attention module predicts a target mask for each mode and takes the target mask as attention to enhance the intra-modal features. Meanwhile, an illumination sensing network module guided by illumination is introduced, weight values are adaptively distributed for different modes by utilizing illumination information, the mode weight is introduced into a mask prediction loss function, the contribution of the two modes to the loss function is adjusted, and the network focuses more on samples difficult to learn so as to achieve high-precision target detection.

Description

Target detection method and system utilizing illumination guidance and attention mechanism
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a target detection method and a target detection system utilizing illumination guidance and attention mechanism.
Background
The target detection is the basis for automatic analysis and understanding of complex scenes, and plays an important role in the fields of intelligent security, human-computer interaction, smart cities and the like. However, in the environments of night, insufficient illumination, severe weather and the like, the imaging quality of the visible light image is greatly affected, and the requirement of high-precision target detection is difficult to meet. The infrared image is imaged by radiation of a target and a background, is not influenced by severe environments such as rain, snow, wind, frost and the like, has strong anti-interference capability, can be identified and disguised, and has good complementary characteristics with a visible light image.
Therefore, how to effectively utilize the characteristics of the visible light and the infrared image, develop complementary information and realize high-precision target detection has important theoretical research significance and practical application value.
However, due to the difficult predictability of the external environment, it is difficult for the target detection network to predict the contribution and utility of each modality data in advance. For example, the following may occur: the object of interest does not appear in one modality and appears characteristic in another modality; the characteristics of a certain degree appear in the two modes, and the information of the two modes needs to be complementarily utilized to obtain final judgment; and other more complex modalities of information presentation. In these complex cases, the network cannot be preset in advance, with what attention should be given to each modality, and which features are of particular concern.
Therefore, a highly efficient and adaptive modality information fusion framework is needed. However, most existing visible light and infrared image fusion algorithms do not clearly divide the features, and the selection of the features is completely handed to a detection network, so that the visible light and infrared image features are not fully utilized, and the detection performance is reduced.
In order to solve the problems, the invention provides a visible light infrared image target detection method utilizing illumination guidance and attention mechanism, which performs explicit modeling on different characteristics, fully utilizes information in visible light and infrared images and achieves high-precision visible light and infrared image fusion target detection.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a target detection method and a system utilizing an illumination guide and attention mechanism, the method and the system can fully utilize information in visible light and infrared images, divide image characteristics into intra-modal characteristics and inter-modal difference characteristics, perform explicit modeling on different characteristics, introduce an illumination sensing network module of illumination guide at the same time, and utilize illumination information to adaptively distribute weights for different modalities so as to improve the precision of target detection of fusion of the visible light and the infrared images.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a target detection method utilizing illumination guidance and attention mechanism, which comprises the following steps:
step 1: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
and 2, step: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
and step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
and 4, step 4: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
and 5: fusing the enhanced visible light image features and the infrared image features obtained in the step 2 and the step 3 by using the weight obtained in the step 4 to obtain fusion features, sending the fusion features into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of a target detection model;
step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function of
Figure BDA0003714686640000021
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure BDA0003714686640000022
for the calculation of the dice loss, s represents the smoothing factor.
Preferably, step 4 further includes training the illumination sensing network, so that the illumination sensing network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is as follows
Figure BDA0003714686640000023
Wherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
Preferably, step 5 comprises weighting the modal weights W obtained in step 4 R ,W T The enhanced visible light characteristics obtained in step 2 and step 3
Figure BDA0003714686640000024
And infrared characteristics
Figure BDA0003714686640000025
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure BDA0003714686640000031
Detecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure BDA0003714686640000032
respectively representing element-by-element summation, multiplication.
Preferably, step 2 specifically includes extracting the visible light image feature F R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. C.) is to characterize the visible light image by F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure BDA0003714686640000033
And infrared image characteristics
Figure BDA0003714686640000034
Namely, it is
Figure BDA0003714686640000035
Figure BDA0003714686640000036
Figure BDA0003714686640000037
Wherein,
Figure BDA0003714686640000038
in order to be a differential feature,
Figure BDA0003714686640000039
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure BDA00037146866400000310
respectively representing element-by-element subtraction, summation and multiplication.
The present invention also provides a target detection system using light guidance and attention mechanism, comprising:
an extraction module: respectively inputting the visible light image and the infrared image into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
inter-modality differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, the intra-modal attention module further comprises training of the intra-modal attention module to correctly predict the mask of the target, the loss function being
Figure BDA00037146866400000311
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure BDA0003714686640000041
for the calculation of the dice loss, s represents the smoothing factor.
Preferably, the illumination sensing module further comprises a training of the illumination sensing network, so that the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is that
Figure BDA0003714686640000042
Wherein T is the total number of samples, y i Class label representing the ith input image, p i The probability of being predicted as daytime for the ith input image.
Preferably, the fusion module comprises a modal weight W to be obtained R ,W T Inter-modal differential interaction attention model and intra-modalEnhanced visible light features obtained by attention module
Figure BDA0003714686640000043
And infrared characteristics
Figure BDA0003714686640000044
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure BDA0003714686640000045
Detecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure BDA0003714686640000046
respectively, element-by-element summation and multiplication.
Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure BDA0003714686640000047
And infrared image characteristics
Figure BDA0003714686640000048
Namely that
Figure BDA0003714686640000049
Figure BDA00037146866400000410
Figure BDA00037146866400000411
Wherein,
Figure BDA00037146866400000412
in order to be a differential feature,
Figure BDA00037146866400000413
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure BDA00037146866400000414
respectively representing element-by-element subtraction, summation and multiplication.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a target detection method and a target detection system utilizing an illumination guide and attention mechanism.
Meanwhile, inter-modality interaction attention and intra-modality attention are introduced, and inter-modality features and intra-modality features are enhanced respectively, wherein the inter-modality interaction attention module is used for enhancing extraction of the complementary features of the network on the modalities by amplifying differences among the modalities, and the intra-modality attention module is used for predicting a target mask for each modality and taking the target mask as the attention to enhance the intra-modality features.
And introducing the modal weight into a mask prediction loss function, and adjusting the contribution of the two modes to the loss function to enable the network to pay more attention to the samples difficult to learn so as to achieve high-precision target detection.
Drawings
FIG. 1 is a schematic diagram of a visible light image and an infrared image;
FIG. 2 is a diagram of a deep convolutional neural network model used in the target detection method provided by the present invention;
FIG. 3 is a schematic diagram of an inter-modality interaction attention module in accordance with the present invention;
FIG. 4 is a schematic view of an intra-modal attention module of the present invention;
FIG. 5 is a schematic view of a target mask label in the present invention;
FIG. 6 is a schematic diagram of an illumination-aware network module for illumination guidance according to the present invention;
FIG. 7 is a visualization of a feature map for each stage of the network in the present invention;
FIG. 8 is a graph showing the results of the test in the present invention.
Detailed Description
The invention will be further described with reference to examples of embodiments shown in the drawings.
Example one
The invention provides a visible light infrared image target detection method utilizing illumination guidance and an attention mechanism. In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below with reference to the accompanying drawings.
Step 1: and respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters.
The visible light image and the infrared image are simultaneously shot in the same scene. Referring to fig. 1, fig. 1 shows a visible light image and an infrared image pair, where the first and third columns are visible light images, and the second and fourth columns are infrared images corresponding to the first and third visible light images. The visible light image may be a color image.
Referring to fig. 2, fig. 2 shows a deep convolutional neural network model used in the target detection method provided in the present invention.
Preferably, the visible light image I is processed by using fast R-CNN as a basic frame R Infrared image I T Respectively input into two paths of deep convolution neural backbone networks G with completely same structures R (·)、G T In the step (c), the image characteristics are extracted, and the two networks do not share parameters to obtain the characteristics F of the visible light image R =G R (I R ) Features F of an infrared image T =G T (I T )。
Where subscript R denotes for the visible light image modality and subscript T denotes for the infrared image modality.
Preferably, in order to enhance the generalization capability of the network, a data enhancement strategy of image scaling and random horizontal flipping can be adopted during training. For example, visible light image I R Infrared image I T Scaled to 640 x 512 pixels.
Preferably, the deep convolutional Neural backbone Network used as the feature extractor adopts ResNet-101(Residual Neural networks of 101layers) + FPN (feature pyramid Network), the initial values of the parameters use the fast R-CNN weights pre-trained on the COCO data set, and only the weights of the feature extraction part are used.
Step 2: and inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts.
Specifically, referring to fig. 3, the extracted visible light image feature F R And infrared image feature F T Input to the inter-modal differential interaction attention module M shown in FIG. 3 inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure BDA0003714686640000061
And infrared image characteristics
Figure BDA0003714686640000062
Namely, it is
Figure BDA0003714686640000063
Figure BDA0003714686640000064
Figure BDA0003714686640000065
Wherein,
Figure BDA0003714686640000066
in order to be a differential feature,
Figure BDA0003714686640000067
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure BDA0003714686640000068
respectively representing element-by-element subtraction, summation and multiplication.
And step 3: and (3) respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modality attention module, predicting a target mask, and enhancing intra-modality features by using the mask as attention to obtain the visible light image features and the infrared image features with enhanced intra-modality features.
Specifically, referring to fig. 4, the visible light image feature F extracted in step 1 is shown R And infrared image feature F T Input to the intra-modal attention module M as shown in FIG. 4 intra In (c), target masks for two modalities are predicted
Figure BDA0003714686640000069
And
Figure BDA00037146866400000610
and obtaining two modal characteristics with enhanced features within the modal
Figure BDA00037146866400000611
Wherein the visible light image features
Figure BDA00037146866400000612
Infrared image characteristics
Figure BDA00037146866400000613
Where δ (-) represents the sigmoid activation function, F (-) represents the 1 × 1 convolution,
Figure BDA00037146866400000614
respectively, element-by-element summation or multiplication.
Referring to FIG. 5, FIG. 5 shows a target mask tag schematic, the network using the target mask tag as the actual value of the target mask.
Preferably, step 3 further comprises training of the intra-modality attention module so that it can correctly predict the mask of the target.
In particular, the training of the intra-modal attention module adjusts parameters according to a loss function, which is a function of the loss
Figure BDA00037146866400000615
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure BDA0003714686640000071
a calculation formula of the dice loss is shown, and s represents a smoothing coefficient; parameters are adjusted according to the loss function so that the intra-modal attention module can correctly predict the mask of the target.
Preferably, a random gradient descent (SGD) method is adopted to optimize the network weight, the SGD momentum is set to be 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to be 0.005.
And 4, step 4: and (4) downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics.
Referring to fig. 6, fig. 6 shows a lighting aware network and gate function diagram. In particular, for visible light image I R The down-sampling is performed in a manner,the input illumination aware network N (-) obtains the probability that the input image is day or night:
C d =δ(N(I R )),C n =1-C d
wherein, C d To input the probability that the image is daytime, C n δ (-) represents the softmax activation function for the probability that the input image is night.
The probability value is then adjusted via a gating function to obtain a reasonable modal weight, thereby adaptively assigning weights to the two modalities. Wherein the weight of the visible light characteristic is
Figure BDA0003714686640000072
The weight of the infrared features is W T =1-W R Where α represents a learnable parameter.
Preferably, α is initially set to 1, the downsampling factor is 1/8 for the original, the lighting aware network N (-) contains 2 convolutional layers and 3 fully-connected layers, each convolutional layer is followed by a ReLU activation function layer and a 2 × 2 max pooling layer to activate and compress features, and the softmax activation function is used after the last fully-connected layer.
Preferably, step 4 further comprises training of the illumination perception network, so that the probability of whether the scene in the image is day or night can be calculated according to the characteristics of the input visible light image.
Specifically, the loss function in the training of the illumination perception network is
Figure BDA0003714686640000073
Wherein T is the total number of samples, y i A classification label representing the ith input image, where 1 is day and 0 is night, p i Predicting the probability of the day for the ith input image; and adjusting parameters according to the loss function, so that the illumination perception network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image.
Preferably, the model is trained using a back propagation method when calculating the loss function. And optimizing the network weight by adopting a random gradient descent (SGD) method, setting the SGD momentum to be 0.9, setting the weight attenuation coefficient to be 0.0001, and setting the learning rate to be 0.005.
And 5: and (4) fusing the enhanced visible light and infrared features obtained in the steps (2) and (3) by using the weight obtained in the step (4) to obtain a fusion feature, sending the fusion feature into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of the target detection model.
Specifically, the modal weight W obtained in step 4 is used R ,W T The enhanced visible light characteristics obtained in step 2 and step 3
Figure BDA0003714686640000081
And infrared characteristics
Figure BDA0003714686640000082
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure BDA0003714686640000083
Detection network D (-) after being fed in, class confidence p of ith anchor box i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure BDA0003714686640000084
respectively, element-by-element summation or multiplication.
Preferably, the step 5 further includes training the detection network, where the training of the detection network includes calculating a loss function of the detection network, and adjusting parameters according to the loss function to obtain a trained target detection model.
This is because the detection network will generate anchor frames, which are preliminary candidate frames, and a graph will generate 512 anchor frames in general, but the positions are not accurate. And further regressing the anchor frames through regression branches to obtain a prediction frame with a more accurate position.
In particular, the loss function is
Figure BDA0003714686640000085
Adjusting parameters according to the loss function to obtain a trained target detection model, wherein N represents the total number of anchor frames, and p i And
Figure BDA0003714686640000086
the prediction class score and the true class label, l, of the ith anchor box are respectively represented i And
Figure BDA0003714686640000087
respectively representing the predicted regression value and the true value of the ith anchor frame, wherein lambda is a weight factor and is set to be 1.
In particular, the loss function is a loss function for detecting network D (-) including detecting network classification loss
Figure BDA0003714686640000088
And regression branch loss
Figure BDA0003714686640000089
Wherein the classification loss function uses a softmax cross entropy loss function,
Figure BDA00037146866400000810
wherein K is the total number of categories, y is the label, when predicting the category and the real category label
Figure BDA00037146866400000811
When consistent
Figure BDA00037146866400000812
Otherwise
Figure BDA00037146866400000813
For the softmax probability that anchor box i belongs to category n,
Figure BDA00037146866400000814
regression Branch loss Using smooth L1 Loss, return toThe branch loss is
Figure BDA00037146866400000815
Wherein:
Figure BDA0003714686640000091
Figure BDA0003714686640000092
Figure BDA0003714686640000093
Figure BDA0003714686640000094
x, y, w, h represent coordinates of the center point, height and width of the frame, x p ,x a ,x * The values y, w, h corresponding to the predicted frame, the anchor frame and the true frame, respectively, from the network regression are the same.
Preferably, a back propagation method is used for training the model, and a random gradient descent (SGD) method is used for optimizing the network weight, so that the network can correctly predict the position and the category of the target of interest.
Here, the SGD momentum is set to 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to 0.005.
Step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Specifically, the target detection model is trained by using the steps, after the training is completed, the image to be detected is input into the trained target detection model, the forward propagation of the network is carried out to obtain the output of the structured network model, and the non-maximum suppression is carried out on the output result of the network model to obtain the final target detection result.
Preferably, the non-maximum suppression threshold is set to 0.5.
Example two
Aiming at the visible light infrared image target detection method utilizing the illumination guidance and attention mechanism provided by the first embodiment of the invention, the first embodiment of the invention provides a test result of the method to evaluate the performance of the method.
In the testing process, an FLIR-aligned data set is selected for precision testing, the FLIR-aligned data set is a registration subset of the FLIR ADAS data set, on the basis of the FLIR ADAS data set, unpaired images are removed, a part of image pairs are manually registered, and finally 4129 pairs of image pairs are kept as a training set and 1013 pairs of image pairs are used as a testing set. Only three types of objects, namely people, bicycles, and vehicles, are retained in the data set.
The visual result and the experimental detection result of the test process are shown in fig. 7 and 8, in fig. 7, the input image, the original feature map, the feature map passing through the differential interaction attention module, the feature map passing through the intra-modal attention module, and the fused feature map are sequentially from left to right, in fig. 8, the solid line box represents the correct detection result, the dotted line box represents the omission, and the dotted line box represents the false alarm.
The method adopts the following analysis indexes for measuring the detection precision: average Precision (AP).
The accuracy of 0.5 for each class of object intersection versus threshold (AP50), the accuracy of 0.5 for all classes of object intersection versus threshold (mAP50), and the average accuracy of 10 intersection versus threshold calculations (maps) of 0.5:0.05:0.95 were evaluated in the experiment and the results are shown in table 1.
Table 1 experiment results of visible light and infrared image fusion target detection algorithm on FLIR-aligned
Figure BDA0003714686640000095
Figure BDA0003714686640000101
As can be seen from the quantitative analysis and the qualitative analysis of the detection precision in Table 1, the detection precision of the method provided by the invention on the aligned-FLIR data set reaches the leading level.
EXAMPLE III
An embodiment of the present invention provides a target detection system using light guidance and attention mechanism, the system including:
an extraction module: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
inter-modality differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features enhanced visible light image features and infrared image features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
Preferably, the intra-modal attention module further comprises training of the intra-modal attention module to correctly predict the mask of the target, the loss function being
Figure BDA0003714686640000102
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure BDA0003714686640000111
for the calculation of the dice loss, s represents the smoothing factor.
Preferably, the illumination sensing module further comprises a training of the illumination sensing network, so that the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is that
Figure BDA0003714686640000112
Wherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
Preferably, the fusion module comprises a modal weight W to be obtained R ,W T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention module
Figure BDA0003714686640000113
And infrared characteristics
Figure BDA0003714686640000114
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure BDA0003714686640000115
Detecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure BDA0003714686640000116
respectively representing element-by-element summation, multiplication.
Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted R And infrared image feature F T Input-to-modality differential interaction attention module M inter (. C.) is to characterize the visible light image by F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure BDA0003714686640000117
And infrared image characteristics
Figure BDA0003714686640000118
Namely, it is
Figure BDA0003714686640000119
Figure BDA00037146866400001110
Figure BDA00037146866400001111
Wherein,
Figure BDA00037146866400001112
in order to be a differential feature,
Figure BDA00037146866400001113
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure BDA00037146866400001114
respectively representing element-by-element subtraction, summation and multiplication.
It should be understood that parts of the specification not set forth in detail are of the prior art.
The protective scope of the present invention is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present invention by those skilled in the art without departing from the scope and spirit of the present invention. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims (10)

1. A method of target detection using a light guidance and attention mechanism, characterized by:
step 1: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
step 2: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
and 3, step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features of the visible light image and the infrared image with enhanced intra-modal features;
and 4, step 4: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
and 5: fusing the enhanced visible light image features and the infrared image features obtained in the step 2 and the step 3 by using the weight obtained in the step 4 to obtain fusion features, sending the fusion features into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of a target detection model;
step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
2. The method of claim 1, wherein: said step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function of
Figure FDA0003714686630000011
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure FDA0003714686630000012
for the calculation of the dice loss, s represents the smoothing factor.
3. The method of claim 1, wherein: step 4 further comprises training of the illumination perception network, the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network is
Figure FDA0003714686630000013
Wherein T is the total number of samples, y i Class label, p, representing the ith input image i The probability of being predicted as daytime for the ith input image.
4. The method of claim 1, wherein: the step 5 comprises the step of obtaining the modal weight W obtained in the step 4 R ,W T The enhanced visible light characteristics obtained in step 2 and step 3
Figure FDA0003714686630000014
And infrared characteristics
Figure FDA0003714686630000015
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure FDA0003714686630000021
Detecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame i And the predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure FDA0003714686630000022
respectively representing element-by-element summation, multiplication.
5. The method of claim 1, wherein: the step 2 comprises the step of extracting the visible light image characteristics F R And infrared image feature F T Input-to-modality differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure FDA0003714686630000023
And infrared image characteristics
Figure FDA0003714686630000024
Namely that
Figure FDA0003714686630000025
Figure FDA0003714686630000026
Figure FDA0003714686630000027
Wherein,
Figure FDA0003714686630000028
in order to be a differential feature,
Figure FDA0003714686630000029
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure FDA00037146866300000210
respectively representing element-by-element subtraction, summation and multiplication.
6. An object detection system utilizing a light-directing and attention-directing mechanism, characterized by: the system comprises:
an extraction module: respectively inputting the visible light image and the infrared image into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;
an inter-modal differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;
intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;
the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;
a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;
a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.
7. The system of claim 6, wherein: the intra-modal attention module further includes training of the intra-modal attention module to correctly predict a mask for the target with a loss function of
Figure FDA0003714686630000031
Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M Rij And M Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid R And W T The weights for the visible light modality and the infrared modality,
Figure FDA0003714686630000032
for the calculation of the dice loss, s represents the smoothing factor.
8. The system of claim 6, wherein: the illumination perception module also comprises the training of an illumination perception network, the probability of whether the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network is
Figure FDA0003714686630000033
Wherein T is the total number of samples, y i Class label representing the ith input image, p i The probability of daytime for the ith input image is predicted.
9. The system of claim 6, wherein: the fusion module comprises a modal weight W to be obtained R ,W T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention module
Figure FDA0003714686630000034
And infrared characteristics
Figure FDA0003714686630000035
Recombined, weighted and cascaded to obtain the fusion characteristics
Figure FDA0003714686630000036
Detecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame i And predicted regression value l i Is (p) i ,l i )=D(F F ) Where CONCAT (-) represents a channel cascade,
Figure FDA0003714686630000037
respectively, element-by-element summation and multiplication.
10. The system of claim 6, wherein: the inter-modality differential interaction attention module comprises a visible light image feature F to be extracted R And infrared image feature F T Input-to-modal differential interaction attention module M inter (. 2) characterizing the visible light image F R And infrared image feature F T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part
Figure FDA0003714686630000038
And infrared image characteristics
Figure FDA0003714686630000039
Namely, it is
Figure FDA00037146866300000310
Figure FDA00037146866300000311
Figure FDA00037146866300000312
Wherein,
Figure FDA00037146866300000313
in order to be a differential feature,
Figure FDA00037146866300000314
represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,
Figure FDA00037146866300000315
respectively representing element-by-element subtraction, summation and multiplication.
CN202210734314.9A 2022-06-27 2022-06-27 Target detection method and system utilizing illumination guide and attention mechanism Pending CN115131640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210734314.9A CN115131640A (en) 2022-06-27 2022-06-27 Target detection method and system utilizing illumination guide and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210734314.9A CN115131640A (en) 2022-06-27 2022-06-27 Target detection method and system utilizing illumination guide and attention mechanism

Publications (1)

Publication Number Publication Date
CN115131640A true CN115131640A (en) 2022-09-30

Family

ID=83379399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210734314.9A Pending CN115131640A (en) 2022-06-27 2022-06-27 Target detection method and system utilizing illumination guide and attention mechanism

Country Status (1)

Country Link
CN (1) CN115131640A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393684A (en) * 2022-10-27 2022-11-25 松立控股集团股份有限公司 Anti-interference target detection method based on automatic driving scene multi-mode fusion
CN115631510A (en) * 2022-10-24 2023-01-20 智慧眼科技股份有限公司 Pedestrian re-identification method and device, computer equipment and storage medium
CN116740410A (en) * 2023-04-21 2023-09-12 中国地质大学(武汉) Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment
CN116778227A (en) * 2023-05-12 2023-09-19 昆明理工大学 Target detection method, system and equipment based on infrared image and visible light image
CN117078920A (en) * 2023-10-16 2023-11-17 昆明理工大学 Infrared-visible light target detection method based on deformable attention mechanism

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631510A (en) * 2022-10-24 2023-01-20 智慧眼科技股份有限公司 Pedestrian re-identification method and device, computer equipment and storage medium
CN115393684A (en) * 2022-10-27 2022-11-25 松立控股集团股份有限公司 Anti-interference target detection method based on automatic driving scene multi-mode fusion
CN116740410A (en) * 2023-04-21 2023-09-12 中国地质大学(武汉) Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment
CN116740410B (en) * 2023-04-21 2024-01-30 中国地质大学(武汉) Bimodal target detection model construction method, bimodal target detection model detection method and computer equipment
CN116778227A (en) * 2023-05-12 2023-09-19 昆明理工大学 Target detection method, system and equipment based on infrared image and visible light image
CN116778227B (en) * 2023-05-12 2024-05-10 昆明理工大学 Target detection method, system and equipment based on infrared image and visible light image
CN117078920A (en) * 2023-10-16 2023-11-17 昆明理工大学 Infrared-visible light target detection method based on deformable attention mechanism
CN117078920B (en) * 2023-10-16 2024-01-23 昆明理工大学 Infrared-visible light target detection method based on deformable attention mechanism

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN115131640A (en) Target detection method and system utilizing illumination guide and attention mechanism
CN108399362B (en) Rapid pedestrian detection method and device
CN111126258B (en) Image recognition method and related device
CN110458165B (en) Natural scene text detection method introducing attention mechanism
Yang et al. Single image haze removal via region detection network
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN113111814B (en) Regularization constraint-based semi-supervised pedestrian re-identification method and device
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
CN114612937B (en) Pedestrian detection method based on single-mode enhancement by combining infrared light and visible light
CN110222615A (en) The target identification method that is blocked based on InceptionV3 network
CN113781519A (en) Target tracking method and target tracking device
CN111539351A (en) Multi-task cascaded face frame selection comparison method
CN117237740B (en) SAR image classification method based on CNN and Transformer
CN114998801A (en) Forest fire smoke video detection method based on contrast self-supervision learning network
CN111126155A (en) Pedestrian re-identification method for generating confrontation network based on semantic constraint
CN117576149A (en) Single-target tracking method based on attention mechanism
CN117237411A (en) Pedestrian multi-target tracking method based on deep learning
CN116152699B (en) Real-time moving target detection method for hydropower plant video monitoring system
CN115063428B (en) Spatial dim small target detection method based on deep reinforcement learning
CN116704309A (en) Image defogging identification method and system based on improved generation of countermeasure network
CN115393901A (en) Cross-modal pedestrian re-identification method and computer readable storage medium
CN116110074A (en) Dynamic small-strand pedestrian recognition method based on graph neural network
CN114694042A (en) Disguised person target detection method based on improved Scaled-YOLOv4
CN114581353A (en) Infrared image processing method and device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination