CN115131640A

CN115131640A - Target detection method and system utilizing illumination guide and attention mechanism

Info

Publication number: CN115131640A
Application number: CN202210734314.9A
Authority: CN
Inventors: 杨文�; 贺钰洁; 张妍; 余淮; 余磊
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-30

Abstract

The invention provides a visible light infrared image target detection method and system utilizing illumination guidance and an attention mechanism, wherein the method and the system utilize a deep convolution neural backbone network to extract image features of visible light and infrared images, and introduce an inter-modal differential interaction attention module and an intra-modal attention module to respectively enhance the inter-modal features and intra-modal features, wherein the inter-modal differential interaction attention module enhances the extraction of the network on modal complementary features by amplifying inter-modal differences, and the intra-modal attention module predicts a target mask for each mode and takes the target mask as attention to enhance the intra-modal features. Meanwhile, an illumination sensing network module guided by illumination is introduced, weight values are adaptively distributed for different modes by utilizing illumination information, the mode weight is introduced into a mask prediction loss function, the contribution of the two modes to the loss function is adjusted, and the network focuses more on samples difficult to learn so as to achieve high-precision target detection.

Description

Target detection method and system utilizing illumination guidance and attention mechanism

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a target detection method and a target detection system utilizing illumination guidance and attention mechanism.

Background

The target detection is the basis for automatic analysis and understanding of complex scenes, and plays an important role in the fields of intelligent security, human-computer interaction, smart cities and the like. However, in the environments of night, insufficient illumination, severe weather and the like, the imaging quality of the visible light image is greatly affected, and the requirement of high-precision target detection is difficult to meet. The infrared image is imaged by radiation of a target and a background, is not influenced by severe environments such as rain, snow, wind, frost and the like, has strong anti-interference capability, can be identified and disguised, and has good complementary characteristics with a visible light image.

Therefore, how to effectively utilize the characteristics of the visible light and the infrared image, develop complementary information and realize high-precision target detection has important theoretical research significance and practical application value.

However, due to the difficult predictability of the external environment, it is difficult for the target detection network to predict the contribution and utility of each modality data in advance. For example, the following may occur: the object of interest does not appear in one modality and appears characteristic in another modality; the characteristics of a certain degree appear in the two modes, and the information of the two modes needs to be complementarily utilized to obtain final judgment; and other more complex modalities of information presentation. In these complex cases, the network cannot be preset in advance, with what attention should be given to each modality, and which features are of particular concern.

Therefore, a highly efficient and adaptive modality information fusion framework is needed. However, most existing visible light and infrared image fusion algorithms do not clearly divide the features, and the selection of the features is completely handed to a detection network, so that the visible light and infrared image features are not fully utilized, and the detection performance is reduced.

In order to solve the problems, the invention provides a visible light infrared image target detection method utilizing illumination guidance and attention mechanism, which performs explicit modeling on different characteristics, fully utilizes information in visible light and infrared images and achieves high-precision visible light and infrared image fusion target detection.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a target detection method and a system utilizing an illumination guide and attention mechanism, the method and the system can fully utilize information in visible light and infrared images, divide image characteristics into intra-modal characteristics and inter-modal difference characteristics, perform explicit modeling on different characteristics, introduce an illumination sensing network module of illumination guide at the same time, and utilize illumination information to adaptively distribute weights for different modalities so as to improve the precision of target detection of fusion of the visible light and the infrared images.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a target detection method utilizing illumination guidance and attention mechanism, which comprises the following steps:

step 1: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;

and 2, step: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;

and step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;

and 4, step 4: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;

and 5: fusing the enhanced visible light image features and the infrared image features obtained in the step 2 and the step 3 by using the weight obtained in the step 4 to obtain fusion features, sending the fusion features into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of a target detection model;

step 6: and inputting the image to be detected into the trained target detection model to obtain a target detection result.

Preferably, step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function of

Wherein T is the total number of samples, S is the number of the characteristic pyramid stages, Y represents a mask label, and Y represents a mask label _ij Is the mask label corresponding to the ith input sample at the jth stage of the feature pyramid, M _Rij And M _Tij Respectively representing target masks, W, predicted by the ith input sample of the visible light branch and the infrared branch at the jth stage of the characteristic pyramid _R And W _T The weights for the visible light modality and the infrared modality,

for the calculation of the dice loss, s represents the smoothing factor.

Preferably, step 4 further includes training the illumination sensing network, so that the illumination sensing network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is as follows

Wherein T is the total number of samples, y _i Class label, p, representing the ith input image _i The probability of being predicted as daytime for the ith input image.

Preferably, step 5 comprises weighting the modal weights W obtained in step 4 _R ,W _T The enhanced visible light characteristics obtained in step 2 and step 3

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

Detecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame _i And the predicted regression value l _i Is (p) _i ,l _i )＝D(F _F ) Where CONCAT (-) represents a channel cascade,

respectively representing element-by-element summation, multiplication.

Preferably, step 2 specifically includes extracting the visible light image feature F _R And infrared image feature F _T Input-to-modal differential interaction attention module M _inter (. C.) is to characterize the visible light image by F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely, it is

Wherein,

in order to be a differential feature,

represents the residual module, σ (-) represents the tanh activation function, GAP (-) represents the global mean pooling,

respectively representing element-by-element subtraction, summation and multiplication.

The present invention also provides a target detection system using light guidance and attention mechanism, comprising:

an extraction module: respectively inputting the visible light image and the infrared image into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;

inter-modality differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;

intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a modal attention module, predicting a target mask, and enhancing the intra-modal features by using the mask as attention to obtain the visible light image features and the infrared image features with the enhanced intra-modal features;

the illumination perception module: downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics;

a fusion module: the method comprises the steps that the weight obtained by an illumination sensing module is utilized, enhanced visible light image features and infrared image features obtained by an inter-modality difference interaction attention module and an intra-modality attention module are recombined, weighted and cascaded to obtain fusion features, the fusion features are sent to a detection network, the category confidence coefficient and the position information of an interested target in an input image are obtained, and training of a target detection model is completed;

a target detection module: and inputting the image to be detected into the trained target detection model to obtain a target detection result.

Preferably, the intra-modal attention module further comprises training of the intra-modal attention module to correctly predict the mask of the target, the loss function being

for the calculation of the dice loss, s represents the smoothing factor.

Preferably, the illumination sensing module further comprises a training of the illumination sensing network, so that the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination sensing network is that

Wherein T is the total number of samples, y _i Class label representing the ith input image, p _i The probability of being predicted as daytime for the ith input image.

Preferably, the fusion module comprises a modal weight W to be obtained _R ,W _T Inter-modal differential interaction attention model and intra-modalEnhanced visible light features obtained by attention module

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

Detecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame _i And predicted regression value l _i Is (p) _i ,l _i )＝D(F _F ) Where CONCAT (-) represents a channel cascade,

respectively, element-by-element summation and multiplication.

Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted _R And infrared image feature F _T Input-to-modal differential interaction attention module M _inter (. 2) characterizing the visible light image F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely that

Wherein,

in order to be a differential feature,

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a target detection method and a target detection system utilizing an illumination guide and attention mechanism.

Meanwhile, inter-modality interaction attention and intra-modality attention are introduced, and inter-modality features and intra-modality features are enhanced respectively, wherein the inter-modality interaction attention module is used for enhancing extraction of the complementary features of the network on the modalities by amplifying differences among the modalities, and the intra-modality attention module is used for predicting a target mask for each modality and taking the target mask as the attention to enhance the intra-modality features.

And introducing the modal weight into a mask prediction loss function, and adjusting the contribution of the two modes to the loss function to enable the network to pay more attention to the samples difficult to learn so as to achieve high-precision target detection.

Drawings

FIG. 1 is a schematic diagram of a visible light image and an infrared image;

FIG. 2 is a diagram of a deep convolutional neural network model used in the target detection method provided by the present invention;

FIG. 3 is a schematic diagram of an inter-modality interaction attention module in accordance with the present invention;

FIG. 4 is a schematic view of an intra-modal attention module of the present invention;

FIG. 5 is a schematic view of a target mask label in the present invention;

FIG. 6 is a schematic diagram of an illumination-aware network module for illumination guidance according to the present invention;

FIG. 7 is a visualization of a feature map for each stage of the network in the present invention;

FIG. 8 is a graph showing the results of the test in the present invention.

Detailed Description

The invention will be further described with reference to examples of embodiments shown in the drawings.

Example one

The invention provides a visible light infrared image target detection method utilizing illumination guidance and an attention mechanism. In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below with reference to the accompanying drawings.

Step 1: and respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters.

The visible light image and the infrared image are simultaneously shot in the same scene. Referring to fig. 1, fig. 1 shows a visible light image and an infrared image pair, where the first and third columns are visible light images, and the second and fourth columns are infrared images corresponding to the first and third visible light images. The visible light image may be a color image.

Referring to fig. 2, fig. 2 shows a deep convolutional neural network model used in the target detection method provided in the present invention.

Preferably, the visible light image I is processed by using fast R-CNN as a basic frame _R Infrared image I _T Respectively input into two paths of deep convolution neural backbone networks G with completely same structures _R (·)、G _T In the step (c), the image characteristics are extracted, and the two networks do not share parameters to obtain the characteristics F of the visible light image _R ＝G _R (I _R ) Features F of an infrared image _T ＝G _T (I _T )。

Where subscript R denotes for the visible light image modality and subscript T denotes for the infrared image modality.

Preferably, in order to enhance the generalization capability of the network, a data enhancement strategy of image scaling and random horizontal flipping can be adopted during training. For example, visible light image I _R Infrared image I _T Scaled to 640 x 512 pixels.

Preferably, the deep convolutional Neural backbone Network used as the feature extractor adopts ResNet-101(Residual Neural networks of 101layers) + FPN (feature pyramid Network), the initial values of the parameters use the fast R-CNN weights pre-trained on the COCO data set, and only the weights of the feature extraction part are used.

Step 2: and inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts.

Specifically, referring to fig. 3, the extracted visible light image feature F _R And infrared image feature F _T Input to the inter-modal differential interaction attention module M shown in FIG. 3 _inter (. 2) characterizing the visible light image F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely, it is

Wherein,

in order to be a differential feature,

And step 3: and (3) respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modality attention module, predicting a target mask, and enhancing intra-modality features by using the mask as attention to obtain the visible light image features and the infrared image features with enhanced intra-modality features.

Specifically, referring to fig. 4, the visible light image feature F extracted in step 1 is shown _R And infrared image feature F _T Input to the intra-modal attention module M as shown in FIG. 4 _intra In (c), target masks for two modalities are predicted

And

and obtaining two modal characteristics with enhanced features within the modal

Wherein the visible light image features

Infrared image characteristics

Where δ (-) represents the sigmoid activation function, F (-) represents the 1 × 1 convolution,

respectively, element-by-element summation or multiplication.

Referring to FIG. 5, FIG. 5 shows a target mask tag schematic, the network using the target mask tag as the actual value of the target mask.

Preferably, step 3 further comprises training of the intra-modality attention module so that it can correctly predict the mask of the target.

In particular, the training of the intra-modal attention module adjusts parameters according to a loss function, which is a function of the loss

a calculation formula of the dice loss is shown, and s represents a smoothing coefficient; parameters are adjusted according to the loss function so that the intra-modal attention module can correctly predict the mask of the target.

Preferably, a random gradient descent (SGD) method is adopted to optimize the network weight, the SGD momentum is set to be 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to be 0.005.

And 4, step 4: and (4) downsampling the visible light image, and inputting the downsampled visible light image into an illumination perception network to predict the weight of the two modal characteristics.

Referring to fig. 6, fig. 6 shows a lighting aware network and gate function diagram. In particular, for visible light image I _R The down-sampling is performed in a manner,the input illumination aware network N (-) obtains the probability that the input image is day or night:

C _d ＝δ(N(I _R ))，C _n ＝1-C _d ，

wherein, C _d To input the probability that the image is daytime, C _n δ (-) represents the softmax activation function for the probability that the input image is night.

The probability value is then adjusted via a gating function to obtain a reasonable modal weight, thereby adaptively assigning weights to the two modalities. Wherein the weight of the visible light characteristic is

The weight of the infrared features is W _T ＝1-W _R Where α represents a learnable parameter.

Preferably, α is initially set to 1, the downsampling factor is 1/8 for the original, the lighting aware network N (-) contains 2 convolutional layers and 3 fully-connected layers, each convolutional layer is followed by a ReLU activation function layer and a 2 × 2 max pooling layer to activate and compress features, and the softmax activation function is used after the last fully-connected layer.

Preferably, step 4 further comprises training of the illumination perception network, so that the probability of whether the scene in the image is day or night can be calculated according to the characteristics of the input visible light image.

Specifically, the loss function in the training of the illumination perception network is

Wherein T is the total number of samples, y _i A classification label representing the ith input image, where 1 is day and 0 is night, p _i Predicting the probability of the day for the ith input image; and adjusting parameters according to the loss function, so that the illumination perception network can calculate the probability of whether the scene in the image is day or night according to the characteristics of the input visible light image.

Preferably, the model is trained using a back propagation method when calculating the loss function. And optimizing the network weight by adopting a random gradient descent (SGD) method, setting the SGD momentum to be 0.9, setting the weight attenuation coefficient to be 0.0001, and setting the learning rate to be 0.005.

And 5: and (4) fusing the enhanced visible light and infrared features obtained in the steps (2) and (3) by using the weight obtained in the step (4) to obtain a fusion feature, sending the fusion feature into a detection network, obtaining the position information of the target of interest in the input image, and finishing the training of the target detection model.

Specifically, the modal weight W obtained in step 4 is used _R ,W _T The enhanced visible light characteristics obtained in step 2 and step 3

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

Detection network D (-) after being fed in, class confidence p of ith anchor box _i And predicted regression value l _i Is (p) _i ,l _i )＝D(F _F ) Where CONCAT (-) represents a channel cascade,

respectively, element-by-element summation or multiplication.

Preferably, the step 5 further includes training the detection network, where the training of the detection network includes calculating a loss function of the detection network, and adjusting parameters according to the loss function to obtain a trained target detection model.

This is because the detection network will generate anchor frames, which are preliminary candidate frames, and a graph will generate 512 anchor frames in general, but the positions are not accurate. And further regressing the anchor frames through regression branches to obtain a prediction frame with a more accurate position.

In particular, the loss function is

Adjusting parameters according to the loss function to obtain a trained target detection model, wherein N represents the total number of anchor frames, and p _i And

the prediction class score and the true class label, l, of the ith anchor box are respectively represented _i And

respectively representing the predicted regression value and the true value of the ith anchor frame, wherein lambda is a weight factor and is set to be 1.

In particular, the loss function is a loss function for detecting network D (-) including detecting network classification loss

And regression branch loss

Wherein the classification loss function uses a softmax cross entropy loss function,

wherein K is the total number of categories, y is the label, when predicting the category and the real category label

When consistent

Otherwise

For the softmax probability that anchor box i belongs to category n,

regression Branch loss Using smooth _L1 Loss, return toThe branch loss is

Wherein:

x, y, w, h represent coordinates of the center point, height and width of the frame, x _p ,x _a ,x ^* The values y, w, h corresponding to the predicted frame, the anchor frame and the true frame, respectively, from the network regression are the same.

Preferably, a back propagation method is used for training the model, and a random gradient descent (SGD) method is used for optimizing the network weight, so that the network can correctly predict the position and the category of the target of interest.

Here, the SGD momentum is set to 0.9, the weight attenuation coefficient is 0.0001, and the learning rate is set to 0.005.

Specifically, the target detection model is trained by using the steps, after the training is completed, the image to be detected is input into the trained target detection model, the forward propagation of the network is carried out to obtain the output of the structured network model, and the non-maximum suppression is carried out on the output result of the network model to obtain the final target detection result.

Preferably, the non-maximum suppression threshold is set to 0.5.

Example two

Aiming at the visible light infrared image target detection method utilizing the illumination guidance and attention mechanism provided by the first embodiment of the invention, the first embodiment of the invention provides a test result of the method to evaluate the performance of the method.

In the testing process, an FLIR-aligned data set is selected for precision testing, the FLIR-aligned data set is a registration subset of the FLIR ADAS data set, on the basis of the FLIR ADAS data set, unpaired images are removed, a part of image pairs are manually registered, and finally 4129 pairs of image pairs are kept as a training set and 1013 pairs of image pairs are used as a testing set. Only three types of objects, namely people, bicycles, and vehicles, are retained in the data set.

The visual result and the experimental detection result of the test process are shown in fig. 7 and 8, in fig. 7, the input image, the original feature map, the feature map passing through the differential interaction attention module, the feature map passing through the intra-modal attention module, and the fused feature map are sequentially from left to right, in fig. 8, the solid line box represents the correct detection result, the dotted line box represents the omission, and the dotted line box represents the false alarm.

The method adopts the following analysis indexes for measuring the detection precision: average Precision (AP).

The accuracy of 0.5 for each class of object intersection versus threshold (AP50), the accuracy of 0.5 for all classes of object intersection versus threshold (mAP50), and the average accuracy of 10 intersection versus threshold calculations (maps) of 0.5:0.05:0.95 were evaluated in the experiment and the results are shown in table 1.

Table 1 experiment results of visible light and infrared image fusion target detection algorithm on FLIR-aligned

As can be seen from the quantitative analysis and the qualitative analysis of the detection precision in Table 1, the detection precision of the method provided by the invention on the aligned-FLIR data set reaches the leading level.

EXAMPLE III

An embodiment of the present invention provides a target detection system using light guidance and attention mechanism, the system including:

an extraction module: respectively inputting the visible light image and the infrared image pair into two paths of deep convolution neural backbone networks with the same structure to extract image characteristics, wherein the two paths of networks do not share parameters;

intra-modality attention module: respectively inputting the extracted visible light image features and infrared image features into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features enhanced visible light image features and infrared image features;

for the calculation of the dice loss, s represents the smoothing factor.

Preferably, the fusion module comprises a modal weight W to be obtained _R ,W _T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention module

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

Detecting the network D (-) after the input, and obtaining the class confidence p of the ith anchor frame _i And the predicted regression value l _i Is (p) _i ,l _i )＝D(F _F ) Where CONCAT (-) represents a channel cascade,

respectively representing element-by-element summation, multiplication.

Preferably, the inter-modality differential interaction attention module includes a visible light image feature F to be extracted _R And infrared image feature F _T Input-to-modality differential interaction attention module M _inter (. C.) is to characterize the visible light image by F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely, it is

Wherein,

in order to be a differential feature,

It should be understood that parts of the specification not set forth in detail are of the prior art.

The protective scope of the present invention is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present invention by those skilled in the art without departing from the scope and spirit of the present invention. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A method of target detection using a light guidance and attention mechanism, characterized by:

step 2: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;

and 3, step 3: respectively inputting the visible light image features and the infrared image features extracted in the step (1) into a intra-modal attention module, predicting a target mask, and enhancing intra-modal features by taking the mask as attention to obtain intra-modal features of the visible light image and the infrared image with enhanced intra-modal features;

2. The method of claim 1, wherein: said step 3 further comprises training of the intra-modal attention module to correctly predict the mask of the target with a loss function of

for the calculation of the dice loss, s represents the smoothing factor.

3. The method of claim 1, wherein: step 4 further comprises training of the illumination perception network, the probability that the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network is

4. The method of claim 1, wherein: the step 5 comprises the step of obtaining the modal weight W obtained in the step 4 _R ,W _T The enhanced visible light characteristics obtained in step 2 and step 3

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

respectively representing element-by-element summation, multiplication.

5. The method of claim 1, wherein: the step 2 comprises the step of extracting the visible light image characteristics F _R And infrared image feature F _T Input-to-modality differential interaction attention module M _inter (. 2) characterizing the visible light image F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely that

Wherein,

in order to be a differential feature,

6. An object detection system utilizing a light-directing and attention-directing mechanism, characterized by: the system comprises:

an inter-modal differential interaction attention module: inputting the extracted visible light image features and infrared image features into an inter-modality differential interaction attention module to obtain visible light image features and infrared image features with enhanced difference parts;

7. The system of claim 6, wherein: the intra-modal attention module further includes training of the intra-modal attention module to correctly predict a mask for the target with a loss function of

for the calculation of the dice loss, s represents the smoothing factor.

8. The system of claim 6, wherein: the illumination perception module also comprises the training of an illumination perception network, the probability of whether the scene in the image is day or night is calculated according to the characteristics of the input visible light image, and the loss function in the training of the illumination perception network is

Wherein T is the total number of samples, y _i Class label representing the ith input image, p _i The probability of daytime for the ith input image is predicted.

9. The system of claim 6, wherein: the fusion module comprises a modal weight W to be obtained _R ,W _T Enhanced visible light features obtained by inter-modality differential interaction attention module and intra-modality attention module

And infrared characteristics

Recombined, weighted and cascaded to obtain the fusion characteristics

Detecting the network D (-) after being sent in, and obtaining the class confidence p of the ith anchor frame _i And predicted regression value l _i Is (p) _i ,l _i )＝D(F _F ) Where CONCAT (-) represents a channel cascade,

respectively, element-by-element summation and multiplication.

10. The system of claim 6, wherein: the inter-modality differential interaction attention module comprises a visible light image feature F to be extracted _R And infrared image feature F _T Input-to-modal differential interaction attention module M _inter (. 2) characterizing the visible light image F _R And infrared image feature F _T Amplifying the difference part to enhance the extraction of the network model to the complementary features, and obtaining the visible light image features with the enhanced difference part

And infrared image characteristics

Namely, it is

Wherein,

in order to be a differential feature,