CN116258934A

CN116258934A - Feature enhancement-based infrared-visible light fusion method, system and readable storage medium

Info

Publication number: CN116258934A
Application number: CN202310267771.6A
Authority: CN
Inventors: 李智勇; 肖志强; 付浩龙; 刘函豪
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-06-13

Abstract

The invention discloses an infrared-visible light fusion method based on feature enhancement, which adopts a feature extraction network of YOLOv5 with double-flow trunks to extract deep features from visible light and infrared images, and reduces deviation to a single mode through symmetrical complementary masks; aiming at the difference between visible light and infrared images, a cross feature enhancement module is added in a fusion module to improve intra-mode feature representation, and a long-distance dependent fusion module is added, so that enhanced features are fused by correlating position codes of multi-mode features, and the joint utilization rate of the multi-mode images and the detection effect of complex scenes can be improved. The application also provides an infrared-visible light fusion system based on characteristic enhancement and a readable storage medium.

Description

Feature enhancement-based infrared-visible light fusion method, system and readable storage medium

Technical Field

The application belongs to the technical field of target detection, and particularly relates to an infrared-visible light fusion method and system based on feature enhancement and a readable storage medium.

Background

The target detection algorithm is widely applied to the fields of automatic driving, monitoring, remote sensing and the like. However, due to limitations of the visible light sensor, most target detection methods of visible light images cannot achieve satisfactory accuracy and are sensitive to severe environmental factors such as rain, fog, and weak light. In contrast, infrared sensors perform well in the harsh environments described above. However, the infrared sensor is greatly affected by temperature. In a high temperature environment with good illumination, visible light images have rich texture and color information, while infrared images have difficulty in effectively distinguishing foreground and background. Therefore, by fusing complementary information from the visible light and the infrared sensor, the accuracy, reliability and robustness of the detection algorithm can be further improved.

In the prior art, the work of infrared-visible light multi-mode target detection is mainly divided into a traditional method and a deep learning method. For the traditional method, features are extracted from visible light and infrared images by using a directional gradient Histogram (HOG), and cascading fusion features are input into a Support Vector Machine (SVM) to obtain detection results, however, the feature extraction capability of a manually designed operator is limited, and an optimal feature extraction result is difficult to obtain; for the deep learning method, due to the strong representation learning capability, the deep learning shows advantages in the aspect of visible infrared fusion target detection, and four fusion detection strategies are designed based on YOLO v 4: image transition fusion, early fusion, mid-term fusion, and late fusion, however, the method using Convolutional Neural Networks (CNNs) is based on the non-global accepted domain of convolutional operators, resulting in information fusion only in local regions. Although these methods have higher performance than single-mode detection methods, they often lack long-range dependence and do not take full advantage of complementarity between modes, resulting in an undesirable detection result.

Therefore, there is a need to provide an infrared-visible light fusion method, system and readable storage medium based on feature enhancement, so as to solve the above-mentioned problems in the background art.

Disclosure of Invention

The purpose of the application is to provide an infrared-visible light fusion method, an infrared-visible light fusion system and a readable storage medium based on feature enhancement, which adopt a feature extraction network of YOLOv5 with double-flow trunks to extract deep features from visible light and infrared images, and reduce deviation to a single mode through symmetrical complementary masks; aiming at the difference between visible light and infrared images, a cross feature enhancement module is added in a fusion module to improve intra-mode feature representation, and a long-distance dependent fusion module is added, so that enhanced features are fused by correlating position codes of multi-mode features, and the joint utilization rate of the multi-mode images and the detection effect of complex scenes can be improved.

In order to solve the technical problems, the application is realized as follows:

an infrared-visible light fusion method based on feature enhancement comprises the following steps:

and (3) data acquisition: collecting a multi-target data set, and preprocessing the multi-target data set, wherein the multi-target data set comprises a visible light image and an infrared image;

feature extraction: constructing a double-flow trunk feature extraction network, wherein the double-flow trunk feature extraction network comprises two branches with the same structure, and the visible light image and the infrared image are respectively sent into the two branches to extract deep features so as to obtain visible light features and infrared features;

feature fusion: constructing a feature fusion network, wherein the feature fusion network comprises a cross feature enhancement module and a long-distance dependent fusion module, the cross feature enhancement module comprises a channel attention branch and a space attention branch which are arranged in series, the visible light features and the infrared features are sent into the channel attention branch for enhancement, the visible light features before enhancement and the infrared features after enhancement through the channel attention branch are added to obtain a first feature, and the infrared features before enhancement and the visible light features after enhancement through the channel attention branch are added to obtain a second feature; the first feature and the second feature are sent to the space attention branch for enhancement, the first feature is added with the second feature enhanced by the space attention branch to obtain the final output of the visible light feature, and the second feature is added with the first feature enhanced by the space attention branch to obtain the final output of the infrared feature; and sending the final output of the visible light characteristic and the infrared characteristic to the long-distance dependent fusion module, and fusing by correlation of position codes based on a Swin transducer model.

Preferably, the pretreatment method is as follows: the method for enhancing the data by generating the image mask by adopting the random area is used for processing the visible light image and the infrared image in the multi-target data set, and comprises the following specific processes: dividing the image into 10 x 10 checkerboards according to the size of the image, and selecting two blocks set to zero in each row with a probability of 30% as an image mask; the image mask is then divided into two complementary masks by rows, one of which is used as a mask for the visible light image and the other is used as a mask for the infrared image, the mask generation process being expressed as:

Generate _mask ＝RGB _mask ∪IR _mask ；

RGB _mask |IR _mask ＝1；

wherein EGB is _mask Is a mask for visible light images, IR _mask Is a mask for infrared images, the generator _mask Is the total mask.

Preferably, the dual-stream trunk feature extraction network is a YOLOv5 feature extraction network of a dual-stream trunk, and the extracted visible light features are expressed as

The infrared characteristic is expressed as +.>

Representing a three-dimensional matrix; w is the width; h is the width; c represents the number of channels.

Preferably, the enhancing process of the attention branch of the channel is as follows: the input features are fully folded in one direction q while maintaining high resolution in the orthogonal direction v of direction q, the operation is expressed as:

W _RGBq ＝σ ₁ (F ₁ (X _RGB ))；W _RGBv ＝σ ₂ (F ₂ (X _RGB ))；

E _IRq ＝σ ₁ (F ₁ (X _IR ))；W _IRv ＝σ ₂ (F ₂ (X _IR ))；

in which W is _RGBq Information indicating the visible light characteristic in the q direction; w (W) _RGBv Information representing the visible light characteristic in the v direction; w (W) _IRq Information representing the infrared signature in the q-direction; w (W) _IRv Information representing the infrared signature in the v direction; sigma (sigma) ₁ Sum sigma ₂ All represent tensor shaping operators; f (F) ₁ (. Cndot.) and F ₂ (. Cndot.) all represent a 1X 1 convolution operation;

in W _RGBq And W is _IRq The weights respectively representing the visible light characteristics and the infrared characteristics are input into a softmax function for classification, the weight distribution of the visible light characteristics and the infrared characteristics is output, and the calculation process is expressed as follows:

in which W is _RGBk Weights representing visible light characteristics; w (W) _IRk The weights representing the infrared features.

Will information W _RGBv Multiplied by weight W _RGBk Will information W _IRv Multiplied by weight W _IRk Then, a 1×1 convolution operation is performed, the channel dimension is upgraded from C/2 to C by adopting a standardized process, all parameters are kept in the range of 0-1 by using a Sigmoid function, and the calculation process is expressed as follows:

W _RGBz ＝Sigmoid(σ ₃ (F ₃ (W _RGBv ×W _RGBk )))；

W _IRz ＝sigmoid(σ ₃ (F ₃ (W _IRv ×W _IRk )))；

in which W is _RGBz Information representing the visible light characteristics in the z-direction; w (W) _IRz Information representing the infrared signature in the z-direction; "×" represents a matrix dot product operation; f (F) _z (. Cndot.) represents a 1×1 convolution operation; sigma (sigma) ₃ Representing a tensor remodelling operator;

then X is taken up _RGB And W is _RGBz Multiplying the channel levels to obtain the characteristic W with smaller noise _RGBln The method comprises the steps of carrying out a first treatment on the surface of the X is to be _IR And W is _IRz Multiplying the channel levels to obtain the characteristic W with smaller noise _IRln The calculation process is expressed as:

W _RGBln ＝X _RGB ⊙W _RGBz ；W _IRln ＝X _IR ⊙W _IRz ；

features W _RGBln And X is _BGB Adding to perform recalibration enhancement to obtain a first feature A _RGBch The method comprises the steps of carrying out a first treatment on the surface of the Features W _RGBln And X is _IR Adding to perform recalibration enhancement to obtain a second feature A _IRch The calculation process is expressed as:

A _RGBch ＝W _IRln +X _RGB ；A _IRch ＝W _RGBln +X _IR 。

preferably, the enhancement procedure of the spatial attention branch is as follows: the input features are fully folded in direction q while maintaining high resolution in direction v, the process of operation is expressed as:

A _RGBq ＝σ ₄ (F _GP (F ₄ (A _RGBch )))；A _RGBv ＝σ ₅ (F ₅ (A _RGBch ))；

A _IRq ＝σ ₄ (F _GP (F ₄ (A _IRch )))；A _IRv ＝σ ₅ (F ₅ (A _IRch ))；

in sigma ₄ Sum sigma ₅ All represent tensor remodelling operators; f (F) ₄ (. Cndot.) and F ₅ (. Cndot.) all represent 1X 1 convolution operations; f (F) _GP (. Cndot.) represents a global pooling operator,

in A way _RGBq And A _IRq Weights respectively representing the first feature and the second feature are input into a Softmax function for classification, weight distribution of the first feature and the second feature is output, and the calculation process is represented as follows:

wherein A is _RGBk Weights representing the first characteristic, A _IRk Weights representing the second features;

will information A _RGBv Multiplied by weight A _RGBk Will information A _IRv Multiplied by weight A _IRk Then sequentially performing information completion, remodelling and Sigmoid functions:

A _RGBz ＝Sigmoid(σ ₆ (A _RGBv ×A _RFBk ))；

A _IRz ＝Sigmoid(σ ₆ (A _IRv ×A _IRk ))；

wherein A is _RGBz ∈R ^1×HW ；A _IRz ∈R ^1×HW A spatial gate representing the first feature and the second feature, respectively;

will A _RGBch And A _RGBz Multiplying by A _iRch And A _IRz Multiplication to obtain spatially enhanced features A of the first and second features, respectively _RGBln And A _IRln The calculation process is expressed as:

A _RGBln ＝A _RGBch ⊙A _RGBz ；A _IRln ＝A _IRch ⊙A _IRz ；

feature A _IRln And A is a _RGBch Adding to recalibrate enhancement to obtain final output X of visible light characteristics _RGBout The method comprises the steps of carrying out a first treatment on the surface of the Feature A _RGBln And A is a _IRch Adding to recalibrate the enhancement to obtain the final output X of the infrared signature _IRout The calculation process is expressed as:

X _RGBout ＝A _IRln +A _RGBch ；X _IRout ＝A _RGBln +A _IRch 。

preferably, the feature fusion process is as follows:

using a shift window dividing method to alternately divide the feature map into M x M dimensions, and if the element map size is smaller than M x M, filling it into M x M size; the window of the next module will then be relatively shifted by (M/2 ) pixels; with this calculation method, the calculation formula is as follows:

wherein F is ⁱ The representation being a joint input of visible and infrared features, F ⁱ ＝{X _RGBout ,X _IRout }；F ^o Is the output characteristic of the transducer block;

and->

Is an intermediate variable; W-MSA and MW-MSA are window multi-head self-attention operation and mask window multi-head self-attention operation, respectively;

taking partial characteristics after window segmentation into consideration, in the self-attention calculation process, an input visible light characteristic diagram F is given _RGB ∈R ^8×8×C And infrared characteristic diagram F _IR ∈R ^8×8×C The method comprises the steps of carrying out a first treatment on the surface of the Flattening each feature map and arranging the sequences of the matrixes to obtain sentences I _RGB ∈R ^64×C And I _IR ∈R ^64×C The method comprises the steps of carrying out a first treatment on the surface of the Then provide the input sentence I E R ^128×C Connection sentence I _RGB And I _IR The method comprises the steps of carrying out a first treatment on the surface of the Projecting the input sentence I into three weight matrices to obtain a set of query Q, key K, and value V:

Q＝IW ^Q ,K＝IW ^K ,V＝IW ^V ；

in which W is ^Q ∈R ^C×128 、W ^K ∈R ^C×128 And W is ^V ∈R ^C×128 All represent a weight matrix.

The self-attention calculation process is as follows:

where d represents the dimension that is either query Q or key K; FPRE denotes the position coding of visible and infrared features, and contains four types of position information: visible light position information RPE _RGB Infrared location information RPE _IR Visible and infrared relative position information RPE _RGB- Infrared and visible light relative position information RPE _iR-RGB ；

In the method, in the process of the invention,

t represents a matrix transposition operation;

the visible light characteristic X output after the deep interaction is obtained through the operation _RGBout And infrared feature X _IRout And adds the two to obtain the final fusion feature F _fusion ：

F _fusion ＝X _RGBout +X _IRout 。

The application also provides an infrared-visible light fusion system based on feature enhancement, comprising:

and a data acquisition module: the method comprises the steps of acquiring a multi-target data set, and preprocessing the multi-target data set, wherein the multi-target data set comprises a visible light image and an infrared image.

And the feature extraction module is used for: the method is used for constructing a double-flow trunk feature extraction network, the double-flow trunk feature extraction network comprises two branches with the same structure, and the visible light image and the infrared image are respectively sent into the two branches to extract deep features, so that visible light features and infrared features are obtained.

And a feature fusion module: the method comprises the steps that a feature fusion network is constructed, the feature fusion network comprises a cross feature enhancement module and a long-distance dependent fusion module, the cross feature enhancement module comprises a channel attention branch and a space attention branch which are arranged in series, the visible light features and the infrared features are sent into the channel attention branch to be enhanced, the visible light features before enhancement and the infrared features after enhancement through the channel attention branch are added to obtain a first feature, and the infrared features before enhancement and the visible light features after enhancement through the channel attention branch are added to obtain a second feature; the first feature and the second feature are sent to the space attention branch for enhancement, the first feature is added with the second feature enhanced by the space attention branch to obtain the final output of the visible light feature, and the second feature is added with the first feature enhanced by the space attention branch to obtain the final output of the infrared feature; and sending the final output of the visible light characteristic and the infrared characteristic to the long-distance dependent fusion module, and fusing by correlation of position codes based on a Swin transducer model.

The present application also provides a readable storage medium having one or more programs stored therein, the one or more programs being executable by one or more processors to implement the steps of the feature-enhanced infrared-visible fusion method described above.

The beneficial effects of this application lie in:

(1) Noise data is introduced in the data preprocessing process so as to force the network to learn complementary mode information, and therefore deviation of the network to a single mode is reduced;

(2) The cross characteristic enhancement module is utilized to perform characteristic enhancement from two angles of a channel and a space, wherein the characteristic enhancement comprises complementary information exchange between two modes, so that the multi-mode characteristics are better fused, the difference between the two modes is effectively overcome, the problem of serious shortage of visible information is solved, and the detection performance of a network is improved;

(3) The method has the advantages that the long-distance dependence fusion module is used, deep interaction information enhancement is focused, the characteristics of the two modes are synchronously segmented through moving the shift window, the depth interaction information enhancement is carried out on the fused characteristics through the multi-head self-retaining mechanism, the adaptability of the fused characteristics in complex illumination scenes is improved, and the omission and false detection of the detector are reduced.

Drawings

FIG. 1 shows a flow chart of the feature enhancement based infrared-visible fusion method provided herein;

FIG. 2 shows a schematic diagram of multi-objective dataset preprocessing;

FIG. 3 illustrates the architecture of a cross-feature enhancement module;

FIG. 4 illustrates the architecture of a long-range dependent fusion module;

FIG. 5 shows a flow chart of long distance attention fusion;

FIG. 6 is a schematic diagram showing the test results of the model of the present application in the first embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Referring to fig. 1-6 in combination, the present invention provides an infrared-visible light fusion method based on feature enhancement, comprising the following steps:

and (3) data acquisition: and acquiring a multi-target data set, and preprocessing the multi-target data set, wherein the multi-target data set comprises a visible light image and an infrared image.

The pretreatment mode is as follows: and processing the visible light image and the infrared image in the multi-target data set by adopting a data enhancement method for generating an image mask by a random area. The specific process is as follows: dividing the image into 10×10 checkerboards according to the size of the image, and selecting two blocks set to zero with a probability of 30% in each row to generate an image mask; the image mask is then divided into two complementary masks by rows, one of which is used as a mask for the visible light image and the other is used as a mask for the infrared image, the mask generation process being expressed as:

Generate _mask ＝RGB _mask ∪IR _mask ；

RGB _mask |IR _mask ＝1；

in RGB _mask Is a mask for visible light images, IR _mask Is a mask for infrared images, the generator _mask Is the total mask.

Feature extraction: and constructing a double-flow trunk feature extraction network, wherein the double-flow trunk feature extraction network comprises two branches with the same structure, and the visible light image and the infrared image are respectively sent into the two branches to extract deep features so as to obtain visible light features and infrared features.

The dual-flow backbone feature extraction network is redesigned to be a dual-flow backbone based on the YOLOv5 feature extraction network. The extracted visible light features are expressed as

The infrared characteristic is expressed as +.>

Representing a three-dimensional matrix; w is the width;h is the width; c represents the number of channels.

The enhancement process of the channel attention branch is as follows: the input features are fully folded in one direction q while maintaining high resolution in the orthogonal direction v of direction q, the operation is expressed as:

W _RGBq ＝σ ₁ (F ₁ (X _RGB ))；W _RGBv ＝σ ₂ (F ₂ (X _RGB ))；

W _IRq ＝σ ₁ (F ₁ (X _IR ))；W _IRv ＝σ ₂ (F ₂ (X _IR ))；

in W _RGBq And W is _IRq The weights respectively representing the visible light characteristic and the infrared characteristic are input into a Softmax function for classification, the weight distribution of the visible light characteristic and the infrared characteristic is output, and the calculation process is expressed as follows:

In the calculation of the weight distribution, the information is severely compressed, and the information W is used for maintaining the information intensity _RGBv Multiplied by weight W _RGBk Will information W _IRv Multiplied by weight W _IRk Then, a 1×1 convolution operation is performed, the channel dimension is upgraded from C/2 to C by adopting a standardized process, all parameters are kept in the range of 0-1 by using a Sigmoid function, and the calculation process is expressed as follows:

W _RGBz ＝Sigmoid(σ ₃ (F ₃ (W _RGBv ×W _RGBk )))；

W _IRz ＝Sigmoid(σ ₃ (F ₃ (W _IRv ×W _IRk )))；

in which W is _RGBz Information representing the visible light characteristics in the z-direction; w (W) _IRz Information representing the infrared signature in the z-direction; "×" represents a matrix dot product operation; f (F) ₃ (. Cndot.) represents a 1×1 convolution operation; sigma (sigma) ₃ Representing a tensor remodelling operator;

by the operation, the characteristic noise in the inter-mode representation can be effectively restrained by utilizing the visible appearance and the geometric characteristic with the maximum information quantity in the intra-mode representation.

Then X is taken up _RGB And W is _RGBz Multiplying the channel levels to obtain the characteristic W with smaller noise _RGBln The method comprises the steps of carrying out a first treatment on the surface of the X is to be _IR And W is _IRz Multiplying the channel levels to obtain the characteristic W with smaller noise _IRln The calculation process is expressed as: w (W) _RGBln ＝X _RGB ⊙W _RGBz ；W _IRln ＝X _IR ⊙W _IRz ；

In the method, in the process of the invention,

the ". Iy represents Hadamard product operation.

Features W _RGBln And X is _RGB Adding to perform recalibration enhancement to obtain a first feature A _RGBch The method comprises the steps of carrying out a first treatment on the surface of the Features W _RGBln And X is _IR Adding to perform recalibration enhancement to obtain a second feature A _IRch The calculation process is expressed as:

A _RGBch ＝W _IRln +X _RGB ；A _IRch ＝W _RGBln +X _IR 。

the enhancement process of the space attention branch is as follows: the input features are fully folded in direction q while maintaining high resolution in direction v, the process of operation is expressed as:

A _RGBq ∈R ^1×C/2 ；

A _RGBv ∈R ^C/2×HW ；A _IRq ∈R ^1×C/2 ；A _IRv ∈R ^C/2×HW 。

wherein A is _RGBk Weights representing the first characteristic, A _IRk And a weight representing the second feature.

In the calculation of the weight distribution, the information is severely compressed, and in order to maintain the information intensity, the information A is _RGBv Multiplied by weight A _RGBk Will information A _IRv Multiplied by weight A _IRk Then sequentially performing information completion, remodelling and Sigmoid functions:

A _RGBz ＝Sigmoid(σ ₆ (A _RGBv ×A _RGBk ))；

A _IRz ＝Sigmoid(σ ₆ (A _IRv ×A _IRk ))；

wherein A is _RGBz ∈R ^1×HW ；A _IRz ∈R ^1×HW A spatial gate representing the first feature and the second feature, respectively.

Then, by combining A _RGBch And A _RGBz Multiplying by A _IRch And A _IRz Multiplication to obtain spatially enhanced features A of the first and second features, respectively _RGBln And A _IRln The calculation process is expressed as:

A _RGBln ＝A _RGBch ⊙A _RGBz ；A _IRln ＝A _IRch ⊙A _IRz ；

wherein A is _RGBln ∈R ^C×H×W ；A _IRln ∈R ^C×H×W 。

X _RGBout ＝A _IRln +A _RGBch ；X _IRout ＝A _RGBln +A _IRch 。

in order to better fuse the visible light and infrared characteristics, the long-distance dependent fusion module is based on a Swin transform model, and fusion of multi-mode complementary information is greatly improved through correlation fusion characteristics of position codes.

The characteristic fusion process comprises the following steps:

the feature map is alternately divided into m×m dimensions using a shift window division method, and if the element map size is smaller than m×m, it is padded to the size of m×m. The window of the next module will then be relatively shifted by (M/2 ) pixels. With this calculation method, the calculation formula is as follows:

and->

Is an intermediate variable; W-MSA and MW-MSA are window multi-head self-attention operation and mask window multi-head self-attention operation, respectively.

Taking partial characteristics after window segmentation into consideration, in the self-attention calculation process, an input visible light characteristic diagram F is given _RGB ∈R ^8×8×C And infrared characteristic diagram F _IR ∈R ^8×8×C . Then, each feature map is flattened and the order of the matrices is arranged to obtain sentence I _RGB ∈R ^64×C And I _IRt ∈R ^64×C . Then provide the input sentence I E R ^128×C Connection sentence I _RGB And I _IR . Thirdly, the input sentence I is projected into three weight matrices to obtain a set of query Q, key K and value V:

Q＝IW ^Q ,K＝IW ^K ,V＝IW ^V ；

The self-attention calculation process is as follows:

where d represents the dimension that is either query Q or key K; FPRE denotes the position coding of visible and infrared features, and contains four types of position information: visible light position information RPE _RGB Infrared location information RPE _IR Visible and infrared relative position information RPE _RGB-IR Infrared and visible light relative position information RPE _IR-RGB ；

In the method, in the process of the invention,

t denotes a matrix transpose operation.

F _fusion ＝X _RGBout +X _IRout 。

The feature enhancement of the cross feature enhancement module is essentially that noise data is introduced to force a network to learn complementary mode information so as to reduce the deviation of the network to a single mode; the characteristic enhancement is carried out from the two angles of the channel and the space, including the complementary information exchange between the two modes, so that the multi-mode characteristics are better fused, the problems of the difference between the two modes and the serious shortage of visible information are effectively solved, and the detection performance of the network is improved. The long-distance dependent fusion module focuses on deep interactive information enhancement, the features of the two modes are synchronously segmented through moving a shift window, the depth interactive information enhancement is carried out on the fused features through a multi-head self-retaining mechanism, the adaptability of the fused features in complex illumination scenes is improved, and the omission and false detection of the detector are reduced.

Example 1

The model is built by the infrared-visible light fusion method based on feature enhancement, training is carried out on 1080Ti desktop computers, an SGD optimizer is used, the initial learning rate is 0.001, the momentum is 0.937, and the weight attenuation is 0.0005. The overall performance of our proposed method was extensively tested using the VEDAI database, with the test results shown in the following table:

from table 1, it is seen that the proposed method achieves the best performance compared to other methods. Compared with the optimal single-mode detection algorithm, mAP indexes (average of average accuracy) are improved by 12.9 percent; for the multi-mode detection method, compared with the optimal method, mAP (average accuracy) is improved by 3.1%; compared with a basic detector, the method provided by the application remarkably reduces missed detection and false detection. The method improves target feature representation through the cross feature enhancement module, and greatly improves detection performance. In addition, the method carries out deep interaction between the information in the two modes, and has higher information fusion degree and better detection performance.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. The infrared-visible light fusion method based on the characteristic enhancement is characterized by comprising the following steps of:

2. The feature-enhancement-based infrared-visible light fusion method of claim 1, wherein the preprocessing is performed in the following manner: the method for enhancing the data by generating the image mask by adopting the random area is used for processing the visible light image and the infrared image in the multi-target data set, and comprises the following specific processes: dividing the image into 10 x 10 checkerboards according to the size of the image, and selecting two blocks set to zero in each row with a probability of 30% as an image mask; the image mask is then divided into two complementary masks by rows, one of which is used as a mask for the visible light image and the other is used as a mask for the infrared image, the mask generation process being expressed as:

Generate _mask ＝RGB _mask ∪IR _mask ；

RGB _mask |IR _mask ＝1；

3. The infrared-visible light fusion method based on feature enhancement as claimed in claim 2, wherein the dual-stream trunk feature extraction network is a YOLOv5 feature extraction network of dual-stream trunk, and the extracted visible light features are expressed as

The infrared characteristic is expressed as +.>

4. The infrared-visible light fusion method based on feature enhancement according to claim 3, wherein the enhancement procedure of the channel attention branch is: the input features are fully folded in one direction q while maintaining high resolution in the orthogonal direction v of direction q, the operation is expressed as:

W _{RGB q} ＝σ ₁ (F ₁ (X _RGB ))；W _{RGB v} ＝σ ₂ (F ₂ (X _RGB ))；

W _{IR q} ＝σ ₁ (F ₁ (X _IR ))；W _{IR v} ＝σ ₂ (F ₂ (X _IR ))；

in which W is _{RGB q} Information indicating the visible light characteristic in the q direction; w (W) _{RGB v} Information representing the visible light characteristic in the v direction; w (W) _{IR q} Information representing the infrared signature in the q-direction; w (W) _{IR v} Information representing the infrared signature in the v direction; sigma (sigma) ₁ Sum sigma ₂ All represent tensor shaping operators; f (F) ₁ (. Cndot.) and F ₂ (. Cndot.) all represent a 1X 1 convolution operation;

in W _{RGB q} And W is _{IR q} The weights respectively representing the visible light characteristic and the infrared characteristic are input into a Softmax function for classification, the weight distribution of the visible light characteristic and the infrared characteristic is output, and the calculation process is expressed as follows:

in which W is _{RGB k} Weights representing visible light characteristics; w (W) _{IR k} Weights representing infrared features;

will information W _{RGB v} Multiplied by weight W _{RGB k} Will information W _{IR v} Multiplied by weight W _{IR k} Then, a 1×1 convolution operation is performed, the channel dimension is upgraded from C/2 to C by adopting a standardized process, all parameters are kept in the range of 0-1 by using a Sigmoid function, and the calculation process is expressed as follows:

W _{RGB z} ＝Sigmoid(σ ₃ (F ₃ (W _{RGB v} ×W _{RGB k} )))；

W _{IR z} ＝Sigmoid(σ ₃ (F ₃ (W _{IR v} ×W _{IR k} )))；

in which W is _{RGB z} Information representing the visible light characteristics in the z-direction; w (W) _{IR z} Information representing the infrared signature in the z-direction; "×" represents a matrix dot product operation; f (F) ₃ (. Cndot.) represents a 1×1 convolution operation; sigma (sigma) ₃ Representing a tensor remodelling operator;

then X is taken up _RGB And W is _{RGB z} Multiplying the channel levels to obtain the characteristic W with smaller noise _{RGB ln} The method comprises the steps of carrying out a first treatment on the surface of the X is to be _IR And W is _{IR z} Multiplying the channel levels to obtain the characteristic W with smaller noise _{IR ln} The calculation process is expressed as: w (W) _{RGB ln} ＝X _RGB ⊙W _{RGB z} ；W _{IR ln} ＝X _IR ⊙W _{IR z} ；

In the method, in the process of the invention,

the Hadamard product is calculated;

features W _{RGB ln} And X is _RGB Adding to perform recalibration enhancement to obtain a first feature A _{RGB ch} The method comprises the steps of carrying out a first treatment on the surface of the Features W _{RGB ln} And X is _IR Adding to perform recalibration enhancement to obtain a second feature A _{IR ch} The calculation process is expressed as:

A _{RGB ch} ＝W _{IR ln} +X _RGB ；A _{IR ch} ＝W _{RGB ln} +X _IR 。

5. the feature enhancement-based infrared-visible light fusion method of claim 4, wherein the enhancement procedure of the spatial attention branch is: the input features are fully folded in direction q while maintaining high resolution in direction v, the process of operation is expressed as:

A _{RGB q} ＝σ ₄ (F _GP (F ₄ (A _{RGB ch} )))；A _{RGB v} ＝σ ₅ (F ₅ (A _{RGB ch} ))；

A _{IR q} ＝σ ₄ (F _GP (F ₄ (A _{IR ch} )))；A _{IR v} ＝σ ₅ (F ₅ (A _{IR ch} ))；

in A way _{RGB q} And A _{IR q} Weights respectively representing the first feature and the second feature are input into a Softmax function for classification, weight distribution of the first feature and the second feature is output, and the calculation process is represented as follows:

wherein A is _{RGB k} Weights representing the first characteristic, A _{IR k} Weights representing the second features;

will information A _{RGB v} Multiplied by weight A _{RGB k} Will information A _{IR v} Multiplied by weight A _{IR k} Then sequentially performing information completion, remodelling and Sigmoid functions:

A _{RGB z} ＝Sigmoid(σ ₆ (A _{RGB v} ×A _{RGB k} ))；

A _{IR z} ＝Sigmoid(σ ₆ (A _{IR v} ×A _{IR k} ))；

wherein A is _{RGB z} ∈R ^1×HW ；A _{IR z} ∈R ^1×HW A spatial gate representing the first feature and the second feature, respectively;

will A _{RGB ch} And A _{RGB z} Multiplying by A _{IR ch} And A _{IR z} Multiplication to obtain spatially enhanced features A of the first and second features, respectively _{RGB ln} And A _{IR ln} The calculation process is expressed as:

A _{RGB ln} ＝A _{RGB ch} ⊙A _{RGB z} ；A _{IR ln} ＝A _{IR ch} ⊙A _{IR z} ；

wherein A is _{RGB ln} ∈R ^C×H×W ；A _{IR ln} ∈R ^C×H×W ；

Feature A _{IR ln} And A is a _{RGB ch} Adding to recalibrate enhancement to obtain final output X of visible light characteristics _{RGB out} The method comprises the steps of carrying out a first treatment on the surface of the Feature A _{RGB ln} And A is a _{IR ch} Adding to recalibrate the enhancement to obtain the final output X of the infrared signature _{IR out} The calculation process is expressed as:

X _{RGB out} ＝A _{IR ln} +A _{RGB ch} ；X _{IR out} ＝A _{RGB ln} +A _{IR ch} 。

6. the infrared-visible light fusion method based on feature enhancement according to claim 5, wherein the feature fusion process is as follows:

wherein F is ⁱ The representation being a joint input of visible and infrared features, F ⁱ ＝{X _{RGB out} ，X _{IR out} }；F ^o Is the output characteristic of the transducer block;

and->

Q＝IW ^Q ，K＝IW ^K ，V＝IW ^V ；

The self-attention calculation process is as follows:

where d represents the dimension that is either query Q or key K; FPRE denotes the position coding of visible and infrared features, and contains four types of position information: visible light position information RPE _RGB Infrared location information RPE _IR Visible and infrared relative position information RPE _RGB-IR Infrared and visible light relative position information RPE _IR-R ；

In the method, in the process of the invention,

t represents a matrix transposition operation;

the visible light characteristic X output after the deep interaction is obtained through the operation _{RGB out} And infrared feature X _{IR out} And adds the two to obtain the final fusion feature F _fusion ：

F _fusion ＝X _{RGB out} +X _{IR out} 。

7. An infrared-visible light fusion system based on feature enhancement, comprising:

8. A readable storage medium having one or more programs stored therein, the one or more programs being executable by one or more processors to implement the steps of the feature-based enhanced infrared-visible light fusion method of any of claims 1-6.