CN116503703A - Infrared light and visible light image fusion system based on shunt attention transducer - Google Patents
Infrared light and visible light image fusion system based on shunt attention transducer Download PDFInfo
- Publication number
- CN116503703A CN116503703A CN202310477962.5A CN202310477962A CN116503703A CN 116503703 A CN116503703 A CN 116503703A CN 202310477962 A CN202310477962 A CN 202310477962A CN 116503703 A CN116503703 A CN 116503703A
- Authority
- CN
- China
- Prior art keywords
- attention
- split
- visible light
- fusion
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 77
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 36
- 230000003993 interaction Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 6
- 239000010419 fine particle Substances 0.000 claims description 4
- 239000002245 particle Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000007670 refining Methods 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 230000000007 visual effect Effects 0.000 abstract description 10
- 230000000295 complement effect Effects 0.000 abstract description 4
- 238000007500 overflow downdraw method Methods 0.000 description 11
- 238000001514 detection method Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 230000016776 visual perception Effects 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000011362 coarse particle Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/10—Image acquisition
- G06V10/12—Details of acquisition arrangements; Constructional details thereof
- G06V10/14—Optical characteristics of the device performing the acquisition or on the illumination arrangements
- G06V10/143—Sensing or illuminating at different wavelengths
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10048—Infrared image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an infrared light and visible light image fusion system based on a shunt attention transducer, and relates to the technical field of image fusion; the system comprises a split-flow attention transducer network model for infrared light and visible light image fusion, wherein the whole network model comprises a split-flow attention feature extraction unit, a cross attention fusion unit and a feature reconstruction unit. In order to generate a fused image with rich scene details and good visual effect, infrared light and visible light images are respectively sent into a network model to extract shallow local features. Thereafter, coarse-grained and fine-grained details within a single attention layer are obtained using a feature extraction unit. In the feature fusion unit, a cross-attention mechanism is introduced to fuse cross-domain complementary features. In addition, in the reconstruction stage, the feature reconstruction unit adopts dense jump connection, and deep and shallow features with different scales are utilized to the maximum extent to construct a fusion image.
Description
Technical Field
The invention relates to the technical field of image fusion, in particular to an infrared light and visible light image fusion system based on a split-flow attention transducer.
Background
Due to limitations in various aspects of the sensor's own physical characteristics, imaging mechanisms, and viewing angles, a single vision sensor often cannot extract enough information from a scene. Based on the theory of heat radiation, the infrared sensor highlights the heat source area without being disturbed by the environment. However, due to the low resolution of the image generated by the infrared sensor, structural features and detailed information are often lacking. In contrast, the visible sensor is able to generate friendly visual images with higher spatial resolution. To inherit the advantages of both sensors, preserving thermal radiation and texture information of infrared and visible images through image fusion techniques is an effective approach. The fusion result has excellent visual perception and scene representation capability, and can be widely applied to the fields of image enhancement, semantic segmentation, target detection and the like.
The key to the fusion of infrared and visible images is how to effectively integrate heat source features and detailed texture information. Over the past several decades, a number of conventional fusion methods and deep learning based fusion methods have been proposed.
The traditional fusion method mainly comprises a spatial domain method and a multi-scale transformation domain method. Spatial domain methods typically calculate a weighted average of pixel-level saliency of an input image to obtain a fused image. The multi-scale transform domain method includes wavelet transform, curve transform, etc., converts an input into a transform domain through mathematical transform, and designs a fusion rule to fuse images. Since the above conventional method does not consider the feature differences between the source images, it tends to have a negative effect on the fused image. In addition, the fusion rules and activity level measurement of the traditional method cannot adapt to complex scenes, and also bring challenges to the wide application of the traditional method.
In recent years, deep learning has become a mainstream in the field of image fusion due to excellent depth feature extraction capability. The method not only can automatically extract deep features from the data, but also overcomes the problems faced by the traditional method when adapting to complex scenes. However, these methods can only use local information for image fusion, and cannot further improve the fusion effect by long-distance dependence. Some Transformer based fusion methods benefit from complementary aggregation of global context features, which have demonstrated excellent performance, but they still have certain limitations. First, the transform-based method is less efficient in generating test images than the convolutional neural network-based method. In addition, most transform-based methods typically process a series of token directly after dividing the image, which results in high memory consumption. Second, because the existing Transformer fusion network ignores the feature of intra-layer mixed granularity, the ability to retain fine-granularity details and coarse-granularity objects is limited. Finally, the feature fusion process fuses information in only a single domain, lacking contextual information, which may affect the visual appearance of the fusion result.
Disclosure of Invention
The invention aims to provide an infrared light and visible light image fusion system based on a split-flow attention transducer, which is used for extracting and fusing global granularity information and local features, and can remarkably reduce the calculation cost by utilizing remote learning and reducing the number of input tokens.
To achieve the above object, an infrared light and visible light image fusion system based on split-flow attention transducer according to the present application includes: the method comprises the steps of (1) a shunt attention transducer network model for fusing infrared light and visible light images, firstly entering an input infrared light and visible light image into a shallow convolution block formed by a plurality of convolution layers, extracting shallow local features, then decomposing the shallow local features into different scales by adopting a multi-scale technology, and capturing shallow information on the different scales.
Further, the split attention transducer network model comprises a split attention feature extraction unit, a cross attention fusion unit and a feature reconstruction unit.
Further, the split-flow attention feature extraction unit comprises three stage blocks, each stage block comprises six split-flow transducer sub-blocks, each sub-block is driven by multi-granularity learning, and the number of the token is reduced while capturing multi-granularity information by injecting the token into the heterogeneous receptive field; the shunt transducer sub-block models global features and achieves extraction of multi-granularity features in a global scope.
Further, the cross attention fusion unit comprises two cross attention residual blocks, and each residual block is provided with a self attention-based intra-domain fusion block to effectively integrate global interaction information in the same domain, and a cross attention-based inter-domain fusion block to further integrate global interaction information between different domains.
Furthermore, the inter-domain fusion block utilizes a cross-attention mechanism to realize interaction of global feature information, and the inter-domain and cross-domain interaction alternate integration is realized by merging cross-domain information and reserving information of different domains by using skip connection.
Further, the feature reconstruction unit comprises a deep layer reconstruction block and a shallow layer reconstruction block, and is used for mapping the aggregated deep layer features back to the image space; the deep reconstruction block comprises four self-attention blocks, and the shallow reconstruction block comprises two convolution layers with 3×3 kernels and 1 step size, each layer being followed by a ReLU activation function.
Furthermore, the deep layer reconstruction block is used for refining the fused deep layer features and realizing multi-scale feature reconstruction from the global angle; the image size is further restored by shallow reconstruction blocks and convolution layers, and then feature transfer is enhanced by using skipped connections, maximally reusing features of different layers to construct a fused image.
Still further, the split attention transducer network model refinement process quantifies:
I Fu =FRU(F FUS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images; safeuj (·) is a split attention feature extraction unit,and->Respectively representing deep granularity characteristics of infrared light and visible light; CAFU (& gt) represents a cross-attention fusion unit,>and->Respectively representing the output characteristics of infrared light and visible light which are aggregated after intra-domain interaction and inter-domain interaction; concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the depth characteristics after fusion; i Fu Is a fusion generated after reconstruction and upsamplingAnd (5) combining the images.
As a further aspect, the split attention fransformer network model is trained using a granularity loss function, wherein the granularity loss function comprises structural similarity loss, fine granularity loss and coarse granularity loss; the particle size loss L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively representing a structural similarity loss, a fine grain loss and a coarse grain loss; alpha and beta are loss function super parameters.
Further, structural similarity loss L S The method comprises the following steps:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability;
loss of fine particle size L FG Loss of coarse grain L CG The method comprises the following steps of:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, wherein |is an absolute value operation; H. w is the height and width of the image and γ is the hyper-parameter.
Compared with the prior art, the technical scheme adopted by the invention has the advantages that: the shunt attention transducer network model establishes a remote dependency relationship between images, extracts and integrates the granularity characteristics, reduces the calculation cost by effectively reducing the number of tokens, and has the advantages that the time efficiency of generating the test image is higher than that of a fusion method based on the transducer and a fusion method based on a convolutional neural network. The invention realizes the combined extraction of coarse granularity and fine granularity characteristics in each attention layer through the shunt attention characteristic extraction unit. The cross attention fusion unit fully realizes intra-domain and inter-domain depth feature interaction and cross-domain information fusion, and the feature reconstruction unit combines the feature mapping with the reconstructed image, so that the network model can recover the fused images with different resolutions. Furthermore, the network is driven with a granularity loss function consisting of structural similarity loss, fine granularity loss, and coarse granularity loss, with granularity information control and structural maintenance employed to achieve feature extraction and fusion. The fusion image generated by the split-flow attention transducer network model has better visual perception, contains enough obvious characteristic and texture detail information, and has higher time efficiency.
Drawings
FIG. 1 is a schematic block diagram of an infrared and visible light image fusion system;
FIG. 2 is a schematic diagram of a cross-attention fusion unit and a reconstruction unit;
FIG. 3 is a qualitative comparison of the present method with other advanced fusion methods on an MSRS dataset;
FIG. 4 is a graph of the time at M 3 Qualitative comparison graphs between the method and other advanced fusion methods on the FD data set;
FIG. 5 is a graph of the time at M 3 Quantitative comparison graphs between the method and other advanced fusion methods on the FD data set;
FIG. 6 is a qualitative comparison of detection between this method and other advanced fusion methods on MSRS datasets.
Detailed description of the preferred embodiments
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the application, i.e., the embodiments described are merely some, but not all, of the embodiments of the application.
Example 1
As shown in fig. 1, the infrared light and visible light image fusion system based on the split-focus transducer comprises a split-focus transducer network model for infrared light and visible light image fusion, wherein the whole network model comprises a split-focus feature extraction unit, a cross-focus fusion unit and a feature reconstruction unit. In order to generate a fused image with rich scene details and good visual effect, infrared light and visible light images are respectively sent into a network model to obtain remarkable information and texture details. The input infrared light and visible light images first enter a shallow convolution block composed of four convolution layers, and shallow local features are extracted. And then decomposing the shallow local features to different scales by adopting a multi-scale technology, capturing shallow information on the different scales, and extracting the shallow information of the different scales is helpful for better understanding the image details.
In order to solve the problems of high memory consumption and only single-scale token in a self-attention mechanism, the invention introduces multi-granularity joint learning in a feature extraction stage to learn granularity information in a single attention layer. As shown in the upper left of fig. 1, the split attention feature extraction unit is designed for better exploration of multi-granularity depth features, and comprises three stage blocks, each comprising six split transducer sub-blocks, each sub-block being driven by multi-granularity learning. By injecting token into the heterogeneous receptive field, the number of token is reduced while multi-granularity information is captured. The shunt transducer sub-block can effectively model global features and achieve extraction of multi-granularity features in a global scope.
At the heart of multi-granularity joint learning is a split-focus mechanism, which models objects of different scales within the same focus layer, and learns multi-granularity information in parallel. The values of K, V on the attention header within the same attention layer are downsampled to different sizes, reducing the number of tokens to enable the capture of coarse and fine granularity information and the merging of features. The mixed granularity feature is then aggregated by skipping the connection. The mechanism has good computing efficiency and fine granularity detail information retention capability.
After extracting deep features, the present invention designs a cross-attention fusion unit to further mine and aggregate intra-domain and inter-domain context information, as shown in FIG. 2. The method comprises two cross attention residual blocks, wherein each residual block is provided with a self attention-based intra-domain fusion block for effectively integrating global interaction information in the same domain, and a cross attention-based inter-domain fusion block for further integrating global interaction information between different domains. The inter-domain fusion block utilizes a cross attention mechanism to realize the exchange of global feature information; by combining the cross-domain information and combining the information of different domains by using the skip connection, the alternate integration of the global inter-domain interaction and the cross-domain interaction is realized.
After the complementary information in different fields is fully aggregated, the aggregated deep features are mapped back to the image space by using a feature reconstruction unit. The deep reconstruction block is deployed to refine the fused deep features, and the reconstruction of the multi-scale features is realized from the global angle. Then, the image size is further restored through two shallow reconstruction blocks based on a convolutional neural network and one convolutional layer; feature delivery is enhanced with skipped connections, maximally reusing features of different layers to construct a fused image. The deep reconstruction block comprises four self-attention blocks, the shallow reconstruction block comprises two convolution layers with kernels of 3×3 and step sizes of 1, and a ReLU activation function is arranged behind each layer.
The split attention transducer network model refinement process can be quantified as:
I Fu =FRU(F FIS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images. Safeuj (·) is a split attention feature extraction unit,and->Representing deep particle size characteristics of infrared light and visible light, respectively. CAFU (& gt) represents a cross-attention fusion unit,>and->Representing the characteristics of the output of the infrared light and the visible light, respectively, that are aggregated after intra-domain and inter-domain interactions. Concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the fused depth features. I Fu Is a fused image generated after reconstruction and upsampling.
The purpose of image fusion is to integrate the detail information of the source image into a single fused image and to generate a fused image with a significant target from the intensity information of the source image. In order to pursue better feature learning capability, the invention adopts a granularity loss function to promote feature extraction and fusion to coarse and fine granularity informationControl of information, and maintenance of structural similarity. The network model is constrained during training using a granularity penalty function that includes a structural similarity penalty, a fine granularity penalty, a coarse granularity penalty to achieve a fused image with similar structural and granularity information as the input image. Particle size loss function L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively, the loss of structural similarity, the loss of fine granularity and the loss of coarse granularity.
Structural similarity penalty L S The combination of three parts, brightness, structure and contrast, is an effective method for measuring the structural similarity between two different images, and is defined as:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability.
Loss of fine particle size L FG For guiding the network to keep more detail features as much as possible, coarse granularity loss L CG Aimed at directing the network to capture appropriate target information; fine particle size and coarse particle size losses are defined as:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, which can measure texture details of an image; the absolute value operation is represented by the absolute.
To prove the superiority of the system of the invention, firstly, test image pairs are selected from MSRS data sets, compared with ten most advanced infrared light and visible light image fusion methods, and then in M 3 Further validating the selected image pairs on the FD and TNO datasets; in qualitative assessment, the image is assessed by the human visual system, such as details of the image, brightness, and integrity of the target. As shown in fig. 3-4, the proposed split-attention transducer network model achieves better visual perception in terms of maintaining visible details and infrared targets than other approaches. Meanwhile, the method has better fusion performance in the aspects of maintaining visible detail textures and infrared remarkable target distribution. The generated fusion image is more in line with human visual perception, and the subjective visual effect is clear and natural.
In order to avoid interference of human factors and comprehensively measure the fusion capability, the invention also utilizes objective quantitative indexes MI and Q abf VIF, AG, SF and SSIM to quantify the performance of evaluating fusion results. Using M 3 The test image pair in the FD dataset is used as a test set to complete different infrared light and visible light image fusion tasks, and the quantitative result is shown in fig. 5. Obviously, the quantitative experiment result obtained by the method obtains the highest value of all the image VIF indexes except the SSIM index. Because the VIF index is consistent with the human visual system, the network is proved to have better human visual effect. Meanwhile, the fusion image retains a large amount of information from the infrared light and visible light source images. SF and AG may embody the detail and texture of the fused image, respectively. Although the SSIM index of the proposed method is not optimal, comparable results still mean that the fused image obtained by the proposed method includes sufficient structural and gradient information.
In order to explore the impact of infrared and visible light image fusion on multi-modal target detection, the invention uses network-generated fusion images to evaluate the performance of target detection. YOLOv7 was used as a reference model for target detection. Figure 6 shows the qualitative results of target detection. The results indicate that the network generated fusion image has the best detection performance, especially in people and vehicles. By fusing the salient region features and texture information, more comprehensive scene description is provided, and detection accuracy is improved. In addition, the invention uses deep labv3+ as a reference model for training, and the effectiveness of the model is compared through the cross-over ratio. The result of semantic segmentation effectively integrates the granularity information of the global context, and the intra-domain and inter-domain complementary information also enhances the semantic features of the fused image, thereby improving the perception capability and segmentation precision of the model.
The model provided by the invention has remarkable advantages in visual performance and objective evaluation. The generated fusion image has better visual perception, comprises enough obvious characteristic and texture detail information, and has higher time efficiency. The potential in advanced visual tasks is also shown in object detection and semantic segmentation. Therefore, the system provided by the invention is beneficial to the development of infrared light and visible light image fusion.
The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.
Claims (10)
1. An infrared light and visible light image fusion system based on a split-flow attention transducer, comprising: the method comprises the steps of (1) a shunt attention transducer network model for fusing infrared light and visible light images, firstly entering an input infrared light and visible light image into a shallow convolution block formed by a plurality of convolution layers, extracting shallow local features, then decomposing the shallow local features into different scales by adopting a multi-scale technology, and capturing shallow information on the different scales.
2. The infrared light and visible light image fusion system based on split-focus transformers according to claim 1, wherein the split-focus transformers network model comprises a split-focus feature extraction unit, a cross-focus fusion unit and a feature reconstruction unit.
3. The split-focus transducer-based infrared and visible light image fusion system of claim 2, wherein the split-focus feature extraction unit comprises three stage blocks, each stage block comprising six split-focus transducer sub-blocks, each sub-block driven by multi-granularity learning, capturing multi-granularity information while reducing the number of token by injecting token into heterogeneous receptive fields; the shunt transducer sub-block models global features and achieves extraction of multi-granularity features in a global scope.
4. The infrared and visible light image fusion system based on split-attention transfomer according to claim 2, wherein the cross-attention fusion unit comprises two cross-attention residual blocks, each residual block is provided with a self-attention based intra-domain fusion block for effectively integrating global interaction information in the same domain, and a cross-attention based inter-domain fusion block for further integrating global interaction information between different domains.
5. The infrared and visible light image fusion system based on split-focus convertors according to claim 4, wherein the inter-domain fusion block utilizes a cross-focus mechanism to realize interaction of global feature information, and alternate integration of global inter-domain and cross-domain interaction is realized by merging cross-domain information and using skipped connections to retain information of different domains.
6. The split-focus transducer-based infrared and visible light image fusion system of claim 2, wherein the feature reconstruction unit comprises a deep reconstruction block and a shallow reconstruction block for mapping the aggregated deep features back into image space; the deep reconstruction block comprises four self-attention blocks, and the shallow reconstruction block comprises two convolution layers with 3×3 kernels and 1 step size, each layer being followed by a ReLU activation function.
7. The infrared light and visible light image fusion system based on split-focus converters according to claim 6, wherein the deep layer reconstruction block is used for refining the fused deep layer features and realizing multi-scale feature reconstruction from a global angle; the image size is further restored by shallow reconstruction blocks and convolution layers, and then feature transfer is enhanced by using skipped connections, maximally reusing features of different layers to construct a fused image.
8. The split-focus Transformer based infrared and visible light image fusion system of claim 1, wherein the split-focus Transformer network model refinement process is quantized to:
I Fu =SRU(F FUS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images; safeuj (·) is a split attention feature extraction unit,and->Respectively representing deep granularity characteristics of infrared light and visible light; CAFU (& gt) represents a cross-attention fusion unit,>and->Respectively representing the output characteristics of infrared light and visible light which are aggregated after intra-domain interaction and inter-domain interaction; concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the depth characteristics after fusion; i Fu Is a fused image generated after reconstruction and upsampling.
9. The split attention fransformer based infrared and visible light image fusion system of claim 1, wherein the split attention fransformer network model is trained using a granularity loss function comprising a structural similarity loss, a fine granularity loss, a coarse granularity loss; the particle size loss L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively representing a structural similarity loss, a fine grain loss and a coarse grain loss; alpha and beta are loss function super parameters.
10. The split attention transducer based infrared and visible light image fusion system of claim 9, wherein structural similarity loss L S The method comprises the following steps:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability;
loss of fine particle size L FG Loss of coarse grain L CG The method comprises the following steps of:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, wherein |is an absolute value operation; H. w is the height and width of the image and γ is the hyper-parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310477962.5A CN116503703A (en) | 2023-04-28 | 2023-04-28 | Infrared light and visible light image fusion system based on shunt attention transducer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310477962.5A CN116503703A (en) | 2023-04-28 | 2023-04-28 | Infrared light and visible light image fusion system based on shunt attention transducer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116503703A true CN116503703A (en) | 2023-07-28 |
Family
ID=87329879
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310477962.5A Pending CN116503703A (en) | 2023-04-28 | 2023-04-28 | Infrared light and visible light image fusion system based on shunt attention transducer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116503703A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036893A (en) * | 2023-10-08 | 2023-11-10 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117115065A (en) * | 2023-10-25 | 2023-11-24 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
CN117115061A (en) * | 2023-09-11 | 2023-11-24 | 北京理工大学 | Multi-mode image fusion method, device, equipment and storage medium |
CN117391983A (en) * | 2023-10-26 | 2024-01-12 | 安徽大学 | Infrared image and visible light image fusion method |
CN118314432A (en) * | 2024-06-11 | 2024-07-09 | 合肥工业大学 | Target detection method and system for multi-source three-dimensional inspection data fusion of transformer substation |
CN118446912A (en) * | 2024-07-11 | 2024-08-06 | 江西财经大学 | Multi-mode image fusion method and system based on multi-scale attention sparse cascade |
CN118552823A (en) * | 2024-07-30 | 2024-08-27 | 大连理工大学 | Infrared and visible light image fusion method of depth characteristic correlation matrix |
-
2023
- 2023-04-28 CN CN202310477962.5A patent/CN116503703A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117115061A (en) * | 2023-09-11 | 2023-11-24 | 北京理工大学 | Multi-mode image fusion method, device, equipment and storage medium |
CN117115061B (en) * | 2023-09-11 | 2024-04-09 | 北京理工大学 | Multi-mode image fusion method, device, equipment and storage medium |
CN117036893A (en) * | 2023-10-08 | 2023-11-10 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117036893B (en) * | 2023-10-08 | 2023-12-15 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117115065A (en) * | 2023-10-25 | 2023-11-24 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
CN117115065B (en) * | 2023-10-25 | 2024-01-23 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
CN117391983A (en) * | 2023-10-26 | 2024-01-12 | 安徽大学 | Infrared image and visible light image fusion method |
CN118314432A (en) * | 2024-06-11 | 2024-07-09 | 合肥工业大学 | Target detection method and system for multi-source three-dimensional inspection data fusion of transformer substation |
CN118446912A (en) * | 2024-07-11 | 2024-08-06 | 江西财经大学 | Multi-mode image fusion method and system based on multi-scale attention sparse cascade |
CN118552823A (en) * | 2024-07-30 | 2024-08-27 | 大连理工大学 | Infrared and visible light image fusion method of depth characteristic correlation matrix |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116503703A (en) | Infrared light and visible light image fusion system based on shunt attention transducer | |
Zhou et al. | Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network | |
Zhao et al. | Efficient and model-based infrared and visible image fusion via algorithm unrolling | |
Li et al. | Survey of single image super‐resolution reconstruction | |
Jin et al. | Pedestrian detection with super-resolution reconstruction for low-quality image | |
WO2023093186A1 (en) | Neural radiation field-based method and apparatus for constructing pedestrian re-identification three-dimensional data set | |
An et al. | TR-MISR: Multiimage super-resolution based on feature fusion with transformers | |
Zhou et al. | MSAR‐DefogNet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution | |
Jin et al. | Vehicle license plate recognition for fog‐haze environments | |
Liu et al. | A semantic-driven coupled network for infrared and visible image fusion | |
Li et al. | Image super-resolution reconstruction based on multi-scale dual-attention | |
Pang et al. | Lightweight multi-scale aggregated residual attention networks for image super-resolution | |
Zhu et al. | Multiscale channel attention network for infrared and visible image fusion | |
Zhang | Image enhancement method based on deep learning | |
Luo et al. | Infrared and visible image fusion based on VPDE model and VGG network | |
Li et al. | Image reflection removal using end‐to‐end convolutional neural network | |
Wang et al. | SCGRFuse: An infrared and visible image fusion network based on spatial/channel attention mechanism and gradient aggregation residual dense blocks | |
Zhao et al. | Real-aware motion deblurring using multi-attention CycleGAN with contrastive guidance | |
Wang et al. | Prior‐guided multiscale network for single‐image dehazing | |
Chen et al. | Contrastive learning with feature fusion for unpaired thermal infrared image colorization | |
Tao et al. | MFFDNet: Single Image Deraining via Dual-Channel Mixed Feature Fusion | |
Li et al. | Multi‐Scale Guided Attention Network for Crowd Counting | |
Liu et al. | Crowd counting method via a dynamic-refined density map network | |
Zhang et al. | Trustworthy image fusion with deep learning for wireless applications | |
Zhang et al. | Facial Image Shadow Removal via Graph‐based Feature Fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |