CN116503703A - Infrared light and visible light image fusion system based on shunt attention transducer - Google Patents

Infrared light and visible light image fusion system based on shunt attention transducer Download PDF

Info

Publication number
CN116503703A
CN116503703A CN202310477962.5A CN202310477962A CN116503703A CN 116503703 A CN116503703 A CN 116503703A CN 202310477962 A CN202310477962 A CN 202310477962A CN 116503703 A CN116503703 A CN 116503703A
Authority
CN
China
Prior art keywords
attention
split
visible light
fusion
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310477962.5A
Other languages
Chinese (zh)
Inventor
周士华
姜洋
李嘉伟
胡轶男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202310477962.5A priority Critical patent/CN116503703A/en
Publication of CN116503703A publication Critical patent/CN116503703A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • G06V10/14Optical characteristics of the device performing the acquisition or on the illumination arrangements
    • G06V10/143Sensing or illuminating at different wavelengths
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an infrared light and visible light image fusion system based on a shunt attention transducer, and relates to the technical field of image fusion; the system comprises a split-flow attention transducer network model for infrared light and visible light image fusion, wherein the whole network model comprises a split-flow attention feature extraction unit, a cross attention fusion unit and a feature reconstruction unit. In order to generate a fused image with rich scene details and good visual effect, infrared light and visible light images are respectively sent into a network model to extract shallow local features. Thereafter, coarse-grained and fine-grained details within a single attention layer are obtained using a feature extraction unit. In the feature fusion unit, a cross-attention mechanism is introduced to fuse cross-domain complementary features. In addition, in the reconstruction stage, the feature reconstruction unit adopts dense jump connection, and deep and shallow features with different scales are utilized to the maximum extent to construct a fusion image.

Description

Infrared light and visible light image fusion system based on shunt attention transducer
Technical Field
The invention relates to the technical field of image fusion, in particular to an infrared light and visible light image fusion system based on a split-flow attention transducer.
Background
Due to limitations in various aspects of the sensor's own physical characteristics, imaging mechanisms, and viewing angles, a single vision sensor often cannot extract enough information from a scene. Based on the theory of heat radiation, the infrared sensor highlights the heat source area without being disturbed by the environment. However, due to the low resolution of the image generated by the infrared sensor, structural features and detailed information are often lacking. In contrast, the visible sensor is able to generate friendly visual images with higher spatial resolution. To inherit the advantages of both sensors, preserving thermal radiation and texture information of infrared and visible images through image fusion techniques is an effective approach. The fusion result has excellent visual perception and scene representation capability, and can be widely applied to the fields of image enhancement, semantic segmentation, target detection and the like.
The key to the fusion of infrared and visible images is how to effectively integrate heat source features and detailed texture information. Over the past several decades, a number of conventional fusion methods and deep learning based fusion methods have been proposed.
The traditional fusion method mainly comprises a spatial domain method and a multi-scale transformation domain method. Spatial domain methods typically calculate a weighted average of pixel-level saliency of an input image to obtain a fused image. The multi-scale transform domain method includes wavelet transform, curve transform, etc., converts an input into a transform domain through mathematical transform, and designs a fusion rule to fuse images. Since the above conventional method does not consider the feature differences between the source images, it tends to have a negative effect on the fused image. In addition, the fusion rules and activity level measurement of the traditional method cannot adapt to complex scenes, and also bring challenges to the wide application of the traditional method.
In recent years, deep learning has become a mainstream in the field of image fusion due to excellent depth feature extraction capability. The method not only can automatically extract deep features from the data, but also overcomes the problems faced by the traditional method when adapting to complex scenes. However, these methods can only use local information for image fusion, and cannot further improve the fusion effect by long-distance dependence. Some Transformer based fusion methods benefit from complementary aggregation of global context features, which have demonstrated excellent performance, but they still have certain limitations. First, the transform-based method is less efficient in generating test images than the convolutional neural network-based method. In addition, most transform-based methods typically process a series of token directly after dividing the image, which results in high memory consumption. Second, because the existing Transformer fusion network ignores the feature of intra-layer mixed granularity, the ability to retain fine-granularity details and coarse-granularity objects is limited. Finally, the feature fusion process fuses information in only a single domain, lacking contextual information, which may affect the visual appearance of the fusion result.
Disclosure of Invention
The invention aims to provide an infrared light and visible light image fusion system based on a split-flow attention transducer, which is used for extracting and fusing global granularity information and local features, and can remarkably reduce the calculation cost by utilizing remote learning and reducing the number of input tokens.
To achieve the above object, an infrared light and visible light image fusion system based on split-flow attention transducer according to the present application includes: the method comprises the steps of (1) a shunt attention transducer network model for fusing infrared light and visible light images, firstly entering an input infrared light and visible light image into a shallow convolution block formed by a plurality of convolution layers, extracting shallow local features, then decomposing the shallow local features into different scales by adopting a multi-scale technology, and capturing shallow information on the different scales.
Further, the split attention transducer network model comprises a split attention feature extraction unit, a cross attention fusion unit and a feature reconstruction unit.
Further, the split-flow attention feature extraction unit comprises three stage blocks, each stage block comprises six split-flow transducer sub-blocks, each sub-block is driven by multi-granularity learning, and the number of the token is reduced while capturing multi-granularity information by injecting the token into the heterogeneous receptive field; the shunt transducer sub-block models global features and achieves extraction of multi-granularity features in a global scope.
Further, the cross attention fusion unit comprises two cross attention residual blocks, and each residual block is provided with a self attention-based intra-domain fusion block to effectively integrate global interaction information in the same domain, and a cross attention-based inter-domain fusion block to further integrate global interaction information between different domains.
Furthermore, the inter-domain fusion block utilizes a cross-attention mechanism to realize interaction of global feature information, and the inter-domain and cross-domain interaction alternate integration is realized by merging cross-domain information and reserving information of different domains by using skip connection.
Further, the feature reconstruction unit comprises a deep layer reconstruction block and a shallow layer reconstruction block, and is used for mapping the aggregated deep layer features back to the image space; the deep reconstruction block comprises four self-attention blocks, and the shallow reconstruction block comprises two convolution layers with 3×3 kernels and 1 step size, each layer being followed by a ReLU activation function.
Furthermore, the deep layer reconstruction block is used for refining the fused deep layer features and realizing multi-scale feature reconstruction from the global angle; the image size is further restored by shallow reconstruction blocks and convolution layers, and then feature transfer is enhanced by using skipped connections, maximally reusing features of different layers to construct a fused image.
Still further, the split attention transducer network model refinement process quantifies:
I Fu =FRU(F FUS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images; safeuj (·) is a split attention feature extraction unit,and->Respectively representing deep granularity characteristics of infrared light and visible light; CAFU (& gt) represents a cross-attention fusion unit,>and->Respectively representing the output characteristics of infrared light and visible light which are aggregated after intra-domain interaction and inter-domain interaction; concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the depth characteristics after fusion; i Fu Is a fusion generated after reconstruction and upsamplingAnd (5) combining the images.
As a further aspect, the split attention fransformer network model is trained using a granularity loss function, wherein the granularity loss function comprises structural similarity loss, fine granularity loss and coarse granularity loss; the particle size loss L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively representing a structural similarity loss, a fine grain loss and a coarse grain loss; alpha and beta are loss function super parameters.
Further, structural similarity loss L S The method comprises the following steps:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability;
loss of fine particle size L FG Loss of coarse grain L CG The method comprises the following steps of:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, wherein |is an absolute value operation; H. w is the height and width of the image and γ is the hyper-parameter.
Compared with the prior art, the technical scheme adopted by the invention has the advantages that: the shunt attention transducer network model establishes a remote dependency relationship between images, extracts and integrates the granularity characteristics, reduces the calculation cost by effectively reducing the number of tokens, and has the advantages that the time efficiency of generating the test image is higher than that of a fusion method based on the transducer and a fusion method based on a convolutional neural network. The invention realizes the combined extraction of coarse granularity and fine granularity characteristics in each attention layer through the shunt attention characteristic extraction unit. The cross attention fusion unit fully realizes intra-domain and inter-domain depth feature interaction and cross-domain information fusion, and the feature reconstruction unit combines the feature mapping with the reconstructed image, so that the network model can recover the fused images with different resolutions. Furthermore, the network is driven with a granularity loss function consisting of structural similarity loss, fine granularity loss, and coarse granularity loss, with granularity information control and structural maintenance employed to achieve feature extraction and fusion. The fusion image generated by the split-flow attention transducer network model has better visual perception, contains enough obvious characteristic and texture detail information, and has higher time efficiency.
Drawings
FIG. 1 is a schematic block diagram of an infrared and visible light image fusion system;
FIG. 2 is a schematic diagram of a cross-attention fusion unit and a reconstruction unit;
FIG. 3 is a qualitative comparison of the present method with other advanced fusion methods on an MSRS dataset;
FIG. 4 is a graph of the time at M 3 Qualitative comparison graphs between the method and other advanced fusion methods on the FD data set;
FIG. 5 is a graph of the time at M 3 Quantitative comparison graphs between the method and other advanced fusion methods on the FD data set;
FIG. 6 is a qualitative comparison of detection between this method and other advanced fusion methods on MSRS datasets.
Detailed description of the preferred embodiments
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the application, i.e., the embodiments described are merely some, but not all, of the embodiments of the application.
Example 1
As shown in fig. 1, the infrared light and visible light image fusion system based on the split-focus transducer comprises a split-focus transducer network model for infrared light and visible light image fusion, wherein the whole network model comprises a split-focus feature extraction unit, a cross-focus fusion unit and a feature reconstruction unit. In order to generate a fused image with rich scene details and good visual effect, infrared light and visible light images are respectively sent into a network model to obtain remarkable information and texture details. The input infrared light and visible light images first enter a shallow convolution block composed of four convolution layers, and shallow local features are extracted. And then decomposing the shallow local features to different scales by adopting a multi-scale technology, capturing shallow information on the different scales, and extracting the shallow information of the different scales is helpful for better understanding the image details.
In order to solve the problems of high memory consumption and only single-scale token in a self-attention mechanism, the invention introduces multi-granularity joint learning in a feature extraction stage to learn granularity information in a single attention layer. As shown in the upper left of fig. 1, the split attention feature extraction unit is designed for better exploration of multi-granularity depth features, and comprises three stage blocks, each comprising six split transducer sub-blocks, each sub-block being driven by multi-granularity learning. By injecting token into the heterogeneous receptive field, the number of token is reduced while multi-granularity information is captured. The shunt transducer sub-block can effectively model global features and achieve extraction of multi-granularity features in a global scope.
At the heart of multi-granularity joint learning is a split-focus mechanism, which models objects of different scales within the same focus layer, and learns multi-granularity information in parallel. The values of K, V on the attention header within the same attention layer are downsampled to different sizes, reducing the number of tokens to enable the capture of coarse and fine granularity information and the merging of features. The mixed granularity feature is then aggregated by skipping the connection. The mechanism has good computing efficiency and fine granularity detail information retention capability.
After extracting deep features, the present invention designs a cross-attention fusion unit to further mine and aggregate intra-domain and inter-domain context information, as shown in FIG. 2. The method comprises two cross attention residual blocks, wherein each residual block is provided with a self attention-based intra-domain fusion block for effectively integrating global interaction information in the same domain, and a cross attention-based inter-domain fusion block for further integrating global interaction information between different domains. The inter-domain fusion block utilizes a cross attention mechanism to realize the exchange of global feature information; by combining the cross-domain information and combining the information of different domains by using the skip connection, the alternate integration of the global inter-domain interaction and the cross-domain interaction is realized.
After the complementary information in different fields is fully aggregated, the aggregated deep features are mapped back to the image space by using a feature reconstruction unit. The deep reconstruction block is deployed to refine the fused deep features, and the reconstruction of the multi-scale features is realized from the global angle. Then, the image size is further restored through two shallow reconstruction blocks based on a convolutional neural network and one convolutional layer; feature delivery is enhanced with skipped connections, maximally reusing features of different layers to construct a fused image. The deep reconstruction block comprises four self-attention blocks, the shallow reconstruction block comprises two convolution layers with kernels of 3×3 and step sizes of 1, and a ReLU activation function is arranged behind each layer.
The split attention transducer network model refinement process can be quantified as:
I Fu =FRU(F FIS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images. Safeuj (·) is a split attention feature extraction unit,and->Representing deep particle size characteristics of infrared light and visible light, respectively. CAFU (& gt) represents a cross-attention fusion unit,>and->Representing the characteristics of the output of the infrared light and the visible light, respectively, that are aggregated after intra-domain and inter-domain interactions. Concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the fused depth features. I Fu Is a fused image generated after reconstruction and upsampling.
The purpose of image fusion is to integrate the detail information of the source image into a single fused image and to generate a fused image with a significant target from the intensity information of the source image. In order to pursue better feature learning capability, the invention adopts a granularity loss function to promote feature extraction and fusion to coarse and fine granularity informationControl of information, and maintenance of structural similarity. The network model is constrained during training using a granularity penalty function that includes a structural similarity penalty, a fine granularity penalty, a coarse granularity penalty to achieve a fused image with similar structural and granularity information as the input image. Particle size loss function L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively, the loss of structural similarity, the loss of fine granularity and the loss of coarse granularity.
Structural similarity penalty L S The combination of three parts, brightness, structure and contrast, is an effective method for measuring the structural similarity between two different images, and is defined as:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability.
Loss of fine particle size L FG For guiding the network to keep more detail features as much as possible, coarse granularity loss L CG Aimed at directing the network to capture appropriate target information; fine particle size and coarse particle size losses are defined as:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, which can measure texture details of an image; the absolute value operation is represented by the absolute.
To prove the superiority of the system of the invention, firstly, test image pairs are selected from MSRS data sets, compared with ten most advanced infrared light and visible light image fusion methods, and then in M 3 Further validating the selected image pairs on the FD and TNO datasets; in qualitative assessment, the image is assessed by the human visual system, such as details of the image, brightness, and integrity of the target. As shown in fig. 3-4, the proposed split-attention transducer network model achieves better visual perception in terms of maintaining visible details and infrared targets than other approaches. Meanwhile, the method has better fusion performance in the aspects of maintaining visible detail textures and infrared remarkable target distribution. The generated fusion image is more in line with human visual perception, and the subjective visual effect is clear and natural.
In order to avoid interference of human factors and comprehensively measure the fusion capability, the invention also utilizes objective quantitative indexes MI and Q abf VIF, AG, SF and SSIM to quantify the performance of evaluating fusion results. Using M 3 The test image pair in the FD dataset is used as a test set to complete different infrared light and visible light image fusion tasks, and the quantitative result is shown in fig. 5. Obviously, the quantitative experiment result obtained by the method obtains the highest value of all the image VIF indexes except the SSIM index. Because the VIF index is consistent with the human visual system, the network is proved to have better human visual effect. Meanwhile, the fusion image retains a large amount of information from the infrared light and visible light source images. SF and AG may embody the detail and texture of the fused image, respectively. Although the SSIM index of the proposed method is not optimal, comparable results still mean that the fused image obtained by the proposed method includes sufficient structural and gradient information.
In order to explore the impact of infrared and visible light image fusion on multi-modal target detection, the invention uses network-generated fusion images to evaluate the performance of target detection. YOLOv7 was used as a reference model for target detection. Figure 6 shows the qualitative results of target detection. The results indicate that the network generated fusion image has the best detection performance, especially in people and vehicles. By fusing the salient region features and texture information, more comprehensive scene description is provided, and detection accuracy is improved. In addition, the invention uses deep labv3+ as a reference model for training, and the effectiveness of the model is compared through the cross-over ratio. The result of semantic segmentation effectively integrates the granularity information of the global context, and the intra-domain and inter-domain complementary information also enhances the semantic features of the fused image, thereby improving the perception capability and segmentation precision of the model.
The model provided by the invention has remarkable advantages in visual performance and objective evaluation. The generated fusion image has better visual perception, comprises enough obvious characteristic and texture detail information, and has higher time efficiency. The potential in advanced visual tasks is also shown in object detection and semantic segmentation. Therefore, the system provided by the invention is beneficial to the development of infrared light and visible light image fusion.
The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (10)

1. An infrared light and visible light image fusion system based on a split-flow attention transducer, comprising: the method comprises the steps of (1) a shunt attention transducer network model for fusing infrared light and visible light images, firstly entering an input infrared light and visible light image into a shallow convolution block formed by a plurality of convolution layers, extracting shallow local features, then decomposing the shallow local features into different scales by adopting a multi-scale technology, and capturing shallow information on the different scales.
2. The infrared light and visible light image fusion system based on split-focus transformers according to claim 1, wherein the split-focus transformers network model comprises a split-focus feature extraction unit, a cross-focus fusion unit and a feature reconstruction unit.
3. The split-focus transducer-based infrared and visible light image fusion system of claim 2, wherein the split-focus feature extraction unit comprises three stage blocks, each stage block comprising six split-focus transducer sub-blocks, each sub-block driven by multi-granularity learning, capturing multi-granularity information while reducing the number of token by injecting token into heterogeneous receptive fields; the shunt transducer sub-block models global features and achieves extraction of multi-granularity features in a global scope.
4. The infrared and visible light image fusion system based on split-attention transfomer according to claim 2, wherein the cross-attention fusion unit comprises two cross-attention residual blocks, each residual block is provided with a self-attention based intra-domain fusion block for effectively integrating global interaction information in the same domain, and a cross-attention based inter-domain fusion block for further integrating global interaction information between different domains.
5. The infrared and visible light image fusion system based on split-focus convertors according to claim 4, wherein the inter-domain fusion block utilizes a cross-focus mechanism to realize interaction of global feature information, and alternate integration of global inter-domain and cross-domain interaction is realized by merging cross-domain information and using skipped connections to retain information of different domains.
6. The split-focus transducer-based infrared and visible light image fusion system of claim 2, wherein the feature reconstruction unit comprises a deep reconstruction block and a shallow reconstruction block for mapping the aggregated deep features back into image space; the deep reconstruction block comprises four self-attention blocks, and the shallow reconstruction block comprises two convolution layers with 3×3 kernels and 1 step size, each layer being followed by a ReLU activation function.
7. The infrared light and visible light image fusion system based on split-focus converters according to claim 6, wherein the deep layer reconstruction block is used for refining the fused deep layer features and realizing multi-scale feature reconstruction from a global angle; the image size is further restored by shallow reconstruction blocks and convolution layers, and then feature transfer is enhanced by using skipped connections, maximally reusing features of different layers to construct a fused image.
8. The split-focus Transformer based infrared and visible light image fusion system of claim 1, wherein the split-focus Transformer network model refinement process is quantized to:
I Fu =SRU(F FUS )
wherein S (·) represents a shallow convolution block, I n And V i Respectively representing the input infrared light and visible light images; safeuj (·) is a split attention feature extraction unit,and->Respectively representing deep granularity characteristics of infrared light and visible light; CAFU (& gt) represents a cross-attention fusion unit,>and->Respectively representing the output characteristics of infrared light and visible light which are aggregated after intra-domain interaction and inter-domain interaction; concat (-) represents a concatenation in the channel dimension; FRU (·) represents a feature reconstruction unit; f (F) FUS Representing the depth characteristics after fusion; i Fu Is a fused image generated after reconstruction and upsampling.
9. The split attention fransformer based infrared and visible light image fusion system of claim 1, wherein the split attention fransformer network model is trained using a granularity loss function comprising a structural similarity loss, a fine granularity loss, a coarse granularity loss; the particle size loss L G Expressed as:
L G =αL S +β(L FG +L CG )
wherein L is S 、L FG And L CG Respectively representing a structural similarity loss, a fine grain loss and a coarse grain loss; alpha and beta are loss function super parameters.
10. The split attention transducer based infrared and visible light image fusion system of claim 9, wherein structural similarity loss L S The method comprises the following steps:
L S =1-SSIM(I n ,V i ,I Fu )
wherein I is Fu For the fusion result, the symbol SSIM (·) represents a structural similarity function defined as:
wherein I is * Representing a source image V i Or I n The method comprises the steps of carrying out a first treatment on the surface of the μ and σ represent the mean and standard deviation, respectively; c (C) 1 ,C 2 And C 3 Is a constant that maintains stability;
loss of fine particle size L FG Loss of coarse grain L CG The method comprises the following steps of:
wherein I 1 Represents L 1 Norm, max { · } represents the maximum choice per element,representing a Sobel gradient operator, wherein |is an absolute value operation; H. w is the height and width of the image and γ is the hyper-parameter.
CN202310477962.5A 2023-04-28 2023-04-28 Infrared light and visible light image fusion system based on shunt attention transducer Pending CN116503703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310477962.5A CN116503703A (en) 2023-04-28 2023-04-28 Infrared light and visible light image fusion system based on shunt attention transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310477962.5A CN116503703A (en) 2023-04-28 2023-04-28 Infrared light and visible light image fusion system based on shunt attention transducer

Publications (1)

Publication Number Publication Date
CN116503703A true CN116503703A (en) 2023-07-28

Family

ID=87329879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310477962.5A Pending CN116503703A (en) 2023-04-28 2023-04-28 Infrared light and visible light image fusion system based on shunt attention transducer

Country Status (1)

Country Link
CN (1) CN116503703A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036893A (en) * 2023-10-08 2023-11-10 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117115065A (en) * 2023-10-25 2023-11-24 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint
CN117115061A (en) * 2023-09-11 2023-11-24 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium
CN117391983A (en) * 2023-10-26 2024-01-12 安徽大学 Infrared image and visible light image fusion method
CN118314432A (en) * 2024-06-11 2024-07-09 合肥工业大学 Target detection method and system for multi-source three-dimensional inspection data fusion of transformer substation
CN118446912A (en) * 2024-07-11 2024-08-06 江西财经大学 Multi-mode image fusion method and system based on multi-scale attention sparse cascade
CN118552823A (en) * 2024-07-30 2024-08-27 大连理工大学 Infrared and visible light image fusion method of depth characteristic correlation matrix

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115061A (en) * 2023-09-11 2023-11-24 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium
CN117115061B (en) * 2023-09-11 2024-04-09 北京理工大学 Multi-mode image fusion method, device, equipment and storage medium
CN117036893A (en) * 2023-10-08 2023-11-10 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117036893B (en) * 2023-10-08 2023-12-15 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117115065A (en) * 2023-10-25 2023-11-24 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint
CN117115065B (en) * 2023-10-25 2024-01-23 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint
CN117391983A (en) * 2023-10-26 2024-01-12 安徽大学 Infrared image and visible light image fusion method
CN118314432A (en) * 2024-06-11 2024-07-09 合肥工业大学 Target detection method and system for multi-source three-dimensional inspection data fusion of transformer substation
CN118446912A (en) * 2024-07-11 2024-08-06 江西财经大学 Multi-mode image fusion method and system based on multi-scale attention sparse cascade
CN118552823A (en) * 2024-07-30 2024-08-27 大连理工大学 Infrared and visible light image fusion method of depth characteristic correlation matrix

Similar Documents

Publication Publication Date Title
CN116503703A (en) Infrared light and visible light image fusion system based on shunt attention transducer
Zhou et al. Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network
Zhao et al. Efficient and model-based infrared and visible image fusion via algorithm unrolling
Li et al. Survey of single image super‐resolution reconstruction
Jin et al. Pedestrian detection with super-resolution reconstruction for low-quality image
WO2023093186A1 (en) Neural radiation field-based method and apparatus for constructing pedestrian re-identification three-dimensional data set
An et al. TR-MISR: Multiimage super-resolution based on feature fusion with transformers
Zhou et al. MSAR‐DefogNet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution
Jin et al. Vehicle license plate recognition for fog‐haze environments
Liu et al. A semantic-driven coupled network for infrared and visible image fusion
Li et al. Image super-resolution reconstruction based on multi-scale dual-attention
Pang et al. Lightweight multi-scale aggregated residual attention networks for image super-resolution
Zhu et al. Multiscale channel attention network for infrared and visible image fusion
Zhang Image enhancement method based on deep learning
Luo et al. Infrared and visible image fusion based on VPDE model and VGG network
Li et al. Image reflection removal using end‐to‐end convolutional neural network
Wang et al. SCGRFuse: An infrared and visible image fusion network based on spatial/channel attention mechanism and gradient aggregation residual dense blocks
Zhao et al. Real-aware motion deblurring using multi-attention CycleGAN with contrastive guidance
Wang et al. Prior‐guided multiscale network for single‐image dehazing
Chen et al. Contrastive learning with feature fusion for unpaired thermal infrared image colorization
Tao et al. MFFDNet: Single Image Deraining via Dual-Channel Mixed Feature Fusion
Li et al. Multi‐Scale Guided Attention Network for Crowd Counting
Liu et al. Crowd counting method via a dynamic-refined density map network
Zhang et al. Trustworthy image fusion with deep learning for wireless applications
Zhang et al. Facial Image Shadow Removal via Graph‐based Feature Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination