CN115222601A

CN115222601A - Image super-resolution reconstruction model and method based on residual mixed attention network

Info

Publication number: CN115222601A
Application number: CN202210940743.1A
Authority: CN
Inventors: 黄峰; 郑伟煌; 沈英; 吴靖; 陈丽琼
Original assignee: "belt And Road" Spatial Information Corridor Haisi Research Institute; Fuzhou University
Current assignee: "belt And Road" Spatial Information Corridor Haisi Research Institute; Fuzhou University
Priority date: 2022-08-06
Filing date: 2022-08-06
Publication date: 2022-10-21

Abstract

The invention provides an image super-resolution reconstruction model and method based on a residual mixed attention network, which comprises the following steps: the device comprises a shallow layer feature extraction module, a deep layer feature extraction module and a reconstruction module. The shallow layer feature extraction module is used for extracting shallow layer features from the low-resolution image; the deep layer feature extraction module is formed by connecting a plurality of cascade residual separation mixed attention groups and global residuals, and performs feature extraction and fusion on the shallow layer features to obtain deep layer features; the reconstruction module uses sub-pixel convolution to perform up-sampling on the deep features to obtain an image with higher resolution. The residual error separation mixed attention module splits the feature graph by adopting a channel separation technology, and sends the feature graph into the two branch modules in parallel for processing, and the local features extracted by the residual error triple attention module and the global features extracted by the high-efficiency Swin Transformer module are fused to obtain rich high-frequency and low-frequency information. By the method, the image with richer details can be obtained, and the super-resolution reconstruction with higher precision is realized.

Description

Image super-resolution reconstruction model and method based on residual mixed attention network

Technical Field

The invention relates to the technical field of computer vision and image processing, in particular to an image super-resolution reconstruction model and method based on a residual mixed attention network.

Background

In the field of electronic image applications, it is often desirable to obtain high resolution images. High resolution means that the density of pixels in the image is high, providing more detail that is essential in many practical applications. For example, high resolution medical images are very helpful for physicians to make correct diagnoses; similar objects are easily distinguished from the like using high resolution satellite images; the performance of pattern recognition in computer vision is greatly enhanced if high resolution images can be provided.

Due to capital and technical limitations, the pictures obtained are in most cases difficult to achieve with the desired resolution. However, the image is subjected to various constraints in the acquisition process, such as optical blur caused by defocusing, diffraction and the like in the digital imaging process, motion blur caused by limited shutter speed, aliasing effect influenced by the density of the sensing unit, random noise in the image photoreceptor or in the image transmission process and the like, and the factors influence the generation quality of the image. Therefore, it is highly necessary to find a way to enhance the current resolution level.

The image super-resolution reconstruction technology is used as a post-processing means, and can enhance the resolution of an image without increasing the hardware cost. Image super-resolution reconstruction aims at algorithmically restoring a high-resolution image from a given low-resolution image. At present, methods for image super-resolution reconstruction can be mainly divided into three categories: interpolation-based super-resolution reconstruction, reconstruction-based super-resolution reconstruction, and learning-based super-resolution reconstruction. The interpolation method is to insert new pixels around the original pixels of the image to increase the size of the image, and assign values to the pixels after the pixels are inserted, so as to restore the image content and achieve the effect of improving the image resolution. The method is simple in calculation and easy to understand and implement, but the problems of ringing effect and serious loss of high-frequency information can occur in the reconstruction result. The reconstruction method is to establish an observation model in the process of image acquisition and then realize super-resolution reconstruction by solving the inverse problem of the observation model. The reconstruction method improves on the recovery details, but its performance degrades as the scale factor increases, and this method is time consuming. The super-resolution reconstruction based on the deep learning recovers high-frequency information by utilizing the end-to-end mapping established between a low-resolution image block and a high-resolution image block, and obtains a good reconstruction result.

At present, a mainstream algorithm usually designs a very deep network architecture, long-time training is needed, and as the depth of a network is deeper, the training difficulty is higher, and the required training skill is increased. Meanwhile, the low-resolution input contains abundant low-frequency information which is treated equally among channels, so that the convolutional neural network is prevented from learning more high-frequency information. In addition, the current convolutional neural network for super-resolution does not fully utilize features on multiple scales, and the learning capability of the network is limited. Meanwhile, only a convolutional neural network is used for constructing a model, and the self-similarity inside the image cannot be fully utilized to capture the remote dependence inside the image. Therefore, it is necessary to solve the existing problems and reconstruct a high quality image.

Disclosure of Invention

In view of this, the present invention provides an image super-resolution reconstruction model and method based on a residual hybrid attention network, so as to recover more texture details and improve the image super-resolution reconstruction accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme: the image super-resolution reconstruction model based on the residual mixed attention network comprises a shallow layer feature extraction module, a deep layer feature extraction module and a reconstruction module: the shallow feature extraction module is composed of a 3 x 3 convolutional layer, extracts shallow features of the low-resolution input image by using the characteristic that the convolutional layer is good at extracting the features, and specifically operates as follows:

M ₀ ＝H _SF (I _LR )

wherein H _SF (. Represents a shallow feature extraction Module, I) _LR Representing an input low resolution image, M ₀ Representing a shallow feature map;

the deep layer feature extraction module consists of a multiple residual error separation mixed attention group and a convolution of 3 multiplied by 3, and extracts high-level features from the shallow layer features, and the process is expressed as follows:

M _DF ＝f ^3×3 (M _n )

wherein

Denotes the ith multiple residual separation mixed attention group, n denotes the number of multiple mixed attention residual groups, M _i-1 、M _i 、M _n Intermediate feature map representing multiple residual separation mixed attention groups, f ^3×3 (. Cndot.) denotes convolution operation with convolution kernel size of 3X 3, M _DF A deep level feature map is shown.

The reconstruction module is composed of a sub-pixel convolution layer and a convolution of 3 multiplied by 3, one sub-pixel convolution layer is utilized to carry out up-sampling on the deep features extracted by the deep feature extraction module, the information flow is reformed into a feature map with specified up-sampling multiplying power, and the process is described as follows:

I _SR ＝H _UP (M _DF +M ₀ )+Bicubic(I _LR )

in which I _SR Representing the high-resolution image after super-resolution reconstruction, H _UP (. Cndot.) denotes the reconstruction module, bicubic (. Cndot.) denotes Bicubic interpolation of the low resolution image to the target resolution.

In a preferred embodiment, the multi-residual separation mixed attention set is composed of a plurality of residual separation mixed attention modules and a 3 × 3 convolutional layer.

In a preferred embodiment, the residual separation mixed attention module comprises two 1 × 1 convolution layers, a residual triple attention module, an efficient Swin Transformer module, residual concatenation, channel splitting and splicing operations, and the specific calculation formula expression is as follows:

X ₁ ，X ₂ ＝H _SPL (f ^1×1 (X))

Y ₁ ＝RTAB(X ₁ )

Y ₂ ＝ESTB(X ₂ )

Z＝f ^1×1 (H _CAT (Y ₁ ，Y ₂ ))+X

wherein X represents the input characteristics of the residual error separation mixed attention module, f ^1×1 (. Represents a 1X 1 convolutional layer, H _SPL (. Represents an operation of channel splitting, X ₁ 、X ₂ Representing the feature graph after splitting, RTAB representing the operation of the residual triple attention module, Y ₁ Represents the output characteristics of the residual triple attention module, ESTB represents the operation of the efficient Swin Transformer module, Y ₂ Represents the output characteristics of the high-efficiency Swin Transformer module, H _CAT (. Cndot.) denotes the operation of channel stitching, and Z denotes the output characteristics of the residual separation mixed attention module.

In a preferred embodiment, the residual triple attention module is composed of two 3 × 3 convolutional layers, a RELU activation function, a residual connection, and a triple attention module, and the specific formula is as follows:

X _O ＝F _TAM (f ^3×3 (RELU(f ^3×3 (X _I ))))+X

wherein, X _I Representing the input features of the residual triple attention module, X _o Representing the output characteristics of the residual triple attention module, F _TAM (. Cndot.) represents the operation of the triple attention module, RELU (. Cndot.) represents the RELU activation function; the triple attention module is composed of two cross-dimension interaction modules and a space attention module, and a specific calculation formula expression is as follows:

wherein

Representing output characteristics of two cross-dimension interaction modules,

the output characteristics of the spatial attention module are indicated, and Y indicates the output characteristics of the triple attention module.

In a preferred embodiment, the cross-dimension interaction module comprises a 7 × 7 convolutional layer, a dimension replacement operation, a Z-Pool layer, a channel connection operation, a maximum pooling operation, and an average pooling operation; the specific calculation formula is as follows:

X′ ₁ ＝H _PER (X ₁ )

Z-Pool(X)＝H _CAT (H _MP (X)，H _AP (X))

X″ ₁ ＝Z-Pool(X′ ₁ )

wherein, X' ₁ Denotes the result of the dimension replacement operation, X ″) ₁ Shows the result of the Z-Pool layer operation, H _PER (. Cndot.) denotes a dimension replacement operation, Z-Pool (. Cndot.) denotes an operation of a Z-Pool layer, H _CAT (. Cndot.) represents an operation that joins along a particular dimension in a given input sequence feature graph, H _MP (. And H) _AP Denotes the maximum pooling operation and the average pooling operation along a particular dimension, respectively, f ^7×7 (. H) represents a convolution operation with a convolution kernel size of 7 × 7, H _IN (. Cndot.) represents an example normalization operation, δ (-) represents a Sigmoid function, and · represents a multiply-by-channel operation.

In a preferred embodiment, the spatial attention module comprises a 7 × 7 convolutional layer, a Z-Pool layer and an example normalization operation, and the specific calculation formula is as follows:

wherein X ₃ 、

Representing the input features and the output features of the spatial attention module.

In a preferred embodiment, the high-efficiency Swin Transformer module consists of two layers of normalization operations, a self-attention calculation module based on moving windows, a local feature extraction feed-forward network and residual connection;

the calculation formula of the high-efficiency Swin Transformer module is as follows:

Q＝LN(W _Q X)

K＝LN(W _K X)

V＝LN(W _V X)

wherein, W _Q 、W _K 、W _V A transformation matrix representing the computation of Q, K, V, LN (-) representing a layer normalization operation, Q, K, V representing the query, key and value matrices, respectively, softMax (-) representing a SoftMax function, SW-MSA (-) representing a moving window self-attention module, leFF (-) representing a locally enhanced multi-layer perceptron module,

the output characteristics of the efficient Swin Transformer module are shown, d is the dimension of the K matrix, and Attention is shown in self-Attention calculation.

The invention also provides an image super-resolution reconstruction method based on the residual mixed attention network, which adopts the image super-resolution reconstruction model based on the residual mixed attention network and comprises the following steps:

step S1: establishing a training set according to the image degradation model to obtain N low-resolution images I _LR And N low resolution images I _LR Corresponding true high resolution image I _HR (ii) a Wherein N is an integer greater than 1;

step S2: inputting the low-resolution image to a shallow feature extraction module to extract shallow features of the image;

and step S3: inputting the shallow features into a deep feature extraction module to extract deep features;

and step S4: inputting the deep features into a reconstruction module, performing sub-pixel convolution to complete up-sampling processing, and reconstructing a final high-resolution image;

step S5: optimizing the image super-resolution reconstruction model through a loss function, wherein the loss function uses an average L1 error between the N reconstructed high-resolution images and the corresponding real high-resolution images, and the expression is as follows:

wherein L1 represents the L1 loss function.

Compared with the prior art, the invention has the following beneficial effects: the invention combines a convolutional neural network and a Transformer, adopts a channel separation technology, splits a characteristic diagram, and sends the characteristic diagram into two branch modules for processing in parallel. And through a residual error separation mixed attention module, fusing local features extracted by a triple attention module based on a convolutional neural network and global features extracted by a transform-based efficient Swin transform module to obtain rich high-frequency and low-frequency information. By the method, images with richer details can be obtained, and super-resolution reconstruction with higher precision is realized.

Drawings

FIG. 1 is a diagram of a residual separation hybrid attention network architecture in a preferred embodiment of the present invention;

FIG. 2 is a diagram of a residual separation mixed attention group configuration in a preferred embodiment of the present invention;

FIG. 3 is a block diagram of a residual separation hybrid attention module in a preferred embodiment of the present invention;

FIG. 4 is a diagram of a residual triple attention module in a preferred embodiment of the present invention;

FIG. 5 is a block diagram of a high efficiency Swin Transformer module in a preferred embodiment of the present invention;

fig. 6 is a flowchart illustrating an image super-resolution reconstruction method according to a preferred embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1 to 6, the present embodiment provides an image super-resolution reconstruction model based on a residual separation hybrid attention network, wherein the model includes: the device comprises a shallow layer feature extraction module, a deep layer feature extraction module and a reconstruction module. The shallow layer feature extraction module is used for extracting shallow layer features from the low-resolution image; the deep layer feature extraction module is formed by connecting a plurality of cascade residual separation mixed attention groups and global residuals, and performs feature extraction and fusion on the shallow layer features to obtain deep layer features; the reconstruction module uses the sub-pixel convolution layer to perform up-sampling on the deep features to obtain an image with higher resolution. The residual error separation mixed attention module splits the feature graph by adopting a channel separation technology, and sends the feature graph into the two branch modules in parallel for processing, and the local features extracted by the residual error triple attention module and the global features extracted by the high-efficiency Swin Transformer module are fused to obtain rich high-frequency and low-frequency information.

Step 1, performing shallow feature extraction on an input image, specifically:

by utilizing the characteristic that the convolutional layer is good at extracting the features, a 3 x 3 convolutional layer is used for extracting the shallow features of the low-resolution input image, and the specific operation is as follows:

M ₀ ＝H _SF (I _LR )

and 2, performing feature extraction and feature fusion on the shallow features through a deep feature extraction module formed by connecting a plurality of cascade residual separation mixed attention groups and global residuals, and obtaining deep features:

step 2.1, as shown in fig. 2, by inputting the shallow feature into the deep feature extraction module formed by connecting n cascaded residual separation mixed attention groups and the global residual, feature extraction and feature fusion can be performed on the shallow feature to obtain richer and deeper features, where the cascaded residual separation mixed attention group includes m cascaded residual separation mixed attention modules, where n =6 and m =6 in this example. The specific calculation is represented by the following formula:

M _DF ＝f ^3×3 (M _n )

wherein

Denotes the ith multiple residual separation mixed attention group, n denotes the number of multiple mixed attention residual groups, M _i-1 、M _i 、M _n Intermediate feature map representing multiple residual separation mixed attention set, f ^3×3 (. -) represents a convolution operation with a convolution kernel size of 3 × 3, M _DF A deep level feature map is shown.

Fig. 3 is a block diagram of a residual separation mixed attention module comprising 1 residual triple attention module, 1 high efficiency Swin Transformer module, two 1 × 1 convolution layers and channel splitting and joining operations, whose calculation is represented by the following formula:

X ₁ ，X ₂ ＝H _SPL (f ^1×1 (X))

Y ₁ ＝RTAB(X ₁ )

Y ₂ ＝ESTB(X ₂ )

z＝f ^1×1 (H _CAT (Y ₁ ，Y ₂ ))+X

wherein X represents the input characteristics of the residual separation hybrid attention module, f ^1×1 (. Represents a 1X 1 convolutional layer, H _SPL (. A) tableOperation of show channel splitting, X ₁ 、X ₂ Representing the feature graph after splitting, RTAB representing the operation of the residual triple attention module, Y ₁ Represents the output characteristics of the residual triple attention module, ESTB represents the operation of the efficient Swin Transformer module, Y ₂ Represents the output characteristics of the high-efficiency Swin Transformer module, H _CAT (. Cndot.) denotes the operation of channel stitching and Z denotes the output characteristics of the residual separation hybrid attention module.

Fig. 4 is a block diagram of a residual triple attention module containing one two 3 x 3 convolutional layers, the RELU activation function, the triple attention module, the average pooling operation, and the residual concatenation. The residual triple attention module fully utilizes the advantage that the CNN is good at extracting inherent characteristics by utilizing the correlation among different dimensions of the characteristic diagram, so that a network can learn richer image information, and the calculation of the residual triple attention module is represented by the following formula:

X _O ＝F _TAM (f ^3×3 (RELU(f ^3×3 (X _I ))))+X

wherein X _I Representing the input features of the residual triple attention module, X _O Representing the output characteristics of the residual triple attention module, F _TAM (. Cndot.) represents the operation of the triple attention module, and RELU (. Cndot.) represents the RELU activation function. The triple attention module is composed of two cross-dimension interaction modules and a space attention module, and a specific calculation formula expression is as follows:

wherein

Representing the output characteristics of two cross-dimension interaction modules,

the output characteristics of the spatial attention module are indicated, and Y indicates the output characteristics of the triple attention module. The cross-dimension interaction module comprises a 7 x 7 convolution layer and dimensionDegree displacement operation, Z-Pool layer, channel connection operation, maximum pooling operation, and average pooling operation. The specific calculation formula is as follows:

X′ ₁ ＝H _PER (X ₁ )

Z-Pool(X)＝H _CAT (H _MP (X)，H _AO (X))

X″ ₁ ＝Z-Pool(X′ ₁ )

wherein, X' ₁ Denotes the result of the dimension permutation operation, X ″ ₁ Shows the result of the Z-Pool layer operation, H _PER (. Cndot.) denotes a dimension replacement operation, Z-Pool (. Cndot.) denotes an operation of a Z-Pool layer, H _CAT (. H) represents operations that connect along a particular dimension in a given input sequence feature graph _MP (. And H) _AP Denotes the maximum pooling operation and the average pooling operation along a particular dimension, respectively, f ^7×7 (. H) represents a convolution operation with a convolution kernel size of 7 × 7, H _IN (. Cndot.) represents an example normalization operation, δ (-) represents a Sigmoid function,. Cndot.represents a multiply-by-channel operation,

output features of the cross-dimension interaction module are represented.

The spatial attention module comprises a 7 multiplied by 7 convolutional layer, a Z-Pool layer and an example normalization operation, and the specific calculation formula is as follows:

wherein X ₃ 、

Fig. 5 is a block diagram of an efficient Swin Transformer module that includes two layer normalization operations, a moving window self-attention module, a local enhanced multi-layer perceptron module, and residual join operations. Its calculation is represented by the following formula:

Q＝LN(W _Q X)

K＝LN(W _K X)

V＝LN(W _V X)

wherein, W _Q 、W _K 、W _V Representing the transformation matrix of computation Q, K, V, d representing the dimensions of the K matrix, attention representing the self-Attention computation, LN (-) representing the layer normalization operation, Q, K, V representing the query, key and value matrices, softMax (-) representing the SoftMax function, SW-MSA (-) representing the moving window self-Attention module, leFF (-) representing the locally enhanced multi-layer perceptron module,

and (3) representing the output characteristics of the high-efficiency Swin transducer module.

And 3, reconstructing the deep characteristic diagram, and reconstructing the characteristic diagram predicted by up-sampling into a high-resolution image, wherein the expression is as follows:

I _SR ＝H _UP (M _DF +M ₀ )+Bicubic(I _LR )

wherein I _SR Representing super resolutionHigh resolution image after rate reconstruction, H _UP (. Cndot.) denotes the reconstruction module, bicubic (. Cndot.) denotes Bicubic interpolation of the low resolution image to the target resolution.

The image super-resolution reconstruction method applied to the image super-resolution reconstruction model comprises the following specific contents.

S1, establishing a training set according to an image degradation model to obtain N low-resolution images I _LR And N low resolution images I _LR Corresponding true high resolution image I _HR (ii) a Wherein N is an integer greater than 1;

s2, inputting the low-resolution image to a shallow feature extraction module to extract shallow features of the image;

s3, inputting the shallow features into a deep feature extraction module to extract deep features;

s4, inputting the deep features into a reconstruction module, performing sub-pixel convolution to complete up-sampling processing, and reconstructing a final high-resolution image;

s5, optimizing the image super-resolution reconstruction model through a loss function, wherein the loss function uses an average L1 error between the N reconstructed high-resolution images and the corresponding real high-resolution images, and the expression is as follows:

wherein L1 represents the L1 loss function.

In order to better illustrate the effectiveness of the present invention, the examples of the present invention also employ a comparative experiment to compare the reconstruction effect.

Specifically, in the embodiment of the present invention, 800 high-resolution images in DIV2K are used as a training Set, and Set5, set14, B100, urban100, and Manga109 are used as a test Set, respectively. And carrying out double-three down sampling on the original high-resolution image to obtain a corresponding low-resolution image.

When the training set is constructed, training and testing of the model is performed on the pyrrch framework. Cropping low-resolution images in a training setEach time 48 image blocks of 64 × 64 are randomly input, 500 epochs are trained. Optimization of network parameters is achieved using the Adam gradient descent method, wherein parameters of the Adam optimizer are set to β 1=0.9, β 2=0.999, and ∈ =10 ^-8 . Initial setting of learning rate is 2 × 10 ^-4 And is reduced by half after the 250, 400, 425, 450, 475 th epoch. The number of RSHABs is set to 36 and the number of channels is set to 180. Peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) were used to evaluate model performance. The performance of the model is tested by using five reference data sets of Set5, set14, B100, urban100 and Manga109, 11 representative image super-resolution reconstruction methods are selected in a comparison experiment and compared with the experimental results of the invention, and the experimental results are shown in table 1, wherein RSHAN is the method provided by the invention.

TABLE 1 average PSNR and SSIM value comparison across 5 test sets

In summary, compared with the prior art, the present invention has the following advantages and effects:

(1) According to the embodiment of the invention, the local modeling capability of the residual error triple attention module based on the convolutional neural network and the non-local modeling capability of the high-efficiency Swin transducer module are combined by adopting the residual error separation mixed attention module. On the premise of ensuring that the quantity of parameters is similar to that of the reconstruction model with the best performance at present, the information relation among different dimensionalities of the image is effectively utilized, the capability of the super-resolution reconstruction model for extracting high-frequency information is remarkably improved, and meanwhile, the extracted features are richer by utilizing the captured long-distance dependence relation.

(2) According to the embodiment of the invention, by adopting the structure that the global residual is embedded with the cascade residual, the network can bypass the low-frequency information in the low-resolution input, learn more high-frequency residual information and acquire rich detailed characteristics. And a deep network does not need to be established, and a high-resolution reconstructed image with good effect can be obtained.

Claims

1. The image super-resolution reconstruction model based on the residual mixed attention network is characterized by comprising a shallow layer feature extraction module, a deep layer feature extraction module and a reconstruction module: the shallow feature extraction module is composed of a 3 x 3 convolutional layer, extracts shallow features of the low-resolution input image by using the characteristic that the convolutional layer is good at extracting the features, and specifically operates as follows:

M ₀ ＝H _SF (I _LR )

M _DF ＝f ^3×3 (M _n )

wherein

Denotes the ith multiple residual separation mixed attention group, n denotes the number of multiple mixed attention residual groups, M _i-1 、M _i 、M _n Intermediate feature map representing multiple residual separation mixed attention set, f ^3×3 (. Cndot.) denotes convolution operation with convolution kernel size of 3X 3, M _DF Representing a deep layer characteristic diagram;

I _SR ＝H _UP (M _DF +M ₀ )+Bicubic(I _LR )

2. The residual mixed attention network-based image super-resolution reconstruction model of claim 1, wherein the multiple residual separated mixed attention group is composed of a plurality of residual separated mixed attention modules and a 3 x 3 convolutional layer.

3. The residual mixed attention network-based image super-resolution reconstruction model of claim 2, wherein the residual separation mixed attention module comprises two 1 × 1 convolution layers, a residual triple attention module, an efficient Swin transform module, residual connection, channel splitting and splicing operations, and the specific calculation formula expression is as follows:

X ₁ ,X ₂ ＝H _SPL (f ^1×1 (X))

Y ₁ ＝RTAB(X ₁ )

Y ₂ ＝ESTB(X ₂ )

Z＝f ^1×1 (H _CAT (Y ₁ ,Y ₂ ))+X

wherein X represents the input characteristics of the residual separation hybrid attention module, f ^1×1 (. Represents a 1X 1 convolutional layer, H _SPL (. Represents an operation of channel splitting, X ₁ 、X ₂ Representing the feature graph after splitting, RTAB representing the operation of the residual triple attention module, Y ₁ Represents the output characteristics of the residual triple attention module, ESTB represents the operation of the efficient Swin Transformer module, Y ₂ Represents the output characteristics of the high-efficiency Swin Transformer module, H _CAT (. Cndot.) denotes the operation of channel stitching, and Z denotes the output characteristics of the residual separation mixed attention module.

4. The residual hybrid attention network-based image super-resolution reconstruction model according to claim 3, wherein the residual triple attention module is composed of two 3 × 3 convolution layers, a RELU activation function, a residual connection and a triple attention module, and the specific formula expression is as follows:

X _O ＝F _TAM (f ^3×3 (RELU(f ^3×3 (X _I ))))+X

wherein X _I Input features representing residual triple attention modules, X _O Representing the output characteristics of the residual triple attention module, F _TAM (. Cndot.) represents the operation of the triple attention module, RELU (. Cndot.) represents the RELU activation function; the triple attention module is composed of two cross-dimension interaction modules and a space attention module, and a specific calculation formula expression is as follows:

wherein

Representing output characteristics of two cross-dimension interaction modules,

5. The residual hybrid attention network-based image super-resolution reconstruction model of claim 4, wherein the cross-dimension interaction module comprises a 7 x 7 convolutional layer, a dimension permutation operation, a Z-Pool layer, a channel connection operation, a maximum pooling operation and an average pooling operation; the specific calculation formula is as follows:

X′ ₁ ＝H _PER (X ₁ )

Z-Pool(X)＝H _CAT (H _MP (X),H _AP (X))

X″ ₁ ＝Z-Pool(X′ ₁ )

wherein, X' ₁ Denotes the result of the dimension permutation operation, X' ₁ Shows the result of the Z-Pool layer operation, H _PER (. Cndot.) denotes a dimension replacement operation, Z-Pool (. Cndot.) denotes an operation of a Z-Pool layer, H _CAT (. H) represents operations that connect along a particular dimension in a given input sequence feature graph _MP (. Cndot.) and H _AP (. Cndot.) denotes the maximum pooling operation and the average pooling operation along a particular dimension, respectively, f ^7×7 (. H) represents a convolution operation with a convolution kernel size of 7 × 7, H _IN (. Cndot.) represents an example normalization operation, δ (-) represents a Sigmoid function, and · represents a multiply-by-channel operation.

6. The residual mixed attention network-based image super-resolution reconstruction model of claim 4, wherein the spatial attention module comprises a 7 x 7 convolutional layer, a Z-Pool layer and an example normalization operation, and the specific calculation formula is as follows:

wherein X ₃ 、

7. The residual mixed attention network-based image super-resolution reconstruction model of claim 3, wherein the high-efficiency Swin Transformer module is composed of two-layer normalization, a self-attention calculation module based on moving window, a local feature extraction feed-forward network and residual connection;

Q＝LN(W _Q X)

K＝LN(W _K X)

V＝LN(W _V X)

wherein, W _Q 、W _K 、W _V A transformation matrix representing the computation of Q, K, V, LN (-) representing a layer normalization operation, Q, K, V representing the query, key and value matrices, softMax (-) representing a SoftMax function, SW-MSA (-) representing a moving-window self-attention module, leFF (-) representing a locally enhanced multi-layer perceptron module,

8. The image super-resolution reconstruction method based on the residual mixed attention network is characterized in that the image super-resolution reconstruction model based on the residual mixed attention network of any one of claims 1 to 7 is adopted, and the method comprises the following steps:

wherein L1 represents the L1 loss function.