CN116385265A

CN116385265A - Training method and device for image super-resolution network

Info

Publication number: CN116385265A
Application number: CN202310360499.6A
Authority: CN
Inventors: 李艳凤; 韦佳宇; 孙嘉; 陈新; 陈后金; 陈紫微
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-04
Anticipated expiration: 2043-04-06
Also published as: CN116385265B

Abstract

The invention relates to a training method of an image super-resolution network, wherein the image super-resolution network comprises a shallow feature extraction subnet, a deep feature extraction subnet, a feature fusion subnet and an image reconstruction subnet, the deep feature extraction subnet comprises a plurality of mixed attention modules, the feature fusion subnet comprises a plurality of neural window full-connection conditional random field modules, and the training method comprises the training step of inputting a low-resolution image into the image super-resolution network to train the low-resolution image; the method is characterized in that the idea of a conditional random field is innovatively added into an image super-resolution model, characteristics are constrained through connection between pixel points, characteristic fusion is carried out, a hybrid attention module combines self attention, spatial attention and channel attention of a shift window, the complementary advantages of the self attention, the spatial attention and the channel attention are utilized, the limitation of a shift window mechanism caused by fewer input pixels on network performance improvement is overcome, and stronger local characteristic characterization capability is reserved.

Description

Training method and device for image super-resolution network

Technical Field

The invention relates to the technical field of image super-resolution, in particular to a training method and device of an image super-resolution network.

Background

Resolution refers to the number of pixels contained in a unit inch image or video, which determines the degree of sharpness that is visually presented. The high-resolution image not only can improve visual appearance of people, but also is beneficial to analysis of subsequent high-level visual tasks. However, in a real application scenario, only a lost low-resolution image can be obtained due to various reasons such as limitation of equipment hardware conditions, compression and decompression flow in transmission, loss of information in digital-to-analog conversion and the like. Super-resolution is to reconstruct a corresponding high-resolution image by using an existing low-resolution image. Due to the necessity of image degradation and the need of people for high-resolution images, the super-resolution reconstruction technology has wide application fields.

Single frame image super-resolution is a basic task in the super-resolution field, and research and application are more common, and are generally classified into methods based on interpolation, reconstruction and learning. Interpolation-based methods, such as bicubic interpolation, lanczos resampling, etc., are very fast and straightforward but suffer from accuracy drawbacks. Reconstruction-based methods typically require complex prior knowledge to limit the possible solution space, and performance decreases rapidly with increasing magnification and takes longer. Early learning-based methods have focused on machine learning, and in recent years, deep learning-based methods have better performance than other methods.

The SISR based on learning is largely classified into a convolutional neural network based method and a transform based method. The CNN-based method generally adopts a linear and continuous stacked network structure, and optimizes performance by means of residual error learning, dense connection and the like. However, CNN convolution kernels are limited in size, which is disadvantageous in capturing long dependencies of features. For this reason, CNN-based superdivision networks tend to employ very deep and complex network structures to recover more details, but this can result in a huge network parameter and excessive consumption of computing resources. Transformer-based methods are of great interest because of the strong characterization capabilities of self-attention mechanisms, which typically require cutting the image into fixed-size blocks and processing the different blocks independently. SwinTransformer designed a shift window mechanism that better utilized the links in neighboring blocks, but it still utilized local self-attention in a limited spatial range, while activating more input pixels is beneficial to improving the performance of the Transformer in SR tasks.

Disclosure of Invention

The invention aims to provide a training method and device for an image super-resolution network, and aims to solve the problems of huge network parameters, poor characteristic constraint effect and few activated input pixels in the prior art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

in one aspect, a training method for an image super-resolution network is provided, where the image super-resolution network includes a shallow feature extraction subnet, a deep feature extraction subnet, a feature fusion subnet and an image reconstruction subnet, the deep feature extraction subnet includes a plurality of hybrid attention modules, the feature fusion subnet includes a plurality of neural window full-connection conditional random field modules, the training method includes a training step of inputting a low-resolution image into the image super-resolution network to train the image super-resolution network, and a loss calculation step corresponding to the training step until the image super-resolution network converges, and the training step includes:

inputting the low-resolution image into a shallow feature extraction subnet to obtain a shallow feature map;

respectively inputting the shallow feature map into a deep feature extraction sub-network and an image reconstruction sub-network, wherein the attention feature map F output by the shallow feature map or the upper mixed attention module is input into a plurality of mixed attention modules to obtain a plurality of attention feature maps F and a mixed feature map obtained by splicing the plurality of attention feature maps F;

inputting the mixed feature map into a first convolution layer module to obtain a deep feature map X;

inputting the deep feature map X and the plurality of attention feature maps F into a feature fusion sub-network to obtain a fusion feature map;

adding the fusion feature map and the shallow feature map, and then inputting the added fusion feature map and the shallow feature map into an image reconstruction subnet to obtain a super-resolution image;

wherein the mixed attention module is configured to:

the method comprises the steps of obtaining a zero-number feature map and obtaining an attention feature map F, wherein the method specifically comprises the following steps:

inputting the attention feature map F output by the zero number feature map or the upper layer of mixed attention module into the space attention enhancement module to obtain a first feature map;

inputting the first feature map into a SwinTransformaerBlock module to obtain a second feature map;

inputting the second feature map into a second convolution layer module to obtain a third feature map;

inputting the third feature map into a space attention enhancing module to obtain a fourth feature map;

and inputting the fourth feature map into a channel attention module to obtain an attention feature map F.

In another aspect, a training system for an image super-resolution network is provided, wherein the training system includes at least one processor; and a memory storing instructions that, when executed by the at least one processor, perform steps in accordance with the foregoing method.

The invention has the beneficial effects that:

1. creatively adding the idea of a conditional random field into an image super-resolution model, and restraining the characteristics through the relation between pixel points and the pixel points to perform characteristic fusion;

2. the mixed attention module combines the self-attention, the spatial attention and the channel attention of the shift window, and utilizes the complementary advantages of the self-attention, the spatial attention and the channel attention of the shift window, so that the limitation of a shift window mechanism on network performance improvement due to fewer input pixels is overcome, the stronger local feature characterization capability is reserved, and meanwhile, the local and non-local feature extraction is considered.

3. The method has the advantages that competitive results are obtained through a small amount of parameters and excellent performances, the method is closer to real images in the aspects of restoring the similarity textures in urban buildings, the natural textures and the literal textures of animal landscapes, and the method is applied to daily life, super-resolution can be effectively carried out on low-resolution images obtained by old photos, transmission and other reasons, visual aesthetic feeling of the images is improved, and visual experience is improved.

Drawings

FIG. 1 is a schematic diagram of a training system of the present invention;

FIG. 2 is a schematic diagram of an image super-resolution network according to the present invention

FIG. 3 is a schematic diagram of a hybrid attention module of the present invention;

FIG. 4 is a schematic diagram of an enhanced spatial attention module of the present invention;

FIG. 5 is a schematic diagram of a SwinTransformaerBlock module in the present invention;

FIG. 6 is a schematic diagram of a multi-headed self-paying moment array calculation process in the present invention;

FIG. 7 is a schematic diagram of a channel attention module of the present invention;

FIG. 8 is a schematic diagram of a hybrid attention module of the present invention;

FIG. 9 is a schematic diagram of a neural window full-connected conditional random field module binary potential calculation flow in the present invention

FIG. 10 is a schematic representation of a low resolution image to be superseparated in the present invention;

fig. 11 is a schematic diagram of a super-resolution image in which super-resolution is completed in the present invention.

Detailed Description

The technical scheme of the present invention will be clearly and completely described in the following in conjunction with the accompanying drawings and embodiments of the present invention.

Some embodiments of the invention relate to a training system for an image super-resolution network, as shown in fig. 1, the training system 1 comprising at least one processor 2; and a memory 3 storing instructions for implementing all the steps in the following method embodiments when executed by the at least one processor 2.

In some embodiments of the training method of the image super-resolution network, the image super-resolution network includes a shallow feature extraction subnet, a deep feature extraction subnet, a feature fusion subnet and an image reconstruction subnet, the deep feature extraction subnet includes a plurality of mixed attention modules, the feature fusion subnet includes a plurality of neural window full-connection conditional random field modules, the training method includes a training step of inputting a low-resolution image into the image super-resolution network to train the image super-resolution network, and a loss calculation step corresponding to the training step until the image super-resolution network converges, the method is characterized in that the training step includes:

wherein the mixed attention module is configured to:

In some training method embodiments of the image super-resolution network, the enhanced spatial attention module is configured to:

acquiring a first initial feature map and obtaining a spatial feature map, wherein the method specifically comprises the following steps:

inputting the first original feature map into a third convolution layer module to obtain a fifth feature map;

sequentially inputting the fifth characteristic diagram into a fourth convolution layer module with the step length of 2, a maximum pooling module, a convolution group module and an up-sampling module to obtain a sixth characteristic diagram, wherein the convolution group module is formed by connecting three 3X 3 convolution layer modules, and two adjacent convolution layers are connected by using a ReLU activation function;

connecting the fifth feature map and the sixth feature map through a combination residual, and sequentially inputting a sixth convolution layer module and a Sigmoid function module to obtain a seventh feature map;

and obtaining a spatial feature map according to the first initial feature map and the seventh feature map.

In some training method embodiments of the image super-resolution network, the swinformerblock module is configured to:

obtaining a second initial feature map and obtaining an STB feature map, wherein the method specifically comprises the following steps:

sequentially inputting the second initial feature map into a first-layer standardization module and a window multi-head self-attention module to obtain an eighth feature map, and obtaining a ninth feature map according to the second initial feature map and the eighth feature map;

inputting the ninth feature map into a second-layer standardized module and a first multi-layer sensor module in sequence to obtain a tenth feature map, and obtaining an eleventh feature map according to the ninth feature map and the tenth feature map;

inputting the eleventh characteristic diagram into a third-layer standardization module and a multi-head self-attention module of a shift window in sequence to obtain a twelfth characteristic diagram, and obtaining a thirteenth characteristic diagram according to the eleventh characteristic diagram and the twelfth characteristic diagram;

and sequentially inputting the thirteenth feature map into a fourth-layer standardized module and a second multi-layer sensor module to obtain a fourteenth feature map, and obtaining the STB feature map according to the thirteenth feature map and the fourteenth feature map.

In some training method embodiments of the image super-resolution network, the channel attention module is configured to:

obtaining a third initial feature map and obtaining a channel feature map, wherein the method specifically comprises the following steps:

sequentially inputting the third initial feature map into a global average pooling module, a full connection layer module, a ReLU activation function module, a full connection layer module and a Sigmoid activation function module to obtain a fifteenth feature map;

and obtaining a channel characteristic diagram according to the third initial characteristic diagram and the fifteenth characteristic diagram.

In some embodiments of the training method of the image super-resolution network, inputting the deep feature map X and the plurality of attention feature maps F into a feature fusion sub-network to obtain a fused feature map includes:

inputting the plurality of attention feature maps F into a plurality of neural window full-connection conditional random field modules correspondingly in sequence to obtain a plurality of NFC feature maps;

splicing the NFC feature images and the deep feature images X into a whole and inputting the whole into a seventh convolution layer module to obtain a convolution feature image;

inputting the convolution feature map into an eighth convolution layer module to obtain a fusion feature map;

the neural window fully-connected conditional random field module is configured to:

obtaining a Q matrix and a K matrix according to the attention characteristic diagram F;

and obtaining a V matrix according to the deep feature map X or the NFC feature map output by the previous neural window full-connection conditional random field module.

In some embodiments of the training method of the image super-resolution network, adding the fusion feature map and the shallow feature map, and then inputting the added fusion feature map and the shallow feature map into the image reconstruction subnet to obtain the super-resolution image includes:

adding the fusion feature map and the shallow feature map to obtain a sixteenth feature map;

and sequentially inputting the sixteenth characteristic diagram into a ninth convolution layer module and a sub-pixel convolution layer module to obtain a super-resolution image.

In some embodiments of the training method of the image super-resolution network, inputting the low-resolution image into the shallow feature extraction subnet to obtain the shallow feature map includes:

acquiring a low-resolution image and a corresponding high-resolution true value image;

and inputting the low-resolution image into a tenth convolution layer module to obtain a shallow characteristic map.

In some embodiments of the training method of the image super-resolution network, acquiring the low-resolution image and the corresponding high-resolution true value image includes:

randomly intercepting images from the original low-resolution images to obtain a plurality of low-resolution images to be processed;

obtaining a plurality of corresponding high-resolution images according to the original high-resolution images and the plurality of low-resolution images to be processed;

and carrying out random horizontal overturn, random vertical overturn and random 90-degree rotation on the plurality of low-resolution images to be processed and the plurality of high-resolution images through probability to obtain a plurality of low-resolution images and a plurality of high-resolution truth images.

In some embodiments of the training method of the image super-resolution network, the loss calculation step includes:

calculating the super-resolution image I _SR And truth image I _HR Average absolute error loss l= ||i of (a) _SR -I _HR || ₁ ；

Back propagation is carried out according to the result of the loss L, and the super-resolution network weight of the image is updated;

repeating the steps to enable the loss L to approach zero until the image super-resolution network converges.

In some embodiments of the training method of the image super-resolution network, the method comprises the following steps:

step 1: the method comprises the steps of randomly intercepting images with the size of 64 multiplied by 64 from original low-resolution images for training, preprocessing the images according to the probability p by randomly horizontally overturning, randomly vertically overturning and randomly rotating by 90 degrees to obtain low-resolution images, training b images in the same batch, and performing the same operation on corresponding high-resolution images to obtain corresponding high-resolution true value images.

Step 2: and sequentially inputting the low-resolution images into a shallow feature extraction subnet and a deep feature extraction subnet to extract features.

(1) And inputting the low-resolution image into a shallow feature extraction subnet to perform feature extraction to obtain a shallow feature map.

A low resolution image of size (b×3×64×64) is input to a 3×3 convolution layer, and a shallow feature map of size (b×c×64×64) is output.

(2) Inputting the shallow feature map into a deep feature extraction subnet, and sequentially utilizing N ₁ The individual mixed attention modules perform feature extraction as shown in fig. 3.

i. The shallow feature map of size (bxcx64 x 64) is input to the enhanced spatial attention module as shown in fig. 4.

Specifically, firstly, the number of channels is reduced to c/4 by a 1×1 convolution layer module, and a convolution layer module with a step length of 2 and a maximum pooling module are utilized to enlarge the receptive field and reduce the space dimension. And then extracting features according to a convolution group module, wherein the convolution group module consists of three 3×3 convolution layer modules, and two adjacent convolution layers are connected by using a ReLU activation function. The spatial dimension is recovered by an interpolation upsampling module, combined with the residual connection and the number of channels is recovered as c by a 1 x 1 convolutional layer module. Finally, generating a feature matrix through a Sigmoid function module, and multiplying the feature matrix by original input features to obtain a new feature map, wherein the size of the new feature map is still (b multiplied by c multiplied by 64).

The feature map obtained in the previous process is input into a swinTransformaerBlock module (STB) as shown in FIG. 5. The STB consists of a windowed multi-head self-attention Module (MSA) with a shift mechanism and a multi-layer perceptron Module (MLP). There is a layer normalization module (LN) between the MSA and the MLP, and a residual connection is used to connect.

Specifically, a feature map of size (bxc×64×64) is first converted into a token-like form (bxc4096×c) by patchhole, and then input into the LN. The feature map is then divided into non-overlapping local windows of size M x M, within which self-attention is calculated. The adjusted feature diagram size is

Wherein the method comprises the steps of

Is the total number of windows. Shifting the window multi-head self-attention (SW-MSA) compared to the window multi-head self-attention (W-MSA), the window is cyclically shifted up and left by +.>

And each pixel.

For a local window feature

Respectively by and with three transformation matrices W _Q ,W _K ,W _V Multiplication can obtain matrix of Query, key and Value, Q, K for short and +.>

W _Q ,W _K ,W _V Parameters of the three transformation matrices are obtained through training and shared across windows. According to Q, K, V, a self-attention matrix is calculated as shown in equation (1):

wherein P represents a learnable relative position code; d represents the dimensions of Query and Key; t represents matrix transposition; attn (Q, K, V) represents a self-attention matrix; softmax () represents the normalized exponential function.

As shown in FIG. 6, the multi-headed self-attention Multihead (Q, K, V) equally divides the calculated Q, K, V into h portions

Then, after the self-attention moment arrays obtained by calculation are spliced, a fusion matrix is utilized>

Fusion is carried out as shown in formula (2), wherein the characteristic dimension obtained is +.>

MultiHead(Q,K,V)＝Concat(Attn ₁ ,...Attn _h )W _O (2)

Wherein MultiHead (Q, K, V) represents a multi-headed self-attention matrix; concat (Attn) ₁ ,...Attn _h ) Representing the matrix obtained after the plurality of self-attention matrixes are spliced; w (W) _O Representing a matrix in which parameters can be learned.

And (3) after the obtained feature size is adjusted to (b multiplied by 4096 multiplied by c), residual connection, a multi-layer sensor module and the like are carried out, and finally, the feature size is adjusted to (b multiplied by c multiplied by 64) by using PatchUneing.

And thirdly, enabling the characteristic diagram with the size of (b multiplied by c multiplied by 64) to pass through a 3 multiplied by 3 convolution layer to play a role in enhancing translational isomorphism of the Swin TransformerBlock module, and repeating the ith step to achieve the effect of further focusing on the region of interest.

Feature maps of size (bxc×64×64) are passed through the channel attention module as shown in fig. 7.

Specifically, first, the feature map is compressed to a (b×c×1×1) size by a global averaging pooling module. Then the channel number is reduced to c/16 through the first full-connection layer module, and the original channel number is recovered after passing through the second full-connection layer module. Finally, after the function module is activated by the Sigmoid, the weights of all channels are respectively multiplied by the channels corresponding to the original feature map, so that the final attention feature map F (b multiplied by c multiplied by 64) is obtained.

v. repeating step i, ii, iii, iv, replacing shallow feature map with the attention feature map F outputted by the previous layer of mixed attention module to complete N altogether ₁ And twice. The attention characteristic diagram F obtained each time is spliced into a whole to obtain a mixed characteristic diagram (bXcN) ₁ X64) by a 1X 1 convolutional layer block, a deep feature map X of size (b X c X64) is obtained.

Step 3: and inputting the deep feature map X and the attention feature maps F into a feature fusion sub-network to perform feature fusion.

i. By N ₂ And the individual neural window full-connection conditional random field module (NewFC-CRF) fuses the attention characteristic graphs F output by the mixed attention modules of all layers.

Specifically, the NewFC-CRF overall structure is like a swinsformaerblock module, calculates Q, K, V matrix of the self-attention portion by different feature maps, calculates by using binary potential as shown in fig. 9, calculates Q, K matrix by using feature map F corresponding to the output of the hybrid attention moment array, replaces V matrix by output X of the previous module network, and the binary potential matrix expression is as shown in formula (3)

ψ _p ＝SoftMax(QK ^T +P)X(3)

Wherein, psi is _p Representing the neural window fully connected conditional random field binary potential matrix.

a. Calculating Q, K, V matrix and obtaining NFC characteristic diagram through the neural window full-connection conditional random field module, wherein the method comprises the following steps:

calculating Q, K matrix using the attention profile F;

replacing the V matrix with the deep feature map X;

b. repeating the step a, wherein the NFC characteristic diagram output by the conditional random field module is fully connected by utilizing the previous neural windowReplace the V matrix to finish N altogether ₂ And twice.

ii, deep characteristic patterns X and N obtained in the step 2 are obtained ₂ The NFC characteristic diagrams obtained by the NewFC-CRF are spliced into a whole (bXc (1+N) ₂ ) X 64) by a 1 x 1 convolutional layer block, a convolutional feature map of size (b x c x 64) is obtained. And finally, obtaining a fused feature map after feature fusion through a 3X 3 convolution layer module.

And (3) adding the feature fused feature map with the shallow feature map obtained in the step (1) to obtain a feature map with the final size of (b multiplied by c multiplied by 64).

Step 4: and inputting the final feature map obtained by adding the fusion feature map and the shallow feature map into an image reconstruction subnet to obtain a super-resolution image, and completing forward propagation.

The feature map of (b×c×64×64) size is input to the 1×1 convolution layer, and the output size is still (b×c×64×64).

The (bxc×64×64) sized feature map is up-sampled to (bx3×64s×64 s) with a sub-pixel convolution layer, where the scale factor s=2/3/4.

Step 5: calculating each super-resolution image I _SR And truth image I _HR Average absolute error loss l= ||i of (a) _SR -I _HR || ₁ 。

Step 6: and (5) carrying out back propagation according to the loss result obtained in the step (5), updating the super-resolution network weight of the image, and repeating the steps until the loss L approaches zero until the network converges.

In some embodiments of the training method of the image super-resolution network, the network for image super-resolution comprises the following steps:

the low resolution image (c×h×w) is input to the trained network, and the size multiple of super resolution×2, ×3 or×4 is selected.

A super-resolution image with a length-width enlarged to a corresponding multiple (c×hs×ws) will be obtained.

step 1, randomly intercepting an image with the size of 64 multiplied by 64 from an original low-resolution image for training, preprocessing the image by randomly horizontally overturning, randomly vertically overturning and randomly rotating by 90 degrees according to the probability of p=0.5, training 16 images in the same batch, and performing the same operation on the corresponding high-resolution image to obtain a corresponding true value image.

And 2, sequentially inputting the low-resolution image into a shallow feature extraction subnet and a deep feature extraction subnet to extract features.

(1) Shallow features are extracted by using a 3×3 convolution layer, an image with a size of (16×3×64×64) is input to the convolution layer, and a shallow feature map with a size of (16×48×64×64) is output.

(2) The shallow feature map is input into a deep feature extraction subnet, and feature extraction is carried out by using 4 mixed attention modules in sequence.

i. Shallow feature maps of size (16 x 48 x 64) are input to the enhanced spatial attention module.

Firstly, reducing the channel number to 12 through a 1×1 convolution layer module, then extracting features through a convolution layer module with the step length of 2 and a maximum pooling layer module, then recovering the space size through an interpolation up-sampling module according to the convolution group, combining residual connection, recovering the channel number to 48 through a 1×1 convolution layer module, finally generating a feature matrix through a Sigmoid function module, multiplying the feature matrix with the original input features to obtain new features, wherein the size of the new features is still (16×48×64×64).

The feature map of (16×48×64×64) size is input into a swinTransformaerBlock module (STB).

The feature map with the size of (16×48×64×64) is converted into a form similar to a token (16×4096×48) through the patch embedding, then the feature map is input into a layer normalization module, the obtained feature map is divided into non-overlapping partial windows with the size of 8×8, the sizes of the features are (16×64×64×48), a multi-head self-attention multi-head (Q, K, V) of the partial windows is calculated, and the flow is as shown in fig. 6, and the adopted parameter details are the head number h= 6,Q and the K, V matrix dimension d=48. And (3) after the obtained feature size is adjusted back to (16 multiplied by 4096 multiplied by 48), residual connection, a multi-layer perceptron module and the like are carried out, and finally, the feature size is restored to (16 multiplied by 48 multiplied by 64 multiplied by 640) by using PatchUneing.

And thirdly, enabling the feature map with the size of (16 multiplied by 48 multiplied by 64) to pass through a 3 multiplied by 3 convolution layer module to play a role in enhancing translational homology of the SwinTransformaerBlock module, and repeating the ith step to achieve the effect of further focusing on the region of interest.

Feature maps of (16 x 48 x 64) size are passed through the channel attention module.

The feature map is first compressed to a size of (16×48×1×1) by a global averaging pooling module, then reduced to 3 by the number of channels of the first full-link layer module, and restored to the original number of channels 48 by the second full-link layer module. And finally, multiplying the calculated weight of each channel with the channel corresponding to the original feature map by a Sigmoid activation function module to obtain an output attention feature map F (16 multiplied by 48 multiplied by 64).

And v, repeating the step i, ii, iii, iv, and replacing the shallow layer characteristic diagram with the attention characteristic diagram F output by the upper layer mixed attention module to finish 4 times. The obtained attention feature maps F are spliced into a whole to obtain a mixed feature map (16×192×64×64), and after passing through a 1×1 convolution layer module, a deep feature map X with the size of (16×48×64×64) is obtained.

i. And fusing the attention characteristic graphs F output by the mixed attention modules of all layers by using a neural window full-connection conditional random field module (NewFC-CRF).

Q, K, X matrix dimension d=48;

calculating Q, K matrix using the attention profile F;

replacing the V matrix with the deep feature map X;

b. and (c) repeating the step a, wherein the NFC characteristic diagram output by the previous neural window full-connection conditional random field module is used for replacing the V matrix, the total completion is 3 times, and the adopted head numbers h are gradually decreased to be 12,8,4 respectively.

And ii, splicing the deep characteristic diagram X obtained in the step 2 and the NFC characteristic diagrams obtained by 3 NewFC-CRFs into a whole (16×192×64×64), and obtaining a convolution characteristic diagram with the size of (16×48×64×64) after passing through a 1×1 convolution layer. And finally, obtaining a fused feature map after feature fusion through a 3X 3 convolution layer.

And (iii) adding the fused feature map after feature fusion with the shallow feature map obtained in the step (1) to obtain a final feature map.

Step 4: and inputting the final characteristics into an image reconstruction module to obtain a super-resolution image, and completing forward propagation.

1) The feature map of (16×48×64×64) size is input to the 1×1 convolutional layer, and the output size is still (16×48×64×64).

2) The feature map of (16×48×64×64) size is up-sampled to (16×3×256×256) with a sub-pixel convolution layer, where the scale factor s=4.

Step 6: and (5) carrying out back propagation according to the loss result obtained in the step (5), updating the super-resolution network weight of the image, and repeating the steps until the loss is close to zero until the network converges. After training all images (1 epoch) each time, using the network model obtained by the last iteration as verification, and storing the network model with the best verification effect as a test. The learning rate was initialized to 5X 10-4, halved for every 200 epochs, and fixed after 1000 epochs.

Then, the network for image super resolution includes the steps of:

the low resolution image (3×259×194) to be super-divided is input into a trained image super-resolution network as shown in fig. 10, and a size multiple×4 of the super-resolution is selected.

A super-resolution image (3×1036×776) after 4 times will be obtained as shown in fig. 11.

The embodiments and functional operations of the subject matter described in this specification can be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware, including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of the foregoing. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on one or more tangible, non-transitory program carriers, for execution by, or to control the operation of, data processing apparatus.

Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of the foregoing.

The term "data processing apparatus" encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or multiple computers. The device may comprise a dedicated logic circuit, for example an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus may include, in addition to hardware, code that creates an execution environment for the relevant computer program, such as code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (which may also be referred to or described as a program, software application, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. The computer program may, but need not, correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in the following: in a markup language document; in a single file dedicated to the relevant program; or in a plurality of coordinated files, for example files that store one or more modules, subroutines, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

A computer suitable for carrying out the computer program comprises and can be based on a general purpose microprocessor or a special purpose microprocessor or both, or any other kind of central processing unit, as examples. Typically, the central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, the computer does not have to have such a device. In addition, the computer may be embedded in another apparatus, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a removable storage device, such as a Universal Serial Bus (USB) flash drive, etc.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including by way of example: semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks, for example, internal hard disks or removable disks; magneto-optical disk; CD-ROM and DVD-ROM discs. The processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.

To send interactions with a user, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device, for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to a user; as well as a keyboard and a pointing device, such as a mouse or trackball, by which a user may send input to a computer. Other kinds of devices may also be used to send interactions with the user; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic input, speech input, or tactile input. In addition, the computer may interact with the user by sending the document to a device used by the user and receiving the document from the device; for example, by sending a web page to a web browser on a user's client device in response to a received request from the web browser.

Claims

1. The training method of the image super-resolution network comprises a shallow feature extraction sub-network, a deep feature extraction sub-network, a feature fusion sub-network and an image reconstruction sub-network, wherein the deep feature extraction sub-network comprises a plurality of mixed attention modules, the feature fusion sub-network comprises a plurality of neural window full-connection conditional random field modules, the training method comprises a training step of inputting a low-resolution image into the image super-resolution network to train the image super-resolution network, and a loss calculation step corresponding to the training step until the image super-resolution network converges, and the training step comprises the following steps of:

wherein the mixed attention module is configured to:

inputting the first feature map into a Swin Transformer Block module to obtain a second feature map;

2. The method of claim 1, wherein the spatial attention enhancement module is configured to:

3. The method of claim 1, wherein the Swin Transformer Block module is configured to:

4. The method of claim 1, wherein the channel attention module is configured to:

5. The training method of an image super-resolution network according to claim 1, wherein the step of merging the deep feature map X and the plurality of attention feature maps F into a feature fusion sub-network to obtain a merged feature map comprises:

6. The training method of an image super-resolution network according to claim 1, wherein adding the fusion feature map and the shallow feature map and then inputting the added fusion feature map and the shallow feature map into an image reconstruction subnet to obtain a super-resolution image comprises:

7. The training method of an image super-resolution network according to claim 1, wherein the inputting the low-resolution image into the shallow feature extraction subnet to obtain the shallow feature map comprises:

8. The method for training an image super-resolution network according to claim 7, wherein said obtaining a low-resolution image and its corresponding high-resolution true value image comprises:

9. The training method of an image super-resolution network as claimed in claim 8, wherein said loss calculating step comprises:

10. A training system for an image super-resolution network, the training system comprising at least one processor; and a memory storing instructions that, when executed by the at least one processor, perform the steps of the method according to any one of claims 1-9.