CN117710215A

CN117710215A - Binocular image super-resolution method based on polar line windowing attention

Info

Publication number: CN117710215A
Application number: CN202410029544.4A
Authority: CN
Inventors: 张红英; 李雪; 黄孝茹
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-03-15
Anticipated expiration: 2044-01-09
Also published as: CN117710215B

Abstract

The invention provides a binocular image super-resolution method based on polar line windowing attention. Firstly, in order to solve the problems of image perception and local information attention, a mixed feature extractor is designed, which consists of a sliding window multi-head self-attention module and a feature distillation and enhancement module, and the combination can effectively capture the high-frequency features of the image with more discrimination; then, aiming at the problem of offset of complementary pixels by the binocular image, designing a epipolar window attention, dividing the window along the epipolar direction to promote the matching of the shifted pixels, and even in a pixel smoothing area, utilizing adjacent pixels in the window as references to realize more accurate pixel matching; and finally, designing a residual polar line window attention mechanism, fusing the fused parallax images again, and simultaneously reconstructing a super-resolution binocular image through sub-pixel convolution. The invention realizes excellent super-resolution performance with less parameter quantity, recovers the characteristic of more discriminative image, and has better robustness.

Description

Binocular image super-resolution method based on polar line windowing attention

Technical Field

The invention relates to an image processing technology, in particular to a binocular image super-resolution method based on polar line windowing attention.

Background

The binocular image super-resolution reconstruction based on the deep learning is an important research direction in the field of computer vision, and the core idea is to infer high-resolution image information by utilizing information in the double-view image. Improving the quality of low-resolution binocular images is a long-term concern in the field of computer vision, and early binocular image super-resolution methods utilize binocular cameras and other devices to acquire two low-resolution images, and then match and fuse the images to obtain high-resolution images with different visual angles. The role of binocular image super-resolution reconstruction in 3D vision is getting more attention in various subjects, and in the image recovery and enhancement new trend super-resolution racetrack, the task of binocular image super-resolution reconstruction shows excellent results, which proves that binocular corresponding information contained in binocular image pairs can be used for improving the image super-resolution reconstruction performance.

However, the low-resolution binocular image has the influence of high-frequency information deletion and large parallax change, so that the fusion of binocular corresponding information in the super-resolution reconstruction of the binocular image is very challenging. Specifically, the task of binocular image super-resolution reconstruction has the following problems: first, the high frequency characteristics are restored. In binocular image super-resolution reconstruction, it is necessary to recover a high-resolution depth map from low-resolution left and right views. This task involves restoring image depth information, including edges and details of objects. Thus, one major challenge is how to effectively restore and enhance the high frequency characteristics of the image. And secondly, merging left and right view information. The binocular image super-resolution reconstruction requires merging of information from the left and right views to increase the accuracy of the depth estimation. This requires an efficient information fusion strategy to integrate the information of the two views together to obtain a better depth map. Third, the number of model parameters is large. In binocular image super-resolution reconstruction tasks, it is often necessary to use a deep neural network model to learn the mapping from low resolution to high resolution. Since high resolution depth images have complex structures, the training and reasoning process requires more computational resources in order to capture these features.

Aiming at the problems of the super-resolution of the binocular image, recently, NAFSSR utilizes NAFnet to extract multi-scale characteristics of the image, PASSRnet and iPASR can well fuse view information by utilizing parallax attention, and SwinFSSR utilizes aligned view complementary information with better residual cross attention. Although the prior art has been substantially advanced, the long dependence of the features in the high-frequency feature recovery and the concern of local information are not balanced, and secondly, when the parallax attention mechanism is used for fusion of complementary information, the problem of up-down offset of the complementary pixels along the polar line is not focused. Therefore, the design of the binocular image super-resolution reconstruction model which is efficient and light has important research significance.

Disclosure of Invention

The invention aims to solve the problem of high-frequency characteristic recovery and binocular complementary information fusion in binocular image super-resolution reconstruction, and provides a binocular image super-resolution method based on polar line windowing attention.

In order to achieve the above object, the present invention provides a binocular image super-resolution method based on polar line windowing attention, which mainly comprises the following seven parts: the first part is to pre-process the binocular image data set; the second part is to extract shallow layer characteristics of the left and right eye low resolution images; the third part is to extract deep features of left and right eye images after shallow features are extracted; the fourth part is to align the left and right eye feature images after deep features are extracted by adopting the line sub-window attention; the fifth part is to iterate the operations from the third part to the fourth part and fuse the output characteristics of all the fourth parts again; the sixth part is to reconstruct the left and right eye images with super resolution; the seventh part is training and testing of a binocular image super-resolution network model based on polar line windowing attention, and finally a reconstructed high-resolution left and right view is obtained, specifically:

the first part comprises two steps:

step 1, downloading binocular public data sets Flickr1024 and Middlebury, selecting 860 groups of binocular image pairs as high-resolution image samples in a training set, and then performing bicubic interpolation downsampling on the binocular high-resolution image pairs to obtain low-resolution image samples in the training set;

step 2, cutting high-resolution and low-resolution image samples into image blocks in one-to-one correspondence, wherein the size of the low-resolution sample is 30 multiplied by 90, the size of the four-time amplified high-resolution sample is 120 multiplied by 360, and performing rotation, translation, shielding and channel sequential transformation operations on the cut high-resolution and low-resolution image blocks to strengthen a training set sample and avoid overfitting to form a final training set sample;

the second part comprises a step of:

step 3, the low-resolution training set sample in the step 2 is processed by a 3 multiplied by 3 convolution layer with shared weight, so that an image is mapped from a low-dimensional space of an RGB channel to a high-dimensional space of a 94 channel, and shallow layer characteristics L1 and R1 of a left-right image are preliminarily obtained;

the third part comprises a step of:

step 4, shallow layer features L1 and R1 of the left and right images obtained in the step 3 are used as input, and residual error Swin transform feature distillation and enhancement module RSTFB is adopted to extract semantic features rich in images, so that deep layer features L2.1 and R2.1 of the images are obtained;

the fourth part comprises a step of:

step 5, taking the deep features L2.1 and R2.1 obtained in the step 4 as input, and performing left-right eye feature alignment operation by adopting an epipolar window attention EWA to obtain aligned features L3.1 and R3.1;

the fifth part comprises two steps:

step 6, taking the output of the step 5 as the input of the step 4, sequentially iterating the step 4 and the step 5 for 7 times to obtain intermediate characteristics L2.N and R2.N output by the step 4 and intermediate characteristics L3.N and R3.N output by the step 5, wherein n=2, 3,4,5,6,7 and 8;

step 7, the binocular features L3.1 and R3.1 aligned in the step 5 and the intermediate features L3.N and R3.N output by the step 5 after the step 6 are subjected to a residual polar line window attention REWA re-fusion operation to obtain fused feature images L4 and R4, and the expression capacity of the parallax image features is enhanced;

the sixth part includes a step of:

step 8, mapping the fused feature images L4 and R4 in the step 7 to RGB space by utilizing a sub-pixel deconvolution layer to obtain feature images L5 and R5;

step 9, performing double three times of upsampling on the input low-resolution left view and right view to obtain upsampling feature images L6 and R6, performing residual error operation on L6 and L5 in the step 8 in a matrix addition mode to reconstruct a high-resolution left view, and simultaneously performing residual error operation on R6 and R5 in the step 8 in a matrix addition mode to reconstruct a high-resolution right view;

the seventh part comprises two steps:

step 10, inputting the training set samples in the step 2 into the network from the step 3 to the step 9, and setting network super parameters: the learning rate is 2e-4, epochs is 60, batch size is 8, the optimizer is Adam, the loss function is MSE, and the training network is used for obtaining a final binocular image super-resolution pre-training model;

and 11, inputting the public test set and the actually existing low-resolution binocular image into the pre-training model obtained in the step 10, and simultaneously reconstructing the super-resolution binocular image by the network.

The invention provides a binocular image super-resolution method based on polar line windowing attention. Firstly, in the aspect of feature extraction, in order to solve the problems of image perception and local information attention, a mixed feature extractor RSTFB is designed to extract network multi-scale features, the RSTFB consists of a multi-head self-attention mechanism based on shifting and non-shifting windows and a feature distillation and enhancement module FDEB, and the combination can effectively capture the high-frequency features with more discriminant images; then, aiming at the problem of offset of complementary pixels of the binocular image, a polar line window attention mechanism EWA is designed, the EWA divides the window along the polar line direction to promote the matching efficiency of the shift pixels, and even in a pixel smooth area, the adjacent pixels in the window can be used as references to realize more accurate pixel matching; and finally, designing a residual polar line window attention mechanism REWA, fusing all the parallax images obtained after the EWA fusion, and simultaneously reconstructing a super-resolution binocular image through sub-pixel convolution. The invention utilizes a mixed feature extractor RSTFB to extract the rich semantic features of binocular images, and the RSTFB establishes a long dependency relationship between images by combining advantages based on a Transformer and a convolutional neural network; according to the invention, the binocular image alignment operation is realized by dividing the window along the polar line direction, so that the network is more focused on the high-frequency characteristics of the image, excellent super-resolution performance is realized with less parameter quantity, the more discriminative characteristics of the image are recovered, and the network has better robustness.

Drawings

FIG. 1 is a diagram of the overall network framework of the present invention;

FIG. 2 is a hybrid feature extractor of the present invention;

FIG. 3 is a polar window attention EWA of the present invention;

FIG. 4 is a residual epipolar window attention REWA of the present invention;

FIG. 5 is a low resolution binocular image;

fig. 6 is a super-resolution binocular image of fig. 5 processed by the present invention.

Detailed Description

In order to better understand the present invention, the following describes the method for reconstructing super-resolution of binocular image based on polar line windowing attention in more detail with reference to the specific embodiments. In the following description, a detailed description of the current prior art may obscure the subject matter of the present invention, which description will be omitted herein.

fig. 1 is a network overall frame diagram of the binocular image super-resolution reconstruction method based on polar line windowed attention of the present invention, which is performed according to the following steps in this embodiment:

and 4, taking shallow layer features L1 and R1 of the left and right images obtained in the step 3 as input, extracting rich semantic features of the images by adopting a residual Swin transform feature distillation and enhancement module, and obtaining deep layer features L2.1 and R2.1 of the images, wherein the implementation is as follows:

step 4.1, as shown in fig. 2 (a), the hybrid feature extractor RSTFB extracts deep features of the image through 6 Swin Transformer FDEB layers from the shallow features L1 and R1; then, a long dependency relationship among image features is established by adopting an overlapped cross attention module; finally, adopting a 3 multiplied by 3 convolution layer to aggregate the features of different depths, and simultaneously introducing residual connection to improve the expression capacity of the network;

step 4.2, STFL is composed by introducing a feature distillation and enhancement module FDEB into the Swin transducer layer, as shown in FIG. 2 (b), aiming at the problem that the transducer lacks attention to local features;

step 4.3, as shown in fig. 2 (C), the fdeb first uses the 1×1 convolution layer and the gel to enhance the feature channel C to 2C to obtain enhanced feature E1, uses the 3×3 convolution layer and the gel to perform feature distillation operation on the feature E1, distills the feature channel to C to obtain distilled feature D1, and uses the 1×1 convolution layer and the gel to re-enhance the enhanced feature E1 to obtain enhanced feature E2; then, obtaining a distillation characteristic D2 and an enhancement characteristic E3 from E2, and obtaining a distillation characteristic D3 and an enhancement characteristic E4 from E3; finally, the distillation features D1, D2, D3 and the enhancement feature E4 are polymerized by adopting a 1X 1 convolution layer, and the important feature information of the image is focused by a network from two dimensions of space and channel through space attention and channel attention, residual connection is introduced in the whole operation, and more discriminative image features are output;

and 5, taking the deep features L2.1 and R2.1 obtained in the step 4 as input, and performing left-right eye feature alignment operation by using the polar line window attention EWA to obtain aligned features L3.1 and R3.1, wherein the implementation is as follows:

step 5.1, as shown in fig. 3, the polar line window attention EWA divides the input deep features L2.1 and R2.1 with feature sizes of h×w×c into 6 windows along the W direction, the window sizes of h×w0×c, wherein w0=w/6, and the left and right window features X-L2 and X-R2 of the same feature region are aligned by using the cross view attention module CAM;

step 5.2, in order for the CAM to adaptively adjust the weights between characteristic channels as shown in fig. 3 (a), to introduce a channel attention convolution CAC in the parallax attention, to make the CAM effectively use the complementary information between views, the CAC is as shown in fig. 3 (b), and adds a GELU between two 3×3 convolution layers, the first convolution layer expands the number of channels from C to 2C, and the second convolution layer reduces the number of channels from 2C to C; finally, splicing all the left and right window features Y-L2 and Y-R2 after CAM alignment in the W0 dimension to obtain aligned binocular features L3.1 and R3.1;

step 7, the aligned binocular features L3.1 and R3.1 in the step 5 and intermediate features L3.n and R3.n output in the step 5 after the step 6 are subjected to a residual polar line window attention REWA re-fusion operation to obtain fused feature images L4 and R4, the expression capacity of the parallax image features is enhanced, the REWA is shown in the figure 4, the outputs of all the EWAs are added and accumulated by adopting a matrix to polymerize the aligned features with different depths, L3.1 and L3.n are obtained by the REWA, and R3.1 and R3.n are obtained by the REWA to obtain R4;

The invention provides a binocular image super-resolution method based on polar line windowing attention, which starts from the problems of image detail feature recovery and polar line fluctuation of centered pixels of a binocular image. Firstly, the method adopts a mixed feature extractor consisting of sliding and non-sliding window multi-head attention and FDEB on feature extraction, so that the problems of global image perception and local feature attention are successfully solved; in addition, in order to solve the problem of offset of complementary pixel information along epipolar lines in a binocular correspondence, the method introduces an epipolar line window attention mechanism EWA. It partitions the window along the epipolar direction to facilitate efficient matching of shifted pixels. The method comprises the steps that a mixed feature extractor RSTFB and a feature fusion device EWA are utilized in a network to iterate for a plurality of times to generate image high-frequency features; and finally, fusing all complementary parallax attention attempts by using REWA, and simultaneously obtaining binocular super-resolution images through a reconstruction module. The algorithm is light in weight, excellent in super-resolution performance and suitable for low-resolution binocular images which are actually degenerated.

While the foregoing describes illustrative embodiments of the present invention, it should be understood that the present invention is not limited to the scope of the embodiments, but rather, it should be apparent to those skilled in the art that various changes can be made within the spirit and scope of the invention as defined and defined by the appended claims, all of which are intended to be protected by the following inventive concept.

Claims

1. The binocular image super-resolution method based on the epipolar windowed attention is characterized in that the design hybrid feature extractor is used for extracting more discriminative features of an image and designing complementary features of the epipolar windowed attention fusion binocular view, and the method comprises seven parts of data set preprocessing, shallow feature extraction, deep feature extraction, feature fusion, iterative feature extraction and fusion operation, super-resolution reconstruction, network model training and testing:

the first part comprises two steps:

step 1, preparing a binocular data set, and performing bicubic interpolation downsampling on a binocular high-resolution image sample to obtain a low-resolution image sample in a training set;

step 2, cutting out high-resolution and low-resolution images, wherein the size of a low-resolution sample is 30 multiplied by 90, the size of a quadruple high-resolution sample is 120 multiplied by 360, and performing rotation, translation, shielding and channel sequential transformation operations on the cut high-resolution and low-resolution image blocks to form a final training set sample;

the second part comprises a step of:

step 3, mapping the low-resolution training set sample in the step 2 to a high-dimensional space of 94 channels through a 3X 3 convolution layer sharing weights, and preliminarily obtaining shallow layer features L1 and R1 of the left and right images;

the third part comprises a step of:

and 4, taking the shallow features L1 and R1 of the left and right images obtained in the step 3 as input, extracting rich semantic features of the image by adopting a residual Swin transform feature distillation and enhancement module RSTFB, and obtaining deep features L2.1 and R2.1 of the image, wherein the implementation is as follows:

(1) Shallow features L1 and R1 firstly extract deep features of an image through 6 Swin Transformer FDEB layers; then, a long dependency relationship among image features is established by adopting an overlapped cross attention module; finally, adopting a 3 multiplied by 3 convolution layer to aggregate the features of different depths, and simultaneously introducing residual connection to improve the expression capacity of the network;

(2) Aiming at the problem that the transducer lacks attention to local characteristics, a characteristic distillation and enhancement module FDEB is introduced into the Swin transducer layer to form STFL;

(3) In the FDEB operation, firstly, a characteristic channel C is enhanced to 2C by adopting a 1X 1 convolution layer and a GELU to obtain an enhanced characteristic E1, characteristic distillation operation is carried out on the characteristic E1 by adopting a 3X 3 convolution layer and the GELU, the characteristic channel is distilled to C to obtain a distillation characteristic D1, and the enhanced characteristic E1 is enhanced again by adopting the 1X 1 convolution layer and the GELU to obtain an enhanced characteristic E2; then, obtaining a distillation characteristic D2 and an enhancement characteristic E3 from E2, and obtaining a distillation characteristic D3 and an enhancement characteristic E4 from E3; finally, the distillation features D1, D2, D3 and the enhancement feature E4 are polymerized by adopting a 1X 1 convolution layer, and the important feature information of the image is focused by a network from two dimensions of space and channel through space attention and channel attention, residual connection is introduced in the whole operation, and more discriminative image features are output;

the fourth part comprises a step of:

(1) The EWA divides the input deep features L2.1 and R2.1 with the characteristic size of H multiplied by W multiplied by C into 6 windows along the W direction respectively, wherein the window size is H multiplied by W0 multiplied by C, W0 = W/6, and the left window feature X-L2 and the right window feature X-R2 of the same feature area are aligned by adopting a cross view attention module CAM;

(2) In order to enable the network to adaptively adjust the weight among characteristic channels, a channel attention convolution CAC is introduced in parallax attention, so that the CAM effectively utilizes complementary information among views, the CAC adds GELU between two 3X 3 convolution layers, the first convolution layer expands the number of channels from C to 2C, and the second convolution layer reduces the number of channels from 2C to C; finally, splicing all the left and right window features Y-L2 and Y-R2 after CAM alignment in the W0 dimension to obtain aligned binocular features L3.1 and R3.1;

the fifth part comprises two steps:

step 7, merging the aligned binocular features L3.1 and R3.1 in the step 5 and the intermediate features L3.n and R3.n output in the step 5 after the step 6 by adopting a residual polar line window attention REWA re-merging operation to obtain fused feature images L4 and R4, and enhancing the expression capacity of the parallax image features, wherein the REWA adds up the outputs of all EWAs by adopting a matrix to polymerize the aligned features with different depths, L4 is obtained by the REWA from the L3.1 and L3.n, and R4 is obtained by the REWA from the R3.1 and R3.n;

the sixth part includes a step of:

the seventh part comprises two steps:

2. The binocular image super resolution method based on polar line windowed attention as set forth in claim 1, wherein EWA in step 5 (1) divides the input deep features L2.1 and R2.1, which have feature sizes of hxw x C, into 6 windows along W direction, respectively.

3. The method of claim 1, wherein in step 5 (2), all CAM-aligned left and right window features Y-L2 and Y-R2 are spliced in the W0 dimension to obtain aligned binocular features L3.1 and R3.1.

4. The binocular image super resolution method based on polar line windowed attention according to claim 1, wherein in step 6, the output of step 5 is used as the input of step 4, and step 4 and step 5 are sequentially iterated for 7 times.

5. The binocular image super resolution method of claim 1, wherein in step 7, the REWA sums the outputs of all EWAs using matrix addition to aggregate the aligned features of different depths.