CN116797461A

CN116797461A - Binocular image super-resolution reconstruction method based on multistage attention-strengthening mechanism

Info

Publication number: CN116797461A
Application number: CN202310853109.9A
Authority: CN
Inventors: 吴靖; 罗文武; 黄峰
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-09-22

Abstract

The application provides a binocular image super-resolution reconstruction method based on a multistage enhanced attention mechanism, which is characterized in that multiple attention is used for feature enhancement in neural network training, feature information in views is fully utilized for fusion processing, a frequency domain related loss function is used for carrying out constraint processing on the frequency domain, the retention of low-frequency information and an image integral structure is enhanced, a better effect is recovered to the binocular image after super-resolution, and clearer textures and edge details are recovered.

Description

Binocular image super-resolution reconstruction method based on multistage attention-strengthening mechanism

Technical Field

The application relates to the technical field of binocular image super-resolution, in particular to a binocular image super-resolution reconstruction method based on a multistage attention-strengthening mechanism.

Background

One straightforward way to achieve binocular image super-resolution is to use a single image super-resolution algorithm for each of the left and right images. Attention mechanisms are a very important research direction in the field of deep learning, and in the last few years, single-image super-resolution algorithms with superior performance have emerged, such as RCAN constructed based on channel attention, PAN constructed based on pixel attention mechanisms, swiniir constructed based on transducer self-attention mechanisms, MAN based on multi-scale large nuclear attention, and the like. However, the independent reconstruction of binocular images using only single image super-resolution methods, respectively, uses only intra-image self-similarity to recover details, but ignores the additional information that can be utilized between cross-views, i.e., cross-view similarity, limiting further improvement in super-resolution performance. Thus, leveraging cross-view information can help reconstruct higher quality super-resolution images, as one view may have supplemental information to the same scene region relative to another view. Along with the social demands, various super-resolution reconstruction technologies in adaptation times are proposed, and the binocular image super-resolution reconstruction technology is also applied to the research basis in various fields, so that the method has great application research significance. Thus, PASSRnet constructed based on parallax attention, iPASR constructed based on a bidirectional parallax attention mechanism, swinFSR constructed based on a transducer self-attention mechanism, CVHSSR constructed based on a large-core convolution attention mechanism, and the like are generated.

While there have been many attempts to explore combining various attention mechanisms with binocular image super-resolution in order to effectively extract more features from internal views and cross-attempts, most binocular image super-resolution methods remain a pending problem in restoring the natural texture and edge details of the image. Therefore, how to effectively utilize inter-view dependencies between different attention features based on various attention mechanisms to reconstruct a super-resolution binocular image requires further exploration.

Disclosure of Invention

In view of the above, the present application aims to provide a binocular image super-resolution reconstruction method based on a multi-stage attention enhancement mechanism, which fully utilizes feature information in a view to perform fusion processing, uses a frequency domain-related loss function to process a frequency domain, enhances retention of low-frequency information and an overall structure of an image, and restores a better effect on a binocular image after super-resolution, and restores clearer textures and edge details.

In order to achieve the above purpose, the application adopts the following technical scheme: a binocular image super-resolution reconstruction method based on a multistage enhanced attention mechanism comprises the following steps:

step S1, a binocular image training set is established; dividing a binocular image super-resolution data set into a training set and a testing set, wherein a low-resolution image is generated through bicubic downsampling; in the training phase, the generated low-resolution images are cut into small blocks, and the corresponding high-resolution images are also cut, and meanwhile, the small blocks are randomly horizontally and vertically flipped to strengthen training data;

step S2, a binocular super-resolution reconstruction network model based on a multi-stage enhanced attention mechanism is established and trained; the network takes a pair of RGB binocular images with low resolution as input to generate a binocular image with super resolution;

s3, constructing a loss function; by L ₁ The loss function and the method combined with the frequency domain loss function enhance the supervision of the high-level characteristic space and restrict the training of the network;

step S4, setting training parameters to perform network training;

s5, testing network performance; and taking the low-resolution binocular image pair as a test sample, inputting the test sample into a network model which is trained in the last step, obtaining the super-resolution binocular image pair, and comparing and checking the super-resolution effect by using objective evaluation indexes and visual effects.

In a preferred embodiment, specifically, the binocular super-resolution reconstruction network model based on the multi-stage enhanced attention mechanism comprises two left and right weight sharing network branches; stacking the mixed attention information extraction modules in each weight sharing network, and extracting intra-view channels and spatial features of left and right images; the binocular interaction view attention module is used for capturing globally corresponding information and cross view information extracted from left and right binocular images; the method comprises three parts: intra-view feature extraction, interactive view feature fusion and binocular image reconstruction.

In a preferred embodiment, the step 2 specifically includes the following steps:

step S21, extracting features in the view; in the feature extraction stage, the input binocular image is first inputInputting into 3×3 convolution layer to extract shallow layer features and generate high-dimensional features +.>Wherein C is the number of characteristic channels; then inputting the high-dimensional features into the stacked multi-attention enhancement blocks for intra-view feature extraction so as to acquire more local features and interaction information and restore more accurate texture details; the multi-attention enhancement block comprises a mixed attention information extraction module and a binocular interactive view attention module;

the mixed attention information extraction module is a basic module of the left and right branches of the network, and the characteristics in the view are extracted more deeply by capturing remote and local dependency relations; the mixed attention information extraction module consists of two modules which are sequentially connected; the first is a simplified channel and spatial information extraction module, and the second is a feedforward network module for residual information aggregation; the two parts of calculation process are as follows:

in the first module, after layer normalization, the channels of the input feature map are extended using a 1×1 convolutional layerThe resulting output will then be passed through a 3 x 3 depth convolution to capture the local context of each channel; the cross-activation structure a unit is then used to further learn the effective representation of the spatial context; the next step is to simplify the channel spatial attention module, make full use of the channel attention mechanism and the spatial attention mechanism, given the original input +.>First learn the feature map for a given input image using average pooling and 1 x 1 convolution operationsInter-channel relation, realizing functions of global space information aggregation and channel information interaction, and outputting simplified channel attention characteristic X ₁ The method comprises the steps of carrying out a first treatment on the surface of the The space information of the feature images is aggregated through the operations of average pooling and maximum pooling, and a method of combining 3×3 convolution and Sigmoid functions is adopted to obtain a simplified space attention graph; finally, the result of element-by-element multiplication of the output of the input feature map and Sigmoid layer is taken as the output X of the simplified spatial attention module ₂ The method comprises the steps of carrying out a first treatment on the surface of the The reduced channel spatial attention module is expressed as:

w in the formula _C (·),H _AP (. Cndot.) are respectively 1X 1 convolution, average pooling operations, H _AP,1 (·),H _MP,1 (. Cndot.) represents the average pooling and maximum pooling operations on dimension 1, H _cat (. Cndot.) represents stitching in the form of dimension 1, σ (. Cndot.) represents Sigmoid activation function, respectively, and Θ represents multiplication by element;

after the simplified channel and the spatial information extraction module, carrying out 1X 1 convolution inverse transformation on the channel mapped by the features to generate self-adaptive feature refinement, and obtaining a result of the first module; in the second module, after normalizing the output result of the previous module, improving the local context sensing capability through a residual information aggregation feedforward network containing a cross-activation structure B unit; specifically, given an input tensorFirst X' is extended to a higher dimension +.>Wherein k is the expansion ratio; next, a 3×3 pair of depth convolution layers is usedX′ ₁ Coding the information of the adjacent pixel positions, and then using a CAS-B unit as an activation function of the depth convolution layer, and halving the number of characteristic channels and outputting the characteristic channels; finally, the initial input dimension X 'remapped by the 1X 1 convolutional layer' ₂ ；

The above process is expressed as:

w in the formula _C (·),W _D Layer normalization, 1×1 convolution, 3×3 depth convolution, respectively, cas.b (·) represents the B unit of the cross-activated structure;

finally, as with the previous module, adding the input of the next module and the output of the convolution layer as the final result;

step S22, cross view feature fusion, wherein a binocular interaction view attention module is used after a left and right branch mixed attention information extraction module; the binocular interaction view attention module uses the binocular characteristics generated in the previous step as input to perform bidirectional cross-view interaction and generates interaction characteristics fused with the view input characteristics; specifically, given an input binocular view characteristicThe binocular characteristic is obtained through layer normalization and 1 multiplied by 1 convolution operationAnd->Wherein->Is a 1 x 1 convolution; then, by performing a fast 1D convolution of size kGenerating channel weight->Wherein k is adaptively determined by mapping of the channel dimension C and the channel weight is multiplied element by the binocular feature to obtain the aggregate feature +.>The following is shown:

in the method, in the process of the application,H _MP (. Cndot.) represents the k x k convolutional layer, max pooling operation, respectively, Θ represents element-by-element multiplication;

by computing the primary attention matrix while generating F _R→L ,F _L→R The method comprises the steps of carrying out a first treatment on the surface of the Finally, the interactive cross-view information and the intra-view information FL, FR are fused together by element-wise addition, according toIn->Is a query matrix of feature projections within the source view (e.g. left view), +.>Is a key-value matrix of feature projections within features (e.g., right view) within the target view; the expression is as follows:

wherein, gamma _L And gamma _R Is the trainable channel scaling and initializing to zero to stabilize training,

is a 1 x 1 convolution;

step S23, reconstructing a binocular image; after the fusion feature is extracted, the fusion feature is output to a 3X 3 convolution layer and a spatial attention enhancement module, finally, the pixel recombination operation is used for upsampling the output feature to a high resolution size, and a global residual path is used for further improving the super resolution performance by utilizing the input binocular image information, so that the left-right attempted super resolution image is recovered

H in _C (·),H _E (·),H _P (·),H _↑ (. Cndot.) are the upsampling operations of convolution operation, enhanced spatial attention module, pixel reorganization, and bilinear interpolation, respectively;

enhanced spatial attention module will give inputSent to the 1 x 1 convolutional layer to getWherein W is _C (. Cndot.) is a 1 x 1 convolution to reduce the channel size of the input features; the block then uses cross-row rolling and cross-row max pooling layers to reduce the space size; after a set of convolutions, bilinear interpolation is performed for feature extractionTo recover the spatial size; combining residual connectivity, and further processing the features to obtain a 1×1 convolutional layer recovery channel size; finally, an attention matrix is generated by the Sigmoid function and multiplied by the original input feature X ".

In a preferred embodiment, in step S3, the loss of the total difference is written as:

L＝L _SR +λL _FFT ,

(10)

wherein L is _SR ,L _FFT Respectively represent L ₁ Reconstructing a frequency domain loss of the loss function, the frequency Charbonnier loss, lambda representing a super parameter for controlling the frequency Charbonnier loss function; the parameter λ for all experiments was set to 0.1;

SR reconstruction loss; the SR reconstruction penalty is essentially an L ₁ A loss function; using pixel level L between super resolution and ground truth binocular image ₁ Distance, PSNR is obtained; the expression is as follows:

in the method, in the process of the application,super-resolution left and right images generated for the model, respectively,>respectively, its high resolution image;

frequency domain loss; introducing frequency domain loss of frequency Charbonnier loss; the expression is as follows:

in the formula, the constant epsilon is experimentally set to be 10 ^-3 The method comprises the steps of carrying out a first treatment on the surface of the FFT (·) represents the fast Fourier transform.

In a preferred embodiment, in step S4, adamW is used for optimizationWhere β1=0.9, β2=0.9, and the weight defaults to 0; the learning rate is initially set to 1×10 ^-3 And reduced to 1 x 10 by cosine annealing strategy ^-7 The method comprises the steps of carrying out a first treatment on the surface of the The model was trained on 30 x 90 patches; during training, 32 samples of each batch are evenly distributed to 8 parts, iterating 2×10 ⁵ And twice.

Compared with the prior art, the application has the following beneficial effects: the application provides a binocular image super-resolution reconstruction method based on a multi-stage enhanced attention mechanism, which integrates various attention mechanisms by constructing a network model based on the multi-stage enhanced attention mechanism, comprehensively and efficiently enhances interaction between information in views and cross attempts, better extracts super-resolution information which cannot be fully utilized in left and right views of a binocular image, enlarges a receptive field and reduces calculated amount; the novel cross attention module is used, and a high-efficiency channel attention mechanism is utilized, so that good balance is achieved in the aspect of high-efficiency interaction; the used channel characteristics and the spatial characteristics are fused, and important information is transmitted forwards through the remote dependency relationship among the characteristics, so that the robustness and generalization capability of detection are effectively improved, the effect of super resolution on certain edges is improved, the natural textures of an image are better recovered, and a better super resolution result is obtained with less calculation amount.

Drawings

FIG. 1 is a flow chart of a binocular image super-resolution reconstruction method based on a multi-stage enhanced attention mechanism according to a preferred embodiment of the present application;

FIG. 2 is a diagram of a binocular super resolution image network in accordance with a preferred embodiment of the present application;

FIG. 3 is a schematic diagram showing a structure of a hybrid attention information extraction module according to a preferred embodiment of the present application;

FIG. 4 is a simplified schematic diagram of a channel space attention module according to a preferred embodiment of the present application;

FIG. 5 is a schematic diagram of a cross-activation architecture in accordance with a preferred embodiment of the present application;

FIG. 6 is a diagram of a binocular interactive view attention module in accordance with a preferred embodiment of the present application;

FIG. 7 is a diagram of the super-resolution results of a binocular image as shown in a preferred embodiment of the present application.

Detailed Description

The application will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application; as used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The application provides a binocular image super-resolution method based on a multistage enhanced attention mechanism, which comprises the following four steps:

first, a binocular image training set is established. Using the presently disclosed binocular image super-resolution dataset as a training sample, a low resolution image is generated by bicubic downsampling. In the training phase, the generated low resolution images are cropped into small blocks, and the corresponding high resolution images are cropped as well, while the small blocks are randomly flipped horizontally and vertically to enhance the training data.

Second, the network architecture is designed. The overall network consists of three parts: intra-view feature extraction, interactive view feature fusion and binocular image reconstruction. In the stage of extracting the features in the view, the input binocular image is firstly input into a convolution layer to extract shallow features, and high-dimensional features are generated. And then inputting the high-dimensional features into the stacked multi-attention enhancement blocks for intra-view feature extraction so as to acquire more local features and interaction information and restore more accurate texture details. In the cross view feature fusion stage, in order to capture cross information between left and right views, a binocular interactive view attention module is used after the left and right branched mixed attention information extraction module. The binocular interaction view attention module uses the binocular features generated by the previous step of mixed attention information extraction module as input to perform bidirectional cross-view interaction and generate interaction features fused with the view input features. In the binocular image reconstruction stage, after the fusion feature extraction, the fusion feature is output to a convolution layer and a spatial attention enhancement module, and finally, the output feature tensor is up-sampled by using a pixel recombination operation, so that the left and right view images with super resolution are restored.

Third, a loss function is constructed. In order to enhance the texture details of binocular images and maintain parallax consistency among viewpoints, the application adopts a mean square error loss function and a method combined with a frequency domain loss function to enhance supervision of a high-level feature space and restrict training of a network. The loss of total variance can be written as:

L＝L _SR +λL _FFT ,

wherein L is _SR ,L _FFT Respectively represent L ₁ Lambda represents a super parameter for controlling the frequency Charbonnier loss function, and is set to 0.1 according to the past experience.

Fourth, training parameters are set for network training. Selecting a proper optimizer, setting parameters such as a loss function, a learning rate, a maximum iteration number, the size of each batch of training samples and the like, and training a network until the training is completed to obtain a final network weight model;

fifth, network performance is tested. And taking the low-resolution binocular image pair as a test sample, inputting the test sample into a network model which is trained in the last step, obtaining the super-resolution binocular image pair, and comparing and checking the super-resolution effect by using objective evaluation indexes and visual effects.

FIG. 1 is a flow chart of the method of the present application. The super resolution processing is performed on the binocular image according to the following detailed steps:

and step S1, establishing a binocular image training set. The presently disclosed binocular image super-resolution dataset is divided into a training set and a testing set, and the low resolution image is generated by bicubic downsampling. In the training phase, the generated low resolution images are cropped into small blocks, and the corresponding high resolution images are cropped as well, while the small blocks are randomly flipped horizontally and vertically to enhance the training data.

And S2, establishing and training a binocular super-resolution reconstruction network model based on a multi-stage enhanced attention mechanism. As shown in fig. 2, the network takes as input a pair of low resolution RGB binocular images, generating a super resolution binocular image. Specifically, the network includes two left and right weight sharing network branches. In each weight sharing network, the mixed attention information extraction modules are stacked to extract intra-view channels and spatial features of left and right images. The binocular interactive view attention module is used for capturing globally corresponding information and cross view information extracted from left and right binocular images. In general, it can be divided into three parts: intra-view feature extraction, interactive view feature fusion and binocular image reconstruction.

And S2.1, extracting the characteristics in the view. In the feature extraction stage, the input binocular image is first inputInputting into 3×3 convolution layer to extract shallow layer features and generate high-dimensional features +.>Wherein C is the number of characteristic channels. And then inputting the high-dimensional features into the stacked multi-attention enhancement blocks for intra-view feature extraction so as to acquire more local features and interaction information and restore more accurate texture details. Wherein the multi-attention enhancement block comprises a mixed attention information extraction module and a binocular interactive view attention module.

As shown in fig. 3, the mixed attention information extraction module is a basic module of the left and right branches of the network, and features in the view can be extracted more deeply by capturing remote and local dependencies. The mixed attention information extraction module consists of two modules connected in sequence. The first is a simplified channel and spatial information extraction module, and the second is a feed-forward network module for residual information aggregation. The two parts of calculation process are as follows:

in the first module, after layer normalization, the channels of the input feature map are extended using a 1×1 convolutional layerThe resulting output will then be passed through a 3 x 3 depth convolution to capture the local context of each channel. The cross-activation structure a unit (as shown in fig. 4) is then used to further learn the effective representation of the spatial context. The next step is to simplify the channel spatial attention module, as shown in fig. 5, fully utilizing the channel attention mechanism and the spatial attention mechanism, filtering out less useful information, given the original input +.>Firstly, learning the inter-channel relation of feature mapping of a given input image by utilizing average pooling and 1 multiplied by 1 convolution operation, realizing the functions of global space information aggregation and channel information interaction, and outputting simplified channel attention feature +_>The spatial information of the feature map is aggregated through the operations of average pooling and maximum pooling, and a method of combining 3×3 convolution and Sigmoid functions is adopted to obtain a simplified spatial attention map. Finally, the result of element-by-element multiplication of the output of the input feature map and Sigmoid layer is taken as the output of the simplified spatial attention moduleThe reduced channel spatial attention module may be expressed as:

w in the formula _C (·),H _AP (. Cndot.) are respectively 1X 1 convolution, average pooling operations, H _AP,1 (·),H _MP,1 (. Cndot.) represents the average pooling and maximum pooling operations on dimension 1, H _cat (. Cndot.) represents stitching in the form of dimension 1, σ (. Cndot.) represents Sigmoid activation function, respectively, and Θ represents element-by-element multiplication.

After the simplified channel and the spatial information extraction module, the channel mapped by the features is subjected to 1×1 convolution inverse transformation to generate self-adaptive feature refinement, so as to obtain the result of the first module.

In the second module, after normalizing the output of the previous module, local context awareness is enhanced by a residual information aggregation feed-forward network containing cross-activated structure B cells (as shown in fig. 4). Specifically, given an input tensorFirst extending X' to a higher dimension using a 1X 1 convolution layerWhere k is the expansion ratio. Next, a 3×3 depth convolution layer pair X 'is used' ₁ The information of adjacent pixel positions is encoded, and then the CAS-B unit is used as an activation function of the depth convolution layer, and the number of characteristic channels is halved and output. Finally, the initial input dimension X 'remapped by the 1X 1 convolutional layer' ₂ . The above procedure can be expressed as:

w in the formula _C (·),W _D Layer normalization, 1 x 1 convolution, 3 x 3 depth convolution, cas.b (·) represents the B unit of cross-activated structure, respectively.

Finally, as with the previous module, the input of the latter module and the output of the convolutional layer are added as the final result.

And S2.2, cross view feature fusion. As shown in fig. 6, in order to capture cross information between left and right views, a binocular interactive view attention module is used after the left and right branched mixed attention information extraction module. The binocular interaction view attention module uses the binocular features generated in the previous step as input to perform bidirectional cross-view interaction and generates interaction features fused with the view input features. Specifically, given an input binocular view characteristicThe binocular feature is obtained by layer normalization and 1X 1 convolution operation>And->Wherein->Is a 1 x 1 convolution. Then, a channel weight ++is generated by performing a fast 1D convolution of size k>Wherein k is adaptively determined by mapping of the channel dimension C and the channel weight is multiplied element by the binocular feature to obtain the aggregate feature +.>The following is shown:

in the method, in the process of the application,representing the k x k convolution layer, max pooling operation, respectively, Θ represents element-wise multiplication.

By computing the primary attention matrix while generating F _R→L ,F _L→R . Finally, the interactive cross-view information and the intra-view information FL, FR are fused together by element-wise addition, according toIn->Is a query matrix of feature projections within the source view (e.g. left view), +.>Is a key-value matrix of feature projections within features (e.g., right view) within the target view. The above can be expressed as follows:

wherein, gamma _L And gamma _R Is the trainable channel scaling and initializing to zero to stabilize training,is a 1 x 1 convolution.

And S2.3, reconstructing a binocular image. After the fusion feature extraction, the fusion feature is output to a 3×3 convolution layer and a spatial attention enhancement module, and finally the output feature is up-sampled to a high resolution size by using a pixel reorganization operation. Furthermore, to reduce the burden of feature extraction, global residual paths are used in this section to exploit the incoming binocular mapThe image information further improves the super-resolution performance and recovers the left-right attempted super-resolution image

H in _C (·),H _E (·),H _P (·),H _↑ (·) are the upsampling operations of convolution operation, enhanced spatial attention module, pixel rebinning, and bilinear interpolation, respectively.

Enhanced spatial attention module will give inputSent to the 1 x 1 convolutional layer to getWherein W is _C (. Cndot.) is a 1 x 1 convolution to reduce the channel size of the input features. The block then uses cross-row rolling and cross-row max pooling layers to reduce the space size. After a set of convolutions, upsampling based on bilinear interpolation is performed to recover the spatial size in order to extract the features. And (3) further processing the features by combining with residual connectivity to obtain the 1 multiplied by 1 convolution layer recovery channel size. Finally, an attention matrix is generated by the Sigmoid function and multiplied by the original input feature X ".

And S3, constructing a loss function. In order to enhance the texture details of binocular images and maintain parallax consistency between viewpoints, the application adopts L ₁ The loss function and the method combined with the frequency domain loss function enhance the supervision of the high-level feature space and restrict the training of the network. The loss of total variance can be written as:

L＝L _SR +λL _FFT , (10)

wherein L is _SR ,L _FFT Respectively represent L ₁ Reconstructing the frequency domain loss of the loss function, frequency Charbonnier loss, λ representing the frequency used for controlSuper-parameters of the frequency Charbonnier loss function. The parameter lambda for all experiments was set to 0.1.

SR reconstruction loss. The SR reconstruction penalty is essentially an L ₁ A loss function. In order to achieve faster convergence, the present application uses a pixel level L between super resolution and ground-truth binocular images ₁ The distance, too smooth texture is avoided, resulting in a higher PSNR. The method can be expressed as follows:

in the method, in the process of the application,super-resolution left and right images generated for the model, respectively,>respectively, its high resolution image.

Frequency domain loss. In order to better recover high-frequency details in the image super-resolution task, the application introduces the frequency domain loss of the frequency Charbonnier loss. The method can be expressed as follows:

in the formula, the constant epsilon is experimentally set to be 10 ^-3 . FFT (·) represents the fast Fourier transform.

And S4, setting training parameters to perform network training. Optimization was performed with AdamW, where β1=0.9, β2=0.9, and the weight defaults to 0. The learning rate is initially set to 1×10 ^-3 And reduced to 1 x 10 by cosine annealing strategy ^-7 . The model was trained on 30 x 90 patches. During training, 32 samples of each batch are evenly distributed to 8 parts, iterating 2×10 ⁵ And twice.

And S5, testing the network performance. And taking the low-resolution binocular image pair as a test sample, inputting the test sample into a network model which is trained in the last step, obtaining the super-resolution binocular image pair, and comparing and checking the super-resolution effect by using objective evaluation indexes and visual effects.

In order to express the effect of super resolution, the Bicubic interpolation method (Bicubic), the existing single image super resolution technique (EDSR) and the binocular image super resolution technique (PASSRnet, SRResNet +sam, ipasr and nafsr-L) were compared in experiments.

The binocular super-resolution image method provided by the embodiment of the application is verified from the two angles of qualitative and quantitative.

2.1 qualitative test results

The embodiment of the application performs super-resolution operation on the images in the test set and compares the super-resolution operation with results obtained by other methods. As shown in fig. 7, a super-resolution result graph is shown. In the image of fig. 7, the full image is shown in the box in the lower right hand corner, with the remainder being a partial enlargement of a block in the image. Compared with other super-resolution methods, the method provided by the embodiment of the application recovers more details related to edges and textures, and verifies that the embodiment of the application has good effect on the binocular image super-resolution task.

2.2 quantitative analysis

The embodiment of the application carries out quantitative error analysis on the super-resolution result of the binocular image in 112 in a test set, the compared method comprises interpolation methods (bicubic), EDSR, (PASSRnet, SRResNet +SAM, iPASR and NAFSSR-L methods), objective quality evaluation refers to quantitative calculation on a target image through a fixed mathematical formula and evaluation on the quality of the image according to the calculated value, and the Peak Signal-to-Noise Ratio (PSNR) and the structural similarity measurement SSIM (Structural Similarity Index Measure) are main objective evaluation indexes at present, wherein the PSNR calculation formulas of the image I and the image K are as follows:

wherein H, W respectively represent the height and width of the images I, K, and are pixel values with position coordinates x, y in the image I, the pixel peak value of the image, and b is the number of bits of the pixel binary system, and currently, b=8 is generally taken in the natural image processing. The unit of PSNR is decibel (dB), the value is usually between 20 and 40, and the larger the value is, the smaller the pixel difference between the reconstructed image and the label image is, and further the better the performance of the super-resolution model is.

Given a label image I and a reconstructed image K, the SSIM is calculated as follows:

μ _I ,μ _K pixel mean value, sigma, of I, K respectively _I ,σ _K Variance of I, K, sigma, respectively _IK Is the covariance of I and K. The value of SSIM is between 0 and 1, with a value of 1 indicating a higher overall similarity of the two images. SSIM is typically used concurrently with PSNR as an objective quality assessment indicator.

After images in the test set were tested by different methods and averaged, experimental results are shown in table 1:

TABLE 1 PSNR and SSIM contrast for different super-resolution methods on test sets

From the results in table 1, it can be seen that the binocular image super-resolution method provided by the embodiment of the application obtains an average peak signal-to-noise ratio of 24.21dB and a structural similarity of 0.7633. Compared with other methods for performing super-resolution by using a neural network, the numerical value shows that the super-resolution method provided by the embodiment of the application has better results on the test set, and improves the super-resolution effect by using the mapping relation between binocular images.

Fig. 7 shows the reconstruction effect of the present application compared with the super-resolution reconstruction results of the Bicubic, the existing single image super-resolution technique (EDSR) and the binocular image super-resolution technique (PASSRnet, SRResNet +sam, ipasr and nafsr-L), the present application can clearly show the lines of the edges, and the background and texture body are clearly distinguished, and the super-resolution result is good.

Claims

1. The binocular image super-resolution reconstruction method based on the multistage enhanced attention mechanism is characterized by comprising the following steps of:

step S4, setting training parameters to perform network training;

s5, testing network performance; and taking the low-resolution binocular image pair as a test sample, inputting the test sample into a network model which is trained in the last step, obtaining the super-resolution binocular image pair, and comparing and checking the super-resolution effect of the binocular image by using objective evaluation indexes and visual effects.

2. The method for reconstructing a binocular image based on a multi-stage enhanced attention mechanism according to claim 1, wherein the binocular super-resolution reconstruction network model based on the multi-stage enhanced attention mechanism comprises two left and right weight sharing network branches; stacking the mixed attention information extraction modules in each weight sharing network, and extracting intra-view channels and spatial features of left and right images; the binocular interaction view attention module is used for capturing globally corresponding information and cross view information extracted from left and right binocular images; the method comprises three parts: intra-view feature extraction, interactive view feature fusion and binocular image reconstruction.

3. The binocular image super-resolution reconstruction method based on the multi-stage enhanced attention mechanism according to claim 2, wherein the step 2 specifically comprises the following steps:

in the first module, after layer normalization, the channels of the input feature map are extended using a 1×1 convolutional layerThe resulting output will then be passed through a 3 x 3 depth convolution to capture the local context of each channel; the cross-activation structure a unit is then used to further learn the effective representation of the spatial context; the next step is to simplify the channel spatial attention module, make full use of the channel attention mechanism and the spatial attention mechanism, given the original input +.>Firstly, the relation between channels of feature mapping of a given input image is learned by utilizing average pooling and 1 multiplied by 1 convolution operation, the functions of global space information aggregation and channel information interaction are realized, and the simplified channel attention feature X is output ₁ The method comprises the steps of carrying out a first treatment on the surface of the The space information of the feature images is aggregated through the operations of average pooling and maximum pooling, and a method of combining 3×3 convolution and Sigmoid functions is adopted to obtain a simplified space attention graph; finally, the result of element-by-element multiplication of the output of the input feature map and Sigmoid layer is taken as the output X of the simplified spatial attention module ₂ The method comprises the steps of carrying out a first treatment on the surface of the The reduced channel spatial attention module is expressed as:

after the simplified channel and the spatial information extraction module, carrying out 1X 1 convolution inverse transformation on the channel mapped by the features to generate self-adaptive feature refinement, and obtaining a result of the first module;

in the second module, after normalizing the output result of the previous module, improving the local context sensing capability through a residual information aggregation feedforward network containing a cross-activation structure B unit; specifically, given an input tensorFirst extending X ' to a higher dimension X ' using a 1X 1 convolution layer ' ₁ Wherein k is the expansion ratio; next, a 3×3 depth convolution layer pair X 'is used' ₁ Coding the information of the adjacent pixel positions, and then using a CAS-B unit as an activation function of the depth convolution layer, and halving the number of characteristic channels and outputting the characteristic channels; finally, the initial input dimension X 'remapped by the 1X 1 convolutional layer' ₂ ；

The above process is expressed as:

w in the formula _C (·),W _D Respectively layer normalization, 1 x 1 convolution, 3 x 3 depth convolution,

CAS.B (-) represents the B unit of the cross-activated structure;

step S22, cross view feature fusion, wherein a binocular interaction view attention module is used after a left and right branch mixed attention information extraction module; the binocular interaction view attention module uses the binocular characteristics generated in the previous step as input to perform bidirectional cross-view interaction and generates interaction characteristics fused with the view input characteristics; specifically, given an input binocular view characteristicThe binocular feature is obtained by layer normalization and 1X 1 convolution operation>And->Wherein->Is a 1 x 1 convolution; then, a channel weight ++is generated by performing a fast 1D convolution of size k>Wherein k is adaptively determined by mapping of the channel dimension C and the channel weight is multiplied element by the binocular feature to obtain the aggregate feature +.>The following is shown:

by computing the primary attention matrix while generating F _R→L ,F _L→R The method comprises the steps of carrying out a first treatment on the surface of the Finally, the interactive cross-view information is summed and anded by element-wise additionThe intra-view information FL, FR are fused together according toIn->Is a query matrix of feature projections within the source view (e.g. left view), +.>Is a key-value matrix of feature projections within features (e.g., right view) within the target view; the expression is as follows:

/>

wherein, gamma _L And gamma _R Is the trainable channel scaling and initializing to zero to stabilize training,is a 1 x 1 convolution;

enhanced spatial attention module will give inputSent to the 1 x 1 convolutional layer to getWherein W is _C (. Cndot.) is a 1 x 1 convolution to reduce the channel size of the input features; the block then uses cross-row rolling and cross-row max pooling layers to reduce the space size; after a set of convolutions, upsampling based on bilinear interpolation to recover the spatial size in order to extract features; combining residual connectivity, and further processing the features to obtain a 1×1 convolutional layer recovery channel size; finally, an attention matrix is generated by the Sigmoid function and multiplied by the original input feature X ".

4. The binocular image super resolution reconstruction method based on the multi-stage enhanced attention mechanism of claim 1, wherein in step S3, the loss of the total difference is written as:

L＝L _SR +λL _FFT ,

SR reconstruction loss; the SR reconstruction penalty is essentially an L ₁ Loss ofA function; using pixel level L between super resolution and ground truth binocular image ₁ Distance, PSNR is obtained; the expression is as follows:

5. The binocular image super resolution reconstruction method based on the multi-stage enhanced attention mechanism of claim 1, wherein in step S4, adamW is adopted for optimization, wherein β1=0.9, β2=0.9, and the weight is default to 0; the learning rate is initially set to 1×10 ^-3 And reduced to 1 x 10 by cosine annealing strategy ^-7 The method comprises the steps of carrying out a first treatment on the surface of the The model was trained on 30 x 90 patches; during training, 32 samples of each batch are evenly distributed to 8 parts, iterating 2×10 ⁵ And twice.