CN112991167A

CN112991167A - Aerial image super-resolution reconstruction method based on layered feature fusion network

Info

Publication number: CN112991167A
Application number: CN202110111223.5A
Authority: CN
Inventors: 王帮海; 杨夏宁
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-06-18

Abstract

The invention provides an aerial image super-resolution reconstruction method based on a layered feature fusion network, which improves the fusion of features with different scales based on the layered structure of a U-shaped network, reduces the loss of image feature information in the convolution process, and improves the effect of a model in reconstructing a more complex aerial image environment. Especially for tasks with larger magnification factors, the image reconstruction effect is more obvious. And model training is carried out by using more densely connected residual blocks, so that a good aerial image reconstruction effect can be obtained even if the depth of the network is not very deep, and the complexity of the whole model is reduced. The method has good effect on reconstruction of various environments with complex noise, such as underwater images and rain and fog weather images, and the universality of the model is described to a certain extent.

Description

Aerial image super-resolution reconstruction method based on layered feature fusion network

Technical Field

The invention relates to the technical field of low-level computer vision, in particular to an aerial image super-resolution reconstruction method based on a hierarchical feature fusion network.

Background

During the imaging process, the image is often affected by various environmental factors, so that the quality of the image is reduced. Under extreme severe environments such as rainy and foggy weather, underwater operation, high-speed object shooting, high-altitude remote aerial shooting and the like, the imaging quality is often faced with the problems of insufficient exposure/overexposure, motion artifacts, noise interference and the like due to the limitations of shooting cost and physical performance of shooting equipment. Generally, upgrading and commissioning a shooting device is the simplest and straightforward way to improve the imaging problem: the manufacturing process of the sensor is improved, the size of pixels in a unit area is reduced, more pixels can be mounted under the CMOS with the same size, and the purpose of improving the picture quality is achieved by improving the number of pixel points of a single image. However, even if the cost is eliminated, the image cannot be completely prevented by simply increasing the performance of the hardware device. In order to achieve high-quality image generation in a lower-cost and more adaptable manner, researchers try to increase the resolution of images through image processing techniques, and the higher the resolution of images is, the more abundant the texture details are, and the clearer the images are. The image super-resolution reconstruction technology is a technical means without increasing hardware cost, reconstructs details of images through software, and has wide application scenes in the aspects of medical image reconstruction, face/license plate monitoring reconstruction recognition, satellite remote sensing image reconstruction, video stream super-resolution and the like.

Chinese patent publication No. CN111583115A, whose publication date is 08/25/2020, discloses a single-image super-resolution reconstruction method and system based on a deep attention network, including: step 1: preprocessing a starting source image training data set DIV2K to obtain a training set; step 2: establishing a convolutional neural network capable of performing super-resolution reconstruction on the image; and step 3: inputting the training set obtained in the step 1 into the convolutional neural network established in the step 2 for training to obtain a super-resolution reconstruction model; and 4, step 4: and (4) inputting the low-resolution single image to be processed into the super-resolution reconstruction model obtained in the step (3), and outputting the single image super-resolution reconstruction image. The image super-resolution network model based on deep learning of the patent is high in complexity and low in feature utilization rate. Especially in complex environments such as high-altitude shooting operation, the reconstruction effect is often poor due to serious loss of the characteristics of the shot images and interference of environmental noise. Meanwhile, the performance of the network is also one of the keys for better performance, but the performance of the network is increased by stacking the depth of the network, which increases the size of the final model.

Disclosure of Invention

The invention provides an aerial image super-resolution reconstruction method based on a layered feature fusion network, which is a novel method for realizing light-weight better reconstruction of complex aerial images and fully extracting image features.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an aerial image super-resolution reconstruction method based on a layered feature fusion network comprises the following steps:

s1: acquiring aerial images with high resolution, wherein the aerial images are aerial images in different scenes and are divided into training data and verification data;

s2: preprocessing the aerial image;

s3: building a network model, wherein the network model adopts a layered U-shaped structure, symmetrical characteristic diagrams are fused in the U-shaped structure, the number of input characteristic diagram channels is increased to the maximum at the bottom layer of the U-shaped structure, the network model is trained by using the training data of the step S1, and the network model is verified by using the verification data of the step S1;

s4: and reconstructing the new aerial image by using the trained network model to obtain a high-resolution image.

Preferably, 114 high resolution, high quality aerial images are selected from the aerial data camera disclosed in step S1, 100 of which are training data and 14 of which are verification data.

Preferably, the scenes of the aerial images in step S1 include car parks, airports, residential areas, sports grounds, harbors, viaducts, farmlands, and highways.

Preferably, the preprocessing in step S2 specifically includes:

100 aerial images as training data are cut into small pictures with the resolution of 480 x 480, the total number is 8234, and before the aerial images are input into a network model, the small pictures are cut into 96 x 96 and input into the network model in batches for training.

Preferably, the U-shaped structure is specifically:

the image is converted into a coarse characteristic image F through initial convolution₀The coarse characteristic diagram is passed through two downward convolution modules to obtain characteristic diagram F₁And feature map F₂Feature map F₂Is increased to a coarse feature map F₀Quadruple of the feature map, and then the feature map F₂Sending into the stacked dense residual blocks for feature extraction, outputting, and fusing the feature map into F 'by a 3 x 3 convolution'₂Feature map F'₂And characteristic diagram F₁After symmetrically splicing, inputting the image into the dense residual block again for image high-frequency feature extraction, fusing the feature map with the final output by a 3 x 3 convolution, and then fusing the feature map with the rough feature map F₀And inputting the spliced images into an up-sampling reconstruction module to generate a high-resolution image.

Preferably, the mathematical representation of the network model is as follows:

F₁＝f_CB(F₀)

F₂＝f_CB(F₁)

F′₂＝f_Pixel(f_DB×N(F₂)),N＝(1,2,…,n)

F′_1,2＝f_Pixel(f_DB×N(H_concat(F₁,F′₂))),N＝(1,2,…,N)

F_HR＝f_1×1(H_concat(F₀,F′_1,2))+F₀

in the formula (f)_CBRepresenting convolution module operations, f_DB×NRepresenting N stacked dense residual block operations, H_concatRepresenting a merging of image features, f_1×1Representing 1 × 1 convolution dimensionality reduction operation, F'_1,2Is a characteristic diagram, F_HRThe resulting high resolution image.

Preferably, the convolution module includes two convolution kernels of 3 × 3 and an LeakyReLU activation function, and the feature map passes through the first convolution kernel of 3 × 3 first and then the second convolution kernel of 3 × 3 after passing through the LeakyReLU activation function.

Preferably, the dense residual block includes four basic residual blocks, wherein the feature information of the first three basic residual blocks is directly input to the tail of the dense residual block, and is spliced with the output of the last residual block, and finally the features are fused by convolution.

Preferably, the basic residual error module includes two convolution kernels of 3 × 3, a LeakyReLU activation function and a lightweight attention module, wherein the feature map passes through the first convolution kernel of 3 × 3, then passes through the second convolution kernel of 3 × 3 after passing through the LeakyReLU activation function, and finally the feature information obtained by the lightweight attention module is added to the feature map for output.

Preferably, the lightweight attention module uses an ECA model attention module, and a parameter matrix is:

in the formula, w represents a weight parameter of each channel, the parameter is k × C, the parameter is a convolution kernel size of learnable one-dimensional convolution, the parameter represents that only cross-channel attention between adjacent k channels is considered, and k is obtained by channel number adaptive learning of an input feature map:

wherein | t_oddRepresenting the odd number nearest to t, a certain mapping relation exists between the channel dimension C and the convolution kernel size k, and the simplest mapping is linear mapping: phi (k) ═ γ k-b, but the neighboring feature representation capability of linear mapping is limited, so it is common practice to extend the power of 2 to a non-linear representation: phi (k) is 2^(γ*k-b)Therefore, the dimension C of the channel and the convolution kernel size k can be obtained through the learning of the formula, and the ECA model attention module sets gamma and b as constants 2 and 1 respectively;

meanwhile, in order to further achieve the purpose of lightweight, the ECA module shares all channels with the learned weight, and the final parameter total amount is only k.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

based on the hierarchical structure of the U-shaped network, the method improves the fusion of the features of different scales, reduces the loss of image feature information in the convolution process, and improves the effect of the model in reconstructing a more complex aerial photography environment. Especially for tasks with larger magnification factors, the image reconstruction effect is more obvious. And model training is carried out by using more densely connected residual blocks, so that a good aerial image reconstruction effect can be obtained even if the depth of the network is not very deep, and the complexity of the whole model is reduced. The method has good effect on reconstruction of various environments with complex noise, such as underwater images and rain and fog weather images, and the universality of the model is described to a certain extent.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic structural diagram of a network model with a U-shaped structure in the embodiment.

FIG. 3 is a diagram illustrating an exemplary sub-pixel convolution method.

FIG. 4 is a block diagram of a density residual block according to an embodiment.

FIG. 5 is a diagram illustrating a basic residual module structure according to an embodiment.

Fig. 6 is a schematic structural diagram of the lightweight attention module in the embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides an aerial image super-resolution reconstruction method based on a layered feature fusion network, as shown in fig. 1, comprising the following steps:

s2: preprocessing the aerial image;

In step S1, 114 high-resolution, high-quality aerial images are selected from the aerial data recorder disclosed, 100 of which are training data and 14 of which are verification data.

The scenes of the aerial images in step S1 include car parks, airports, residential areas, sports grounds, ports, viaducts, farmlands, and highways.

The preprocessing in step S2 includes:

In the embodiment, 114 selected high-quality and high-resolution aerial images are used, and 100 aerial images are cut into sub-images of 480 multiplied by 480 pixels, wherein the total amount is 8234; and performing double-cubic downsampling on all images by using MATLAB to generate corresponding low-resolution images, and forming a pair with the high-resolution images to form a pair aerial photography data set required by network training.

As shown in fig. 2, the U-shaped structure is specifically:

the image is converted into a coarse feature map F with the number of feature channels of 64 through initial convolution₀Each downward convolution module increases the channel number of the characteristic diagram to 2 times of the original channel number, and the coarse characteristic diagram sequentially obtains a characteristic diagram F after passing through the downward convolution modules twice₁(128 layers) and feature map F₂(256 layers), feature map F₂Is increased to a coarse feature map F₀Quadruple of the feature map, and then the feature map F₂Sending into the stacked dense residual blocks for feature extraction, outputting, and fusing the feature map into F 'by a 3 x 3 convolution'₂Feature map F'₂And characteristic diagram F₁After symmetrically splicing, inputting the image into the dense residual block again for image high-frequency feature extraction, fusing the feature map with the final output by a 3 x 3 convolution, and then fusing the feature map with the rough feature map F₀And inputting the spliced images into an up-sampling reconstruction module to generate a high-resolution image. The up-sampling reconstruction module uses a sub-pixel convolution method, the structure of the method is shown in figure 3, a hidden layer is used for carrying out feature extraction on a low-resolution image, and C is finally generated²Feature maps of the various channels. The sub-pixel convolution layer rearranges all channels of a single pixel of the feature image to form a single pixel in the high-resolution image space, and finally rearranges the single pixel into the high-resolution image

The mathematical representation of the network model is as follows:

F₁＝f_CB(F₀)

F₂＝f_CB(F₁)

F′₂＝f_Pixel(f_DB×N(F₂)),N＝(1,2,…,n)

F′_1,2＝f_Pixel(f_DB×N(H_concat(F₁,F′₂))),N＝(1,2,…,N)

in the formula (f)_CBRepresenting convolution module operations, f_DB×NRepresenting N stacked dense residual block operations, H_concatRepresenting a merging of image features, f_1×1Representing 1 × 1 convolution dimensionality reduction operation, F'_1,2In order to be a characteristic diagram,F_HRthe resulting high resolution image.

The convolution module comprises two convolution kernels of 3 x 3 and a LeakyReLU activation function, the number of channels of the characteristic diagram is doubled after the characteristic diagram passes through the first convolution kernel of 3 x 3, and the characteristic diagram passes through the second convolution kernel of 3 x 3 after the characteristic diagram passes through the LeakyReLU activation function.

As shown in fig. 4, the dense residual block includes four basic residual blocks, wherein the feature information of the first three basic residual blocks is directly input to the tail of the dense residual block, and is spliced with the output of the last residual block, and finally these features are fused by convolution. Compared with a simply stacked residual structure, the dense residual blocks are used for improving the utilization rate of local residual information, the characteristic information is directly connected in each dense residual block, and the residual is transmitted in the dense residual blocks almost without loss, so that high-frequency characteristics which are beneficial to image reconstruction and have higher discriminative performance can be retained to the maximum extent.

As shown in fig. 5, the basic residual error module includes two convolution kernels of 3 × 3, a LeakyReLU activation function, and a lightweight attention module, wherein the feature map passes through the first convolution kernel of 3 × 3, then passes through the second convolution kernel of 3 × 3 after passing through the LeakyReLU activation function, and finally the feature information obtained by the lightweight attention module is added to the feature map for output.

As shown in fig. 6, the lightweight attention module uses an ECA model attention module, and the parameter matrix is:

wherein | t_oddRepresenting the odd number nearest to t, a certain mapping relation exists between the channel dimension C and the convolution kernel size k, and the simplest mapping is linear mapping: phi (k) ═ γ k-b, but the neighboring feature representation capability of linear mapping is limited, so it is common practice to extend the power of 2 to a non-linear representation: phi (k) is 2^(γ*k-b)Thus, the dimension C of the channel and the convolution kernel size k can be learned by the above equation, and the ECA model attention module sets γ and b to constants 2 and 1, respectively.

The traditional residual structure can equivalently see the residual information of each path, the human vision is more sensitive to the information such as the edge, the brightness and the like of an image, and in order to effectively reconstruct the information, an attention mechanism is added to each residual block. The attention module needs to be as lightweight as possible so that it can be inserted into each residual module, and furthermore, in order to guarantee the reconstruction effect, the attention module needs to have a large receptive field. To achieve both of these objectives, the referencing ECA channel attention module extracts the attention information of the residual structure and inserts into each block residual module.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A super-resolution reconstruction method for aerial images based on a layered feature fusion network is characterized by comprising the following steps:

s2: preprocessing the aerial image;

2. The super-resolution reconstruction method for aerial images based on hierarchical feature fusion network as claimed in claim 1, wherein in step S1, 114 aerial images with high resolution and high quality are selected from the public aerial data machine, wherein 100 aerial images are training data, and 14 aerial images are verification data.

3. The super-resolution reconstruction method for aerial image based on hierarchical feature fusion network as claimed in claim 2, wherein the scene of the aerial image in step S1 includes car parking lot, airport, residential area, sports ground, harbor, viaduct, farmland and high speed runway.

4. The super-resolution reconstruction method for aerial images based on hierarchical feature fusion network according to claim 3, wherein the preprocessing in step S2 specifically comprises:

5. The aerial image super-resolution reconstruction method based on the hierarchical feature fusion network according to claim 4, wherein the U-shaped structure specifically comprises:

6. The super-resolution reconstruction method for aerial images based on hierarchical feature fusion network according to claim 5, characterized in that the mathematical representation of the network model is as follows:

F₁＝f_CB(F₀)

F₂＝f_CB(F₁)

F′₂＝f_Pixel(f_DB×N(F₂)),N＝(1,2,…,n)

F′_1,2＝f_Pixel(f_DB×N(H_concat(F₁,F′₂))),N＝(1,2,…,N)

F_HR＝f_1×1(H_concat(F₀,F′_1,2))+F₀

7. The super-resolution reconstruction method for aerial images based on hierarchical feature fusion network of claim 6, wherein the convolution module comprises two convolution kernels of 3 x 3 and a LeakyReLU activation function, the feature map passes through the first convolution kernel of 3 x 3 first and then the number of channels is doubled, and passes through the second convolution kernel of 3 x 3 after passing through the LeakyReLU activation function.

8. The super-resolution reconstruction method for aerial image based on hierarchical feature fusion network of claim 7, characterized in that the dense residual block comprises four basic residual modules, wherein feature information of the first three basic residual modules is directly input to the tail of the dense residual block and spliced with the output of the last residual block, and finally these features are fused by convolution.

9. The aerial image super-resolution reconstruction method based on the hierarchical feature fusion network of claim 8, wherein the basic residual module comprises two convolution kernels of 3 x 3, a LeakyReLU activation function and a lightweight attention module, wherein the feature map is subjected to the first convolution kernel of 3 x 3, then the LeakyReLU activation function, then the second convolution kernel of 3 x 3, and finally the feature information obtained by the lightweight attention module is added to the feature map for output.

10. The super-resolution reconstruction method for aerial images based on the hierarchical feature fusion network of claim 9, wherein the lightweight attention module uses an ECA model attention module, and the parameter matrix is: