CN111340696B

CN111340696B - Convolutional neural network image super-resolution reconstruction method fused with bionic visual mechanism

Info

Publication number: CN111340696B
Application number: CN202010084579.XA
Authority: CN
Inventors: 王琼; 王鑫
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2022-11-04
Anticipated expiration: 2040-02-10
Also published as: CN111340696A

Abstract

The invention discloses a convolutional neural network image super-resolution reconstruction method fused with a bionic visual mechanism, which comprises the steps of firstly, adopting a significance detection method simulating a human visual attention mechanism to carry out significance region detection on a remote sensing image; secondly, aiming at the salient region, performing super-resolution reconstruction by adopting an image super-resolution reconstruction method based on a convolutional neural network; and finally, performing super-resolution reconstruction on the non-significant region by adopting a bicubic interpolation method. Compared with the existing super-resolution reconstruction method based on the convolutional neural network image, the method provided by the invention can be used for quickly carrying out super-resolution reconstruction on the image, and is suitable for occasions with strict real-time requirements.

Description

Convolutional neural network image super-resolution reconstruction method fused with bionic visual mechanism

Technical Field

The invention belongs to the field of image processing, and particularly relates to a convolutional neural network image super-resolution reconstruction method fused with a bionic vision mechanism.

Background

In real life, hardware conditions of various imaging instruments are greatly limited, and the resolution of an obtained image often does not meet the actual requirement. But only by improving hardware conditions in order to achieve high resolution, is expensive. Therefore, it is necessary to develop a method for improving the resolution by software. The image super-resolution reconstruction means that an image is reconstructed in a software mode according to one or more images with lower resolution, and the resolution of the image is obviously higher than that of an original image.

The breadth of the remote sensing image is large, the contained information is extremely large, and the processing speed can be greatly reduced if the whole remote sensing image is directly processed.

Disclosure of Invention

The invention aims to provide a convolutional neural network image super-resolution reconstruction method fused with a bionic vision mechanism, which can remove image contents which are not concerned by human eyes, effectively reduce image areas needing to be reconstructed and greatly improve algorithm efficiency.

The technical solution for realizing the purpose of the invention is as follows: a super-resolution reconstruction method of a convolutional neural network image fused with a bionic visual mechanism comprises the following steps:

carrying out significance region detection on the remote sensing image by adopting a significance detection method for simulating a human visual attention mechanism;

aiming at the salient region, performing super-resolution reconstruction by adopting an image super-resolution reconstruction method based on a convolutional neural network;

and performing super-resolution reconstruction on the non-significant region by adopting a bicubic interpolation method.

Compared with the prior art, the invention has the remarkable advantages that: the reconstruction algorithm of the invention takes the reconstruction quality and the reconstruction time of the remote sensing image into consideration, only reconstructs the salient region, accelerates the reconstruction speed, improves the efficiency, simultaneously ensures the reconstruction quality of the salient region, and does not influence the acquisition of important information details by people.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention.

Fig. 2 (a) -2 (e) are the reconstructed result graphs of the algorithm of the present invention for the visually significant region, which are respectively a parking lot, an airport apron, a crossroad, a large-scale port, and a road turntable, wherein the first column is a low-resolution image, the 2 nd column is a significant image, the 3 rd column is a significant region, and the 4 th column is a reconstructed image.

FIG. 3 is a comparison diagram of the detail of the image before and after reconstruction.

Detailed Description

As shown in fig. 1, a convolutional neural network image super-resolution reconstruction method fused with a bionic visual mechanism includes the following steps:

(1) Carrying out significance region detection on the remote sensing image by adopting a significance detection method for simulating a human visual attention mechanism;

(1.1) color and brightness feature extraction

Assuming that r, g, b are the red, green, blue channels of the input color image, respectively, the luminance channel I is:

I＝(r+g+b)/3

the human eye has low sensitivity to color at low luminance values, so the three color channels r, g, b are normalized according to the luminance channel I. Human eyes have low sensitivity to color at low luminance values, so three color channels r, g, b are normalized according to the luminance channel I. Pixels in the image with the brightness I being more than Maxinum/10 are normalized, and pixels less than or equal to the brightness value are set to be zero, wherein Maxinum is the largest one of all the brightness values in the image. According to the normalized R, G and B values, four color channels of red, green, blue and yellow (R, G, B and Y) can be obtained through calculation, the values are subjected to non-negative restriction, and the value smaller than 0 is replaced by 0. The calculation formula is as follows:

R＝r-(g+b)/2

G＝g-(r+b)/2

B＝b-(r+g)/2

Y＝(r+g)/2-|r-g|/2-b

performing Gaussian down-sampling on the four color channels R, G, B and Y and the brightness I obtained by calculation to obtain respective Gaussian pyramids I _σ 、R _σ 、G _σ 、B _σ 、Y _σ The established gaussian pyramid is 9 layers.

(1.2) Direction feature extraction

The Gabor filter has good directional selectivity, so that the two-dimensional Gabor filter is used for extracting directional characteristics, and the two-dimensional Gabor formula is as follows:

in practice the Gabor filter is a gaussian function modulated by a complex sinusoidal function, where α, β are the standard deviations of the gaussian envelope in the x and y directions, λ and θ, respectively _k Respectively the wavelength and direction of the sine wave. Wherein:

selecting theta _k Outputs of Gabor filters in four directions of = 0 °,45 °,90 °,135 ° } are direction features, and direction feature maps in four directions are calculated.

(1.3) feature map construction

When the feature mapping is obtained, a Center-Surround method is adopted, wherein the parameter c refers to a fine scale, and the parameter s refers to a coarse scale. The calculation method is as follows:

I _σ (c,s)＝|I _σ (c)ΘI _σ (s)|

R _σ G _σ (c,s)＝|(R _σ (c)-G _σ (c))Θ(R _σ (s)-G _σ (s))|

B _σ Y _σ (c,s)＝|(B _σ (c)-Y _σ (c))Θ(B _σ (s)-Y _σ (s))|

O _σ (c,s,θ)＝|O _σ (c,θ)ΘO _σ (s,θ)|

where c ∈ {2,3,4}, and s = c + δ, δ ∈ {3,4}. Theta stands for image matrix subtraction, provided that the image sizes need to be the same, I _σ Is a luminance characteristic map, R _σ G _σ And B _σ Y _σ Is a color profile, which is expressed because of the existence of a mechanism in the cerebral cortex, which we call "color pair opposites", O _σ Is a directional feature map. Thus, 6 luminance feature maps, 12 color feature maps, and 24 direction feature maps were obtained, for a total of 42 feature maps.

(1.4) saliency map generation

Feature map M generated using Markov chain pairs [ n] ² → R are combined in a standard way to construct a corresponding saliency map A: [ n] ² → R, the procedure is as follows:

m (i, j) and M (M, n) are respectively the characteristics of the nodes (i, j) and (M, n). The difference distance of M (i, j) and M (M, n) is defined as

Connecting each pixel of the feature map with other pixels to form a full connected graph G _A : connecting each vertex in M with the other n-1 vertices, defining weights for the directed edges from point (i, j) to point (M, n):

wherein

Sigma is a free parameter and takes 1/15 to 1/10 of the image width. Weight ω ₁ ((i, j), (m, n)) is proportional to the dissimilarity, proximity of the two points.

In directed graph G _A And constructing a Markov chain, and setting the outgoing weight sum of each point to be 1. The equilibrium state of the markov chain is used as a measure of significance, which reflects the time a randomly wandering particle dwells at each endpoint. Finally, the significant graph A: [ n:] ² → R, so that the salient pixel values are concentrated in the critical areas. Then according to A defining full connected graph G _C And solving the equilibrium state of the Markov chain on the graph to obtain a final saliency graph.

(2) Aiming at the salient region, performing super-resolution reconstruction by adopting an image super-resolution reconstruction method based on a convolutional neural network;

(2.1) extraction and representation of image blocks

The image blocks are densely extracted firstly, and then the extracted image blocks are represented, so that a pre-trained base is used, wherein a method adopted by pre-training can be PCA (principal component analysis, which can reduce the dimension of data to achieve the purpose of compression but retain important features), DCT (discrete cosine transform), haar and the like. Each basis here is equivalent to a convolution kernel, and this operation is equivalent to performing convolution operation on the image, and the optimization operation on the convolution kernel is included in the optimization on the convolution neural network. The first layer may be denoted as operation F ₁ ：

F ₁ (Y)＝max(0,W ₁ *Y+B ₁ )

Wherein, W ₁ Representing a convolution kernel, B ₁ Denotes a base, "' denotes a convolution operation. W is a group of ₁ Represents n ₁ C x f ₁ ×f ₁ C represents the number of image channels, f ₁ Representing the length and width of the convolution kernel in space. That is, we use n for the input image ₁ Convolution operationTo extract features, the sizes of convolution kernels are c × f ₁ ×f ₁ The resulting output is n ₁ And opening a feature map. B is ₁ Is dimension n ₁ The convolution kernel determines all elements in the vector. ReLU is used as the activation function in the convolution result.

(2.2) non-Linear mapping

Each image block has n extracted from the first layer ₁ Features of dimension, in a second layer of operation, each vector obtained in the first layer is remapped to another vector, the vector dimension being defined by n ₁ Is changed into n ₂ . This is equivalent to using n ₂ A convolution kernel with a spatial size of 1 × 1 is specified here as a convolution kernel with a size of 1 × 1, but sometimes larger convolution kernels, for example 3 × 3 or 5 × 5, are also used, which is easily obtained by generalization. It should be noted that if the convolution kernel size becomes 3 × 3 or 5 × 5, the nonlinear mapping is not on one block of the input image, but on a block of the size 3 × 3 or 5 × 5 of the feature map. The operation of the second layer is:

F ₂ (Y)＝max(0,W ₂ *F ₁ (Y)+B ₂ )

wherein, W ₂ Corresponds to n ₂ N is ₁ ×f ₂ ×f ₂ Of a convolution kernel of, B ₂ Is one n ₂ A dimension vector. The high-resolution image blocks later used for reconstructing the image are all output n in the layer ₂ Dimensional vector representation.

We often adopt the method of increasing the number of convolutional layers to enhance the network nonlinearity, and the network complexity is increased (the number of parameters per layer is changed to n) ₂ ×f ₂ ×f ₂ ×n ₂ One), naturally these parameters also require more time to train.

(2.3) reconstruction

In the conventional method, the predicted overlapped high-resolution image blocks are often processed by averaging, and the final high-resolution image is generated in this way. We can use a pre-defined convolution kernel to equate the above form of averaging, but of course this convolution kernel is built on top of a set of feature maps, each position of which represents a high resolution image block, but is presented as a vector. In view of this, the last layer still uses convolution operations to generate the reconstructed high resolution image of the present invention, which is defined as follows:

F(Y)＝W ₃ *F ₂ (Y)+B ₃

wherein, W ₃ Corresponding to c n ₂ ×f ₃ ×f ₃ Of a convolution kernel of, B ₃ Is a c-dimensional vector.

If the representation of the high-resolution image block is in the image domain (i.e., the formation of the image block only requires that the representation of each image block be reshaped relatively easily), W ₃ The function of an average filter is realized; w if the representation of the high resolution image block is in some other domain (e.g., based on coefficients of some basis), W ₃ The effect of this is to convert these domains into the image domain first, by using the projection coefficients, and then the operation is the same as in the image domain. It can be seen that in any case, W ₃ Is always linear as a convolution kernel.

It can be seen that the above three operations all involve convolution operations and also all need to define convolution layers for implementation. The three steps are connected to obtain a super-resolution reconstruction network, which is naturally called a convolutional neural network.

(2.4) network parameter training

Learning the end-to-end mapping function F requires that the parameters θ = { W in the network be aligned ₁ ,W ₂ ,W ₃ ,B ₁ ,B ₂ ,B ₃ An estimation is performed which achieves this by minimizing the error between the reconstructed high resolution image F (Y; θ) and the corresponding original high resolution image X. { X _i Denotes a set of high resolution images, corresponding to which the low resolution images are denoted by Y _i Denotes that the loss function employed by the present invention is the Mean Square Error (MSE):

where n represents the number of training samples. To achieve a higher peak signal-to-noise ratio (PSNR), we only use MSE as the loss function. PSNR is the most widely used evaluation criterion, and can better quantify the quality of the reconstructed image, and although the loss function in the form of MSE is used to improve the PSNR value, better results can be obtained by using other evaluation criteria such as SSIM and FSIM.

Random gradient descent using standard back propagation minimizes losses. The weight matrix is updated by:

wherein l is belonged to {1,2,3}, i represents the number of layers and iterations, eta represents the learning rate,

the partial derivative is indicated.

For all convolution kernels in the layers, it is necessary to initialize their parameters, which are random, but the sources are normal distributions with a mean value of 0 and a standard deviation of 0.01. The learning rate η is 10 in both the first layer and the second layer ^-4 In the last layer is 10 ^-5 If the learning rate of the last layer is small, the convergence of the network is facilitated.

In the training phase, raw images { X } _i Is randomly cut out from the training image with the size f _sub ×f _sub A sub-image of xc pixels is called a sub-image because it is not an image block. We refer to image blocks which are overlapping and which are subject to subsequent processing such as averaging. To obtain a low resolution set of images { Y } _i And (5) blurring the sub-image by adopting a Gaussian kernel, reducing the sub-image by a certain factor by utilizing downsampling, and amplifying the sub-image by the same factor by utilizing a bicubic interpolation method.

(3) And (4) performing super-resolution reconstruction on the non-significant region by adopting a bicubic interpolation method, and finally amplifying to the size same as that of the significant region.

The present invention will be described in detail with reference to the following examples and drawings.

Examples

The present embodiment is based on the configuration: an Intel Core I7 1050CPU, a personal computer with a memory of 16G, an operating system of Windows 10 (x 64), and simulation software of MATLAB R2016b. Fig. 2 (a) to 2 (e) show some experimental results, which are respectively a parking lot, an apron, a crossroad, a large port and a road turntable, wherein the 1 st column is a low-resolution image, the 2 nd column is a saliency map, the 3 rd column is a saliency region, and the 4 th column is a reconstructed image. For a clearer display, details of the image parts before and after reconstruction of the group diagram in fig. 2 (b) are enlarged and shown, as shown in fig. 3, the left diagram is before reconstruction, and the right diagram is after reconstruction. From the results, it can be seen that the algorithm of the present invention subjectively achieves good visual effects.

The algorithm provided by the invention is compared with the existing convolutional neural network-based image super-resolution reconstruction method (marked as CNNSR), and a comparison result based on four common full-reference image quality evaluation indexes is given in table 1. The reconstruction time comparison results of the two algorithms are shown in table 2.

TABLE 1 remote sensing image reconstruction quality evaluation

TABLE 2 reconstruction time comparison

On the objective reconstruction quality, four evaluation indexes of the algorithm are slightly lower than those of the traditional CNNSR algorithm, but the difference is small, because the method is integrated with a bionic vision mechanism, neglects a non-salient region and emphatically reconstructs a salient region. On the other hand, however, the algorithm of the present invention is significantly faster than the conventional CNNSR algorithm in terms of reconstruction time.

In conclusion, the algorithm of the invention gives consideration to the reconstruction quality and the reconstruction time of the remote sensing image, only reconstructs the salient region, accelerates the reconstruction speed, improves the efficiency, ensures the reconstruction quality of the salient region and does not influence the acquisition of important information details of people.

Claims

1. A super-resolution reconstruction method of a convolutional neural network image fused with a bionic visual mechanism is characterized by comprising the following steps:

carrying out significance region detection on the remote sensing image by adopting a significance detection method for simulating a human visual attention mechanism; the specific method comprises the following steps:

step 1.1, extracting color and brightness characteristics

I＝(r+g+b)/3

normalizing the r, g and b color channels according to the brightness channel I; normalizing pixels with the brightness I greater than Maxinum/10 in the image, and zeroing the pixels with the brightness I less than or equal to Maxinum/10, wherein Maxinum is the maximum value of all brightness values in the image; according to the normalized r, g and b values, four color channels of red, green, blue and yellow are obtained through calculation, the values are subjected to non-negative constraint, and the value smaller than 0 is replaced by 0; the calculation formula is as follows:

R＝r-(g+b)/2

G＝g-(r+b)/2

B＝b-(r+g)/2

Y＝(r+g)/2-|r-g|/2-b

gaussian down-sampling is carried out on the four color channels R, G, B and Y obtained through calculation and the brightness I to obtain respective Gaussian pyramids I _σ 、R _σ 、G _σ 、B _σ 、Y _σ The established Gaussian pyramid is 9 layers;

step 1.2, directional feature extraction

And extracting the direction characteristics by adopting a Gabor filter, wherein a two-dimensional Gabor formula is as follows:

the Gabor filter is a gaussian function modulated by a complex sinusoidal function, where α, β are the standard deviations of the gaussian envelope in the x and y directions, λ and θ, respectively _k Wavelength and direction of a sine wave, respectively, wherein:

x _θk ＝xcos(θ _k )+ysin(θ _k )

y _θk ＝-xsin(θ _k )+ycos(θ _k )

selecting theta _k The output of the Gabor filter in the four directions of = 0 °,45 °,90 °,135 ° } is the direction characteristic, and the direction characteristic diagram in the four directions is obtained through calculation;

step 1.3, feature map construction

When obtaining the feature map, a Center-Surround method is adopted, and the calculation method is as follows:

I _σ (c,s)＝|I _σ (c)ΘI _σ (s)|

R _σ G _σ (c,s)＝|(R _σ (c)-G _σ (c))Θ(R _σ (s)-G _σ (s))|

B _σ Y _σ (c,s)＝|(B _σ (c)-Y _σ (c))Θ(B _σ (s)-Y _σ (s))|

O _σ (c,s,θ)＝|O _σ (c,θ)ΘO _σ (s,θ)|

wherein c ∈ {2,3,4}, s = c + δ, δ ∈ {3,4}; the parameter c refers to the fine scale, and the parameter s refers to the coarse scale; theta stands for image matrix subtraction, I _σ Is a luminance characteristic map, R _σ G _σ And B _σ Y _σ Is a color feature map, O _σ Is a directional feature map; obtaining 6 brightness characteristic mapping maps, 12 color characteristic mapping maps and 24 direction characteristic mapping maps, and totaling 42 characteristic mapping maps;

step 1.4, saliency map generation

Feature map M generated using Markov chain pairs [ n] ² → R carry out standard combination to construct a corresponding saliency map A: [ n] ² → R, the procedure was as follows:

m (i, j) and M (M, n) are respectively the characteristics of the nodes (i, j) and (M, n); the difference distances of M (i, j) and M (M, n) are defined as

Connecting each pixel of the feature map with other pixels to form a full connected map G _A : connecting each vertex in M with the other n-1 vertices, defining weights for the directed edges from point (i, j) to point (M, n):

wherein

Sigma is a free parameter; weight ω ₁ ((i, j), (m, n)) is proportional to the dissimilarity, proximity of two points;

in directed graph G _A Constructing a Markov chain, and setting the sum of the outgoing edge weights of each point to be 1; the salient diagram A is shown in the specification n] ² → R, so that salient pixel values are concentrated in critical areas; then according to A defining full connected graph G _C Solving the equilibrium state of the Markov chain on the graph to obtain a final saliency graph;

2. The super-resolution reconstruction method of the convolutional neural network image fused with the bionic vision mechanism according to claim 1, characterized in that, aiming at the salient region, the super-resolution reconstruction method is adopted, and the super-resolution reconstruction method is as follows:

step 2.1, extraction and representation of image blocks

The image blocks are densely extracted, then the extracted image blocks are represented, pre-trained bases are used, each base is equivalent to a convolution kernel, and the first layer is represented as operation F ₁ ：

F ₁ (Y)＝max(0,W ₁ *Y+B ₁ )

Wherein, W ₁ Representing a convolution kernel, B ₁ Denotes a base, "+" denotes a convolution operation; w ₁ Represents n ₁ C' x f ₁ ×f ₁ C' represents the number of image channels, f ₁ Represents the length and width of the convolution kernel in space; utilizing n for input image ₁ Extracting features by convolution operation, wherein the sizes of convolution kernels are c' x f ₁ ×f ₁ The resulting output is n ₁ A sheet feature map; b is ₁ Is dimension n ₁ Using ReLU as the activation function in the convolution result;

step 2.2, nonlinear mapping

Each image block has n extracted from the first layer ₁ Features of dimension, in a second layer of operation, each vector obtained in the first layer is remapped to another vector, the dimension of the vector is n ₁ Is changed into n ₂ (ii) a The operation of the second layer is:

F ₂ (Y)＝max(0,W ₂ *F ₁ (Y)+B ₂ )

wherein, W ₂ Corresponds to n ₂ N is ₁ ×f ₂ ×f ₂ Of a convolution kernel of, B ₂ Is n ₂ A dimension vector; high score later on for reconstructed imagesResolution image blocks are all output n in this layer ₂ A dimensional vector;

step 2.3, reconstruction

The last layer still uses convolution operations to generate a reconstructed high resolution image, defined as follows:

F(Y)＝W ₃ *F ₂ (Y)+B ₃

wherein, W ₃ Corresponding to c' n ₂ ×f ₃ ×f ₃ Of a convolution kernel of, B ₃ Is a c' dimensional vector;

w if the representation of the high resolution image block is in the image domain ₃ The function of an average filter is realized; w if the representation of the high resolution image block is in other domains ₃ The effect of (1) converting these domains into image domains first, using a method of projection coefficients;

connecting the three steps to obtain a super-resolution reconstruction network, namely a convolutional neural network;

step 2.4, network parameter training

Learning the end-to-end mapping function F requires that the parameter θ = { W in the network be aligned ₁ ,W ₂ ,W ₃ ,B ₁ ,B ₂ ,B ₃ Estimation, using a method which minimizes the error between the reconstructed high-resolution image F (Y; θ) and the corresponding original high-resolution image X, { X } _i Represents a set of high resolution images, and the corresponding low resolution image is represented by { Y } _i Denotes that the loss function used is the mean square error:

wherein n represents the number of training samples;

random gradient descent using standard back propagation minimizes losses; the weight matrix is updated by:

represents the partial derivative;

for convolution kernels in all layers, parameters of the convolution kernels need to be initialized, the initialized parameters are random, but the sources are normal distribution with the mean value of 0 and the standard deviation of 0.01; the learning rate η is 10 in both the first layer and the second layer ^-4 In the last layer is 10 ^-5 ；

In the training phase, raw images { X } _i Is randomly cropped from the training image with the size f _sub ×f _sub The sub-image of x c' pixels is blurred by a gaussian kernel, down-sampled and reduced by a certain factor, and then magnified by the same factor by a bicubic interpolation method.

3. The super-resolution reconstruction method for the convolutional neural network image fused with the bionic vision mechanism as claimed in claim 2, wherein in step 2.1, the pre-training method is PCA, DCT or Haar.

4. The super-resolution reconstruction method for convolutional neural network images fused with a bionic vision mechanism as claimed in claim 1, wherein a bicubic interpolation method is used for super-resolution reconstruction of the non-salient region, and finally the super-resolution reconstruction is enlarged to the same size as the salient region.