CN114612315A

CN114612315A - High-resolution image missing region reconstruction method based on multi-task learning

Info

Publication number: CN114612315A
Application number: CN202210008040.5A
Authority: CN
Inventors: 吴炜; 谢煜晨; 吴宁; 钟幸宇
Original assignee: Southeast Digital Economic Development Research Institute
Current assignee: Southeast Digital Economic Development Research Institute
Priority date: 2022-01-06
Filing date: 2022-01-06
Publication date: 2022-06-10

Abstract

The invention provides a high-resolution image missing region reconstruction method based on multitask learning, which takes missing region filling on a high-spatial resolution image as a conditional image generation problem and adopts a conditional countermeasure generation network model structure based on gate-controlled convolution to realize. The generator extracts features through the coding and decoding structures, performs ground feature type, boundary extraction and missing region filling by using the same features to complete three related tasks, and guides the feature extraction module to pay attention to information such as ground feature types and boundaries, so that a real image without missing details is obtained. The discriminator judges whether each image is generated or real, and guides the generator to carry out optimization according to the judgment. According to the invention, through guiding the generator to pay attention to the ground feature boundary and the ground feature type information, a seamless reconstruction result with rich details and textures can be obtained.

Description

High-resolution image missing region reconstruction method based on multi-task learning

Technical Field

The invention provides a high-resolution image missing region reconstruction method based on multi-task learning, which solves the problem of detail loss in high-resolution remote sensing image reconstruction and obtains a non-missing image with high fidelity.

Background

Cloud and cloud shadow and other factors can cause partial area loss on the remote sensing image. The satellite carrying the high spatial resolution sensor is particularly difficult to acquire the cloud-free remote sensing image due to the influences of factors such as long revisit period, small sensor width and the like. And the missing area filling restores the missing part on the image according to a certain rule to obtain a non-missing image covering the whole research area, thereby providing effective data for various applications. The method for filling the missing area is divided into two types of space dimension and spectrum dimension according to different information of the missing area filling.

And reconstructing the missing region based on the space dimension and predicting the missing region by using the effective region to obtain an image without missing. According to whether a reference image is used, the method can be divided into single-phase image filling and multi-phase image filling, wherein the single-phase image filling extracts structural information such as gradients from an effective part on the image and populates the structural information to a missing area to realize missing area reconstruction; the multi-phase image method maps the corresponding area of the reference image to the missing area according to a certain mode, thereby filling the missing area. Another method is to copy and paste the similar areas on the same image into the missing areas (Mallouf A, Carre P, Augerau Bet al,2009.A bandelet-based inpainting technique for closed removed from removed images [ J ]. IEEE Transactions on geoccience and remove Sensing,47(7): 2363-2371. DOI:10.1109/TGRS.2008.2010454.) by using the reference image as the reference for similar area selection.

The missing feature reconstruction based on the time dimension is to regard each pixel as a feature which changes along with time, such as reflectivity, DN value or NDVI, and reconstruct the missing feature according to the change rule of the time dimension. Common methods such as mean value filling, previous pixel value filling and next pixel value filling have good effects on stationary time sequences, and the effects on periodical ground objects such as bedding and the like are general due to lack of consideration on periodical change of the ground objects. Another method is to select the similar pixels according to Time sequence characteristics, namely to select the same or similar change curves according to the Time sequence characteristics and to reconstruct the Missing region by using the reference Time sequence curves, and the method can consider the change of each observation scale pixel along with Time but neglect the spatial proximity relation of each pixel (Wu Wei, Ge Luoqi, Luo Jiancheng, Huang Ruohong, Yang Yingpin.A Spectral-Temporal Patch-Based Missing Area Reconstruction for Time-Series Images [ J ] Remote Sensing,2018,10(10): 1560.).

The remote sensing image not only shows the correlation in space, but also shows the correlation in time sequence observation, namely, the occurrence and development rules of the earth surface are reflected. The reconstruction result can not only reflect the texture details of the local area, but also be integrated with the non-missing area; but also reflects the time-varying characteristics of the image. The method for fusing the time domain and the space domain simultaneously utilizes the two kinds of information to realize the reconstruction of the missing region. One method divides the image into homogeneous image blocks by multi-scale segmentation, and describes the spatial similarity of image pixels because the image blocks are formed by similar pixels. Typical methods include forming superpixels by pixel clustering, obtaining similar spectrum blocks based on the superpixels, measuring the similarity of time variation among the image blocks by a certain index, and utilizing a similar curve to reconstruct the whole of the missing region (C.H.Lin, K.H.Lai, Z.Bin Chen, and J.Y.Chen, "Patch-based information retrieval of closed-coordinated multiple images," IEEE Trans.Geoci.Remote Sens, vol.52, No.1, pp. 163 163,2014, doi: 10.1109/TGRS.2012.7408. Zhou Yak' nan, Yang SAR Xiezezezing, Feng Li, Wu Wei, Wu Anjun 1005, Luo Jnche, Zhou Xiancheng, Zhang Xun, Zhang Xue-search, coding J.2020. transform and coding data sharing, No. 2020. 8. sub-code relating to obtain the same image blocks.

In recent years, because the convolution operation can adjust the mapping relation between images with different spatial resolutions in a learning mode and describe a complex nonlinear response relation, deep learning based on a convolution neural network is widely applied to remote sensing image processing and information extraction. The generation of the countermeasure network (GAN) realizes cooperative optimization through the countermeasure training of the generator and the discriminator, and achieves good effect in various tasks. Wherein the generator generates a false-to-true image, and the discriminator determines whether the input image is generated by the generator or a true image. The generation of the antagonistic neural network achieves a reconstruction effect superior to that of the traditional method in tasks such as image reconstruction and the like.

However, unlike the common image reconstruction task that removes unwanted background terrain and obtains a seamless image that looks real, the remote sensing image reconstruction is to restore the real ground surface state. Since the high spatial resolution satellite image has abundant texture information, the reconstruction of the missing region requires the recovery of detail information inside the region, such as texture and edge of ground features. In order to solve the problem that the details of the high-spatial-resolution remote sensing image cannot be recovered when the traditional method is directly applied to reconstruction of the high-spatial-resolution remote sensing image, the method treats reconstruction of a missing area as a multi-task learning process, and takes information such as edges, categories and the like of the remote sensing image as one output of a model, so that a feature extractor can pay attention to the features of the edges and the categories of the ground objects in a training process, and a better reconstruction result is obtained.

Disclosure of Invention

A partial area on an original image I is missing, a mask M is used for identifying the effective condition of each pixel on the image, 1 represents a missing value, and 0 represents an effective value. The problem of reconstruction of the missing region is that the region marked 1 on the mask is estimated and filled to obtain an image I without missing_OUTThe process of (1). The invention can regard the missing area filling as a conditional image generation problem, adopts the condition GAN based on gate-controlled convolution, and the network structure is shown as figure 1, the network is mainly divided into two parts, a generatorFeature extraction is carried out through a coding and decoding structure, three related tasks of ground feature type identification, boundary extraction and missing region filling are carried out by using the same feature, and a feature extraction module is guided to pay attention to information such as ground feature types and boundaries, so that a real image without missing details is obtained. The discriminator judges whether each image is generated or real, and guides the generator to carry out optimization according to the judgment. The structures of the generator and the discriminator are described below.

1. Generator

The generator inputs the remote sensing image I and the missing region mask M thereof to obtain the ground object type T, the boundary B and the result I of the missing region reconstruction_OUT. The structure is composed of a convolution-based automatic coding-decoding network, and the specific structure is described according to an encoder, a decoder and an image reconstruction part as follows:

1) encoder for encoding a video signal

The encoder realizes the feature extraction of a plurality of scales and comprises the following steps:

firstly, the original image I and the mask M are processed by channel series connection (Concat), and then are processed by two-dimensional convolution (Conv2D), so as to obtain the characteristic F₁：

F₁＝Conv2D(Concat(I*(1-M)+M,M)),k＝32,s＝1 (1)

Where k denotes the number of convolution kernels, s denotes the convolution step, and + denotes the pixel-by-pixel matrix addition.

Second, output feature F was aligned using gated convolution GatedConv2D₁Encoding to obtain F₂：

F₂＝GatedConv2D(F₁),k＝64,s＝2 (2)

Again, Conv2D pair F was used₂Coding to obtain F₃：

F₃＝Conv2D(F₂),k＝64,s＝1 (3)

Then, F is mixed₃F is obtained after encoding through GatedConv2D operation with the convolution kernel number k being 128 and s being 2₄Then F is added₄Continuously combining the coded features through Conv2D operation of the convolution kernel number k being 128 and s being 1 to obtain the coded featuresF₅。

Finally, we used GatedConv2D for F₅And performing multi-scale cavity convolution to obtain the target.

F₆＝MultiDilatedGatedConv2D(F₅),k＝128,s＝1 (4)

The multidiated gatec onv2D includes four layers of gatec onv2D, where the number k of convolution kernels is 128, and each layer is subjected to cavity convolution with different expansion rates, which are 2, 4, 8, and 16, respectively. The hole convolution can enable the network to obtain multi-scale features and a larger receptive field, and provides cross-scale and cross-region information for the reconstruction of the missing region.

2) Decoder

The decoder inputs the characteristic diagram obtained by the encoder into the gating convolution to realize the amplification of the characteristic diagram, obtains the characteristics with the same size as the original image, and realizes the reconstruction of the missing area, and the realization steps are as follows.

Firstly, the characteristics F of the encoder output₅And F₆Serially connecting Concat according to channel dimension, inputting the Concat into a GatedConv2D convolutional layer for fusion and decoding to obtain a characteristic F₇And F₈：

F₇＝GatedConv2D(Concat(F₆,F₅)),k＝128,s＝1 (5)

F₈＝Conv2D(F₇),k＝128,s＝1 (6)

Input feature F using TransposeGatedConv2D₈Is amplified to obtain F₉：

F₉＝TransposeGatedConv2D(F₈),k＝64,s＝1 (7)

Where TransposeGatedConv2D is a combination of an upsampling function and GatedConv2D, the upsampling function used in the present invention is an adjacent sample interpolation.

Similarly, F₆And F of the encoder output₉Serially connecting Concat according to channel dimension, and inputting GatedConv2D with convolution kernel number k being 64 and s being 1 to obtain F₁₀F was obtained via a Conv2D with k 64 and s 1₁₁. F is to be₁₁F is obtained by amplifying transposegegrv 2D with the number of convolution kernels k 32 and s 1₁₂Then, againF is to be₁₂And F₁The GatedConv2D with the number of convolution kernels k 32 and s 1 input in series yields F₁₃。

3) Image reconstruction

The image reconstruction is a process of reconstructing a missing area to obtain an image without missing. The multi-task learning of the invention simultaneously comprises two additional tasks of surface feature type identification and edge detection, thereby guiding the feature extraction module to focus on two kinds of information of surface feature type and boundary.

In obtaining feature F₁₃Then, a layer of Conv2D operation processing is carried out to obtain a reconstructed image I_OUT：

I_OUT＝Conv2D(F₁₃),k＝3,s＝1 (8)

In the feature type identification subtask, F is₇，F₁₀Amplifying to F by using cubic interpolation function U₁₃And connecting Concat in series according to the channel number dimension, and obtaining a ground object type identification result T through one-layer two-dimensional convolution Conv2D_OUTThe process can be expressed as follows:

T_OUT＝Conv2D(Concat(U(F₇),U(F₁₀),F₁₃)),k＝C,s＝1 (9)

where k indicates that the number of classification categories C is equal.

The edge detection subtask depends on the result of ground feature class identification, and is obtained by one-layer convolution of ground feature classification results, and an edge intensity map B_OUTCan be seen as:

B_OUT＝Conv2D(T_OUT),k＝1,s＝1 (10)

the edge detection can be used for guiding the boundary details of the ground objects during image reconstruction, and the sharpening degree of the generated result is optimized; the land feature type information can be used for guiding the generator to pay attention to the type information of the land features, and the identification capability of the texture is improved.

2. Distinguishing device

The discriminator is a spectral local discriminator consisting of 6 Conv2D with convolution kernels of 5 × 5 and step sizes of 2 × 2. Each convolutional layer of the discriminator uses spectral normalization to stabilize the training.

During the discriminant training process, the present invention trains using LSGAN as a Loss function, which is the Loss of the principal discriminant Loss_DLoss of and GAN_GANTwo parts, in the form:

wherein X represents input real data I; y is the input generated data, i.e. I_OUT. D (X) and d (Y) respectively represent the discrimination results of the input real data X and the input generated data Y. Because a full convolution discriminator is used, the obtained discrimination high-dimensional matrix is smaller than the size of an original image, the discrimination probability of whether the image subregion is generated or real is represented, 1 is used for generation, 0 marks are real, a and b represent high-dimensional matrixes with the same size as the output size of the discriminator, and the value of a is 1; the values of b are all 0.

Loss of GAN Loss_GANThe method is used for calculating the difference between the result distinguished by the generator generated data discriminator and the correct result so as to optimize the generation of the generator, and the specific formula is as follows:

and judging whether the input picture is a real image or a generated image through the two functions, so as to realize the cooperative optimization of the input picture and the generated image.

The invention has the advantages that two additional tasks of ground feature type identification and boundary detection are added in the generator, so that the generator can focus on the ground feature type and boundary information simultaneously when filling the ground features, and a clearer result is obtained.

Drawings

FIG. 1 is a diagram of a network architecture of the present invention

FIGS. 2(a) and (b) are the spatial distribution diagrams of samples from Hangzhou and Panyu respectively

FIGS. 3(a) to (c) are exemplary diagrams of input data according to the present invention, (a) is an original image, (b) is a feature type label, and (c) is a boundary of different types of features

FIG. 4 shows the results of the experiment of the present invention

Detailed Description

The embodiment of the invention provides a remote sensing image missing region reconstruction method based on multitask learning, which comprises the following specific implementation modes:

two experimental areas of Hangzhou and Panyu were selected and 500 points were randomly generated, and the spatial distribution thereof is shown in FIG. 2. These points are used as the center to cut out the image block of 1000 × 1000 size. Wherein 50 image blocks are from GeoEye data acquired in 2006, 50 word view images in 2016, and 400 tile data captured from google earth, and the acquisition time of the tile data of google earth cannot be accurately determined. The 500 samples contain images acquired under different weather conditions, and the images have good conditions and also have conditions such as shadows and haze, and have good representativeness.

The data covers typical urban and rural scenes, and two labels are needed for model training: 1) the land feature types are mainly marked, the land feature types are divided into 8 types according to the requirements of research contents, namely grasslands, roads, cultivated lands, buildings, forests, bare lands, water bodies and the like, and the 8 types are mainly set according to the land feature types which can be distinguished on images and the difference of texture features of the land feature types; 2) boundary: when the ground feature type is identified, the boundary of each land parcel is accurately marked. However, during model training, we train the model using BDCN, perform boundary extraction, buffer the lines, and assign weights according to gaussian. 3) Real image: the method adopts a simulated missing region, namely an effective region on the remote sensing image is regarded as the missing region, the estimation result of the algorithm is adopted for filling, the filling result is compared with the real result, and the precision of the method is evaluated. The original image, data categories and their boundary results are shown in FIG. 3.

Randomly dividing the 500 samples into three groups, wherein the first group is 400 samples, is a training set and is mainly used for training a model; the second group of 50 sheets is mainly used for verifying the stability of the model; and the third group of 50 sheets is mainly used for evaluating the precision of the method. The following precision test and model training were performed based on 50 samples of the test set.

Because the method is a simulation experiment, missing regions with different sizes and different shapes are randomly generated on the image during each training, and then the reconstruction is carried out according to the method to obtain a reconstructed image without missing.

Based on the above configuration, a network model is constructed as described in the present invention, and the image data set is input to the model for training. The data are then trained to obtain a stable model. The invention trains 500 generations in total, the iteration size is 2, the generator and the discriminator are optimized by two Adam optimizers with the initial learning rate of 0.0001, and the learning rate of the generator and the discriminator in each 100 generations is attenuated to be half of the original learning rate.

After a stable network model is obtained, an image with a simulated missing region is input into the model to generate the missing region, and the result of reconstructing a part of the missing region is shown in fig. 4. Meanwhile, compared with the original image, the data result is real, and the detailed information such as the texture and the edge thereof is rich, so that the text method is an effective data reconstruction method.

The foregoing is merely a description of embodiments of the invention and is not intended to limit the scope of the invention to the particular forms set forth, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A high-resolution image missing region reconstruction method based on multitask learning is characterized in that partial regions of an input original image I are missing, effective conditions of pixels on the image are marked by using a mask M, 1 represents a missing value, and 0 represents an effective value, and a network adopted by the method comprises the following parts:

1. generator

The generator inputs the remote sensing image I and the missing region mask M thereof to obtain the ground object type T, the boundary B and the result I of the missing region reconstruction_OUTThe structure is composed of a convolution-based automatic coding-decoding network, and the specific structure is described according to an encoder, a decoder and an image reconstruction part as follows:

1) encoder for encoding a video signal

F₁＝Conv2D(Concat(I*(1-M)+M,M)),k＝32,s＝1 (1)

Second, output feature F was aligned using gated convolution GatedConv2D₁Coding to obtain F₂：

F₂＝GatedConv2D(F₁),k＝64,s＝2 (2)

Again, Conv2D pair F was used₂Coding to obtain F₃：

F₃＝Conv2D(F₂),k＝64,s＝1 (3)

Then, F is mixed₃F is obtained after encoding through GatedConv2D operation with the convolution kernel number k being 128 and s being 2₄Then F is added₄Continuing to combine the coded features to obtain F through Conv2D operation with the convolution kernel numbers k being 128 and s being 1₅。

F₆＝MultiDilatedGatedConv2D(F₅),k＝128,s＝1 (4)

The multidiated gatec onv2D includes four layers of gatec onv2D, where the number k of convolution kernels is 128, and each layer is subjected to cavity convolution with different expansion rates, which are 2, 4, 8, and 16, respectively. The cavity convolution can enable the network to obtain multi-scale features and a larger receptive field, and provides cross-scale and cross-region information for the reconstruction of the missing region;

2) decoder

The decoder inputs the characteristic diagram obtained by the encoder into the gating convolution to realize the amplification of the characteristic diagram, obtains the characteristic with the same size as the original image, and realizes the reconstruction of the missing area, and the realization steps are as follows:

F₇＝GatedConv2D(Concat(F₆,F₅)),k＝128,s＝1 (5)

F₈＝Conv2D(F₇),k＝128,s＝1 (6)

Input feature F using TransposeGatedConv2D₈Is amplified to obtain F₉：

F₉＝TransposeGatedConv2D(F₈),k＝64,s＝1 (7)

Where TransposeGatedConv2D is a combination of an upsampling function and GatedConv2D, the upsampling function used in the present invention is an adjacent sample interpolation,

similarly, F₆And F of the encoder output₉Serially connecting Concat according to channel dimension, and inputting GatedConv2D with convolution kernel number k being 64 and s being 1 to obtain F₁₀F was obtained via a Conv2D with k 64 and s 1₁₁. F is to be₁₁F is obtained by amplifying transposegegrv 2D with the number of convolution kernels k 32 and s 1₁₂Then F is added₁₂And F₁The GatedConv2D with the number of convolution kernels k 32 and s 1 input in series yields F₁₃；

3) Image reconstruction

The image reconstruction is a process of reconstructing a missing area to obtain a missing-free image, the multi-task learning simultaneously comprises two additional tasks of surface feature type identification and edge detection, so that the feature extraction module can be guided to pay attention to two information of surface feature types and boundaries simultaneously,

I_OUT＝Conv2D(F₁₃),k＝3,s＝1 (8)

T_OUT＝Conv2D(Concat(U(F₇),U(F₁₀),F₁₃)),k＝C,s＝1 (9)

wherein k represents that the number of classification categories C is equal,

B_OUT＝Conv2D(T_OUT),k＝1,s＝1 (10)

the edge detection is used for guiding the image to focus on the boundary details of the ground object during reconstruction, and the sharpening degree of the generated result is optimized; the land feature type information can be used for guiding the generator to pay attention to the type information of the land feature, and the identification capability of the texture is improved;

2. distinguishing device

The discriminator is a spectral local discriminator consisting of 6 Conv2D with convolution kernels of 5 × 5 and step sizes of 2 × 2. Each convolutional layer of the discriminator uses spectral normalization to stabilize the training,

wherein X represents input real data I; y is the input generated data, i.e. I_OUT. D (X) and d (Y) respectively represent the discrimination results of the input real data X and the input generated data Y. Because a full convolution discriminator is used, the obtained discrimination high-dimensional matrix is smaller than the size of the original imageThe judgment probability of whether the image subregion is generated or real is represented by 1, 0 marks the real, a and b represent high-dimensional matrixes with the same size as the output of the discriminator, and the value of a is 1; the values of b are all 0's,

2. According to the structure of the generator part of claim 1, the three tasks of boundary detection, image type identification and missing area reconstruction are performed simultaneously, so that the feature extraction can focus on the ground feature boundary, the image type and the missing area on the image at the same time.