CN115564982A

CN115564982A - Same-domain remote sensing image classification method based on counterstudy

Info

Publication number: CN115564982A
Application number: CN202110738534.4A
Authority: CN
Inventors: 王慧; 闫科; 于克光; 杨乐; 李烁; 李靖; 蓝朝桢; 于翔舟; 李伟
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-01-03

Abstract

The invention relates to a method for classifying remote sensing images in the same domain based on counterstudy, and belongs to the technical field of remote sensing image data processing. The invention classifies the remote sensing image data by adopting the classification model formed by the generator and the discriminator, so that the classification model obtains data distribution similar to a truth label space as much as possible, thereby improving the overall perception capability and classification precision of the classification model on the input image. In addition, the generator adopts the framework of an encoder and a decoder, and the decoder adopts a multilayer convolutional neural network which is provided with a residual error part, an up-sampling part and an attention enhancement part; the output of the decoder is used as the input of the encoder, and meanwhile, the decoder is connected with the corresponding layers in the encoder so as to fuse the low-layer feature position information and the high-layer feature semantic information, improve the relevance between pixels, reduce the information loss in the up-sampling process and further improve the classification precision.

Description

Same-domain remote sensing image classification method based on counterstudy

Technical Field

The invention relates to a same-domain remote sensing image classification method based on counterstudy, and belongs to the technical field of remote sensing image data processing.

Background

The backbone networks ResNet and resnext with better performance do not exert their own performance advantages, and experimental results show that there is a certain gap between the improved methods of the above two backbone networks as encoders and the VGG network. There are two main reasons for this phenomenon: firstly, from the point of statistics, the conventional deep learning image classification method assumes that a training set and a test set obey the same distribution, i.e. the feature spaces of the training set and the test set are the same or similar; theoretically, the classification accuracy of the model on the training data should be the same as or slightly different from the accuracy on the test data, but in the actual situation, the classification accuracy on the test data is lower than that of the training data, and an overfitting phenomenon occurs. Second, although the complex deep learning model can better learn most features in the training data, some features depending on the relationship such as spatial structure and color texture cannot be effectively learned. Particularly, in a remote sensing image, large structural differences may exist between the same ground objects, and strong similarity may exist between the color tones and the textures of different ground objects, which reflects that certain relation exists between pixels at a semantic level; the traditional deep learning model training uses a cross entropy loss function for optimization, the gradient used by the loss function in backward propagation is only related to the difference between a single pixel in a prediction result and a pixel in a corresponding truth label, the correlation of neighborhood pixels is not considered, the correlation among the pixels is ignored, the precision of a classification result is low, and the situation that the edge of a ground object is discontinuous or the difference between the classification result and the truth label is large in geometric shape easily occurs.

Disclosure of Invention

The invention aims to provide a method for classifying remote sensing images in the same domain based on counterstudy, which aims to solve the problem of low classification precision of the existing remote sensing image classification method.

The invention provides a method for classifying the remote sensing images in the same domain based on counterstudy to solve the technical problems, which comprises the following steps:

acquiring remote sensing image data to be classified, and inputting the remote sensing image data to a trained classification model for classification; the classification model comprises a generator and a discriminator, wherein the generator adopts a structure of an encoder and a decoder and is used for obtaining a pixel classification prediction result of the remote sensing image; the discriminator adopts a convolutional neural network and is used for distinguishing a real label from a prediction result generated by the generator by acquiring high-order consistency between the real label and the prediction result; the generator and the arbiter are trained in a counter-training manner.

The invention classifies the remote sensing image data by adopting a classification model formed by a generator and a discriminator, and makes the classification model obtain data distribution similar to a truth label space as far as possible by utilizing the strong function fitting capacity of a generated countermeasure network, thereby improving the overall perception capability and the classification precision of the classification model to the input image.

Further, in order to improve the correlation between pixels and reduce the information loss in the up-sampling process, the encoder adopts a feature extraction network for mapping the input remote sensing image data to a high-dimensional feature space; the decoder adopts a multilayer convolution neural network, the multilayer convolution neural network comprises a plurality of convolution neural networks with different depths, and a residual error part, an up-sampling part and an attention enhancing part are arranged in each convolution neural network with different depths; the output of the decoder is used as the input of the encoder, and meanwhile, the decoder is connected with the corresponding layers in the encoder so as to fuse the position information of the low-layer feature and the semantic information of the high-layer feature.

Furthermore, the residual error part comprises two convolution modules, the input end and the output end of the residual error part are connected in a cross-layer mode, and the input end of the residual error part is used for receiving the characteristics of the spliced previous layer of convolution neural network and the encoder; the input end of the up-sampling part is used for receiving the output signal of the residual error part and restoring the feature map processed by the residual error part to the size corresponding to the high-order feature map.

Furthermore, the attention enhancing part comprises a semantic information enhancing module and a position information enhancing module which are processed in parallel, the input of the semantic information enhancing module and the position information enhancing module are both feature maps output by the up-sampling part, the semantic information enhancing module is used for operating the channel dimension of the input feature maps and completing the modeling of the relation of specific semantic information between the high-order feature maps and the low-order feature maps by utilizing the correlation between the high-order channel and the low-order channel; the position information enhancement module is used for establishing the position information correlation between the local features of the input feature map and other neighborhoods.

Further, the processing procedure of the semantic information enhancement module is as follows:

obtaining statistical information of the input characteristic diagram on the channel dimension by using global average pooling;

determining the weight of each channel dimension by utilizing linear transformation and an activation function according to the obtained statistical information of the channel dimension, wherein the calculation formula of the weight is as follows:

g _c representing the feature vectors obtained through global average pooling,

and W ₂ ∈R ^C/n×C Respectively representing the weight of 1 multiplied by 1 convolution layer, reLU represents ReLU function, sigma represents Sigmoid operation, and C represents the total number of categories;

based on the weights in each channel dimension, an enhanced feature map is determined.

Further, the enhanced feature map obtained by the location information enhancing module is:

v ^i，j ＝h _i，j *x ^i，j

q＝W _s *x

wherein v = [ v = ^1，1 ，v ^1，2 ，…，v ^W，H ]V represents the enhanced feature map, x = [ x = ^1，1 ，x ^1，2 ，…，x ^W，H ]，

A slice representing the input feature map along the channel dimension, (i, j) corresponds to the feature map's spatial position coordinates i e {1,2, \8230;, W }, j e {1,2, \8230;, H };

representing the mapping matrix after a 1 x 1 convolution operation,

denotes that x passes through W _s Mapping the weight graph; h is _i，j Is q _i，j Scaled to [0,1 ] by Sigmoid]The result in (i) represents the importance of the position information at the (i, j) position in the feature map.

Further, the generator adopts a loss function in the training process as follows:

is used to reduce the performance of the discriminator,

a focus loss function for multi-path fusion to generate a correct classification prediction for each pixel of the input image, theta _G The parameters of the generator G are represented by,

is that

And

a linear combination of (a) as

D (-) indicates that the discriminator judges that X as an input is the predicted value G (X) from the generator ⁽ⁿ⁾ ) Or truth label Y ⁽ⁿ⁾ 。

Furthermore, the discriminator is formed by connecting 8 convolutional layers in series, the convolutional kernel size of the convolutional layer is 4 × 4, the step length of the last convolutional layer is not more than 1, and the step lengths of the first convolutional layer to the seventh convolutional layer are all 2; the first convolutional layer uses the ReLU activation function, and the remaining convolutional layers use the LeakyReLU activation function.

Further, the loss function of the discriminator may be defined as follows:

wherein theta is _D Represents the parameters in the discriminator D and,

representing a binary cross entropy loss, D (-) representing the arbiter determines that X of the input is the predicted value G (X) from the generator _(n) ) Or true value label Y _(n) Y represents a certain type of one-hot codes in the truth label, and x represents a certain type of prediction results generated by the generator.

Further, the parameter θ of the discriminator _D Parameter θ of sum generator _G The step-by-step updating is adopted, and the parameter theta of the generator is fixed firstly _G Updating the parameter θ of the discriminator _D Enabling the discriminator to distinguish the prediction results; re-fixing the discriminator parameter theta _D Updating the parameter θ of the generator _G So that the generator generates a prediction result that the discriminator cannot distinguish between true and false.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a classification model employed by the present invention;

FIG. 2 is a schematic diagram of a generator in the classification model of the present invention;

FIG. 3 is a block diagram of a VGG (VGG-19) network employed by an encoder in the classification model of the present invention;

FIG. 4 is a diagram illustrating the structure of the attention and residual errors used by the decoder in the classification model according to the present invention;

FIG. 5-a is a schematic diagram of a countermeasure network employing a cGAN model framework;

FIG. 5-b is a schematic diagram of a countermeasure network employing a pix2pix model framework;

FIG. 6 is a schematic diagram of the structure of the discriminator used in the present invention;

FIG. 7 is a visualization of a multi-path fused focus loss function design employed by the generator of the present invention;

FIG. 8 is a diagram of partial predicted results of different model methods on a Vaihingen data set in an experimental example of the present invention;

FIG. 9-a is the original image No. 31 in the data set of Vaihingen in the experimental example of the present invention;

FIG. 9-b is a schematic diagram of the truth label of image No. 31 in the data set of Vaihingen in accordance with the present invention;

FIG. 9-c is a graph showing the classification of image number 31 in the Vaihingen dataset using the SVL-3 method in accordance with the present invention;

FIG. 9-d shows the results of classification of image No. 31 in the Vaihingen dataset by the RIT _ L7 method in accordance with the experimental examples of the present invention;

FIG. 9-e shows the classification result of image No. 31 in the Vaihingen dataset by DLR _8 method in the experimental example of the present invention;

FIG. 9-f shows the classification result of image number 31 in the Vaihingen data set by the CASIA method in the experimental example of the present invention;

FIG. 9-g shows the results of classifying the 31 st image in the Vaihingen data set by the AREANs-VGG method in the experimental example of the present invention;

FIG. 9-h shows the results of classification of image No. 31 in the Vaihingen dataset using the AREANs-ResNet method in the experimental example of the present invention;

FIG. 9-i shows the results of classification of image No. 3l in the Vaihingen dataset using the AREANs-ResNeSt method in the experimental example of the present invention;

FIG. 10-a is a Potsdam dataset No. 6 _13raw image according to an example of the present invention;

FIG. 10-b is a schematic diagram of the truth label of image No. 6_, 13 in the Potsdam dataset according to the example of the present invention;

FIG. 10-c shows the result of classification of image No. 6 _13in Potsdam dataset using the SVL _1 method in accordance with the present invention;

FIG. 10-d shows the results of classification of image No. 6 _13in the Potsdam dataset using the RIT _ L7 method in accordance with the present invention;

FIG. 10-e shows the classification of image No. 6 _13in Potsdam dataset using the UZ _1 method in accordance with the present invention;

FIG. 10-f shows the result of classification of image No. 6 _13in Potsdam dataset using the DST _5 method according to the experimental example of the present invention;

FIG. 10-g shows the results of classification of image No. 6 _13in the Potsdam dataset using the BKHz _3 method in accordance with the experimental example of the present invention;

FIG. 10-h shows the result of classification of image No. 6 _13in Potsdam dataset using CASIA2 method in accordance with the present invention;

FIG. 10-i shows the classification of image No. 6 _13in Potsdam dataset using BUCTY5 method in the experimental example of the present invention;

FIG. 10-j is a graph showing the results of classifying the 6 th image in Potsdam dataset by the AREANs-VGG method in accordance with the present invention;

FIG. 10-k shows the results of classification of image No. 6 _13in the Potsdam dataset using the AREANs-ResNet method in accordance with the present invention;

FIG. 10-1 shows the results of classification of image No. 6_, 13 in the Potsdam dataset using the AREANs-ResNeSt method in the experimental examples of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

The classification model adopted by the method for classifying the remote sensing images in the same domain based on the counterstudy comprises a generator and a discriminator, wherein the generator adopts an encoder-decoder structure fused with a residual error attention enhancement mechanism, and the discriminator adopts a convolutional neural network. The classification model is divided into two stages during training, namely a first stage: performing supervision training on the model independently, and optimizing the model by utilizing a multi-path fusion focus loss function to ensure that the model obtains certain classification capability and provides an initial training model for the second-stage confrontation training; and a second stage: on the basis of the first stage, a discriminator is added, a confrontation training strategy is introduced, the whole framework is jointly optimized by combining the confrontation loss and a multi-path fusion focus loss function, and the image classification precision of the generator is improved.

1. Establishing a classification model

There are two types of conditional antagonistic generating network architectures, as shown in FIG. 5-a (cGAN) and as shown in FIG. 5-b (pix 2 pix). The generator input in cGAN consists of two parts: random noise z and a control condition c, and a generator G is influenced by the control condition to complete the mapping from z to G (z). The discriminator D continuously recognizes false G (z) by learning, as shown in equation (1):

where "1" indicates that the input for D is from the true value x and "0" indicates that the input for D is from the generator G (z). The final target function is:

pix2pix is aimed at mapping image space to true label space, so its input at the generator end is only the original image x, while the random noise z is realized by the Dropout layer in the network structure. The working principle of the discriminator is the same as that of the condition generation countermeasure network. To get a result as similar as possible to the truth label y, pix2pix is also augmented with L in the optimization objective ₁ Distance constraint, as shown in equation (3):

therefore, the objective function of cGAN binding:

the final objective function is:

wherein λ is L ₁ Weight of distance constraint.

Based on the above, the classification model adopted in the present invention is shown in fig. 1, and includes a generator network with a fusion attention mechanism and a residual error module and a discriminator network based on image fusion. The generator is mainly used for obtaining a pixel classification prediction result of the remote sensing image. The discriminator is used for evaluating whether the prediction result generated by the generator is reliable, and judges the input of the original image fused with the truth label as true (1) and judges the input of the original image fused with the prediction result of the generator as false (0).

The structure of the generator is shown in fig. 2, and includes two parts, namely an encoder and a decoder, wherein the encoder can be composed of various backbone networks, such as VGG, resNet, and resenestt; the decoder adopts a multilayer convolutional neural network, the multilayer convolutional neural network comprises a plurality of convolutional neural networks with different depths, and each convolutional neural network with different depths is provided with a Residual error part, an up-sampling part and an Attention enhancement part, wherein a Residual block is used as the Residual error part, an Upsampling block is used as the up-sampling part, and an Attention-enhanced block is used as the Attention enhancement part. The prediction result is output by the Softmax layer. For the present embodiment, the decoder employs five layers of convolutional neural networks, where Level-X (X =2,3,4, 5) represents neural networks at different depths in the decoder, and the last layer of neural networks is output through the sofimax layer. In the legend, a denotes the convolution + activation function, B denotes the down-sampling operation, and C denotes the embedded attention and residual structure block.

For this embodiment, the encoder uses a VGG network, as shown in fig. 3, which is composed of 5 blocks, each block is composed of convolution layer + Linear rectifying Unit (Rectified Linear Unit) + maximum pooling layer, since the human eye represents the abstraction process on different levels for visual information, channels are increased block by block, and the size of feature maps is decreased block by block. The features extracted from different blocks in the VGG all represent the expression of the target on different levels, and the higher the level is, the higher the abstraction degree is, so that the output of the network terminal is the high-level abstract expression of the input image. Given an input data set of X = [ X ] ₁ ，x ₂ ，…，x _N ]Wherein

Representing the ith image of size w × h in the data setComprising c channels.

Which represents operations performed in the nth block of the encoder E (here, the VGG network), including conv series of convolution operations, reLU correction units (including Sigmoid, tanh, or ReLU functions), and pool operations. After a series of operations, the encoder finally outputs:

hierarchical features containing target visual representation information and implicit local texture information can be learned by the encoder, but classification errors are easily caused due to the lack of global context information for encoding target spatial relationships. The high-order characteristic graph of each channel obtained by the encoder can be regarded as a response to a specific category, and the high-order characteristic graph contains rich semantic information; however, high-order features often lack basic spatial information and cannot accurately describe the edge position of the target, and low-order features have complete spatial information but are limited by the receptive field and relatively lack semantic information.

The Residual block is formed by two 3 × 3 convolutions, as shown in fig. 4, the input end and the output end of the Residual block are connected in a cross-layer manner, and the input of the Residual block is the characteristics of the last layer of network and the encoder which are subjected to splicing processing. The residual error part can not only increase the depth of the network and improve the network performance, but also effectively solve the problem of model degradation caused by network deepening and relieve the problem of difficult fitting of the constant mapping of the multilayer neural network.

The feature graph processed by the residual error part enters an upsampling part, the feature graph is restored to the size of the corresponding high-order feature graph by the upsampling part, the part is realized by adopting the transposition convolution operation, and the method has the advantages that the parameters can be learned by comparing interpolation methods such as nearest neighbor interpolation, bilinear interpolation or bicubic interpolation, and the like, so that the manual presetting is not needed.

The attention enhancing part comprises a semantic information enhancing module and a position information enhancing module, the feature maps processed by the up-sampling part enter the two modules at the same time, the feature maps processed by the two modules are fused to obtain an enhanced result, and the enhanced result is different from the serial combination of CCNet.

The semantic information enhancement module operates according to the channel dimension of the input feature map, and completes the relation modeling of the specific semantic information between the high-order feature map and the low-order feature map by utilizing the correlation between the high-order channel and the low-order channel. Firstly, global Average Pooling (GAP) is used to obtain statistical information of input feature x on channel dimensions:

x _c a feature map representing the input x on the c-th dimension channel,

g _c representing global statistics acquired from the c-th dimension channel. Subsequently, in order to enhance the correlation of the feature maps on the different channels, a combination of linear transformations and activation functions is introduced:

g _c representing the feature vector after passing through the GAP,

and W ₂ ∈R ^C/n×C Each represents a weight of 1 × 1 convolution layer, reLU represents a ReLU function, and σ represents a Sigmoid operation. It should be noted that the full connection layer is not used for linear transformation, mainly to reduce the amount of computation and model parameters,

what is obtained is in each channel dimensionWith different weights, the final feature map is:

u＝[u ₁ ，u ₂ ，…，u _C ]the enhanced feature map is shown, and x represents a matrix multiplication in the channel dimension.

The position information enhancement module is used for establishing the position information correlation between the local features and other neighborhoods. The structure is relatively simple and can be expressed as follows:

q＝W _s *x (10)

wherein, x = [ x ] ^1，1 ，x ^1，2 ，…，x ^W，H ]，

A slice representing the input feature map along the channel dimension, (i, j) corresponds to the spatial position coordinates of the feature map i e {1,2, \8230;, W }, j e {1,2, \8230;, H };

representing the mapping matrix after a 1 x 1 convolution operation,

denotes that x passes through W _s Mapping the weight graph; h is a total of _i，j Is q _i，j Scaled to [0,1 ] via Sigmoid]The result in (f) represents the importance of the position information at the (i, j) position in the feature map. The final enhanced profile results are therefore:

v ^i，j ＝h _i，j *x ^i，j (12)

v＝[v ^1，1 ，v ^1，2 ，…，v ^W，H ]and v represents the enhanced feature map.

Through the parallel processing of the two modules, the feature map obtains the position information and the semantic information enhancement at the same time, and the final result is as follows:

y＝u+v (13)

the function of the discriminator is to distinguish the real label and the prediction result generated by the generator by acquiring high-order consistency between the two labels. The discriminator of the present invention employs a structure similar to a markov discriminator (PatchGAN), as shown in fig. 6. The Markov discriminator judges the truth of a picture block with a specific size in the image without inputting the whole image into the discriminator, and averages the judgment results of all the picture blocks to be used as the final output result of the discriminator. The method aims to reduce the dimensionality of input data, reduce the parameter quantity and improve the operation speed of the network on the premise of ensuring the precision. In this embodiment, the discriminator is formed by connecting 8 convolutional layers in series, the convolutional kernel size of the convolutional layer is 4 × 4, and the step lengths of the other convolutional layers are 2 except for the step length of the last convolutional layer being 1; the first convolutional layer employs a ReLU activation function, and the remaining convolutional layers employ a LeakyReLU as an activation function. Here, the last layer of the discriminator is not a full-link layer in the conventional discriminator, but is a convolutional layer, and in doing so, the final output is a matrix (as shown in fig. 6, where I represents the size of the input image of the discriminator), which has a local receptive field on the input image and is more favorable for the requirement of the semantic segmentation task. On the input of the discriminator, the probability maps of each class are multiplied by corresponding RGB (Red-Green-Blue) or IRRG (Infrared-Red-Green) images to obtain a new feature map as input instead of directly adopting truth labels or predicted values. The profile contains 3 × C channels (where C denotes the number of classes) and it is more advantageous for the discriminator to use the information in the original image to distinguish between predicted results and true values.

2. Training classification models

The invention adopts a mode of countertraining to train a generator and a discriminator. The generator generates a segmentation result, the discriminator distinguishes a sample to be selected from a real sample, the two samples compete with each other in a zero-sum game framework according to data distribution, and the specific form can be represented as follows:

wherein X = { X ⁽¹⁾ ，X ⁽²⁾ ，…，X ^(N) Denotes a set of input images, Y = { Y = ⁽¹⁾ ，Y ⁽²⁾ ，…，Y ^(N) The notation indicates a truth label set corresponding to Y, V represents an objective function of the minmax game, and E represents an expected value of a distribution function; d (-) represents a discriminator, θ _D Represents the parameters in the discriminator D; g (-) denotes a generator (this chapter refers to the classification network proposed in chapter III), θ _G Representing the parameters in the generator G.

(1) Training discriminator

In the classification model based on generation of the countermeasure network, the penalty function of the discriminator may be defined in the form:

θ _D the parameters in the discriminator D are represented,

representing binary cross-entropy loss (i.e. countering loss) ^[40，42] D (-) indicates that the discriminator determines that X of the input is the predicted value G (X) from the generator ⁽ⁿ⁾ ) Or true value label Y ⁽ⁿ⁾ . y represents a certain type of one-hot codes in the truth labels, and x represents a certain type of prediction results generated by the generator.

(2) Training generator

Generator through training mixing loss function

The generator is implemented to generate samples that are difficult for the arbiter to distinguish between "true and false".

The device comprises two parts:

and

is used to reduce the performance of the discriminator,

i.e., a multi-path fused focus loss function, for generating a correct classification prediction of each pixel of the input image. It is represented as follows:

θ _G a parameter representative of the generator G is,

is that

And

a linear combination of (a) as

The penalty factor of (2).

The focus loss function for multipath fusion is defined as follows:

representing the nth shade in each batchLike the one-hot encoded tag of class c,

refers to the Softmax layer output, C represents the total number of classes, and N refers to the total number of images participating in training in each batch. Gamma is called "focusing parameter" and its role is to make the network focus on samples that are difficult to classify; in the invention, gamma =2 is taken.

In order to further improve the network performance, different layers of extracted feature maps are respectively extracted from a decoder, and a new Loss function form is formed by combining the feature maps with the Focal local. As shown in fig. 7, the outputs of Level-5 to Level-2 and Softmax layers in the decoder are respectively extracted, because the channel dimensions of the feature maps extracted from different layers are different (the dimensions of the feature channels extracted from Level-5 to Level-2 are 256, 256, 64, respectively, and the channel dimensions output by the Softmax layer are the total number of categories); respectively passing the characteristic graphs extracted from Level-5 to Level-2 through 1 multiplied by 1 convolutional layers, keeping the characteristic graphs consistent with the output of a Softmax layer in channel dimension, respectively calculating the Focal local and then summing, wherein the specific form is as follows:

FL _fusion ＝∑FL(p，q) _i (18)

FL(p，q) _i representing the Focal local calculated after extracting the feature maps from different layers; m denotes the total number of extraction layers, where M =5.

Countermeasure training typically employs a step-by-step update of the discriminator parameter θ _D Parameter θ of sum generator _G The method (1). The AREANs proposed by the present invention are therefore divided into two steps: first, the parameter θ of the generator is fixed _G Updating the parameter θ of the discriminator _D Enabling the discriminator to distinguish the prediction results; second, fix the parameter θ of the discriminator _D Updating the parameter θ of the generator _G So that the generator generates a prediction that is difficult for the arbiter to distinguish between "true and false". Through the training mode of the confrontation, the statistical relationship between the prediction result and the corresponding truth label in the high-level semantics can be established, and simultaneously, in the confrontation with the discriminator, each layer in the generator can play the role.

Experimental verification

To further validate the classification performance of the classification method of the present invention, this experiment was trained on an infrared-red-blue (IR-R-G) three channel image of the variangen and Potsdam datasets.

In the first training stage, the discriminator does not participate, and the proposed Focal local training generator with multi-path fusion is used; wherein the encoder weights are initialized using the public pre-training model (VGG-19, resNet-101, and ResNeSt-101 pre-training models on ImageNet are used in this experiment), and the decoder is initialized using the Kaiming method; initial learning rate is set to 10 ^-4 Total iteration 10 ⁵ Next, the process is repeated.

In the second training stage, a discriminator is introduced for confrontation training, the generator adopts the training model weight obtained in the first stage, and the discriminator is initialized by using a Kaiming method; in the aspect of setting the learning rate, the invention adopts TTUR (Two Time-Scale Update Rule) to respectively set the learning rates of the Two to different values, the method can lead the training of the GANs to be more stable on the premise of not increasing the Time cost, and the initial learning rate of the generator is set to be 10 ^-4 And the initial learning rate of the discriminator is set to 5X 10 ^-4 。

Is set to 0.05, and 3 × 10 is iterated altogether ⁵ Next, the process is repeated.

The middle of the two training stages does not need to be interrupted, and an end-to-end training mode is formed. Adam is adopted to train the AREANs framework, beta, provided by the invention ₁ ＝0.5，β ₂ =0.999, weight decay is 10 ^-4 Learning rate of 10 per iteration ⁵ Half the secondary decay, batch _ size =16.

The experiment assesses the degree of performance improvement by comparing the classification accuracy of the baseline method to that of the method after different improvement strategies were applied to the two data sets. Wherein "base 1", "base 2" and "base 3" respectively denote reference methods with VGG-19, resNet-101 and resnext-101 as encoders, before AR structure addition and antagonistic learning; "Baseline1+ AR", "Baseline2+ AR" and "Baseline3+ AR" are respectively improved methods after adding an AR structure; "Baseline1+ GAN", "Baseline2+ GAN" and "Baseline3+ GAN" respectively represent improved methods of introducing antagonistic learning; "Baseline1+ AR + GAN", "Baseline2+ AR + GAN" and "Baseline3+ AR + GAN" represent the improved methods after adding AR structures and resisting learning, respectively, i.e. the improved method of the ARENAs architecture proposed by the present invention. Table 1 shows the performance comparison before and after introducing AR structure, counterlearning and using the proposed method in the Vaihingen dataset using the three backbone network encoders. Wherein, the evaluation index of the ground feature type adopts an F1 score, and the overall performance index of the method respectively adopts an average cross-over ratio (mIoU) and an Overall Accuracy (OA).

TABLE 1

In summary, the AREANs architecture proposed by the present invention obtains the best test result on the gaihingen data set, and the improved methods of three different backbone networks have a greater performance improvement than the reference method: in the aspect of mIoU, the improvement is respectively 4.57%, 7.22% and 10.13% compared with the reference method; in the aspect of OA, the OA is respectively improved by 1.09%,2.53% and 3.40%.

As can be seen from Table 1, except for the ARENAs proposed in this chapter, the reference method after the counterstudy strategy is adopted has a certain performance improvement over the three reference methods; the method mainly comprises the step that the GAN can learn high-order consistency between a true value label and a prediction label, so that a prediction result of a model is in the same manifold as training data as much as possible, and the accuracy of the prediction result is improved.

Among three improved methods based on AREANs framework, baseline3+ AR + GAN obtains the optimal experimental result on all indexes. The ResNeSt-101 network is a variation of the ResNet-101 network, and improves the basic performance of the network by adding a Split-Attention module and a multi-path fusion improvement measure while keeping the depth of the network of the latter. The performance of 'Base 1ine3+ AR + GAN' with ResNeSt-101 as an encoder is greatly improved, mIoU is improved by 10.13%, and OA is improved by 3.40%.

The AREANs methods of the three backbone networks are respectively improved by 4.23%,9.27% and 9.38% in the mloU aspect and are respectively improved by 1.58%,2.66% and 2.89% in the OA aspect. The method verifies that the classification precision can be effectively improved under the condition of not increasing the iteration times.

From the classification effect on small targets (vehicles), the three AREANs framework methods are greatly improved, and F1 scores are improved by 10.10%,21.71% and 23.18% respectively. The decoder improvement strategy of combining AR structure and counterlearning is explained, so that the model has stronger sensitivity to a tiny target, particularly on a high-resolution remote sensing image with complex texture and structure.

The AREANs (AREANs-VGG, AREANs-ResNet and AREANs-ResNeSt) proposed by the invention are compared with other classical networks on the Vaihingen data set, and FCN, UNet, segNet, PSPNet and DeepLabv3+ are the most typical 5 networks in deep learning; treUNet adopts a self-adaptive Tree-CNN module mode, a confusion matrix is constructed by utilizing Tree-CNN b1ock, and misclassification in prediction is reduced by combining a Tree pruning algorithm; the HSN replaces different convolutional layers with the initiation module, so that the multi-scale receptive field of the network for rich texture information is improved, and the prediction capability of the network is enhanced; the REMSNet improves the receptivity of the network to global texture information by constructing a parallel multi-core deconvolution module and increasing an attention mechanism; the SPNet provides a design idea of a lightweight network by introducing a stripe pooling module and a mixed pooling module and combining an attention machine mechanism and a multi-path fusion idea. The above networks all adopt the architecture of "encoder-decoder", and the specific experimental results are shown in table 3.

TABLE 3

In general, the classification between different models is closer in appearance, mainly because they all use an architecture similar to the "coder-decoder" of the FCN. On the basis, the performance of the network is improved to different degrees through different improvement means, such as establishing high-low order feature relation (UNet, segNet), constructing multi-scale feature fusion (PSPNet, HSN), increasing attention mechanism (REMSNet) and the like. Thanks to the Attention-Residual block and the confrontation training, the AREANs provided by the invention obtain excellent results in terms of overall accuracy and F1 fraction, wherein the AREANs-ResNeSt has the best performance, and the OA value and the average F1 value are respectively improved by 1.73 percent and 3.78 percent compared with SegNet and are respectively improved by 3.11 percent and 4.91 percent compared with UNet, thereby showing the effectiveness of the method provided by the invention.

The AREANs framework adopted by the invention focuses on extracting useful characteristics including global context information and local position information, and the antagonism training method implicitly learns high-order structure information in a training stage and purifies a prediction result in a testing stage without extra time cost (as shown in Table 4).

TABLE 4

Partial prediction results of different model methods on the Vaihingen data set are shown in fig. 8, in which the first behavior of fig. 8 is a truth label corresponding to six original images, the second behavior is a truth label corresponding to each original image, and the rest behaviors are different algorithms to obtain classification results. It can be seen that the ground features with different colors, sizes and textures cause the problems of large intra-class difference and small inter-class difference in the images, which brings great difficulty to the task of image classification. Specifically, cars are different in color, causing large step differences, dark cars are very similar to their shadows on the road, and objects with similar texture are also easily misclassified, such as buildings and impervious layers in the third column of fig. 8, short pots and trees in the fifth and sixth columns of fig. 8. In addition, at the edge of the whole image, due to lack of correlation with the surrounding environment, it may not be possible to extract the object, as in the second column of the figure. Where FCN, UNet, segNet are low performing, the main reason for misclassification is that the network cannot effectively reuse features, which may result in a lack of useful context information. Although deplab v3+ and PSPNet employ ASPP modules and pyramid pooling modules, respectively, they are not effective in performing pixel-level classification tasks for high-resolution aerial images. Although SPNet introduced stripe pooling and hybrid pooling with attention mechanism, it did not solve the problem of misclassification due to class imbalance well. In contrast, the AREANs architecture proposed in this section outperforms other comparison methods through the AR block and the countermeasure training strategy. In addition, the generator of the AREANs architecture is optimized by a multi-path fusion focus loss function, and the influence of the imbalance-like problem is reduced to a certain extent.

In addition, the classification method of the invention is compared with the published partial convolution neural network model on the ISPRS benchmark (the specific content can be inquired: https:// www2.ISPRS. Org/communities/comm 2/wg4/results /). These models mainly include:

SVL _ X: the method is provided by an ISPRS 2D Semantic Labeling control organization party and is used as a reference for comparing the participants. And (3) comprehensively using NVDI, nDSM and SVL characteristics, and adopting an Ada-boost classifier and a CRF model to obtain a final prediction result. In the following experiments, SVL _1 indicates that CRF post-treatment was performed, and SVL _3 indicates that CRF post-treatment was not performed on the prediction results.

RIT _ L7: the model combines a random tree algorithm to extract structural information in the image and the label, and the FCN utilizes the extracted structural information to classify the pixel level. And selecting IR-R-G three-channel images and DSM data as training data.

DLR _8: the model utilizes an ensemble learning method, and introduces edge detection by integrating characteristic information of FCN, segNet and VGG on different scales, so that the accuracy of segmentation results is improved; and selecting IR-R-G three-channel images and DSM data as training data.

UZ _1: the model adopts a decoder-encoder structure, learns the spatial information of input data through a decoder, completes the restoration of characteristic information by utilizing a deconvolution structure in the decoder, and also adds nDSM data as training data.

DST _ X: adopting a mixed FCN structure, taking an image and DSM as training data, and adopting CRF to carry out post-processing on the data; in the experiment, the surrogate trained with the Vaihingen dataset was called DST _2, and the surrogate trained with the Potsdam dataset was called DST _5.

BKHN _ X: adopting a mixed structure of FCN and ResNet-101, and taking DSM and nDSM as training data besides inputting images; in the experiment, the surrogate trained with the Vaihingen dataset was named BKHN-5 and the surrogate trained with the Potsdam dataset was named BKHN-3.

GSN: an FCN structure with a threshold control mechanism is adopted, and ResNet-101 is used as an encoder.

CASIA2: the model adopts a network structure of a single self-cascade structure, and the encoder selects a ResNet-101 variant; the data only uses three-channel IRRG images, and does not use elevation data (DSM and nDSM) of ISPRS 2D Semantic Labeling control and any post-processing method, which are the same as those of AREANs mentioned herein.

ADL _3: the method combines CNN and artificial features to realize pixel-level classification of dense image blocks. Training a random forest classifier by using artificial features, and combining CNN to realize preliminary prediction of images; and finally, refining the prediction result by using the CRF.

ONE _7: the method fuses prediction results of SegNets of two scales, and uses IRRG images and NVDI (Normalized DifferenceVenetion Index), DSM and nDSM data as training data.

BUCTY5: and (3) training the network by comprehensively using IRRGB and DSM data by using a tree-shaped CNN structure and combining a pruning algorithm.

The results of the quantitative comparison of the present invention (AREANs for short) with the above method in the ISPRS Vaihingen test set are shown in Table 5, and the results of the quantitative comparison of the present invention with the above method in the ISPRS Potsdam test set are shown in Table 6.

TABLE 5

TABLE 6

Tables 5 and 6 show the results of quantitative comparisons of AREANs proposed by the present invention with other methods disclosed in ISPRS 2D Semantic Labeling Contest. Overall, AREANs achieved excellent performance on both datasets, AREANs-resenestt achieved OA values of 91.3% and 91.9% on the Vaihingen and Potsdam datasets, respectively; moreover, the correct recognition rate of small targets (the "vehicle" class) is greatly improved compared with other methods on benchmark, and the F1 scores on the Vaihingen and Potsdam data sets reach 90.5% and 97.0%, respectively.

Because a large amount of complex texture and structure information is contained in the high-resolution remote sensing image, a deeper and stronger model is selected as a feature extraction network, and the method becomes one of solutions for improving the overall performance of the model. GSN, CASIA2 and BKKHK adopt a pretrained network ResNet-101 which has excellent performance on a natural image data set as an encoder according to the thought, and are used for feature extraction of high-resolution remote sensing images. Such a design is mainly based on the following reasons: the fine tuning method based on the pre-training model can improve the generalization capability of the network; the randomly initialized network may pay more attention to the spectral information of the image target, neglect semantic information of the target, and cause the generalization capability of the network to be reduced. Therefore, the classification accuracy of the three methods is better than that of other methods, and particularly, the classification accuracy of CASIA2 and BKHN is slightly lower than that of the invention.

In the AREANs architecture provided by the invention, all indexes of AREANs-ResNeSt on two data sets are optimal, the performance of AREANs-ResNet is slightly inferior to that of the AREANs-ResNet, the performance of AREANs-VGG is relatively weakest, but the AREANs-VGG still has certain advantages compared with other methods on benchmark. Thanks to the training strategy of the counterlearning, compared with DST _ X, ONE _7 and DLR _8, the framework provided by the invention can effectively improve the model classification accuracy under the condition of only using a generator for testing without increasing the time consumption and the parameter quantity. In addition, a pre-trained CNN network is adopted in a decoder, so that an overfitting phenomenon caused by strong correlation among remote sensing image blocks is avoided. Meanwhile, in order to improve the stability of the generation of the confrontation network training, the TTUR strategy is used for training in the method, and the training difficulty is reduced. The method provided by the invention does not use any additional data (such as DSM, nDSM, NDVI, artificial design features and the like) for assistance, does not adopt additional classifiers and post-processing steps (such as CRF in RIT _ L7, DST _ X and ADL _ 3), and further does not adopt a method of model integration to improve the classification accuracy of the network.

FIGS. 9-a to 9-i and FIGS. 10-a to 10-1 show the classification results obtained by randomly selecting one image from each of the two data sets. It can be seen that the invention embeds the Attention-Residual block, effectively improves the sensing ability of the model to the position and semantic information, and obtains a better classification result by combining the pre-training model.

Claims

1. A method for classifying remote sensing images in the same domain based on counterstudy is characterized by comprising the following steps:

acquiring remote sensing image data to be classified, and inputting the remote sensing image data to a trained classification model for classification; the classification model comprises a generator and a discriminator, wherein the generator adopts a structure of an encoder and a decoder and is used for obtaining a pixel classification prediction result of the remote sensing image; the discriminator adopts a convolutional neural network and is used for distinguishing a real label from a prediction result generated by the generator by acquiring high-order consistency between the real label and the prediction result; the generator and the arbiter train in a mode of countertraining.

2. The method for classifying remote sensing images in the same domain based on antagonistic learning according to claim 1, wherein the encoder adopts a feature extraction network for mapping the input remote sensing image data to a high-dimensional feature space; the decoder adopts a multilayer convolutional neural network and comprises a plurality of convolutional neural networks with different depths, and a residual error part, an up-sampling part and an attention enhancing part are arranged in each convolutional neural network with different depths; the output of the decoder is used as the input of the encoder, and meanwhile, the decoder is connected with the corresponding layers in the encoder so as to fuse the low-layer feature position information and the high-layer feature semantic information.

3. The method for classifying remote sensing images in the same domain based on antagonistic learning according to claim 2, wherein the residual part comprises two convolution modules, the input end and the output end of the residual part are connected in a cross-layer manner, and the input end of the residual part is used for receiving the characteristics of the convolutional neural network and the encoder on the previous layer after splicing; the input end of the up-sampling part is used for receiving the output signal of the residual error part and restoring the feature map processed by the residual error part to the size corresponding to the high-order feature map.

4. The method for classifying remote sensing images in the same domain based on antagonistic learning as claimed in claim 2, wherein the attention enhancing part comprises a semantic information enhancing module and a position information enhancing module which are processed in parallel, the input of the semantic information enhancing module and the position information enhancing module are both feature maps output by the up-sampling part, the semantic information enhancing module is used for operating the channel dimension of the input feature maps, and the relationship modeling of specific semantic information between the high-order feature maps and the low-order feature maps is completed by utilizing the correlation between the high-order channel and the low-order channel; the position information enhancement module is used for establishing the position information correlation between the local features of the input feature map and other neighborhoods.

5. The method for classifying remote sensing images in the same domain based on counterstudy as claimed in claim 4, wherein the semantic information enhancement module comprises the following processing procedures:

determining the weight of each channel dimension by using a linear transformation and an activation function according to the obtained statistical information of the channel dimension, wherein the calculation formula of the weight is as follows:

g _c representing feature vectors obtained through global average pooling,

and

respectively representing the weight of a 1 multiplied by 1 convolution layer, reLU representing a ReLU function, sigma representing a Sigmoid operation, and C representing the total number of categories;

6. The method for classifying remote sensing images in the same domain based on counterstudy according to claim 4, wherein the enhanced feature map obtained by the position information enhancement module is as follows:

v ^i，j ＝h _i，j *x ^i，j

q＝W _s *x

wherein v = [ v = ^1，1 ，v ^1，2 ，…，v ^W，H ]V denotes an enhanced feature map, x = [ x ] ^1，1 ，x ^1，2 ，…，x ^W，H ]，

representing the mapping matrix after a 1 x 1 convolution operation,

denotes that x passes through W _s Mapping the weight graph; h is _i，j Is q _i，j Scaled to [0,1 ] by Sigmoid]The result in (ii) represents the importance of the position information at the (i, j) position in the feature map.

7. The method for classifying remote sensing images in the same domain based on antagonistic learning as claimed in claim 2, wherein the generator adopts a loss function in the training process as follows:

is used to reduce the performance of the discriminator,

a focus loss function for multi-path fusion to generate correct classification prediction for each pixel of the input image, theta _G The parameters of the generator G are represented by,

is that

And

a linear combination of (a) as

D (-) indicates that the discriminator judges that X as an input is the predicted value G (X) from the generator ⁽ⁿ⁾ ) Or true value label Y ⁽ⁿ⁾ 。

8. The method for classifying remote sensing images in the same domain based on antagonistic learning according to claim 1 or 2, wherein the discriminator is formed by connecting 8 convolutional layers in series, the sizes of convolutional cores of the convolutional layers are 4 x 4, the step length of the last convolutional layer is not 1, and the step lengths of the first convolutional layer to the seventh convolutional layer are all 2; the first convolutional layer employs a ReLU activation function, and the remaining convolutional layers employ a LeakyReLU activation function.

9. The method for classifying remote sensing images in the same domain based on counterstudy as claimed in claim 7, wherein the loss function of the discriminator can be defined as follows:

wherein theta is _D Represents the parameters in the discriminator D and,

representing a binary cross entropy loss, D (-) representing the arbiter determines that x of the input is from the generatorPredicted value G (X) of ⁽ⁿ⁾ ) Or true value label Y ⁽ⁿ⁾ Y represents a certain type of one-hot codes in the truth label, and x represents a certain type of prediction results generated by the generator.

10. The method for classifying remote sensing images in the same domain based on counterstudy as claimed in claim 9, wherein the parameter θ of the discriminator _D Parameter θ of sum generator _G The step-by-step updating is adopted, and the parameter theta of the generator is fixed firstly _G Updating the parameter θ of the discriminator _D Enabling the discriminator to distinguish the prediction results; re-fixing the discriminator parameter theta _D Updating the parameter theta of the generator _G So that the generator generates a prediction result that the discriminator cannot distinguish between true and false.