CN116503636A

CN116503636A - Multi-mode remote sensing image classification method based on self-supervision pre-training

Info

Publication number: CN116503636A
Application number: CN202211551345.7A
Authority: CN
Inventors: 薛志祥; 周嘉男; 张鹏强; 魏祥坡; 李亚萍; 谭熊; 刘冰; 余岸竹; 郭迎钢
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-07-28

Abstract

The invention relates to a multi-mode remote sensing image classification method based on self-supervision pre-training, and belongs to the technical field of remote sensing image classification processing. The invention performs pre-training on a self-supervision learning model by using unlabeled multi-mode remote sensing images to obtain a trained encoder and a cross attention module, uses the trained encoder and the cross attention module as a multi-mode feature extractor to establish a multi-mode remote sensing image classification model, performs fine tuning training on the classification model by using a small number of multi-mode remote sensing images with labels, and inputs the multi-mode remote sensing images to be classified into the classification model after fine tuning training to realize remote sensing image classification. According to the method, a large number of images with labels are not needed, the encoder can learn key features of the remote sensing images through pre-training, meanwhile, information interaction among multi-mode features is achieved through the cross attention module, data of all modes are fully utilized, and the classification accuracy of the images is improved.

Description

Multi-mode remote sensing image classification method based on self-supervision pre-training

Technical Field

The invention relates to a multi-mode remote sensing image classification method based on self-supervision pre-training, and belongs to the technical field of remote sensing image classification processing.

Background

With the continuous development of remote sensing technology, a large number of multi-mode remote sensing images can be used for cognizing the earth environment. Passive remote sensing technology can obtain high spatial and spectral resolution information of features, so that the obtained hyperspectral image (HSI), multispectral image (MSI) and ultrahigh resolution (VHR) image contain rich information and can be used for interpretation of observed scenes. Active remote sensing technology monitors targets by transmitting and receiving return information, and Synthetic Aperture Radar (SAR) and airborne laser data (LiDAR) record specific electromagnetic characteristics of the object under test. These multi-modal remote sensing images contain general and specific information of the observed scene, and analyzing these multi-modal images for land cover classification is very challenging.

The supervised classification method is the most common mode of land cover classification. Early studies focused mainly on image analysis methods such as band selection, feature extraction, and classifier design. Due to the strong feature extraction capability of deep learning models, typical deep networks are used for image classification, especially in the field of hyperspectral image classification. Convolutional Neural Networks (CNNs) may extract local depth features for recognition tasks, 1D CNN, 2D CNN, and 3D CNN for land cover classification. The cyclic neural network (RNN) classifies sequence information included in hyperspectral images, and recently, in order to explore long-range dependencies, a transducer has been studied to further improve classification performance. In addition to the above model-driven approach, hyperspectral image classification also employs several machine learning strategies such as attention mechanisms, knowledge distillation, multi-scale learning, sparse representation, and the like. For the collaborative classification of the multi-mode remote sensing images, hyperspectral and LiDAR data classification, hyperspectral and multispectral image classification, hyperspectral image and SAR image collaborative classification exist. These supervised methods are data driven methods, the accuracy of which depends to a large extent on the number of training samples. However, labeling high quality samples is laborious, thus solving the problem of the scarcity of labeled samples in the field of remote sensing image classification.

Semi-supervised learning paradigms provide a viable solution to the small sample classification problem by using labeled and unlabeled samples. Graph-based learning is a typical semi-supervised classification model of remote sensing data, where graph-rolling neural networks and generation countermeasure networks can extract depth features from both labeled and unlabeled data for land cover classification, which have shortcomings in solving sample generation and large image classification problems. Another important direction of research to address the scarcity of annotation samples is the small sample learning paradigm, but this pre-training approach is supervised and this process also requires a large number of annotation samples for feature learning.

Although both current supervised and semi-supervised models achieve significant performance, these classification schemes still fail to effectively address the most prominent problem, namely the large amount of unlabeled multimodal data and the limited number of labeled samples. As a novel learning paradigm, self-supervised learning utilizes inherent features of unlabeled data to learn salient features and uses the learned feature representations for downstream recognition tasks. Self-supervised learning can be generalized to methods based on contrast and generative formulas, depending on the learning objectives in the designed interface task. Contrast learning aims at learning a potential space in which similar pairs of samples are clustered together and different pairs of samples are separated, thereby learning invariant and distinguishing feature representations. Contrast self-supervised learning typically employs CNN as a basic feature extractor to extract high-level abstract features, which have limitations in remote dependency extraction and multi-modal data processing.

The goal of the generative self-supervision to achieve feature learning by recovering artificially corrupted data is that once the model can recover the original signal from the corrupted data, this means that the model has learned key features that characterize the original signal. The generated self-supervision pre-training scheme is very successful in the field of natural language processing, but is lagged in the field of vision, and the main reason is that images are natural signals with a large amount of spatial redundancy, so that advanced key information is difficult to learn from the images, particularly in the field of remote sensing image processing, multi-mode remote sensing images exist in the same observation scene, and the information of the heterogeneous images is rich and complementary, however, the current generated self-supervision model cannot fully utilize the heterogeneous images, and has limited multi-mode characteristic learning capability, so that the final precision is lower.

Disclosure of Invention

The invention aims to provide a multi-mode remote sensing image classification method based on self-supervision pre-training, which aims to solve the problem that the existing self-supervision model cannot fully utilize heterogeneous images to cause low classification precision.

The invention provides a multi-mode remote sensing image classification method based on self-supervision pre-training for solving the technical problems, which comprises the following steps:

1) Acquiring remote sensing images of at least two modes, respectively performing blocking treatment on the remote sensing images of each mode, dividing each mode of remote sensing image into regular non-overlapping blocks, performing random masking treatment on the obtained image blocks to obtain embedded characteristic information of the unmasked image blocks, and recording the characteristics of the masked image blocks;

2) The method comprises the steps of inputting the obtained embedded characteristic information of an unmasked image block into an encoder, learning the input embedded characteristic information by the encoder, and exchanging results among mode characteristics through a cross attention module;

3) Inputting the learning result of the cross attention module and the characteristics of the masking image blocks into a decoder, reconstructing the masking image blocks in each mode of remote sensing image by the decoder, taking the difference between the reconstructed blocks and the corresponding masking image blocks as loss, and training the encoder and the cross attention module by utilizing the loss;

4) Taking the trained encoder and the cross attention module as a multi-modal feature extractor, constructing a lightweight classifier, taking the multi-modal feature extractor and the lightweight classifier as a multi-modal remote sensing image classification model, and performing fine tuning training on the classification model;

5) And inputting each mode of remote sensing image to be classified into the classification model after fine adjustment training, namely realizing the classification of the multi-mode remote sensing image.

The invention performs pre-training on a self-supervision learning model by using unlabeled multi-mode remote sensing images to obtain a trained encoder and a cross attention module, uses the trained encoder and the cross attention module as a multi-mode feature extractor to establish a multi-mode remote sensing image classification model, performs fine tuning training on the classification model by using a small number of multi-mode remote sensing images with labels, and inputs the multi-mode remote sensing images to be classified into the classification model after fine tuning training to realize remote sensing image classification. According to the method, a large number of images with labels are not needed, the encoder can learn key features of the remote sensing images through pre-training, meanwhile, information interaction among multi-mode features is achieved through the cross attention module, data of all modes are fully utilized, and the classification accuracy of the images is improved.

Further, the fine tuning in the step 4) uses the mean square error between the masking image block and the reconstruction block as a loss function.

The invention only calculates the reconstruction loss on the masking image block and the corresponding reconstruction block during the pre-training, thereby not only ensuring the training precision, but also avoiding the problem of large calculation amount caused by calculation on all the image blocks.

Further, the encoder and decoder adopt a transducer structure; the encoder comprises a position coding module and a plurality of transducer structures; the decoder comprises N decoding units, wherein N is the number of modes of the remote sensing image, and each decoding unit comprises two transducer structures and a multi-layer sensor.

Further, the transducer structure comprises a multi-layer sensor, a layer normalization module and a multi-head attention.

The invention adopts a multi-layer perceptron, a layer normalization module and a multi-head attention to construct a transducer structure.

Further, the cross-attention module adopts a multi-head cross-attention mechanism.

The invention uses the cross attention module to exchange information among heterogeneous features after the feature learning of the encoder to enhance the representation capability of the learned features, and realizes the cross fusion among different modal features through a multi-head cross attention (MCA) mechanism.

Further, the lightweight classifier is a support vector machine classifier.

The invention adopts the support vector machine classifier as the lightweight classifier, and can quickly and efficiently realize classification.

Drawings

FIG. 1 is a flow chart of a multi-modal remote sensing image classification method based on self-supervision pre-training in accordance with the present invention;

FIG. 2 is a network structure diagram of a self-supervised learning model constructed in accordance with the present invention;

FIG. 3 is a schematic diagram of a multi-modal self-supervising pre-training and fine tuning scheme employed by the present invention;

FIG. 4 is a schematic diagram of experimental data used in the experimental verification process;

FIG. 5 is a graph comparing the classification of the present invention with the prior classification method on a reference dataset.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings.

The invention adopts a self-supervision learning model formed by an asymmetric encoder-decoder to pretrain, the learning model firstly utilizes the encoder to learn the advanced generalized information contained in the multi-mode data, exchanges among the mode characteristics through the cross attention module to learn more complementary information from the multi-mode remote sensing data, and finally utilizes the decoder to reconstruct so as to realize the training of the encoder and the cross attention module; then taking the trained encoder and the cross attention module as a multi-modal feature extractor, constructing a lightweight classifier, taking the multi-modal feature extractor and the lightweight classifier as a multi-modal remote sensing image classification model, and performing fine tuning training on the classification model; and finally, inputting each mode of remote sensing image to be classified into a classification model after fine adjustment training, namely realizing the classification of the multi-mode remote sensing image. The implementation principle of the method is shown in fig. 1, and is described in detail below with reference to a specific example.

1. And constructing a self-supervision learning model.

As shown in fig. 2, the self-supervised learning model constructed by the present invention employs an asymmetric encoder-decoder structure, and both the encoder and decoder employ a transducer structure. The transform structure comprises a Layer normalization module (Layer Norm), a residual block, a multi-Layer perceptron (MLP) and a multi-head attention (MHA) structure, wherein the multi-head attention is used for constructing a plurality of subspaces from multi-mode marks to learn complex dependency relations, the residual connection structure can comprehensively utilize characteristics of different stages and promote training of a model, and the Layer normalization performs normalization operation on the extracted characteristics. The formula of the multi-head attention is expressed as:

MHA(Q,K,V)＝Concat(h ₁ ,h ₂ ,…,h _h )W ^O

wherein Q is _i ，K _i And V _i Queries, indexes, and inliers respectively representing the ith headerThe capacity, the number of heads is h.

The encoder comprises a position encoding module and a plurality of transducer structures, wherein the input of the transducer encoder is a multi-mode visible mark, and the class mark represents information of corresponding mode data and is used for learning universality information from the multi-mode data. Relative position encoding is used to represent the position information of the feature, and class labels are also added to each modality data for feature T _v Representation, T _v And representing the characteristics of the mask-free image sub-blocks in the remote sensing images after the embedded characteristics are stacked. Visible embedded feature T after the embedding operation _v Is input to the encoder for learning, and can be expressed as:

T _e ＝Encoder(T _v )

in order to enrich the learned multi-modal features, the cross attention module is used for exchanging information after the encoder, the cross attention module is shown in fig. 2, a multi-head cross attention (MCA) structure is adopted, and when the features of the three modes are fused, the features of any two modes are fused based on the cross attention. Tagging classes from a modalityAs proxy, with patch flag from another modality +.>The connection is represented as follows:

as shown in FIG. 2, a multi-headed cross-attention (MCA) mechanism pair is employedAnd->The processing is performed, which can be expressed as follows,

where Q, K and V represent query, index, and content values in a cross-attention operation, and C represents an embedded feature dimension. Pairs of heterogeneous features are integrated into the content aware mode through the cross-attention layer, thereby enhancing presentation capabilities.

The decoder is used for reconstructing specific modes according to the multi-mode features and mask marks, and comprises N decoding units, wherein N is the number of modes of the remote sensing image, each decoding unit comprises two transducer structures and a multi-layer sensor, and the input of each decoding unit is an unmasked embedded featureMask feature->Location embedding is used to provide location information. Can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,and->The visual and mask feature embedding representing the i-th modality data, respectively. At the last layer of each decoding module, the prediction head reconstructs the image block in pixel space, using the fully connected layer as the prediction head. Prediction header of decoding moduleIt projects into a vector that is the same dimension as the input samples. Then obtaining a prediction block P through matrix transformation operation ⁱ Can be used as a watchThe method comprises the following steps:

for the present embodiment, the acquired multi-modal data refers to images of three modes, i.e., HIS (hyperspectral image), DSM (digital earth model image) and VHR (high resolution image), and as another embodiment, images of other modes may be selected. The image data of the three modes contains rich and complementary information, can be subjected to self-supervision feature learning, and needs to be preprocessed before self-learning.

The pretreatment process is as follows: taking each pixel in the image as a center, cutting out a local image with the size of H multiplied by W as a self-supervision pre-training processing unit, and enablingRepresenting a sample representing ith modality data, wherein H, W, B _i Representing the height, width and depth of the sample, respectively; dividing the samples of each modality into regular non-overlapping blocks, randomly masking a subset of these blocks at a ratio of m, embedding a trainable linear projection into the unmasked patch, and concatenating the embedded markers of all modalities as input to a unified encoder, denoted T _v The method comprises the steps of carrying out a first treatment on the surface of the Each mask mark is represented by a trainable vector which is used as a placeholder for the decoder, denoted T _m 。

2. And pre-training the constructed self-supervision learning model.

Similar to a Masked Automatic Encoder (MAE), the present invention normalizes the original block and the reconstructed block, calculates the Mean Square Error (MSE) between the original block and the reconstructed block, and calculates the reconstruction loss, which is the sum of each modal error, only on the masked block and the corresponding reconstructed block. Based on the total reconstruction loss, the built self-supervision learning model is pre-trained, and a training sample with marks is not needed in the pre-training process.

3. And constructing a multi-mode remote sensing image classification model by using the pre-trained encoder and the cross attention module.

After the pre-training process, the trained encoder and the cross attention module are used as a multi-modal feature extractor, and a multi-modal remote sensing image classification model is constructed by using the multi-modal feature extractor and a lightweight classifier, wherein the lightweight classifier in the embodiment adopts a Support Vector Machine (SVM). Because the self-monitoring pre-training process can convert the local remote sensing image block into the feature vector, other classifiers, such as a nearest neighbor classification model, can also be adopted. As shown in FIG. 3, the method utilizes the trained encoder and the cross attention layer to extract the characteristics of the input multi-mode remote sensing image to obtain corresponding characteristics, normalizes class marks of the multi-mode characteristics, connects the class marks with corresponding spectrum information, inputs the class marks into the SVM for classification, and utilizes a small number of samples with labels to finely tune the classification model before the classifier, so that the classification of the multi-mode remote sensing image can be realized through the finely tuned classification model.

Because the features learned by the pre-training stage encoder and the cross attention module have high discriminant, when the classification is performed, only a small amount of labeling samples are needed to train the lightweight classifier in the fine adjustment stage, so that the urgent requirement on the labeling samples is relieved, and the labeling samples do not need to be relied on.

Experiment verification

To further verify the effect of the present invention, the model proposed by the present invention was implemented in the PyTorch framework using Python programming language, using the main classification evaluation coefficients, namely Overall Accuracy (OA), average Accuracy (AA) and kappa coefficient (κ), to quantitatively evaluate classification performance, while also qualitatively evaluating experimental results using classification graphs.

The experiment selects a Berlin data set as experimental data, wherein the data set comprises hyperspectral data and PolSAR data in Berlin areas, the hyperspectral data consists of 244 wave bands, and the spectral range is 400nm to 2500nm; the spatial size is 380×2384 pixels, and the spatial resolution is 30m; the image coverage contains 8 distinguishable features, as shown in fig. 4.

To evaluate the classification performance of the proposed method, the experiment was compared using common classification patterns, including supervised deep learning models (i.e., SVM, CDCNN, SSUN and SSRN), semi-supervised learning models (i.e., TSVM and CEGCN), and small sample learning methods (i.e., DFSL and DMVL). In the experiment, 20 samples are selected for classification of each type of ground feature in order to verify the classification precision of different models under the condition of small samples.

From the experimental results, the average OA, AA, kappa coefficient and the corresponding variance of the present invention (MultiSSL) and each method and the classification accuracy of each feature are shown in Table 1.

TABLE 1

Table 1 shows detailed comparative experimental results obtained for the different classification methods with respect to average OA, AA, kappa and accuracy for each class, and also reports the root mean square error for OA, AA and kappa. By the above quantitative comparison, it can be determined that:

semi-supervised and self-supervised methods have better performance than supervised methods when classifying with a small number of labelled samples. In the supervised classification method, a small number of marked samples can cause a fitting phenomenon, so that classification performance is seriously affected; the semi-supervised learning method utilizes marked and unmarked samples simultaneously, and the classification precision is higher than that of a supervised model; the self-supervision learning method learns key feature representation in a designed interface task, effectively improves classification performance, particularly DMVL, can learn meaningful features from unlabeled data based on a comparison learning paradigm, and has a higher classification accuracy than other comparison methods.

According to the invention, a multi-mode self-supervision pre-training and fine tuning scheme is adopted, and when the same number of labeling samples are used, the obtained main evaluation coefficient and the individual classification accuracy achieve the best classification performance, and the overall classification accuracy on a reference data set reaches 70.34%. For the Berlin dataset selected in the experiment, the existing learning paradigm realizes low classification accuracy through a small number of marked samples, and the method provided by the invention realizes higher accuracy on a complex scene because the model provided by the invention utilizes multi-mode remote sensing data to perform self-supervision pre-training, thereby learning key features of subsequent classification tasks and effectively relieving the high dependence on annotation samples.

As shown in fig. 5, classification charts obtained by different classification methods on the reference data set are GT, SVM, CDCNN, SSUN, SSRN, TSVM, CEGCN, 3DCAE, DFSL, DMVL and then MultiSSL (the present invention) in this order from left to right, and in these charts, different land objects are represented by different colors. It can be observed that the semi-supervised and self-supervised learning methods obtain more uniform classification graphs, the classification graphs obtained by the model provided by the invention have fewer noise pixels, and the details in the enlarged view are more accurate, which indicates that the multi-SSL method can learn more distinguishing features from the multi-mode remote sensing image.

The invention constructs a self-supervision learning model by an asymmetric encoder-decoder, pretrains the self-supervision learning model by utilizing unlabeled multi-mode remote sensing images to obtain a trained encoder and a cross attention module, takes the trained encoder and the cross attention module as a multi-mode feature extractor, takes the multi-mode feature extractor and a lightweight classifier as a multi-mode remote sensing image classification model, and carries out fine tuning training on the classification model by utilizing a small quantity of multi-mode remote sensing images with labels. The effectiveness and superiority of the classification method provided by the invention are further proved by carrying out comprehensive experiments on the multi-mode reference data set, and a feasible solution is provided for solving the problem of classifying small samples.

Claims

1. A multi-mode remote sensing image classification method based on self-supervision pre-training is characterized by comprising the following steps:

2. The method for classifying multi-modal remote sensing images based on self-supervised pre-training according to claim 1, wherein the fine tuning in step 4) uses the mean square error between the masking image block and the reconstruction block as a loss function.

3. The method for classifying the multi-mode remote sensing images based on the self-supervision pre-training according to claim 1 or 2, wherein the encoder and the decoder adopt a transducer structure, and the encoder comprises a position encoding module and a plurality of transducer structures; the decoder comprises N decoding units, wherein N is the number of modes of the remote sensing image, and each decoding unit comprises two transducer structures and a multi-layer sensor.

4. The method for classifying multi-modal remote sensing images based on self-supervision pre-training according to claim 3, wherein the transducer structure comprises a multi-layer sensor, a layer normalization module and a multi-head attention.

5. The multi-modal remote sensing image classification method based on self-supervision pre-training according to claim 1, wherein the cross-attention module adopts a multi-head cross-attention mechanism.

6. The method for classifying multi-modal remote sensing images based on self-supervision pre-training according to claim 1, wherein the lightweight classifier is a support vector machine classifier.