CN113011427B

CN113011427B - Remote sensing image semantic segmentation method based on self-supervision contrast learning

Info

Publication number: CN113011427B
Application number: CN202110285256.1A
Authority: CN
Inventors: 李海峰; 李益; 李朋龙; 丁忆; 马泽忠; 张泽烈; 胡艳; 肖禾; 陶超
Original assignee: Chongqing Geographic Information And Remote Sensing Application Center; Central South University
Current assignee: Chongqing Geographic Information And Remote Sensing Application Center; Central South University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-06-21
Anticipated expiration: 2041-03-17
Also published as: CN113011427A; AU2021103625A4

Abstract

The invention discloses a remote sensing image semantic segmentation method based on self-supervision contrast learning, which comprises the following steps of: constructing a semantic segmentation network model (such as Deeplab v3 +); pre-training an encoder of the network model by using label-free data; after the pre-training is finished, performing supervised semantic segmentation training on the network model on a labeled sample; performing semantic segmentation on the remote sensing image by adopting a network model finished by supervised semantic segmentation training; in the pre-training process, a global style comparison and local matching comparison combined mode is adopted for comparison and learning. The invention applies the contrast self-supervision learning to the remote sensing semantic segmentation data set, provides a global style and local matching contrast learning framework, and forms the remote sensing image semantic segmentation method based on the self-supervision contrast learning, so that the semantic segmentation method has wider application range and better segmentation effect.

Description

Remote sensing image semantic segmentation method based on self-supervision contrast learning

Technical Field

The invention relates to the technical field of semantic segmentation of remote sensing images, in particular to a semantic segmentation method of a remote sensing image based on self-supervision contrast learning.

Background

With the development of remote sensing technology, the high-resolution remote sensing image is easier to obtain, and the remote sensing image is more and more widely applied to the aspects of city planning, disaster monitoring, environmental protection, traffic tourism and the like. The extraction and identification of information in remote sensing images are generally the basis of all applications, and semantic segmentation is a technology for identifying and classifying full-image pixels, so that the semantic segmentation is an important and challenging research direction in the field of remote sensing.

In recent years, with the development of deep learning technology, the semantic segmentation of remote sensing images obtains impressive results, and is more and more widely applied to the aspects of global earth surface coverage, urban built-up area identification and the like. However, the success of the existing deep learning technology heavily depends on a large number of high-quality labeled samples, but due to the high labeling cost of the semantic segmentation task and the huge heterogeneity of the remote sensing image in time and space, the existing labeled data is only a section of the remote sensing image, and the requirements of the diversity and the richness of the samples cannot be met.

For the problem of insufficient labeled samples, a common method is to perform data enhancement to generate more samples, which improves the robustness of the model to a certain extent, and can be used in a general training process, but the effect is limited; other studies attempt to use other labeled data, such as pre-training or migration learning, that is, model parameters trained on other larger data sets or data sets more relevant to the current task are migrated into the existing task instead of random initialization, so that training time can be greatly reduced, and the defect of data shortage can be overcome to some extent.

In fact, although a large number of labels are not available, image data with extremely high diversity and richness is available all over the world, so that how to fully and effectively utilize the data is critical, and a mode of semi-supervised learning is to train a large amount of unlabelled data and a small amount of labeled data. The self-supervision learning provides a new paradigm, does not depend on any labeled data, and directly uses the image data to design a self-supervision signal to guide learning, so that the problems of the supervision paradigm can be well avoided, potential more universal knowledge can be expected to be learned, and then the knowledge is transferred to specific downstream tasks.

Depending on the design of the self-supervision signal, the current self-supervision learning can be broadly divided into three categories: context based, timing based, contrast based. Recent work has shown that methods based on comparative learning can achieve superior performance. Reference documents: chen T, Kornblith S, Norouzi M, et al.A Simple frame for contrast Learning of Visual Representations [ J ]. 2020. The comparative learning is to construct a characterization by learning the similarity or dissimilarity of two things, and the core idea is that the feature expression between the positive samples should be similar, and the feature expression between the negative samples should be dissimilar. The method based on the comparative learning can obtain better performance, and the intuition is that the characteristics of different transformations of the same image are similar and dissimilar to the characteristics of different images, so as to train a proper network.

However, most of the existing comparison learning is example-level comparison, that is, a global feature is extracted from the whole graph, and then the global feature is distinguished. The method has the advantages that good performance is shown on a natural image classification data set, but ground object distribution in a single image of the actually cut remote sensing image is possibly rich, the class of the single image is relatively pure or has a certain prominent class unlike the natural image classification data set, much information is lost if only global features are extracted from the whole image and are distinguished like an original example-level comparison learning method, and meanwhile, the difference between a semantic segmentation task and the classification task is considered, so that only image-level distinction is needed for classification, the semantic segmentation task is pixel-level classification, and different parts in the same image need to be distinguished.

Disclosure of Invention

In view of the above, the present invention aims to solve the problems of learning features directly from unlabeled images to help a downstream semantic segmentation task with only a small number of labels, and simultaneously solving the problems that the category of a single image of image data is not pure and the semantic segmentation task needs to be distinguished locally.

In order to achieve the purpose, the invention adopts the following technical scheme:

the remote sensing image semantic segmentation method based on the self-supervision contrast learning comprises the following steps:

step 1, constructing a Deeplab v3+ network model;

2, pre-training the encoder of the network model by adopting label-free data;

step 3, after the pre-training is finished, performing supervised semantic segmentation training on the network model on a labeled sample;

and 4, performing semantic segmentation on the remote sensing image by adopting a network model finished by supervised semantic segmentation training.

In the pre-training process, the comparison learning is carried out by adopting a mode of combining global style comparison and local matching comparison, and the method comprises the following steps:

step 201, performing random data transformation on the label-free data, and for a given sample x_iPerform a random data transformation t' (x)_i) And t' (x)_i) Thereby producing two related instances x'_iAnd x ″)_iTaking the sample as a positive sample pair, wherein t 'represents random clipping and scaling, and t' represents random clipping and scaling, random turning, random rotation, random color distortion and random Gaussian blur in sequence;

step 202, using an encoder e (-) in the deep v3+ network model to extract global style features from the transformed sample instance: stylef'_i＝stylef(x′_i)＝cat(μ(e(x′_i)),σ(e(x′_i) Of these, styref'_iRepresenting global style characteristics, mu represents the average value of each channel in the characteristic diagram, namely global average pooling, sigma represents the variance of each channel, and cat represents channel splicing;

step 203, processing the global style feature by using a projection head projection header (·), where the projection head is a multi-layer perceptron with a hidden layer:

z_i′＝g(stylef_i′)＝W⁽²⁾r(W⁽¹⁾stylef_i′)，

wherein, W⁽²⁾Denotes the second fully-connected layer, W⁽¹⁾Representing the first fully connected layer, r represents the Relu activation function,z_i' represents the global style characteristics after the projection head processing;

step 204, from the transformed sample instance x ', using encoder e (-) and decoder d (-) of the Deeplab v3+ network model'_iAnd x ″)_iExtracting feature vector d (e (x'_i) And d (e (x ″)_i) From d (e (x'_i) D (e (x'))_i) Obtaining a plurality of matching locally corresponding features, p'_jAnd p ″)_jThe feature map corresponding to the matched local is matched, then the feature map is subjected to global average pooling to obtain a local feature vector, namely:

f_L(p′_j)＝μ(p′_j)

wherein f is_L(p′_j) Is a local feature vector;

step 205, using projection head projection header_L(. to) process the local feature vectors, the projection head is a multi-layer perceptron with one hidden layer:

u′_j＝g_L(f_L(p′_j))＝W⁽⁴⁾r(W⁽³⁾f_L(p′_j))，

wherein, W⁽⁴⁾Denotes the fourth fully-connected layer, W⁽³⁾Denotes a third fully-connected layer, u'_jRepresenting local matching characteristics processed by the projection head;

step 206, training the encoder using an overall loss function, the overall loss function consisting of global style contrast loss and local matching contrast loss:

L＝(1-λ)l_G+λ·l_L

wherein λ is an adjustable weight parameter, l_GRepresenting global style contrast loss,/_LIndicating a locally matched contrast loss.

Preferably, for N samples from the same batch, the global style contrast loss is defined as follows:

wherein:

where sim () represents the similarity between the computed feature vectors, Λ^-2(N-1) negative samples except for the positive sample pair are represented, and tau represents a temperature parameter; style () represents extracting a global style feature vector from the encoder-extracted features by computing the mean and variance,

stylef(x′_i)＝stylef′_i＝cat(μ(e(x′_i)),σ(e(x′_i))

for N samples from the same batch, the local match contrast loss is defined as follows:

wherein:

wherein N is_LRepresenting the number, Λ, of all local regions selected from the N samples of the same batch^-Is the collection of the characteristic diagrams corresponding to all the other parts except the matched part.

In the process of calculating the local matching contrast loss, selecting and matching local regions, extracting local features of the corresponding local regions, and calculating the local matching contrast loss;

the selection and matching of the local regions comprises: for a given sample x_iAfter a random data transformation t' (x)_i) And t' (x)_i) Then two transformation versions are generated, a plurality of local areas are randomly selected and matched from the two transformation versions, the pixel position is recorded by introducing an index tag, and the central position of the matched local areas is ensured to be in the original imageCorrespondingly; from x'_iRandomly selecting a local area, obtaining the index value of the local central position, and then determining x ″' according to the index value_iThe positions of the matched parts are located, excessive overlapping between the parts is ensured, the local area is excluded after each selection is finished, the centers of the subsequently selected parts are ensured not to fall into the selected parts, and the step is repeated for multiple times to obtain a plurality of matched local areas;

the local feature extraction comprises the following steps: from transformed sample instance x 'using the encoder and decoder portions of the Deeplab v3+ network model'_iAnd x ″)_iExtracting feature vector d (e (x'_i) And d (e (x ″)_i) D (e (x) 'according to the idea of selecting and matching local regions from among the selection and matching of local regions'_i) And d (e (x ″)_i) Obtaining a plurality of locally corresponding features matching, let p'_jIs derived from d (e (x'_i) Selected to obtain p ″)_jIs from d (e (x ″)_i) Is selected from among p'_jAnd p ″)_jMatching local corresponding feature maps, and then performing global average pooling on the feature maps to obtain local feature vectors;

the local match contrast loss updates the deplab v3+ network model by the feature characterization of the matching local regions being similar, while the feature expressions of the local regions that are not matching in the same batch are dissimilar.

Preferably, in the pre-training phase, Adam is used as the optimizer and the weight decay is set to 1e^-5An initial learning rate of 0.01, using a cosine attenuation strategy, selecting the downstream task with the lowest loss; in the fine tuning phase, Adam is used as the optimizer, the epoch number is 150, the batch _ size is 16, and the initial learning rate is 0.001.

The invention has the beneficial effects that:

(1) the invention applies contrast self-supervision learning to the remote sensing semantic segmentation data set, and can directly learn features from the unlabeled images to help the downstream semantic segmentation task with only a small number of labels;

(2) aiming at the problems that the category of a single image of image data is not pure and the semantic segmentation task needs to be distinguished locally, a global style and local matching contrast learning framework is provided, so that the semantic segmentation effect is better.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a block diagram of the global style comparison and local matching comparison combination proposed by the present invention;

FIG. 3 is a schematic diagram of local region selection and matching according to the present invention;

FIG. 4 is a graph comparing the results of examples of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in FIG. 1, the method for segmenting the semantics of the remote sensing image based on the self-supervision contrast learning comprises the following steps:

step 1, constructing a Deeplab v3+ network model;

2, pre-training the encoder of the network model by adopting label-free data;

The comparative learning is to construct a characterization by learning the similarity or dissimilarity of two things, and the core idea is that the feature expression between the positive samples should be similar, and the feature expression between the negative samples should be dissimilar. Considering that the ground feature distribution on a single remote sensing image which is actually cut out may be abundant, and many detailed information is inevitably lost only by extracting image-level representation for distinguishing, the invention provides global style and local matching comparison learning aiming at a remote sensing semantic segmentation task, and the overall framework of the method is shown in figure 2 and mainly comprises two parts: global style contrasts and local match contrasts. It mainly consists of two modules: 1) the global style comparison module is mainly used for improving the overall characteristics that the overall average pooling characteristics adopted by the existing comparison learning in the process of measuring the characteristics of the sample can not well replace the overall characteristics of an image, and introduces style characteristics which can represent the overall characteristics of the image so as to help the model to better learn the image-level characteristics;

2) the local feature matching comparison module mainly considers that the ground feature types on a single image of a semantic segmentation data set are rich, only global features are extracted, a lot of detail information can be lost, and meanwhile, image-level representation is possibly suboptimal for a semantic segmentation task needing pixel-level differentiation, which needs local (pixel) level differentiation.

step 201, performing random data transformation on the label-free data, and for a given sample x_iPerform a random data transformation t' (x)_i) And t' (x)_i) Thereby producing two related instances x'_iAnd x ″)_iTaking the sample as a positive sample pair, wherein t 'represents random clipping and scaling, and t' represents random clipping and scaling, random turning, random rotation, random color distortion and random Gaussian blur in sequence; in order to prompt the model to learn the general space-time invariance characteristics, the embodiment learns the space invariance characteristics by performing spatial transformation such as random clipping scaling, inversion and rotation, and expects to learn the time invariance characteristics by simulating the transformation on a time phase by color distortion, gaussian blur, random noise addition and the like.

Step 202, extracting global style features from the transformed sample instance by using an encoder e (-) in the Deeplab v3+ network model: styrene ef'_i＝stylef(x′_i)＝cat(μ(e(x′_i)),σ(e(x′_i) Of these, styref'_iRepresenting global style features, μ represents the averaging of each channel in the feature map, i.e. global mean pooling, and σ represents the average of each channelCalculating variance, and cat represents channel splicing;

z_i′＝g(stylef_i′)＝W⁽²⁾r(W⁽¹⁾stylef_i′)，

wherein, W⁽²⁾Denotes the second fully-connected layer, W⁽¹⁾Representing the first fully connected layer, r represents the Relu activation function;

step 204, encoding and decoding from transformed sample instance x 'using Deeplab v3+ network model'_iAnd x ″)_iExtracting feature vector d (e (x'_i) And d (e (x ″)_i) From d (e (x'_i) And d (e (x ″)_i) Obtaining features, p 'corresponding to a plurality of matching local portions'_jAnd p ″)_jThe feature map corresponding to the matched local is obtained, then the feature map is subjected to global average pooling to obtain a local feature vector, namely:

f_L(p′_j)＝μ(p′_j)

step 205, project header is adopted_LProcessing the local features, wherein the projection head is a multilayer perceptron with an implied layer:

u′_j＝g_L(f_L(p′_j))＝W⁽⁴⁾r(W⁽³⁾f_L(p′_j))，

wherein, W⁽⁴⁾Denotes the fourth fully-connected layer, W⁽³⁾Represents a third fully connected layer;

L＝(1-λ)l_G+λ·l_L

In the prior art, the mean and variance of each channel extracted in the convolutional neural network can represent the style of a picture, so that in the invention, global style feature vectors are extracted from features extracted by an encoder by calculating the mean and variance, as shown in a formula:

stylef′_i＝stylef(x′_i)＝cat(μ(e(x′_i)),σ(e(x′_i))

therefore, for N samples from the same batch, the global style contrast loss is defined as follows:

where sim () denotes computing the similarity between the feature vectors, Λ^-2(N-1) negative samples except for the positive sample pair are represented, and tau represents a temperature parameter; style () represents the global style feature vector extracted from the encoder by computing the mean and variance of the features: styryl ef (x'_i)＝stylef′_i＝cat(μ(e(x′_i)),σ(e(x′_i))。

In the actual cutting image, the category of a single image is not necessarily pure, and may be relatively rich, if only a whole image is extracted to perform measurement and differentiation, a lot of information is inevitably lost, and meanwhile, considering that the semantic segmentation task is pixel-level classification, unlike the image classification task, the semantic segmentation task only needs to distinguish the whole image, and also needs to distinguish different parts in the single image, and mainly comprises the following parts:

(1) local region selection and matching

As shown in fig. 3, for a given sample x_iTransformed by random data t' (x)_i) And t' (x)_i) Two transformed versions are then generated from which a plurality of local regions are randomly selected and matched. Wherein due to dataDuring transformation, operations such as random scaling and rotation are carried out, positions of the pixels are not matched, and therefore the pixel positions are recorded by introducing an index tag, and the central position of the matched local is guaranteed to be corresponding in an original image. Specifically, first from x'_iRandomly selecting a local area, obtaining the index value of the local central position, and then determining x ″' according to the index value_iAnd the positions of the matched local parts are matched, excessive overlapping of the local parts is ensured, the local area is excluded after each selection is finished, the center of the subsequently selected local part is ensured not to fall into the selected local part, and the step is repeated for multiple times to obtain a plurality of matched local areas.

(2) Local feature extraction

From transformed sample instances x 'using a full codec network'_iAnd x ″)_iExtracting feature vector d (e (x'_i) And d (e (x ″)_i) E and d can be the encoding-decoding part of any semantic segmentation network, and in the embodiment, e and d respectively correspond to the encoding and decoding parts of Deeplab v3 +. From d (e (x'_i) D (e (x'))_i) Obtaining a plurality of matching locally corresponding features, e.g. p'_jIs derived from d (e (x'_i) Selected to obtain p ″)_jIs derived from d (e (x ″)_i) Is selected from among p'_jAnd p ″)_jIs the corresponding characteristic diagram of the matched local. Then, global average pooling is carried out on the local feature vectors to obtain local feature vectors, namely:

f_L(p′_j)＝μ(p′_j)

(3) local feature extraction

The local contrast loss is similar through the feature characterization of the matched local region, and the feature expression of the local region which is not matched with the local region in the same batch is dissimilar to update a complete semantic segmentation network, and for N samples from the same batch, the local contrast loss is defined as follows:

wherein N is_LRepresenting the number of all local regions selected from the N samples of the same batch, Λ^-Is the collection of the characteristic diagrams corresponding to all the other parts except the matched part. g_LSimilar to g (-) is a projection head.

In order to illustrate the effectiveness of the method, experiments are performed on four data sets, as shown in table 1, wherein ISPRS-Potsdam Dataset and deep global Land Cover Classification Dataset are public data sets, the labeling quality of the data sets is relatively high, Hubei Dataset and xiangan Dataset are real surface coverage Classification datasets, the image resolution and the Classification system of the two data sets are basically consistent, and the influence of domain difference on self-supervision is conveniently researched subsequently, wherein the labeling quality of the Hubei data set is good and inconsistent, the image time is inconsistent with the labeling time, and the label of the Hunan Tan data set is roughly corrected manually, so that the quality is relatively high.

Table 1 data set description

The ISPRS Potsdam data set consists of 38 high-resolution remote sensing aerial images, images are collected in a Germany Potsdam city, the size of the images is 6000 pixels multiplied by 6000 pixels, the spatial resolution is 5cm, each image has four wave bands (NIR, R, G and B), the labels are manually labeled pixel-level labels, the quality is high, and 6 categories are provided: impervious surfaces, buildings, low-rise vegetation, trees, automobiles, and others. To train and evaluate the network, 24 of them were selected as the training set and the remaining 14 as the test set. The data set was slide-clipped to 256 × 256 patches, resulting in 13824 images for self-supervised training, 138 randomly selected as labeled training data downstream, and 8064 downstream test set data volumes.

Deep global Land Cover Classification Challenge provides high resolution sub-scale satellite images of 2448 × 2448, mainly covering rural areas, for a total of 8 categories: towns, rural areas, agricultural areas, pastures, forests, bodies of water, wastelands, unknown lands (clouds and others). 730 out of the experiments were selected as training sets and 73 were selected as test sets. The data set was slide-clipped to 512 x 512 size image blocks, and finally 18248 images were used for the self-supervised training, with the downstream default training set data volume of 182 and the test set data volume of 1825.

The Hubei data set is from a real project, the image data is from a high-resolution second satellite, the resolution is 2m, the label is from binary data, and the labels are artificially combined into 10 types: background, arable land, towns, rural areas, water, forestry, grasslands, other structures, traffic facilities, and others. The time of the label is not necessarily corresponding to the image, the quality of the labeled data is not uniform, and the definition among the classes in the class merging process is questioned, so the data quality is not high. We split the entire hubei province into several panels 13889 x 9259, and we randomly selected only 34 of them as training and 5 as testing due to limited resources. The data set is slide-clipped to 256 × 256 image blocks, and finally 66471 images are generated for self-supervision, the data volume of the downstream default training set is 664, and the data volume of the test set is 9211.

The Hunan pond data set is also from a real project, the image data is from a high-resolution second satellite, the resolution is 2m, the coverage range is Hunan pond city in Hunan province of China, and the labels are artificially merged into 8 types: background, cultivated land, cities and towns, rural areas, water bodies, forest lands, grasslands and traffic facilities. Because the images of the labels in the region in the same year are artificially and roughly corrected during the manufacturing process, the quality of the quan-Hunan data is higher than that of the Hubei. The entire Hunan Tan City was divided into 4096 × 4096 size panels, 85 of which were randomly selected for training and 21 for testing. The data set was slide-cropped to 256 × 256 image blocks, yielding 16051 images for self-supervision, 160 downstream default training set data volume and 3815 test set data volume.

The method used for comparison is as follows: random baseline: in the fine adjustment stage, no pre-training model is loaded, and the network is initialized randomly; ImageNet Pre-training, namely initializing a fine tuning stage model backbone by using a Pre-training model on ImageNet; jigsaw, Inpating, MoCo v2, and simCLR will not be described in detail.

The merits of the proposed self-supervision method need to be evaluated on a specific downstream semantic segmentation task, specifically we measure the overall accuracy on a test set of downstream labeled data using OA and Kappa, where OA represents the overall accuracy of all pixels, defined as follows:

wherein TP represents the total number of pixels predicted correctly, and N represents the total number of pixels.

Although OA can directly reflect the correct proportion of the overall classification, when the sample is unbalanced, the partial classification may be very low when OA is high, and Kappa is used as an index for measuring consistency, which can well reflect the situation, and is specifically defined as follows:

wherein,

a_cnumber of real pixels representing class c, b_cAnd (4) expressing the number of the predicted c-th type pixels, wherein N is the total number of the pixels.

In the pre-training process, the model adopts Deeplab v3+, wherein the method such as simCLR and the like as the baseline is only designed for training the encoder part, Adam is used as an optimizer, the weight attenuation is set to be 1e-5, the initial learning rate is 0.01, a cosine attenuation strategy is used for training 400 epochs, and the lowest loss is selected for a downstream task. The Batch _ size is 64. The input image size is 256 × 256, random cropping scaling resize is performed to 224 × 224,

although our method can train the decoding part of the network at the same time during the self-supervised training, since the simCLR method for comparison is only designed for training the encoder, as in the following experimental results, as not specifically stated, only the encoder part of the pre-training model is loaded during the fine tuning, then the supervised semantic segmentation training is performed on a small number of labeled samples, Adam is used as an optimizer, the epoch number is 150, the batch _ size is 16, the initial learning rate of 0.001, and each epoch decays to 0.98.

The fine adjustment effect of the proposed method on a small number of labeled samples is explored on a plurality of data sets, the labeled sample amount used in fine adjustment is set to be 1% of the self-supervision data amount, the result is shown in table 2, different self-supervision modes can be found from the result to have great influence on the result, and the method of the invention achieves the optimal effect; in addition, the method mostly exceeds the result of loading ImageNet pre-training parameters, wherein the ImageNet pre-training parameters are obtained by performing supervised training on million levels of ImageNet, and the amount of self-supervision data in the experiment is mostly only about 2 ten thousand, so that the method can be used for explaining that although the pre-training model for loading ImageNet can be greatly improved, the pre-training parameter for loading ImageNet is not an optimal mode due to the huge difference between a natural image and a remote sensing image, and the method is more reasonable if a strong model can be directly trained from the unmarked remote sensing image. Furthermore, it is noted that in the experiments the images used for self-supervision are similar to the images of the downstream tasks, both from the same dataset, but such an arrangement is available in practice, since a large number of images of the same origin can be easily obtained by satellite technology.

Table 2 four data set methods compare results.

Since the self-supervision task does not need a label, and a large amount of abundant image data is available, the experiment explores whether the result has a gain when the self-supervision data volume increases, the experiment is performed on two datasets potsdam and xiangtan, the self-supervision data volume is respectively randomly extracted by 20%, 50% to 100%, the result is shown in fig. 4, wherein None represents that the self-supervision pre-training parameters are not loaded, the network parameters are directly and randomly initialized, and it can be found from the result that the whole body has a rising trend along with the increase of the self-supervision data volume on the two datasets, and meanwhile, compared with the simlr method as a baseline, the method is relatively more obviously improved, so that the method is expected to be more meaningful when the self-supervision training is performed by using a larger dataset.

TABLE 3 Effect of self-supervised training of different domain datasets on results

In addition, we have compared the model performance of the pre-training using the data sets of different domains, and as the result is shown in table 3, it can be found from the result that the performance of the pre-training model using the data set similar to the data set of the downstream task is better, and at the same time, most of the time, our method surpasses the supervised learning, except in the case of the domain difference being extremely small (HuBei → XiangTan, XiangTan → HuBei), mainly because the two domains not only have the same image resolution, but also are close in physical position, and the key is that the classification system is also completely consistent, so it is difficult to exceed the precision of the supervised learning at present. Although the model performance is further improved with the increase of the self-supervision training amount, in the experiment, it is found that if images which are not similar to the data set of the downstream task are mixed, the model performance can not be further improved and even can be damaged, but since the self-supervision task does not need a label, a large amount of image data which is similar to the target data set can be obtained.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The remote sensing image semantic segmentation method based on the self-supervision contrast learning is characterized by comprising the following steps of:

step 1, constructing a Deeplab v3+ network model;

2, pre-training the encoder of the network model by adopting label-free data;

step 4, performing semantic segmentation on the remote sensing image by adopting a network model finished by supervised semantic segmentation training;

in the pre-training process, comparison learning is carried out in a mode of combining global style comparison and local matching comparison, and the method comprises the following steps:

step 202, utilizing an encoder e (-) in the Deeplab v3+ network model to extract a feature map of global style features from the transformed sample instance: style ef_i′＝stylef(x′_i)＝cat(μ(e(x′_i))，σ(e(x′_i) Of which styref_i' global style feature, μ represents the average of each channel in the feature map,global average pooling, wherein sigma represents the variance of each channel, and cat represents the channel splicing;

step 203, processing the global style feature by using a projection head, where the projection head g (-) is a multi-layer perceptron with a hidden layer:

z_i′＝g(stylef_i′)＝W⁽²⁾r(W⁽¹⁾stylef_i′)

wherein, W⁽²⁾Denotes the second fully-connected layer, W⁽¹⁾Representing the first fully-connected layer, r representing the Relu activation function, z_i' represents the global style characteristics processed by the projection head g (-);

step 204, from the transformed sample instance x ', using encoder e (-) and decoder d (-) of the Deeplab v3+ network model'_iAnd x ″)_iExtracting feature vector d (e (x'_i) And d (e (x ″)_i) From d (e (x'_i) And d (e (x ″)_i) Obtaining features, p 'corresponding to a plurality of matching local portions'_jAnd p ″)_jThe feature map corresponding to the matched local is obtained, then the feature map is subjected to global average pooling to obtain a local feature vector, namely:

f_L(p′_j)＝μ(p′_j)

wherein, f_L(p′_j) Is a local feature vector;

step 205, project head g is adopted_L(. processing local feature vectors, the projection head g_L(.) is a multi-layer perceptron with one hidden layer:

u′_j＝g_L(f_L(p′_j))＝W⁽⁴⁾r(W⁽³⁾f_L(p′_j))

wherein, W⁽⁴⁾Denotes the fourth fully-connected layer, W⁽³⁾Denotes a third fully connected layer, u_jRepresenting said projection head g_L() processed local match features;

L＝(1-λ)l_G+λ·l_L

2. The method for semantic segmentation of remote sensing images based on self-supervised contrast learning according to claim 1, wherein for N samples from the same batch, the global style contrast loss is defined as follows:

wherein:

where sim () denotes computing the similarity between the feature vectors, Λ^-2(N-1) negative samples except for the positive sample pair are represented, and tau represents a temperature parameter; style () represents extracting a global style feature vector from the encoder-extracted features by computing the mean and variance,

stylef(x′_i)＝stylef_i′＝cat(μ(e(x′_i))，σ(e(x′_i))

wherein:

3. The remote sensing image semantic segmentation method based on the self-supervision contrast learning according to claim 2, characterized in that in the process of calculating the local matching contrast loss, the local regions are selected and matched first, then the local features of the corresponding local regions are extracted, and finally the local matching contrast loss is calculated;

the selection and matching of the local region comprises: for a given sample x_iAfter a random data transformation t' (x)_i) And t' (x)_i) Then two transformation versions are generated, a plurality of local areas are randomly selected and matched, the pixel position is recorded by introducing an index tag, and the central position of the matched local area is ensured to be corresponding in the original image; from x'_iRandomly selecting a local area, obtaining the index value of the local central position, and determining x ″' according to the index value_iThe positions of the matched parts are located, excessive overlapping between the parts is ensured, the local area is excluded after each selection is finished, the centers of the subsequently selected parts are ensured not to fall into the selected parts, and the step is repeated for multiple times to obtain a plurality of matched local areas;

the local feature extraction includes: from transformed sample instance x 'using the encoder and decoder portions of the Deeplab v3+ network model'_iAnd x ″)_iExtracting feature vector d (e (x)'_i) And d (e (x ″)_i) Selecting and matching local regions from d (e (x ') according to the idea of selecting and matching local regions'_i) And d (e (x ″)_i) Obtaining features, let p ', corresponding to multiple matching local portions'_jIs derived from d (e (x'_i) Selected from among) p ″_jIs derived from d (e (x ″)_i) Is selected from among p'_jAnd p ″)_jIs the matching local corresponding characteristic diagram, and then carries out global operation on the characteristic diagramObtaining local feature vectors by average pooling;

the local match contrast loss updates the deepab v3+ network model by the feature characterization of the matching local regions being similar, while the feature expressions of the local regions that are not matching in the same batch are dissimilar.

4. The remote sensing image semantic segmentation method based on the self-supervision contrast learning as claimed in claim 1 is characterized in that Adam is adopted as an optimizer in a pre-training stage, and weight attenuation is set to be 1e^-5An initial learning rate of 0.01, and a cosine attenuation strategy is used to select the downstream task with the lowest loss; in the fine tuning phase, Adam is used as the optimizer, the epoch number is 150, the batch _ size is 16, the initial learning rate is 0.001, and each epoch decays to 0.98.