CN113298815A

CN113298815A - Semi-supervised remote sensing image semantic segmentation method and device and computer equipment

Info

Publication number: CN113298815A
Application number: CN202110686544.8A
Authority: CN
Inventors: 刘明明; 刘兵; 李爽; 王伟男; 胡光喆; 仇文宁; 付红; 戚海永; 张海燕; 马衍颂
Original assignee: Jiangsu Jianzhu Institute
Current assignee: Jiangsu Jianzhu Institute
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-08-24

Abstract

The invention discloses a semi-supervised remote sensing image semantic segmentation method, a semi-supervised remote sensing image semantic segmentation device and computer equipment, wherein the method comprises the following steps: acquiring an original remote sensing image; scaling the original remote sensing image into 3 scaled images with different sizes; respectively inputting the 3 scaled images into 3 criss-cross attention modules to obtain 3 attention feature maps; performing fusion processing on the 3 attention feature maps to obtain a multi-scale attention feature map; and inputting the multi-scale attention feature map into a deep semantic segmentation network to obtain a semantic segmentation prediction map. The semi-supervised remote sensing image semantic segmentation model based on multi-scale attention can train the whole model by using label-free data and fully utilize the global context relationship between feature maps, thereby effectively improving the edge segmentation precision between remote sensing image targets and improving the integral accuracy.

Description

Semi-supervised remote sensing image semantic segmentation method and device and computer equipment

Technical Field

The invention relates to the field of semantic segmentation of remote sensing images, in particular to a semi-supervised remote sensing image semantic segmentation method and device based on multi-scale attention and computer equipment.

Background

In the research of remote sensing images, semantic segmentation of the remote sensing images classifies each pixel point in the remote sensing images, and is always an important research direction in the remote sensing images. The traditional method for semantic segmentation of remote sensing images often uses a machine learning algorithm, but the classification accuracy needs to be further improved. In recent years, with the development of deep learning, a Convolutional Neural Network (CNN) having an excellent feature extraction capability has been widely used in various fields of image processing, such as scene classification and the like. Long proposes a Full Convolutional Network (FCN) to replace the fully connected layer in a CNN network with a full convolutional layer. Unlike conventional image classification methods, FCN can achieve image segmentation of any size. SegNet proposes a deconvolution structure, exploiting the characteristics of the middle layer by skipping the connections. Gangfu et al propose a multi-scale network structure that replaces the traditional convolution, with the dilated convolution increasing the receptive field without reducing spatial resolution. The void space pyramid structure (ASPP) mainly provides a plurality of void convolution branches which have different void rates to extract multi-scale features, and obviously improves the segmentation precision of a target in an image. The Deeplabv3 network is improved for many times and is the most successful network model in the deep learning semantic segmentation field at present. Its latest version, Deeplabv3+, achieves the highest accuracy over multiple public datasets. The multi-scale integration can effectively solve the problem of target segmentation. A single neural network model has multiple different sized receptive fields to accommodate multiple sizes of target segmentation. Since the full convolution network has superior performance compared with the traditional machine learning, many scholars apply CNN to the semantic segmentation of remote sensing images, and the deep convolution network plays an increasingly important role in many fields of remote sensing images. Still another proposal is two independent full convolution network branches, using segmented images and height information from optical remote sensing as inputs to the two branches. After a series of convolution operations, the predicted segmentation results of the two branches are fused. The methods can achieve ideal effects when the marking data are sufficient.

The remote sensing image semantic segmentation can be used for geographic detection, and has an important role in obtaining landmark landform information. In recent years, with the convenience of obtaining remote sensing images and the improvement of image quality, research on remote sensing images is increasing. The remote sensing image semantic segmentation needs to classify each pixel point on the feature map, so for the labeled image, each pixel point also needs to be labeled. With the improvement of the resolution of the remote sensing image acquisition, the semantic segmentation and labeling of the remote sensing image are more difficult, and the edge of the target is difficult to segment accurately. At present, most of mainstream remote sensing image semantic segmentation researches are based on a deep convolutional neural network. Li yu provides an image semantic segmentation method based on a deep convolution fusion conditional random field, shallow layer detail information and high layer semantic information are fused into a network model, meanwhile, parameters of the conditional random field are inferred to be embedded into a network framework in an iteration layer shape, the network model is built, rich detail information and context information of a remote sensing image are comprehensively utilized in the forward and reverse propagation process of model training, and end-to-end remote sensing image semantic segmentation is achieved. And the group peak provides a method based on the connection of the coding and decoding structural features, and improves the DeconvNet network model. When the model is coded, the spatial structure information can be effectively reserved by recording the position of the pooling index and applying the position to the pooling process; during decoding, the model is effectively subjected to feature extraction by using a mode of connecting encoding and decoding corresponding feature layers. The remote sensing image semantic segmentation method based on the improved Deeplabv3 is provided by the Xiong scene, and the semantic integrity of the image on the resolution is ensured by improving a single upsampling layer and performing multi-layer upsampling by using residual errors obtained in a backbone network. However, the existing remote sensing image semantic segmentation method cannot well utilize the non-labeled data, so that the segmentation effect is poor when the labeled data are less. When the remote sensing image labeling data is insufficient, how to improve the semantic segmentation effect and the space for extracting the target. The current semi-supervised segmentation method causes the problem of inaccurate segmentation of target edges in remote sensing images because long-distance correlation is not concerned.

Disclosure of Invention

Based on the above, it is necessary to provide a semi-supervised remote sensing image semantic segmentation method, device and computer equipment for solving the above technical problems.

The embodiment of the invention provides a semi-supervised remote sensing image semantic segmentation method, which comprises the following steps:

acquiring an original remote sensing image;

scaling the original remote sensing image into 3 scaled images with different sizes; respectively inputting the 3 scaled images into 3 criss-cross attention modules to obtain 3 attention feature maps; performing fusion processing on the 3 attention feature maps to obtain a multi-scale attention feature map;

and inputting the multi-scale attention feature map into a deep semantic segmentation network to obtain a semantic segmentation prediction map.

In one embodiment, the obtaining of the multi-scale attention feature map specifically includes:

inputting the original remote sensing image into a deep convolution neural network to obtain characteristic images X with different sizes₁And characteristic diagram X₂And characteristic diagram X₃；

Will feature diagram X₁Characteristic diagramX₂And characteristic diagram X₃Respectively inputting into 3 criss-cross attention modules to obtain attention feature map C₁Attention feature chart C₂Attention feature chart C₃；

To attention feature C₁Attention feature chart C₂Attention feature chart C₃And sequentially carrying out up-sampling and fusion processing to obtain a multi-scale attention feature map.

In one embodiment, the obtaining of the attention feature map specifically includes:

for the characteristics M of the original remote sensing image, belonging to R^C×W×HUsing two 1 x 1 convolutional layers, two feature maps are generated, named Q and K, respectively, (Q, K) ∈ R^C′×W×H；

For the feature mapping Q and the feature mapping K, sequentially carrying out Affinity operation, SoftMax operation and Aggregation operation to obtain an attention feature map A e R^{(H+W-1)×W×H}；

Wherein c 'is a channel of the characteristic image, c is the number of channels of the original remote sensing image, and c' is smaller than c; H. and W is the height and the width of the original remote sensing image respectively.

In one embodiment, the Affinity operation, SoftMax operation, and Aggregation operation specifically include:

for each position u of the feature map Q, a vector Q is obtained_u∈R^C′(ii) a While obtaining the set omega by extracting from K the eigenvectors in the same row or column as the position u_uAnd has the following components:

wherein for Ω_u∈R^{(H+W-1)×C′}，Ω_i,u∈R^C′Represents omega_uThe ith element in (1); d_i,uE.g. D represents the characteristic Q_uAnd Ω_i,uThe degree of correlation between i ═ 1,. | Ω_u|]，D∈R^{(H+W-1)×W×H}；

By channel dimension on DApplying SoftMax operation, applying a convolutional layer with 1 x 1 filter on M, and generating V epsilon R for feature adaptation^C×W×H(ii) a And mapping V for each position on the feature space dimension to obtain a vector V_u∈R^CAnd a set phi_u∈R^(H+W-1)×CAnd has the following components:

set phi_uRepresenting a feature vector set in the feature map V in the same column or the same row with the position u; a. the_i,uIs the position of scalar value channels i and u in A;

acquiring No-local information of the image through an Aggregation operation; wherein, M'_uRepresents M' epsilon R in the output characteristic diagram^C×W×HThe feature vector at position u.

In one embodiment, a semi-supervised remote sensing image semantic segmentation method further includes:

inputting the one-hot coding vectors of the semantic segmentation prediction graph and the annotation image into a discriminator network to obtain a semantic segmentation confidence image; wherein the original remote sensing image comprises: and (5) labeling the image.

In one embodiment, the discriminator network comprises:

5 convolution layers, the size of the convolution kernel is 4 multiplied by 4, the number of channels is [64, 128, 256, 512, 1] respectively, and the step length is 2; replacing the ReLU after the convolution layer with Leaky-ReLU; an upsampling layer is added to the last layer.

space-based multi-class cross entropy L_ceAntagonistic loss function L_ASemi-supervised loss function L_SAnd training the deep semantic segmentation network and the discriminator network.

In one embodiment, the training deep semantic segmentation network and the discriminator network specifically include:

multi-class loss function when using tagged dataNumber L_ceObtained by the following method:

through L_DTraining the discriminator network:

when x is_nWhen the pixel point is equal to 1, the generator generates the pixel point; if y is_nIf 1, the sample is from the label image; d (G (X)_n))^(h,w)) Is a pixel X_nA feature at the position of (h, w); d (Y)_n)^(h,w)Is a pixel Y_nA feature at the position of (h, w); if it is not

To a certain class classification, then Y_n ^(h,w)Is 1, otherwise is 0;

fighting the learning process through loss L_ATo train the discriminator:

when training with unlabeled data, only L_AWhere applicable, and for unlabeled data, confidence maps D (G (X) are generated by training the discriminator network_n))^(h,w))；

Y obtained by performing one-hot coding on annotated image_nThrough element-by-element setting, the method obtains

If c is^*＝argmax_cG(X_n)^(h,w,c)Then, then

Setting a threshold value

Obtaining the areas with confidence by highlighting the confidence map; l is_SThe definition is as follows:

i () is an index function, and control is performed by setting U_SThe sensitivity is controlled by the value of (a) to adjust the training process of the network.

A semi-supervised remote sensing image semantic segmentation device comprises:

the image acquisition module is used for acquiring an original remote sensing image;

the multi-scale attention feature map determining module is used for scaling the original remote sensing image into 3 scaled images with different sizes; respectively inputting the 3 scaled images into 3 criss-cross attention modules to obtain 3 attention feature maps; performing fusion processing on the 3 attention feature maps to obtain a multi-scale attention feature map;

and the semantic segmentation prediction map determining module is used for inputting the multi-scale attention feature map into the deep semantic segmentation network to obtain a semantic segmentation prediction map.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an original remote sensing image;

Compared with the prior art, the semi-supervised remote sensing image semantic segmentation method, the semi-supervised remote sensing image semantic segmentation device and the computer equipment provided by the embodiment of the invention have the following beneficial effects:

the invention provides a multiscale attention-based semi-supervised remote sensing image semantic segmentation model, which can be used for training the whole model by using label-free data and fully utilizing the global context relationship among feature maps, thereby effectively improving the edge segmentation precision among remote sensing image targets and improving the integral accuracy. Specifically, in order to fully utilize the global context, long-distance correlation between pixel points is utilized, so that the segmentation precision of the target edge is improved, and a cross attention network is introduced. Through multipath input, image features with different sizes are extracted, and receptive fields with different sizes can be obtained, so that features of different visual angles of training data are fully utilized, and the training data are fully utilized. Meanwhile, in order to solve the problem of difficult semantic annotation, a semi-supervised semantic segmentation method is applied to a remote sensing image, a segmentation network is used as a generator, the output of the generator is as close to an annotated image as possible under the auxiliary training of a discriminator, because the FCN is greatly successful in the semantic segmentation of images in natural scenes, a plurality of scholars apply the FCN to the semantic segmentation of the remote sensing image, a full convolution discriminator is used for distinguishing the annotated image from a predicted image, and a semi-supervised framework can utilize marked data and unmarked data, so that the data can be fully utilized under the condition of small annotated data volume, and the segmentation effect is improved.

Drawings

FIG. 1 is a schematic illustration of an attention mechanism provided in one embodiment;

FIG. 2 is a multi-scale attention diagram provided in one embodiment;

FIG. 3 is a schematic diagram of a network of discriminators provided in one embodiment;

FIG. 4 is a diagram of a semi-supervised semantic segmentation based on multi-scale attention as provided in one embodiment;

FIG. 5 is a segmentation result visualization of a CCF2015 data set based on multi-scale attention provided in an embodiment;

fig. 6 is a visualization of segmentation results of the US2D dataset based on multi-scale attention generation provided in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, the provided semi-supervised remote sensing image semantic segmentation method specifically comprises the following steps:

attention mechanism

The ability of a neural network to focus on features that it wants to focus on needs to introduce a mechanism of attention, which is also very important for studying the context between feature maps in semantic segmentation. Attention is not paid to the shape of the input data, and the range that can be used is wider. Under the condition that the computing resources are certain, the attention mechanism is taken as an effective means for solving the information overload problem, redundant information can be effectively removed, the computing resources can be allocated to the most needed place, most unimportant information can be ignored, and therefore the effective allocation of the computing resources is achieved. As shown in fig. 1.

Attention mechanism is used to calculate the degree of correlation between features, between Query and Key-Value { K_i,V_iMapping | i ═ 1,2, ·, m } to output, where query, key, value are all vectors, weighting all values in output V, calculating query and key to obtain weight. The attention mechanism is calculated as follows:

firstly, the similarity of Q and K is calculated and compared, and is represented by f:

f(Q,K_i),i＝1,2,...,m

the similarity between Q and K is calculated by the following four methods: point multiplication calculation, weight method calculation, splicing weight method calculation and sensor method calculation. Respectively as shown in the following formulas:

f(Q,K_i)＝Q^TK_i

f(Q,K_i)＝Q^TWK_i

f(Q,K_i)＝W[Q；K_i]

f(Q,K_i)＝V^Ttanh(WQ+UK_i)

and normalizing the scores by utilizing SoftMax, and completing numerical value conversion to obtain the probability distribution with the sum of the weights of all elements being one. The specific operation is shown as the following formula:

according to alpha_iAnd performing weighted summation calculation on all values in V to obtain an attention vector H. The specific operation is shown as the following formula:

multi-scale attention semi-supervised semantic segmentation model

Aiming at the problem that the generator does not pay attention to long-distance correlation, a crisscross attention module is introduced, and useful context information can be captured by utilizing long-distance dependence, so that the problem of visual understanding is facilitated. In the present invention, attention modules that attempt to criss-cross collect long distance related information in both horizontal and vertical directions for enhanced per-pixel functionality. And through multi-scale fusion, each pixel point can be combined with context information of three visual angles, so that the classification of the pixel point is more accurate.

Generator

As shown in fig. 2, the input image is scaled to 3 different sizes, and then the input image is respectively input to the attention module, and the obtained feature maps are merged to obtain a multi-scale attention feature map. As shown in FIG. 2, the input image is passed through a deep convolutional neural network, which is based on ResNet101, and then a feature map X is generated₁，X₂，X₃. The spatial size of the feature map X is H × W. Characteristic diagram X₁，X₂，X₃Respectively obtaining C through the attention module₁，C₂，C₃Each pixel point in the characteristic diagram is related to all pixel points in the longitudinal direction and the transverse direction in a context mode. And then after the images are subjected to upsampling, all the feature maps are restored to the original size and then subjected to feature fusion to obtain a final multi-scale attention feature map.

Discriminator

For the discriminator network, there are 5 convolutional layers, the convolutional kernel size is 4 × 4, the number of channels is [64, 128, 256, 512, 1], the step length is 2, and the ReLU after convolutional layer is changed into Leaky-ReLU. In order to restore the output image to the size of the input image, an upsampling layer is added to the last layer. Since training of the generative challenge network requires a large memory space, a complex discriminator structure is not employed.

Semi-supervised remote sensing image semantic segmentation algorithm based on multi-scale attention

As shown in FIG. 2, for a given feature M ∈ R^C×W×HFirst, two 1 × 1 convolutional layers are used for the feature M, so that two feature maps, named Q and K respectively, (Q, K) epsilon R^C′×W×H. c' is the number of channels of the feature image, which is smaller than the number c of channels of the image, and the dimensionality of the feature image is reduced. After obtaining the profiles Q and K, further pass d_i,uAffinity is manipulated so that an attention feature map A ∈ R can be obtained^{(H+W-1)×W×H}。

Affinity operates as follows: for each position u of the feature map Q, a vector Q is obtained_u∈R^C′. Meanwhile, the set Ω may be obtained by extracting a feature vector in the same row or column as the position u from K_u。

For omega_u∈R^{(H+W-1)×C′}。Ω_i,u∈R^C′Represents omega_uThe ith bit element of (1). Wherein d is_i,uE.g. D represents the characteristic Q_uAnd Ω_i,uThe degree of correlation between i ═ 1,. | Ω_u|]，D∈R^{(H+W-1)×W×H}. Then, applying SoftMax operations on D by channel dimension, an attention map is calculated. Finally, a convolutional layer with 1 x 1 filter is applied to M to generate V e R that can be used for feature adaptation^C×W×H. For each location mapping V in the feature space dimension, a vector V can be obtained_u∈R^CAnd a set phi_u∈R^(H+W-1)×C。

Set phi_uThe feature vector set in the feature map V in the same column or the same row as the position u is shown. Obtaining No-local information of the image through Aggregation operation, wherein M'_uRepresents M' epsilon R in the output characteristic diagram^C×W×HThe feature vector at position u. A. the_i,uIs the position of scalar value channels i and u in a. Since the context information is added to the local feature M, the local feature can be paid better attention, and the pixel-level representation of the feature can be improved. Since the attention map may focus on long distance correlations, the feature map has a relatively broad contextual view, and thus context information may be selectively aggregated based on the attention map. As shown in fig. 4, the attention feature maps obtained by the input pictures with different sizes through the cross attention module are up-sampled and restored to the size same as that of the original input picture, and then are fused to obtain a final multi-scale attention feature map.

Prediction graph G (X) of generator_n)^(h,w)And a vector Y obtained by the one-hot coding of the annotation image_nAfter being input into the discriminator, the confidence coefficient map with the size of H multiplied by W multiplied by 1 is output after training. Three losses in LLoss function:

L＝L_ce+λ_AL_A+λ_SL_S

L_ce，L_Aand L_SRespectively, a multi-class cross entropy loss function, a countermeasure loss function and a semi-supervised loss function. Lambda [ alpha ]_A，λ_SAre two weights for minimizing the loss function L. When using the label data, the multi-class penalty function L_ceCan be obtained by the following method:

through L_DTraining the discriminator network:

when x is_nWhen the value is 1, the representative pixel point is generated by the generator. If y is_nIf 1, then the sample is from the label image. D (G (X)_n))^(h,w)) Is a pixel X_nA feature at the position of (h, w). D (Y)_n)^(h,w)Are defined similarly. In order to convert the discrete label map into the C-channel probability map, the labeling image is transformed by the one-hot coding. If it is not

A classification rule Y of a certain category_n ^(h,w)Is 1, otherwise is 0. The antagonistic learning process is through loss of L_ATo the trained discriminator:

the generator network is trained to fool the discriminator by increasing the probability of generating a prediction from the true distribution of the annotated images. When training with unlabeled data, only L_ASince it only requires a discriminatorA network. At this time, since no labeled image exists, the multi-class cross entropy loss function cannot be utilized. In addition, for unlabeled data, confidence maps D (G (X) can be generated by training the discriminators_n))^(h,w)) It can be used to infer which regions are sufficiently close to the true distribution of the annotated image. Can be used to predict the distribution of regions similar to actual values and used as a marker picture. The marked image is subjected to one-hot coding to obtain Y_nIs arranged element by element to obtain

If c is^*＝argmax_cG(X_n)^(h,w,c)Then

L_sAnd L_ceSimilarly. Then setting a threshold value

To highlight areas of confidence in the confidence map. L is_SThe definition is as follows:

i () is an index function which can be controlled by setting U_STo control its sensitivity. Thereby adjusting the training process of the network.

Example analysis

The experiments are all carried out on an Ubuntu 18.04 operating system, and the semi-supervised remote sensing image semantic segmentation based on multi-scale attention is trained by using RTX 2080 ti. The code for all experiments was based on extracting image features using pytorch0.4.0, cuda9, selecting a network model pre-trained with ResNet101 on a pascal voc2012 dataset, using generative confrontation network training aids. For the generator i.e. the split network part,the optimizer adopted during model training is Adam, initial learning rate and L₂The coefficients of the regularization terms are all set to 0.0001. For the discriminator network, an initial learning rate of 10 is used for the Adam optimizer^-4，L₂The coefficient of the regularization term is 0.9. When using annotation data, λ_AIs set to 0.01, λ when the annotation data is not used_AIs set to 0.001. Lambda [ alpha ]_STake 0.1, U_SSet to 0.2 and I to 0.1. The training period (training epoch) herein is set to 20000, and since the memory required for the generative countermeasure network is large, the size of the trained batch size is set to 2. In the experiment, a Mean Intersection Over Union (MIOU) is used as an evaluation criterion for the quality of the generated segmentation image. The higher the evaluation criterion, the closer the generated description text is to the real annotation description, i.e. the higher the quality of the description text. And calculating the ratio of the intersection and union of the two sets of the real value and the predicted value. This ratio can be transformed into the sum (union) of TP (intersection) over TP, FP, FN. Namely: MIOU is TP/(FP + FN + TP).

p_ijRepresenting the true value i, predicted as the number of j, and k +1 is the number of classes (including empty classes). p is a radical of_iiIs a true quantity. p is a radical of_ij、p_jiFalse positive and false negative are indicated, respectively.

Example analysis on CCF2015 dataset

The CCF2015 data set contains five classes. There are five maps and four objects marked therein: vegetation, buildings, water, roads, etc. The original picture data set ranges in size from 3000 × 3000 to 6000 × 6000. The original image needs to be processed because of its too large resolution. 13000 images of 256 × 256 size are obtained from the processed image. The specific treatment method is as follows: both the original image and the label image need to be rotated: 90 degrees, 180 degrees, 270 degrees, which are cropped by randomly generating x and y coordinates and then to a thumbnail 256 × 256. A standard validation set of 1000 images was used for the model evaluated.

The experiments performed on CCF2015 based on multi-scale semi-supervision of attention are shown in tables 1, 2. In order to prove that the multiscale generated confrontation network remote sensing image semantic segmentation method has better performance than the existing method, the multiscale attention-based semi-supervised remote sensing image semantic segmentation method is further compared with the fully supervised deep b and semi-supervised Hung methods. As shown in tables 1 to 5, the semi-supervised remote sensing image semantic segmentation method based on multi-scale attention has a larger increase in MIOU compared with a network without long distance correlation introduced before. It shows that introducing multi-scale attention, enhancing the context correlation between pixels is of great significance. The idea is fully proven by experiments on CCF2015 and US2D data sets. The process was tested on 1/8 and 1/2, respectively, on CCF 2015. In order to further verify the feasibility of the method, as shown in fig. 5, the semantic segmentation of the multi-scale attention remote sensing image is improved in both the segmentation effect of the road (blue) and the vegetation (red) and the accuracy of the segmentation of the building (green), and the edge information of the target is closer to the original label. The method for semi-supervised remote sensing image semantic segmentation based on multi-scale attention is proved to be combined with the effectiveness of attention, and the context relation between the longitudinal line and the transverse line of the attention pixel point and other pixel points is paid to, so that the long-distance correlation is better paid to, the feature extraction capability of the generator is further improved, and the performance of the whole network is further improved.

Table 1 experimental results of multi-scale attention on CCF2015 dataset of 1/8

Table 3 experimental results of multi-scale attention on CCF2015 dataset of 1/2

Example analysis on US2D data set

The city semantic two-dimensional (US2D) dataset for the IGARSS2019 race is a large public dataset containing RGB maps and semantic labels. The US2D dataset covers jackson verl, florida and omaha, nebraska. The Geoscience And Remote Sensing congress (igars) is an influential conference in the field of Remote Sensing. For the data processing of the US2D data set, the image obtained by cutting with the remote sensing 512 × 512 resolution is shown in fig. 5, and the semantic label of the corresponding original image, and the Ground Sampling Distance (GSD) is about 30 cm. For the experiments herein, the crop yielded 13732 training images and 1720 images tested.

The results based on the US2D data set are shown in tables 3 and 4. Experiments were performed on the US2D dataset, with 1/8 and 1/2 labeled data, respectively, and the remainder as unlabeled data. Compared with the existing full-supervision method, the method is found to be greatly improved, and compared with the existing excellent semi-supervision method, the method is also improved to a certain extent. The visualization result on US2D is shown in fig. 6, and it can be seen that the obtained semantic segmentation image can well reflect semantic features by combining the attention-gaining mechanism with the semi-supervised method.

Experimental results of multiscale attention on the US2D dataset of Table 41/8

Experimental results of multiscale attention on the US2D dataset of Table 51/2

In a word, in recent years, the application of a deep convolutional network on remote sensing images is more and more extensive, and aiming at the problem that the existing semi-supervised remote sensing image semantic segmentation method does not concern long-distance correlation between pixels, so that the global context can not be effectively utilized, a cross attention mechanism is introduced, a multi-scale attention module is designed, a generative countermeasure network is combined, the whole network is trained under a semi-supervised framework, the training effect can be improved by using label-free data in data concentration, and the remote sensing image semantic segmentation precision is improved. The effectiveness of the method provided by the invention is verified through experiments on two public remote sensing data sets.

In addition, semantic segmentation is a very important research field of computer vision, and is different from image classification in the field of computer vision, for example, image classification only needs to be performed on each picture, and the semantic segmentation judges the category of each pixel point in an image so as to perform accurate segmentation. Thus, semantic segmentation of images requires much larger annotations than classification of images. It is therefore meaningful to study semantic segmentation with a small amount of labeled data by semi-supervised studies. Since the semantic segmentation of the image is at a pixel level, the specific contour of an object can be well represented through the semantic segmentation, and a target to which each specific pixel belongs is pointed out, so that accurate segmentation is achieved, which is also very helpful for the research of remote sensing images. The remote sensing image has great research value in many fields at present, realizes the semantic segmentation of the remote sensing image, and has profound significance for acquiring the space geographic information and the like needed by people by better utilizing the remote sensing image. The invention provides a semi-supervised remote sensing image semantic division method based on a deep convolutional network and countercheck learning, which aims at the problems that a training model in a remote sensing image needs a large amount of labeled data for training, the deep convolutional model does not concern long-distance correlation, the remote sensing image is difficult to label and the labeled data amount is small, and aims at the problem that the countercheck network remote sensing image semantic division method based on the semi-supervised multi-scale generation has no concern about long-distance correlation. By utilizing input pictures with different scales, the multi-view characteristics of the images are collected, and the semantic segmentation effect of the remote sensing images is further improved. The invention provides a semi-supervised remote sensing image semantic segmentation method combining a deep convolutional neural network and a generative countermeasure network aiming at a remote sensing image semantic segmentation task, difficult acquisition of remote sensing image data under certain conditions, small labeled data amount and great manpower and material resources spent on labeled data reduction as much as possible, and then provides semi-supervised remote sensing image semantic segmentation based on multi-scale attention aiming at extracting features of the current method and irrelevant context correlation, so that the current optimal performance is realized.

In one embodiment, a semi-supervised remote sensing image semantic segmentation device is provided, which comprises:

and the image acquisition module is used for acquiring the original remote sensing image.

The multi-scale attention feature map determining module is used for scaling the original remote sensing image into 3 scaled images with different sizes; respectively inputting the 3 scaled images into 3 criss-cross attention modules to obtain 3 attention feature maps; and carrying out fusion processing on the 3 attention feature maps to obtain a multi-scale attention feature map.

The semantic segmentation confidence image determining module is used for inputting the one-hot coding vectors of the semantic segmentation prediction image and the annotation image into a discriminator network to obtain a semantic segmentation confidence image; wherein the original remote sensing image comprises: and (5) labeling the image.

A network training module for cross entropy L based on space multi-class_ceAntagonistic loss function L_ASemi-supervised loss function L_STraining the deep semantic segmentation network and the discriminator network.

The specific definition of the semi-supervised remote sensing image semantic segmentation device can refer to the definition of the semi-supervised remote sensing image semantic segmentation method in the above, and is not described herein again. All modules in the semi-supervised remote sensing image semantic segmentation device can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

and acquiring an original remote sensing image.

Scaling the original remote sensing image into 3 scaled images with different sizes; respectively inputting the 3 scaled images into 3 criss-cross attention modules to obtain 3 attention feature maps; and carrying out fusion processing on the 3 attention feature maps to obtain a multi-scale attention feature map.

Space-based multi-class cross entropy L_ceAntagonistic loss function L_ASemi-supervised loss function L_STraining the deep semantic segmentation network and the discriminator network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features. Furthermore, the above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A semi-supervised remote sensing image semantic segmentation method is characterized by comprising the following steps:

acquiring an original remote sensing image;

2. The semi-supervised remote sensing image semantic segmentation method according to claim 1, wherein the obtaining of the multi-scale attention feature map specifically comprises:

Will feature diagram X₁And characteristic diagram X₂And characteristic diagram X₃Respectively inputting into 3 criss-cross attention modules to obtain attention feature map C₁Attention feature chart C₂Attention feature chart C₃；

3. The semi-supervised remote sensing image semantic segmentation method according to claim 1, wherein the obtaining of the attention feature map specifically comprises:

4. The semi-supervised remote sensing image semantic segmentation method according to claim 3, wherein the Affinity operation, SoftMax operation and Aggregation operation specifically comprise:

for each position u of the feature map Q, a vector Q is obtained_u∈R^C′(ii) a By extracting AND bits from K at the same timeThe feature vectors of u in the same row or column are arranged to obtain the set omega_uAnd has the following components:

Applying SoftMax operations on D by channel dimension and a convolutional layer with 1 x 1 filter on M to generate V e R for feature adaptation^C×W×H(ii) a And mapping V for each position on the feature space dimension to obtain a vector V_u∈R^CAnd a set phi_u∈R^(H+W-1)×CAnd has the following components:

acquiring No-local information of the image through an Aggregation operation; wherein M is_u'represents M' epsilon R in the output characteristic diagram^C ^×W×HThe feature vector at position u.

5. The semi-supervised remote sensing image semantic segmentation method of claim 1, further comprising:

6. The semi-supervised remote sensing image semantic segmentation method of claim 5, wherein the discriminator network comprises:

7. The semi-supervised remote sensing image semantic segmentation method of claim 1, further comprising:

8. The semi-supervised remote sensing image semantic segmentation method according to claim 7, wherein the training of the deep semantic segmentation network and the discriminator network specifically comprises:

when using the label data, the multi-class penalty function L_ceObtained by the following method:

through L_DTraining the discriminator network:

To a certain class classification, then Y_n ^(h,w)Is 1, otherwise is 0;

fighting the learning process through loss L_ATo train the discriminator:

If c is^*＝argmax_cG(X_n)^(h,w,c)Then, then

Setting a threshold value

9. A semi-supervised remote sensing image semantic segmentation device is characterized by comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method of any of claims 1-8.