CN117496158A - Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method - Google Patents

Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method Download PDF

Info

Publication number
CN117496158A
CN117496158A CN202311758529.5A CN202311758529A CN117496158A CN 117496158 A CN117496158 A CN 117496158A CN 202311758529 A CN202311758529 A CN 202311758529A CN 117496158 A CN117496158 A CN 117496158A
Authority
CN
China
Prior art keywords
mbi
semantic segmentation
remote sensing
pixel
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311758529.5A
Other languages
Chinese (zh)
Inventor
尹建伟
王修航
蔡钰祥
李照帅
郭玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311758529.5A priority Critical patent/CN117496158A/en
Publication of CN117496158A publication Critical patent/CN117496158A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Remote Sensing (AREA)
  • Astronomy & Astrophysics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a contrast learning and semantic segmentation method for improving MBI by semi-supervised scene fusion. The invention pre-trains the shared encoder structure by utilizing a contrast learning task, and fine-tunes the shared encoder by utilizing an up-sampling semantic segmentation task fused with an MBI attention mechanism to obtain a final semantic coding network, thereby improving the accuracy of a semantic segmentation method of the remote sensing image. In addition, the invention provides a method for respectively guiding the negative sample sampling and the semantic segmentation up-sampling of the contrast learning by utilizing MBI priori knowledge so as to realize the distinction and semantic segmentation effects of the positive and negative samples of the enhanced contrast learning.

Description

Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method
Technical Field
The invention belongs to the technical field of remote sensing image semantic segmentation and contrast learning extraction buildings, and particularly relates to a method for extracting buildings by conducting up-sampling remote sensing image semantic segmentation through semi-supervised scene fusion and improved MBI negative sampling pixel level contrast learning and priori knowledge.
Background
Along with the progress of scientific technology, especially radar monitoring technology, the remote sensing technology realizes all-weather and all-day earth observation, high-space and high-spectral resolution image data become mainstream, ground object texture and detail information are more abundant, and the remote sensing formally enters a high-resolution remote sensing era. Meanwhile, urban area expansion monitoring is quite hot in a time study, how to define urban area ranges becomes key, and building extraction is an important direction of a high-resolution remote sensing technology, so that urban area building distribution, density and scale can be accurately and effectively measured. Therefore, the novel remote sensing building extraction idea has important practical significance for urban planning and resource management.
Convolutional Neural Networks (CNNs) have been the most commonly used tool in semantic segmentation over the past few years, and some CNN-based architectures have shown their effectiveness in this task, such as FCN, segNet, U-Net, PSPNet and deep. These traditional semantic segmentation methods typically rely on a large number of labeled training data, where each pixel is manually labeled with its corresponding class, however manually labeling large-scale data sets is expensive and time consuming—this is especially true in the high-resolution remote sensing field, and is limited by factors such as shooting angle, lighting conditions, and cloud coverage, which becomes more difficult. In addition, high-resolution remote sensing images typically have a large amount of detailed information and complex scenes, making pixel-level labeling more time-consuming and error-prone. In such a background, contrast learning is increasingly attracting attention in the field of semantic segmentation. Contrast learning is an unsupervised learning method that performs feature learning by learning similarities and differences between samples (Wu Z, xiong Y, yu S X, et al, superior feature learning via non-parametric instance discrimination [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognment.2018:3733-3742.). In the semantic segmentation task, the model can learn the feature representation with more discriminant by contrast learning, so that the semantic segmentation performance is improved.
In the contrast learning field, a typical problem is that positive and negative samples are selected, the positive samples can be simply processed by means of image enhancement and the like, the selection of the negative samples is troublesome, when tag information exists, anchor point pixels can be selected for positive samples of the same class of pixels, the negative samples are different classes of pixels, but in pixel level contrast learning, the tag information is unknown, and the selection of the positive and negative samples can become difficult.
In order to solve the problem of pixel level contrast learning positive and negative sample selection, attention is paid to a novel negative sample sub-sampling strategy (Zhong Y, yuan B, wu H, et al, pixel constant-constant semi-supervised semantic segmentation [ C ]// Proceedings of the IEEE/CVF International Conference on Computer vision.2021:7273-7282 ]), but a certain problem still exists when various methods or derivative methods are applied to the high-resolution remote sensing field because the high-resolution remote sensing image contains a large amount of background information and ground object objects with different scales, and the attention of balance to different ground object categories is considered when the positive and negative samples are selected, so that the selection by the simple sub-sampling strategy is difficult.
However, it is easily suggested that in a weakly supervised scenario, building index image (Huang X, zhang l. Amultidirection and multiscale morphological index for automatic building extraction from multispectral GeoEye-1image [ j ] Photogrammetric Engineering & Remote Sensing,2011,77 (7): 721-732 ]) -MBI feature map can be simply used as a pseudo-tag, and reference for learning positive and negative sample selection is relatively learned, but considering that simple MBI calculation is mixed with a large number of roads, we refer to the idea of affine transformation of STN space transformation network, image correction, and correct the coarsely extracted MBI feature map by using general semantic features extracted by FastSAM. In addition, semantic features, including shape, size, texture, etc., contained in the building index feature map may also be used to guide the building segmentation upsampling process. It is also a very important issue how to strengthen the relevance of semantic segmentation and contrast learning, and to fuse MBI to guide two tasks so that they better help each other.
Disclosure of Invention
In view of the above, the invention provides a method for extracting buildings by conducting up-sampling semantic segmentation on remote sensing images under the guidance of negative sampling pixel level contrast learning and priori knowledge of MBI (component parts analysis) through semi-supervised scene fusion improvement, which utilizes a contrast learning pre-training model under the condition that a large amount of semi-supervised training data lacks labels, and further conducts model fine tuning through a small amount of labeled data sets, so that the buildings are extracted by semantic segmentation under a small amount of labeled data samples. And simultaneously, performing MBI guided up-sampling on the remote sensing image to generate a semantic segmentation result, and further improving the performance and the robustness of a semantic segmentation model of the remote sensing image and an MBI channel attention mechanism fused up-sampling model through mutual coordination and promotion of the semantic segmentation and up-sampling modules.
A contrast learning and semantic segmentation method for improving MBI through semi-supervised scene fusion is characterized by comprising the following steps:
(1) Image preparation: class y with semantic tags s Remote sensing image of (a)Label-free remote sensing image->
(2) Constructing an MBI calculation module based on reference model improvement;
(3) Constructing and initializing a model, and constructing a contrast learning semantic segmentation model;
(4) Will not have the remote sensing image of labelPerforming color enhancement once to generate an enhanced sample +.>From the enhanced sample->Selecting anchor pixels, the samples will be enhanced +.>Inputting the MBI characteristic map into an improved MBI calculation module, taking the MBI characteristic map as a pseudo tag, and performing +_on the unlabeled remote sensing image>Is selected as positive sample +.>In the unlabeled remote sensing image->Pixels of different middle anchor pixel class as negative sample +.>Will enhance the sample->Positive sample->Negative sample->As input to the first module of the contrast learning semantic segmentation model obtained in step 2) to maximize positive sample pair similarity and minimize negative sample pair shared encoder contrast learning pre-training to obtainTo the encoder after preliminary training;
will carry semantic tag y s Remote sensing image of (a)As a second module input for contrast learning semantic segmentation model, semantic tags y s As a semantic segmentation module training label, the up-sampling stage fuses MBI priori knowledge guidance, and finely adjusts the primarily trained encoder to obtain a final encoder;
(5) And (3) using the final encoder obtained in the step (3) for semi-supervised scene semantic segmentation and building extraction.
The method for extracting the building by semantic segmentation of the remote sensing image guided by the improved MBI negative sampling pixel level contrast learning and priori knowledge through the semi-supervised scene fusion mainly comprises a pixel level contrast learning pre-training module for the improved MBI negative sampling and a semantic segmentation fine-tuning training module for the improved MBI priori knowledge guided up-sampling. The invention pre-trains the shared encoder structure by utilizing a contrast learning task, and fine-tunes the shared encoder by utilizing an up-sampling semantic segmentation task fused with an MBI attention mechanism to obtain a final semantic coding network, thereby improving the accuracy of a semantic segmentation method of the remote sensing image. In addition, the invention provides a method for respectively guiding the negative sample sampling and the semantic segmentation up-sampling of the contrast learning by utilizing the improved MBI priori knowledge so as to realize the distinction and semantic segmentation effects of the positive and negative samples of the enhanced contrast learning.
Further, in step (2), the MBI calculation module based on the reference model improvement includes:
2.1 A RemoteCLIP pre-fetch module for: by attaching semantic tags y s Remote sensing image of (a)And no label remote sensing image->Inputting a reference model RemoteCLIP in the field of remote sensing vision to obtain the general semantic feature of the image ∈R->To guide MBI calculation;
2.2 MBI calculation module for: by attaching semantic tags y s Remote sensing image of (a)And a label-free remote sensing imageInputting into MBI calculation module to obtain crude extracted building index feature map ++>
2.3 STN spatial transform network for: by combining generic semantic featuresAnd building index profile->Inputting into STN space transformation network to obtain building index feature map m with removed road noise influence r
Further, in step (3), the contrast learning semantic segmentation model includes:
3.1 A first module for: by comparing unlabeled remote sensing imagesThe similarity between the anchoring pixels and positive and negative samples is obtained by fusion MBI negative sampling, and the negative samples are used for learning feature representation, primarily training a shared feature encoder and obtaining a primarily trained encoder;
3.2 A second module for: by having semantic tags y s Remote sensing image of (a)Semantic segmentation training, wherein the up-sampling stage fuses MBI priori knowledge guidance, and fine-tuning is performed on the primarily trained encoder to obtain a final encoder.
Further, in step (3), the preliminary training shared feature encoder uses a contrast loss metric functionThe computational expression is as follows:
wherein: the first part represents a pixel level difference constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda 1 Is super-parameter (herba Cinchi Oleracei)>Representing pixel level contrast loss, the mathematical expression is as follows:
where τ is the control scale hyper-parameter.
The second part represents the feature space constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda i For superparameter, representing adjusting the loss weight of each pixel, +.>Representing feature space constraint loss, the mathematical expression is as follows:
wherein I 2 Represents L 2 Normal, μ (z i ) Representing an anchor pixel z i The mean value in the feature space is calculated,representing positive sample alignment pixels->Mean value in feature space, < >>Represents the negative sample contrast pixel +.>And (3) controlling the minimum distance between the anchor pixel and the negative sample comparison pixel by using the average value m in the feature space as the interval hyper-parameter.
Further, a method for extracting buildings by semantic segmentation of a remote sensing image by merging MBI negative sampling pixel level contrast learning and priori knowledge-guided up-sampling in a semi-supervised scene comprises the following steps:
(1) Image preparation: class y with semantic tags s High resolution remote sensing image of (a)Label-free high-resolution remote sensing image>
(2) An improved MBI calculation module is constructed, which comprises the following three modules:
MBI construction module (Huang X, zhang L.A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1image [ J ]. Photogrammetric Engineering & Remote Sensing,2011,77 (7): 721-732.) highlights building features by applying morphological operations on Remote Sensing images in different directions and different scales, and differentiating the morphological sections to obtain building index feature images.
The FastSAM target mask crude extraction module (Zhao X, ding W, an Y, et al Fast Segment analysis [ J ]. ArXiv preprint arXiv:2306.12156,2023 ]), according to the appointed sample, inputting a parameter frozen FastSAM reference model, obtaining a crude extracted building region, namely a general semantic feature map;
STN space transformation network (Jaderberg M, simonyan K, zisserman A. Spatial transformer networks [ J ]. Advances in neural information processing systems,2015,28 ]) constructs learning module, positioning network receives general semantic feature map, learns transformation parameters, outputs affine transformation parameters, sampler uses transformed grid to sample pixel value from general semantic feature map, generates a new feature map, and fuses and corrects the generated new feature map and MBI calculation feature map.
(3) Constructing and initializing a model, and constructing a contrast learning semantic segmentation model, wherein the model comprises the following two modules:
the first module (the fusion-improved MBI negative-sampling pixel-level contrast learning pre-training module) anchors the similarity between pixels and positive and negative samples by comparing unlabeled images, where the negative samples are derived using fusion MBI negative sampling, for learning feature representations, and for preliminary training of the shared feature encoder.
The second module (semantic segmentation fine tuning training module that fuses improved MBI a priori knowledge guided upsampling) fine tunes the shared encoder structure through tagged image semantic segmentation training, where the upsampling stage fuses the MBI a priori knowledge guidance.
(4) Will not have the high resolution picture of labelAs anchor pixel samples. Simultaneously inputting the sample into an MBI construction module to obtain an MBI feature map, taking the MBI feature map as a pseudo tag, and selecting pixels of the same kind as the anchor pixel as positive samplesDifferent kinds of pixels as negative samples +.>Will enhance the sample->Positive sample->Negative sample->As an input, pre-training is learned with a maximum positive-sample pair similarity and a minimum negative-sample pair shared encoder contrast.
(5) High resolution remote sensing imageAs model input, semantic tag y s As a semantic segmentation module training label, training and fusing MBI priori knowledge to guide an up-sampled semantic segmentation fine tuning model.
The method for extracting the building by conducting semantic segmentation on the up-sampled remote sensing image by means of semi-supervised scene fusion and improved MBI negative sampling pixel level contrast learning and priori knowledge is characterized in that a pixel level contrast learning pre-training module in the model construction in the step (3) is specifically realized as follows:
2.1 forming a contrast learning model { E, M, C }, by using a shared encoder E, an improved shared MBI calculation module M and a contrast learning module C;
2.2 will enhance the samplesAnd (3) inputting a contrast learning model { E, M and C }, calculating to obtain pixel-level positive and negative sample contrast loss, and training by using maximized positive sample pair similarity contrast learning, wherein negative sample pixel selection depends on MBI calculation.
The method for extracting the building by conducting semantic segmentation on the up-sampled remote sensing image by fusing MBI negative sampling pixel level contrast learning and priori knowledge in the semi-supervised scene comprises the following specific implementation modes of a priori knowledge-guided up-sampled semantic segmentation module of a model in the step (2):
3.1 forming a guided up-sampling semantic segmentation model { E, M, R, U } fusing MBI priori knowledge by using a shared encoder E, an improved shared MBI calculation module M, an attention calculation module R and an up-sampling module U;
3.2 high resolution remote sensing image with tagInputting a guided up-sampling semantic segmentation model { E, M, R, U } fused with MBI priori knowledge to obtain a semantic segmentation feature map +_f of the high-resolution remote sensing image>Wherein->Resolution size and high resolution remote sensing image +.>Consistent;
3.3 Using semantic tags y s The guided upsampling module is model trained while the shared encoder E structure is fine tuned.
The semi-supervised scene fusion MBI negative sampling pixel level contrast learning and priori knowledge guided up-sampling remote sensing image semantic segmentation building extraction method comprises the steps that the contrast loss measurement function is in the contrast learning pre-training moduleThe calculation expression is as follows:
wherein: the first part represents a pixel level difference constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda 1 Is super-parameter (herba Cinchi Oleracei)>Representing pixel level contrast loss, the mathematical expression is as follows:
where τ is the control scale hyper-parameter.
The second part represents the feature space constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda i For superparameter, representing adjusting the loss weight of each pixel, +.>Representing feature space constraint loss, the mathematical expression is as follows:
wherein I 2 Represents L 2 Normal, μ (z i ) Representing an anchor pixel z i The mean value in the feature space is calculated,representing positive sample alignment pixels->Mean value in feature space, < >>Represents the negative sample contrast pixel +.>And (3) controlling the minimum distance between the anchor pixel and the negative sample comparison pixel by using the average value m in the feature space as the interval hyper-parameter.
The method is a complete remote sensing image contrast learning semantic segmentation framework, and comprises an improved MBI calculation module, a shared encoder module, a contrast learning pre-training module selected based on a fusion MBI negative sample, an up-sampling semantic segmentation module guided by fusion MBI priori knowledge, remote sensing image semantic segmentation result generation and the like.
The invention provides a method for extracting a building by semantic segmentation of a remote sensing image, which is used for conducting up-sampling through semi-supervised scene fusion and improved MBI negative sampling pixel level contrast learning and priori knowledge. The pixel level contrast learning pre-training module adopts a shared feature encoder, a shared MBI computing module and an independent contrast learning module, performs pixel level contrast learning of fusion MBI negative sampling to obtain a pre-training model, guides the up-sampling semantic segmentation module to perform semantic segmentation training of fusion MBI up-sampling by adopting the shared feature encoder, the shared MBI computing module and the independent up-sampling decoder, finely adjusts the whole model, and obtains a semantic segmentation result diagram.
Drawings
Fig. 1 is a schematic flow chart of steps of a method for extracting buildings by semantic segmentation of a remote sensing image, which is used for conducting up-sampling under the guidance of prior knowledge and through semi-supervised scene fusion and improved MBI negative sampling pixel level contrast learning.
Fig. 2 is a schematic diagram of a specific implementation process of a method for extracting buildings by semantic segmentation of a remote sensing image by performing semi-supervised scene fusion and improved MBI negative sampling pixel level contrast learning and priori knowledge-guided up-sampling.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 1 and fig. 2, the method for extracting buildings by semantic segmentation of up-sampled remote sensing images guided by semi-supervised scene fusion improved MBI negative sampling pixel level contrast learning and priori knowledge comprises the following steps:
(1) Image preparation: class y with semantic tags s High resolution remote sensing image of (a)Label-free high-resolution remote sensing image>
In the present embodiment, the GF1 satellite image with the building tag is input as the semantic segmentation module, and the non-tagged GF1B satellite image is input as the contrast learning module. The GF1 satellite image has 33 3 channels (RGB) and a spatial resolution of 2m, and in this embodiment, the size of the image is cut to 512×512 pixels (high resolution), so as to obtain 3039 semantic segmentation training image blocks and corresponding building labels. In addition, the GF1B satellite image has 36 3 channels (RGB) and a spatial resolution of 2m, and the size of the image is cut to 512×512 pixels (low resolution) in this embodiment, so as to obtain 2932 contrast learning training image blocks.
(2) An improved MBI computing module is constructed.
The reference model referred by the embodiment is a multidirectional and multiscale morphological building index calculation model, firstly, the multiband brightness maximum value in a remote sensing image is calculated as an image brightness value, then morphological top hat reconstruction and morphological section calculation are carried out, and the shape, the height change and the contour characteristics of a building are described, so that the building index characteristics are obtained;
the adopted correction model is a visual reference model FastSAM, the model comprises two parts, one part is based on a full-instance segmentation network of YOLOv8, CSPDarknet53 is used as a backbone network, a characteristic pyramid network structure is connected to detect targets on different scales, and finally a detection head comprises a 4-layer convolution layer, a 1-layer normalization layer and an activation function to generate target frame positions and category predictions; the other part is a promt guiding selection module which comprises an image encoder, a promt encoder and a mask decoder. The FastSAM model does not need training, and adopts freezing parameters for coarse extraction.
The correction module selects an STN space transformation network, three modules of a positioning network, a grid generator and a sampler are constructed, the positioning network is a 2-layer convolutional neural network, the grid generator comprises a 1-layer convolutional layer and a 1-layer full-connection layer, and the sampler generally adopts a bilinear interpolation strategy. The method comprises the steps of taking a general semantic feature map as input, a positioning network receiving the general semantic feature map, learning transformation parameters, outputting affine transformation parameters, a grid generator generating a normalized grid by using the parameters output by the positioning network, a sampler sampling pixel values from the general semantic feature map by using the transformed grid to generate a new feature map, fusing the generated new feature map with an MBI calculation feature map, adopting self-attention mechanism operation in the fusion, taking the MBI feature map as a key, roughly extracting general semantic features as a query, and obtaining a corrected image.
(2) The model construction and initialization are carried out, and the contrast learning semantic segmentation model is divided into the following two modules:
the first module is a pixel level contrast learning pre-training module integrating MBI negative sampling, and a contrast learning model { E, M, C } is formed by adopting a shared encoder E, a shared MBI calculation module M and a contrast learning module C. Label-free high-resolution remote sensing imageInputting the high-resolution remote sensing image +.>As an anchor pixel sample, and by color enhancement, the corresponding position pixel is obtained as a positive sample +.>Negative sample->The selection of the shared encoder is dependent on the calculation of MBI, namely, the pixels belonging to different categories with the anchor pixels are selected, and the shared encoder structure is primarily obtained through the training of a contrast model;
the second module is a semantic segmentation fine tuning training module for guiding up-sampling by fusing MBI priori knowledge, a shared feature encoder E, a shared MBI calculation module M, an attention calculation module R and an independent up-sampling decoder U are adopted to form a channel attention up-sampling model { E, M, R, U } of the fused MBI, and the tagged high-resolution remote sensing image is obtainedAfter inputting the semantic segmentation model { E, M, R, U } (shared encoder E, MBI calculation module M, attention calculation module R, up-sampling decoder U) an image semantic segmentation result +.>Wherein->Resolution size and high resolution remote sensing image +.>And consistent.
In this embodiment, a Swin transducer encoder is used as the shared feature encoder E, and specific implementation details are as follows:
the encoder E adopts a Swin transform encoder, realizes that the original image is downsampled 4 times in an upper reference Swin-L model-Stage 1, the number of output channels of a Linear Embedding layer is 192, then two blocks are stacked Swin Transformer Block, the first Block adopts W-MSA, the second Block adopts SW-MSA, the window size is 7*7, the number of output characteristic channels is 192, and the number of multi-head attention modules is 6; stage 2 firstly downsamples the original feature map by a Patch metering layer by 2 times, the number of output channels is 384, then two blocks of Swin Transformer Block are stacked, the first Block adopts W-MSA, the second Block adopts SW-MSA, the window size is 7*7, the number of output feature channels is 384, and the number of multi-head attention modules is 12; stage 3 downsamples the original feature map by a Patch metering layer by 2 times, the number of output channels is 768, then stacks two Swin Transformer Block, the first Block adopts W-MSA, the second Block adopts SW-MSA, the window size is 7*7, the number of output feature channels is 768, and the number of multi-head attention modules is 24; stage 4 downsamples the original feature map by a Patch Merging layer by 2 times, the number of output channels is 1536, then stacks two Swin Transformer Block, the first Block adopts W-MSA, the second Block adopts SW-MSA, the window size is 7*7, the number of output feature channels is 1536, and the number of multi-head attention modules is 48; the remaining Layer Norm Layer, global pooling Layer, and full connection Layer are not described in detail.
The technique for learning the pre-training model { E, M, C } for the pixel level contrast of the negative samples is implemented as follows:
label-free high-resolution remote sensing imageAnd enhanced image->Inputting the encoded feature map obtained in the shared encoder E, respectively obtaining an anchor pixel sample and a corresponding positive sample, realizing the encoder on the see-through, and then obtaining a high-resolution remote sensing image +.>Inputting the shared MBI calculation module to obtain an MBI feature map, selecting pixels belonging to different categories with the anchor pixels as negative-sample pixels, and training, contrasting and learning a pre-training model according to a minimized contrast loss function to obtain a pre-trained shared encoder structure.
The technique for guiding the up-sampled semantic segmentation model { E, M, R, U } for a priori knowledge is implemented as follows: high-resolution remote sensing image with labelThe coding feature diagram is obtained in the input sharing encoder E, the encoder is realized in the top, and then the feature layer image i is obtained lr Channel attention calculations are performed to obtain importance weights for each channel to guide feature fusion and processing, self-adaptive may be employedAttention mechanism and global pooling approach, here we use the common self-attention mechanism to calculate the attention weight, then weighted sum, calculate the attention representation i of the feature map l In particular, a linear transformation is applied to the MBI calculation result graph, which is generally realized by a convolution layer, so as to obtain a mapping characteristic graph f l For coding feature layer image i lr By applying two independent linear transformations, the mapped feature maps are g respectively l ,h l For f l Applying convolution operation with window size of 3*3 to obtain query feature map Q, and comparing g l And h l And respectively applying convolution operation with the window size of K to obtain a key feature image K and a value feature image V, and calculating a similarity score between the query feature image Q and the key feature image K by using dot products. Normalizing the channel dimensions of Q and K, performing dot product operation on elements at corresponding positions to obtain an attention weight matrix A, performing normalization, multiplying the attention weight matrix A with a value feature graph V, and summing according to the channel dimensions to obtain a feature fused representation->
Will fuse the feature mapInput up-sampling decoder U, generate final semantic segmentation result image +.>The upsampling decoder here also selects a trainable transposed convolutional layer to restore the input feature map resolution to be consistent with the original image.
The present embodiment is exemplified by the above model, but is not limited to these encoder and decoder methods.
(3) Model training: will not have the high resolution picture of labelAs anchor pixel samples. Simultaneously inputting the sample into an MBI construction module to obtain an MBI characteristic diagram, and taking the MBI characteristic diagram as a pseudoLabel, select the same kind of pixel as the anchor pixel as positive sample +.>Different kinds of pixels as negative samples +.>Will enhance the sample->Positive sample->Negative sample->As an input, pre-training is learned with a maximum positive-sample pair similarity and a minimum negative-sample pair shared encoder contrast. Wherein the contrast loss metric function of contrast learning +.>
Wherein: the first part represents a pixel level difference constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda 1 Is super-parameter (herba Cinchi Oleracei)>Representing pixel level contrast loss, the mathematical expression is as follows:
where τ is the control scale hyper-parameter.
The second part represents the feature space constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda i For superparameter, representing adjusting the loss weight of each pixel, +.>Representing feature space constraint loss, the mathematical expression is as follows:
wherein I 2 Represents L 2 Normal, μ (z i ) Representing an anchor pixel z i The mean value in the feature space is calculated,representing positive sample alignment pixels->Mean value in feature space, < >>Represents the negative sample contrast pixel +.>And (3) controlling the minimum distance between the anchor pixel and the negative sample comparison pixel by using the average value m in the feature space as the interval hyper-parameter.
High resolution remote sensing imageAs model input, semantic tag y s As a semantic segmentation module training label, training and fusing MBI priori knowledge to guide up-sampling semantic segmentation fine tuning modelWherein the training function adopts a mean square error loss, and the corresponding formula is as follows:
wherein:for a high-resolution remote sensing image, E is a shared encoder, M is an MBI calculation module, R is an attention calculation module, U is an up-sampling decoder, and E, M, R and U form a semantic segmentation model for guiding up-sampling by fusing MBI priori knowledge. II 2 Is the L2 norm.
In this embodiment, the tagged GF1 satellite image is used as a semantic segmentation input, and the untagged GF1B satellite image is used as a contrast learning input, and the image sizes are 512×512 pixels, and 3 channels. The training loss function comprises a cross entropy loss function, a mean square error loss function and a contrast learning loss function, and the learning rate is 10 -4 The optimization algorithm is adam, 100 epochs are trained and stopped, and the semantic segmentation model is trained after the training is finished.
Table 1 shows F1-score (F1) and Intersection over Union (IoU) indexes obtained by calculation of semantic segmentation results and label truth values, which are obtained by a non-contrast learning, histogram matching (traditional method), an existing contrast learning method based on a sub-sampling negative sample selection strategy and a contrast learning strategy according to the invention, which are tested through a correlation experiment.
TABLE 1
Non-contrast learning Histogram matching Existing contrast learning Contrast learning of the invention
F1 0.1447 0.2576 0.4124 0.5026
IoU 0.0902 0.1783 0.3046 0.3823
From the experimental results, compared with the non-contrast learning, the IoU index of semantic segmentation is effectively improved, and the improvement reaches 0.2921. Meanwhile, compared with simple histogram matching, the IoU index of the embodiment is also improved by 0.204; compared with the existing contrast learning method, the IoU index of the contrast learning method is improved by 0.0777, which proves that the contrast learning method can effectively improve the negative sample pixel sampling accuracy. Therefore, the invention is greatly helpful for improving contrast learning training.
Table 2 shows peak-signature-ratio (PSNR) and Structural Similarity Index Measure (SSIM) indexes obtained by calculating a bilinear interpolation (traditional method) tested by a correlation experiment, a super-resolution method based on a convolutional neural network ESPCN, and an up-sampling result and a label true value obtained by the up-sampling method.
TABLE 2
From the above experimental results, compared with the conventional interpolation method, the embodiment effectively improves the PSNR index of the upsampling by 3.99. Meanwhile, compared with ESPCN based on a convolutional neural network, the PSNR index of the embodiment is improved by 1.81, which shows that the invention can obtain a high-quality semantic segmentation result which is closer to a real result with the assistance of a semantic segmentation task. Therefore, the invention has great help to improve the performance of the up-sampling and semantic segmentation of the remote sensing image.
Table 3 shows the semantic segmentation results F1-score (F1) and Intersection over Union (IoU) indexes obtained by the contrast learning loss function which only contains pixel level difference constraint, the contrast learning loss function which contains pixel level difference constraint and feature space difference constraint and is tested by the correlation experiment.
TABLE 3 Table 3
Pixel difference constraint loss function The invention compares the learning loss functions
F1 0.4428 0.5093
IoU 0.3272 0.3801
As can be seen from the experimental results, compared with the traditional contrast learning loss function only adopting pixel difference constraint, the contrast learning loss function in the embodiment effectively improves F1 indexes of semantic segmentation, improves the F1 indexes by 0.0665, and improves the IoU indexes by 0.0529, which proves that the contrast learning loss function has stronger constraint on contrast learning and can improve the performance of the pre-training encoder. Therefore, the invention is greatly helpful for improving the semantic segmentation performance of the remote sensing image.
The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims (4)

1. A contrast learning and semantic segmentation method for improving MBI through semi-supervised scene fusion is characterized by comprising the following steps:
(1) Image preparation: class y with semantic tags s Remote sensing image of (a)Label-free remote sensing image->
(2) Constructing an MBI calculation module based on reference model improvement;
(3) Constructing and initializing a model, and constructing a contrast learning semantic segmentation model;
(4) Will not have the remote sensing image of labelPerforming color enhancement once to generate an enhanced sample +.>From the enhanced sample->Selecting anchor pixels, the samples will be enhanced +.>Inputting the MBI characteristic map into an improved MBI calculation module, taking the MBI characteristic map as a pseudo tag, and obtaining a non-tag remote sensing image +.>Is selected as positive sample +.>In the unlabeled remote sensing image->Pixels of different middle anchor pixel class as negative sample +.>Will enhance the sample->Positive sample->Negative sampleInputting the first module of the contrast learning semantic segmentation model obtained in the step 2) to maximize the similarity of the positive sample pair and minimize the contrast learning pre-training of the negative sample pair to the shared encoder so as to obtain a primarily trained encoder;
will carry semantic tag y s Remote sensing image of (a)As a second module input for contrast learning semantic segmentation model, semantic tags y s As a semantic segmentation module training label, the up-sampling stage fuses MBI priori knowledge guidance, and finely adjusts the primarily trained encoder to obtain a final encoder;
(5) And (3) using the final encoder obtained in the step (3) for semi-supervised scene semantic segmentation and building extraction.
2. The method for contrasted learning and semantic segmentation of semi-supervised scene fusion MBI according to claim 1, wherein in step (2), the reference model improvement-based MBI calculation module comprises:
2.1 FastSAM pre-fetch module for: by attaching semantic tags y s Remote sensing image of (a)And a label-free remote sensing imageInputting a reference model FastSAM in the visual field to obtain general semantic features of the image>To guide MBI calculation;
2.2 MBI calculation module for: by attaching semantic tags y s Remote sensing image of (a)And no label remote sensing image->Inputting into MBI calculation module to obtain crude extracted building index feature map ++>
2.3 STN spatial transform network for: by combining generic semantic featuresAnd building index profile->Inputting into STN space transformation network to obtain building index feature map m with removed road noise influence r
3. The method for contrasted learning and semantic segmentation of a semi-supervised scene fusion MBI of claim 1, wherein in step (3), the contrasted learning semantic segmentation model comprises:
3.1 A first module for: by comparing unlabeled remote sensing imagesThe similarity between the anchoring pixels and positive and negative samples is obtained by fusion MBI negative sampling, and the negative samples are used for learning feature representation, primarily training a shared feature encoder and obtaining a primarily trained encoder;
3.2 A second module for: by having semantic tags y s Remote sensing image of (a)Semantic segmentation training, wherein the up-sampling stage fuses MBI priori knowledge guidance, and fine-tuning is performed on the primarily trained encoder to obtain a final encoder.
4. The method for contrast learning and semantic segmentation of semi-supervised scene fusion MBI as recited in claim 3, wherein in step 3.1), the preliminary training shared feature encoder uses a contrast loss metric functionThe computational expression is as follows:
wherein: the first part represents a pixel level difference constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing positive sample alignment pixels,>represents a negative sample alignment pixel, lambda 1 Is super-parameter (herba Cinchi Oleracei)>Representing pixel level contrast loss, the mathematical expression is as follows:
wherein τ is a control scale hyper-parameter;
the second part represents the feature space constraint, i is the pixel index, z i Representing the number of anchor pixels to be anchored,representing the positive sample versus the pixel,represents a negative sample alignment pixel, lambda i For superparameter, representing adjusting the loss weight of each pixel, +.>Representing feature space constraint loss, the mathematical expression is as follows:
wherein I 2 Represents L 2 Normal, μ (z i ) Representing an anchor pixel z i The mean value in the feature space is calculated,representing positive sample alignment pixels->Mean value in feature space, < >>Represents the negative sample contrast pixel +.>And (3) controlling the minimum distance between the anchor pixel and the negative sample comparison pixel by using the average value m in the feature space as the interval hyper-parameter.
CN202311758529.5A 2023-12-20 2023-12-20 Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method Pending CN117496158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311758529.5A CN117496158A (en) 2023-12-20 2023-12-20 Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311758529.5A CN117496158A (en) 2023-12-20 2023-12-20 Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method

Publications (1)

Publication Number Publication Date
CN117496158A true CN117496158A (en) 2024-02-02

Family

ID=89671163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311758529.5A Pending CN117496158A (en) 2023-12-20 2023-12-20 Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method

Country Status (1)

Country Link
CN (1) CN117496158A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071763A (en) * 2024-04-16 2024-05-24 浙江大学 Self-training-based semi-supervised three-dimensional shape segmentation method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071763A (en) * 2024-04-16 2024-05-24 浙江大学 Self-training-based semi-supervised three-dimensional shape segmentation method and device

Similar Documents

Publication Publication Date Title
CN111915592B (en) Remote sensing image cloud detection method based on deep learning
Hou et al. SolarNet: a deep learning framework to map solar power plants in China from satellite imagery
Fu et al. Research on semantic segmentation of high-resolution remote sensing image based on full convolutional neural network
CN110675421B (en) Depth image collaborative segmentation method based on few labeling frames
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN117496158A (en) Semi-supervised scene fusion improved MBI contrast learning and semantic segmentation method
Liu et al. Survey of road extraction methods in remote sensing images based on deep learning
CN116453121B (en) Training method and device for lane line recognition model
CN114283285A (en) Cross consistency self-training remote sensing image semantic segmentation network training method and device
Benbahrıa et al. Intelligent mapping of irrigated areas from Landsat 8 images using transfer learning
Li et al. An aerial image segmentation approach based on enhanced multi-scale convolutional neural network
Gao et al. Multiscale curvelet scattering network
CN117788296B (en) Infrared remote sensing image super-resolution reconstruction method based on heterogeneous combined depth network
Zuo et al. A remote sensing image semantic segmentation method by combining deformable convolution with conditional random fields
CN117671509B (en) Remote sensing target detection method and device, electronic equipment and storage medium
CN114550014A (en) Road segmentation method and computer device
Gui et al. A scale transfer convolution network for small ship detection in SAR images
Yanan et al. Cloud detection for satellite imagery using deep learning
Wang [Retracted] Landscape Classification Method Using Improved U‐Net Model in Remote Sensing Image Ecological Environment Monitoring System
Yue et al. SCFNet: Semantic correction and focus network for remote sensing image object detection
CN116758419A (en) Multi-scale target detection method, device and equipment for remote sensing image
Zhang et al. Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation
Jing et al. Time series land cover classification based on semi-supervised convolutional long short-term memory neural networks
CN115147727A (en) Method and system for extracting impervious surface of remote sensing image
Yuan et al. Buildings change detection using high-resolution remote sensing images with self-attention knowledge distillation and multiscale change-aware module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination