CN115965789A

CN115965789A - Scene perception attention-based remote sensing image semantic segmentation method

Info

Publication number: CN115965789A
Application number: CN202310061100.4A
Authority: CN
Inventors: 冯天; 张微; 洪廷锋; 马笑文; 车瑞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-01-21
Filing date: 2023-01-21
Publication date: 2023-04-14

Abstract

The invention discloses a remote sensing image semantic segmentation method based on scene perception attention. Aiming at the characteristics that the ground features in the high-resolution remote sensing image have internal spatial correlation and the problems of complex background, large intra-class variance and the like, a local class center and a global class center are respectively generated through a class center generation submodule, a scene perception attention submodule is further used for embedding context information and position prior information into the feature representation of the pixels, and meanwhile, the local class center is introduced to serve as an intermediate perception element to indirectly associate the global class center. The invention not only utilizes the spatial correlation of the ground objects in the remote sensing image to strengthen the context modeling, but also solves the problems of more background noise and large intra-class variance. The method combines scene perception and class-level context aggregation, provides a new solution for the task of segmenting the high-resolution remote sensing image, and improves the accuracy of semantic segmentation of the remote sensing image.

Description

Scene perception attention-based remote sensing image semantic segmentation method

Technical Field

The invention applies the technology of relevant aspects in the fields of deep learning and computer vision, in particular to a high-resolution remote sensing image semantic segmentation method based on class-level context aggregation of scene perception.

Background

Semantic segmentation aims at predicting the semantic class of each pixel in an image, and is one of the fundamental and extremely challenging tasks in remote sensing image analysis. Semantic segmentation plays an important role in the fields of road extraction, city planning, environment detection and the like by providing semantic and positioning information for interested ground objects. Compared with natural images, the ground features in the remote sensing images have intrinsic spatial correlations which are often observable, for example, vehicles usually stay on roads, buildings are densely distributed on two sides of the roads, and the like.

In recent years, convolutional Neural Networks (CNNs) have become an important method for promoting the development of semantic segmentation effects due to their strong feature extraction capability. However, due to its fixed geometry, CNNs have natural limitations that can only effectively capture local receptive fields and short-range contextual information. Therefore, context modeling, including spatial context modeling and relational context modeling, is an important choice for capturing long-range dependencies.

Spatial context modeling methods, such as PSPNet and deplabv 3+, aggregate context information using spatial pyramid pooling and multi-scale hole convolution, respectively. These methods focus on capturing homogenous contextual dependencies, but often ignore class differences, which can lead to the introduction of unreliable context if confusing classes appear in the image scene.

The relation context modeling method adopts an attention mechanism, namely, the similarity of pixel levels in an image is calculated to weight and aggregate heterogeneous context information, and a remarkable effect is achieved in a semantic segmentation task. However, these methods mainly focus on the relationship between pixels, and neglect the perception of the scene (i.e. global context information and location prior) by the pixels, so that the spatial correlation of the ground features in the remote sensing image is not fully explored.

Based on the above, the invention firstly improves the spatial attention mechanism, provides a scene perception attention submodule, and utilizes the spatial correlation of the ground objects in the remote sensing image through the scene perception of embedded pixels. The scene perception is divided into two parts, namely, context information is embedded, which is used for identifying different pairwise relations between ground objects under different scenes, for example, in urban areas, roads usually coexist with buildings, but in rural areas, the roads may be surrounded by farmlands; the other is position prior embedding, which is to identify the inherent mode distribution followed by the ground object in space, for example, pixels with similar distances usually show higher correlation, and pixels of the same object usually follow a certain position relationship.

In addition, the remote sensing image has the characteristics of complex background and large intra-class variance, and the traditional attention mechanism introduces a large amount of background noise due to intensive affinity operation, so that the problem of large intra-class variance in the remote sensing image is difficult to process by simply using global class representation.

Disclosure of Invention

The technical problem to be solved by the invention is how to fuse local-global attention on the basis of constructing position prior and context prior of a scene where the pixel is located for the pixel, improve the feature expression capability of each pixel through scene perception and class-level context aggregation, and provide a remote sensing image semantic segmentation method based on the scene perception attention. According to the method, the local-global attention is introduced, the pixels are associated with the global class representation, and the local class representation is used as the middle perception element, so that the model precision is improved while the required attention operation is greatly reduced.

The invention adopts the following specific technical scheme:

a remote sensing image semantic segmentation method based on scene perception attention comprises the following specific steps: inputting a remote sensing image to be subjected to semantic segmentation into a semantic segmentation model consisting of an encoder module and a decoder module to obtain a semantic segmentation result;

in the encoder module, firstly, feature extraction is carried out through a backbone network, and features output by the backbone network are used as rough feature representation;

the decoder module comprises a class-centric generation sub-module (CCG) and a scene-aware attention sub-module (SSA), and takes a coarse-feature representation output by the encoder as an input; when the decoder works, firstly, pre-classifying the rough feature representation output by the encoder to obtain global class probability distribution, then, taking the rough feature representation and the global class probability distribution as the input of a class center generation submodule together to obtain a global class center, and cutting the global class center along the spatial dimension to obtain a plurality of cut global class center local blocks; meanwhile, the decoder module cuts the rough feature representation and the global class probability distribution respectively along the spatial dimension in the same way to obtain a plurality of pairs of rough feature representation local blocks and global class probability distribution local blocks with the same size, and inputs each pair of rough feature representation local blocks and global class probability distribution local blocks into a class center generation submodule to obtain a local class center; then, inputting the rough feature representation local blocks obtained after cutting, the global class center local blocks obtained after cutting and the local class centers into a scene perception attention submodule simultaneously to obtain enhanced feature representation, and splicing the enhanced feature representations of all the local blocks again according to the positions before cutting to restore the spatial dimensions same as those of the rough feature representation; finally, splicing the rough feature representation and the spliced enhanced feature representation along the channel direction to obtain an output feature representation, and performing up-sampling on the output feature representation to obtain a semantic segmentation result of the input remote sensing image;

the module firstly carries out affinity operation on the input class probability distribution and the feature representation to obtain class representation information, then carries out Argmax operation on the class representation information to obtain a pre-classification mask, and finally puts the class representation information back to the corresponding pixel position in the original rough feature representation according to the pre-classification mask to obtain the class center;

the scene perception attention submodule introduces context information embedding and position prior embedding in attention operation to embed the scene perception of pixels; in the sub-module, firstly, according to the rough characteristic representation local block, position prior information is obtained through position prior embedding, and simultaneously, context information is embedded into the rough characteristic representation local block to construct a context diagonal matrix and perform contextualization on the context diagonal matrix; then, the contextualized feature representation firstly aggregates local class centers and then adds the element-by-element and the position prior information to obtain an affinity matrix; and finally, aggregating the global class centers according to the affinity matrix to obtain the enhanced feature representation after the perception of the embedded pixel scene.

Preferably, the context information embedding is used for constructing a context diagonal matrix, so that attention can be adjusted according to a given context, and the specific method is as follows: firstly, performing parallel global average pooling and maximum pooling on input rough feature representation local blocks through two branches, respectively using two times of feature mapping on respective pooling results of the two branches to obtain context vectors, sharing the same weight through the feature mapping used by the two branches, finally adding the context vectors obtained by the two branches element by element, outputting the added context vectors through a Sigmoid function, and converting the added context vectors into a context diagonal matrix, thereby performing the culture on rough features.

Preferably, the location prior embedding is used for constructing a relative location code between pixels to embed a coarse feature representation local block, so as to improve the sensitivity of attention to spatial distribution, and the specific method is as follows: the method comprises the steps of firstly calculating the offset of relative positions between pixels in the horizontal direction and the vertical direction, selecting a corresponding trainable vector in a coding bucket according to the offset to obtain relative position coding, and finally representing local block aggregation relative position coding by input rough features to obtain position prior information.

Preferably, the specific calculation algorithm in the scene awareness sub-module is as follows:

first, the coarse features input in the submodules represent the local blocks R _l Local class center S _l And a global class center local block S _g Performing 1 × 1 convolution to obtain three matrixes of Q, K and V, and reshaping the dimension of the three matrixes into (B '× hw × C), wherein B' is

B denotes the Batch size, H, of the input semantic segmentation model ^′ W' is the height and width represented by the rough feature, C is the number of feature channels represented by the rough feature, and h and W are the height and width of each local block; then constructing a relative position code r for embedding position prior information into the matrix Q, wherein the dimension of the r is (hw multiplied by C), and the ith matrix of hw multiplied by C represents the relative position code of the pixel i and all other pixels; then the relative position of the ith row of the matrix Q and the pixel i is coded r _i Is transposed to perform a matrix multiplication to obtain the position prior information ≥ of the pixel i>

Finally, splicing the position prior of each pixel along the vertical direction to obtain the position prior information p of the local block, wherein the dimension of the position prior information p is (B' × hw × hw); meanwhile, a context diagonal matrix C with dimension (B' × C × C) is constructed for embedding context information into Q, and after the context diagonal matrix C is used for performing contextualization on the matrix Q, the matrix K is aggregated to obtain a similarity matrix ^ based on>

The dimension is (B' × hw × hw); finally, an affinity matrix A = So/tmax (S + p) is calculated based on the position prior information p and the similarity matrix S, and a matrix V is aggregated according to the affinity matrix A to obtain an enhanced feature representation ^ based on the embedded pixel scene perception>

And its dimension was reshaped to (B' × C × h × w).

Preferably, the backbone network is an HRNetv2-w32 model, and pre-training weights learned on the ImageNet data set are loaded.

Preferably, the pre-sorting operation is implemented by two 1 × 1 convolutions in succession.

Preferably, the decoder divides the coarse feature representation and the global class center into local block sizes of 4 × 4.

Preferably, the semantic segmentation model is trained in advance by using labeled training data before being used for actual semantic segmentation.

Preferably, the training data needs to be subjected to data enhancement, and the loss functions adopted by the training of the semantic segmentation model are all cross entropy losses.

Preferably, the remote sensing image is a high-resolution remote sensing image with a spatial resolution of 1m or less.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses an image semantic segmentation method based on class-level context aggregation of scene perception. The method is characterized in that the ground objects in the high-resolution remote sensing image have intrinsic spatial correlation, and scene perception is embedded in attention; and local-global class attention is introduced to the problems of complex background, large intra-class variance and the like. According to the invention, a local class center and a global class center are generated through a class center generation submodule, a scene perception attention submodule is designed, context information and position prior information are embedded for pixel feature representation, and meanwhile, the local class center is introduced as an intermediate perception element to indirectly associate with the global class center, so that not only is the spatial correlation of ground features in a remote sensing image utilized to strengthen context modeling, but also the problems of more background noise and large intra-class variance are solved. The method combines scene perception and class-level context aggregation, provides a new solution for the task of segmenting the high-resolution remote sensing image, and can improve the performance of semantic segmentation of the remote sensing image.

Drawings

FIG. 1 is a diagram of a SACANet model architecture;

FIG. 2 is a schematic diagram of a class-centric generation submodule;

FIG. 3 is a schematic illustration of embedded scene-aware local-global class attention;

FIG. 4 is a flow chart of a training and testing process of the SACANet model according to an embodiment of the present invention;

fig. 5 is a test visualization result in an embodiment of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.

The spatial attention mechanism is widely applied to semantic segmentation of remote sensing images due to the modeling capability of the spatial attention mechanism on long-range dependency. Many approaches that employ a spatial attention mechanism aggregate context information using direct relationships between pixels in an image, while ignoring the scene perception of the pixels (i.e., sensing the global context of the scene in which the pixels reside and sensing their relative positions). Considering that scene awareness facilitates contextual modeling using spatial correlation of ground features, the present invention designs a scene awareness attention submodule based on an improved spatial attention mechanism embedded in scene awareness. In addition, the invention provides a local-global attention mechanism aiming at the problems that the general attention mechanism introduces excessive background noise and is difficult to solve the problem of larger intra-class variance in the remote sensing image. The core of the invention is to provide a deep network of class-level context aggregation based on scene perception, namely SACANet, and it should be noted that each module in the network has higher portability and can be applied to most networks.

The invention provides a remote sensing image semantic segmentation method based on scene perception attention, which specifically comprises the following steps: and inputting the image to be subjected to semantic segmentation into a semantic segmentation model SACANet consisting of an encoder module and a decoder module to obtain a semantic segmentation result. The image in the present invention is preferably a remote sensing image, and more preferably a high-resolution remote sensing image having a spatial resolution of 1m or less.

The following describes the specific structure and principle of the aforementioned semantic segmentation model SACANet in detail.

In the Encoder (Encoder) module of the SACANet, feature extraction is performed through a backbone network, and features output by the backbone network are expressed as rough features.

The Decoder (Decoder) module of the aforementioned SACANet is mainly composed of a class center generation sub-module (CCG) and a scene awareness attention sub-module (SAA), and the Decoder module takes as input a coarse feature representation of the encoder output. When the decoder works, firstly, the rough feature representation obtained in the decoder is subjected to pre-classification operation to obtain global class probability distribution, then the rough feature representation and the global class probability distribution are used as the input of the class center generation submodule CCG together to obtain a global class center, and the global class center is cut along the spatial dimension to obtain a plurality of cut global class center local blocks. Similarly, the decoder performs the same cutting on the coarse feature representation and the global class probability distribution along the spatial dimension to obtain a plurality of pairs of coarse feature representation local blocks and global class probability distribution local blocks with the same size, and then inputs each pair of corresponding coarse feature representation local blocks and global class probability distribution local blocks into the class center generation submodule CCG to obtain the local class center. Then, the cut rough feature representation local blocks, the cut global class center local blocks and the local class centers are simultaneously input into a scene perception attention submodule to obtain enhanced feature representations, and the enhanced feature representations of all the local blocks are spliced again according to positions before cutting to restore the spatial dimension same as that of the rough feature representation. And finally, splicing the rough feature representation and the spliced enhanced feature representation along the channel direction to obtain an output feature representation, and performing up-sampling on the output feature representation to the size of the input remote sensing image to obtain a semantic segmentation result of the input remote sensing image.

The specific structure of the SACANet model of the present invention is described in detail below. Fig. 1 is an overall structural diagram of the SACANet model, which includes an encoder module and a decoder module. The encoder module is used to extract semantic features, while the decoder is used to enhance the semantic features obtained in the decoder and restore the image spatial resolution, including CCG and SSA, by embedding scene-aware local-global class context modeling.

Specifically, for the encoder module, the input is the image to be segmented, the dimension is (B × 3 × H × W), B is the size of the input batch, which depends on the sample size of each batch in the training stage, and in the prediction stage, the input may be set to 1, H, and W are the height and width of the original image, respectively. First, in the present embodiment, the HRNetv2-w32 model is used as a backbone network, and the HRNetv2-w32 loads the pre-training weights learned from the Image-Net dataset. Inputting an image to be segmented into a backbone network for feature extraction to obtain a rough feature representation R with dimensions of (B multiplied by C multiplied by H '× W'), C is the number of feature channels represented by the rough feature representation, and H ^′ W' are each

And &>

When the decoder works, firstly, the rough and rough characteristic representation R obtained in the decoder is pre-classified by two continuous 1 multiplied by 1 convolutions to obtain the global class probability distribution<Dimension of (B.times.K.times.H '. Times.W'), wherein KIs the number of categories. The coarse feature representation R is then compared to the global class probability distribution<Inputting CCG to obtain global class center S with dimension of (B × C × H '× W'), and cutting along spatial dimension to obtain local block S _g Dimension of

Where h and w are the length and width of the local block. Similarly, the decoder represents the coarse features R and the global class probability distribution<Cutting along the spatial dimension to obtain a plurality of partial blocks R of the same size _l And< _l and will correspond to the R in the position _l And< _l inputting CCG to obtain local class center S _l Wherein R is _l And S _l Dimension is->

< _l Is->

In this embodiment, the decoder may set both the coarse feature representation and the local block size resulting from the global class center cut to 4 × 4. Then, the rough feature after cutting is represented as R _l Global class center after cutting S _g And a local class center S _l Simultaneous input of SSA to obtain enhanced feature representation R _a And restored to the original spatial dimension (B × C × H '× W'). And finally, splicing the rough feature representation and the enhanced feature representation to obtain an output feature representation, and performing four-time up-sampling on the output feature representation to obtain a semantic segmentation result of the input remote sensing image.

The class center generation submodule (CCG) in the invention is applied twice in the decoder, which is used for generating the global class center S and the local class center S respectively _l And the same CCG module can be multiplexed by the two production centers. The input of the class center generation submodule is global or local feature representation and class probability distribution corresponding to the feature representation, and the outputFor global or local class centers, no matter global or local feature representation and class probability distribution are used as input, the flow of producing the class centers in the module is the same, specifically: firstly, performing affinity operation on input class probability distribution and feature representation to obtain class representation information, then performing argmax operation on the class representation information to obtain a pre-classification mask, and finally, putting the class representation information back to the corresponding pixel position in the original rough feature representation according to the pre-classification mask to obtain a class center. If the input is global feature representation and class probability distribution, the output is also global class center, and if the input is local feature representation and class probability distribution, the output is also local class center.

In this embodiment, the sub-module CCG is generated for class center, and its main purpose is to replace the feature representation of pixels with a large amount of background noise with a class representation with richer semantic information. As shown in fig. 2, taking generation of a global class center as an example, the specific implementation in the CCG is: global class probability distribution to be input<And the roughness characteristics indicate that the dimension of R is reshaped into (B × K × N) and (B × C × N), where N is H × W. Then along the channel direction, will<Performing affinity operation on the transpose matrix of the sum R to obtain global class representation information C _g I.e. by

The dimension is (B × K × C). C _g Each of the C-dimensional vectors in the set is a feature representation of the corresponding category. To get a representation of the class to which each pixel belongs, a pre-classification mask E = Argmax (C) is calculated along the channel direction _g ) The dimension of which is (B × 1 × H '× W'), and the value of each pixel in the mask represents a subscript of the class representation information to which it belongs. Finally, placing the category feature representation back to the position of each pixel according to the mask information to obtain a global class center S with the dimension of (B multiplied by C multiplied by H '× W'). The process of generating the local class center is the same as the above process, and only the input and output and the dimensionality of each variable need to be changed, namely the input global class probability distribution<With coarse features indicating replacement of R by< _l And R _l The output becomes a local class center S _l Dimensional modification of other variablesAnd (6) changing.

The scene aware attention sub-module (SAA) in the present invention introduces context information embedding and location prior embedding in the conventional attention operation to embed the scene perception of pixels. Furthermore, unlike general self-attention operations, this module is operated by introducing a local class center S _l To indirectly associate a feature representation R of a pixel _l And a global class center S _g The problems of complex background and large intra-class variance in the remote sensing image are solved. In the sub-module, firstly, according to the rough characteristic representation local block, position prior information is obtained through position prior embedding, and simultaneously, context information is embedded into the rough characteristic representation local block to construct a context diagonal matrix and perform contextualization on the context diagonal matrix; then, the contextualized feature representation firstly aggregates local class centers, and then adds the element-by-element and the position prior information to obtain an affinity matrix; and finally, aggregating the global class centers according to the affinity matrix to obtain an enhanced feature representation after the perception of the embedded pixel scene.

In this embodiment, attention submodule SSA is directed to scene awareness, which aims to represent R for local features _l Embedding context information and position prior information, and simultaneously, introducing a local class center S _l Indirectly associating a global class center S as an intermediate perceptual element _g To obtain an enhanced feature representation R embedded in the scene perception _a . As shown in fig. 3, the scene awareness sub-module SSA in the present embodiment uses an improved attention operation, which is as follows: first to R _l 、S _l And S _g Performing 1 × 1 convolution to obtain three matrixes of Q, K and V, and reshaping the dimension of the three matrixes into (B '× hw × C), wherein B' is

Then position prior information is embedded for Q, i.e. a relative position code r is constructed, with the dimension of (hw × hw × C), wherein the ith hw × C matrix represents the relative position code of the pixel i and all other pixels. Coding r the relative position of the ith row in the Q matrix and the corresponding pixel i _i The transpose of (a) is subjected to matrix multiplication to obtain the position prior information p of the pixel i _i I.e. by

And finally, splicing the position prior of each pixel along the vertical direction to obtain global position prior information p of the local block, wherein the dimension of the global position prior information p is (B' × hw × hw). Embedding context information for Q also requires constructing a context diagonal matrix C, which has dimensions (B' × C). After contextualizing Q with the context diagonal matrix c, K is then aggregated to obtain the similarity matrix S, i.e.

The dimension is (B' × hw × hw). Unlike general attention, the affinity matrix in attention based on scene perception simultaneously takes into account the similarity between the relative position between pixels and the pixel characteristics, i.e. the affinity matrix a = Softmax (S + p). And finally, aggregating V according to the affinity matrix A to obtain a feature representation R after embedded pixel scene perception _a I.e. is->

And its dimension was reshaped to (B' × C × h × w). It should be noted that, in the present invention, the dimension can be reshaped by using a function such as Reshape in a Pytorch frame.

The position prior embedding used in the embodiment is to enable pixels to sense an internal distribution pattern of a ground object in a remote sensing image, the key point of which is to construct relative position codes among the pixels and embed coarse feature representation local blocks, so as to improve the sensitivity of attention to spatial distribution, and the specific method is as follows: the method comprises the steps of firstly calculating the offset of relative positions between pixels in the horizontal direction and the vertical direction, selecting a corresponding trainable vector in a coding bucket according to the offset to obtain relative position coding, and finally representing local block aggregation relative position coding by input rough features to obtain position prior information. The invention considers the relative position in horizontal position and vertical position, taking pixel I and pixel I as an example, the relative position code between them can be defined as

Where P is a coding bucket storing a set of trainable vectors with dimensions ((2 ξ + 1) × (2 ξ + 1) × C), I ^x (i,j)＝g(x _i -x _K ) And I ^y (i,j)＝g(y _i -X _K ) Respectively in two directions>

The representative selects the corresponding code vector from the code bucket according to the offset. Meanwhile, in order to reduce the parameter quantity and the calculation cost required by the semantic segmentation of the high-resolution remote sensing image, the invention limits the offset to be within the maximum distance xi, namely mapping the offset to be within a limited set by using a clipping function g (x) =1ax (-xi, 1in (x, xi)). According to the method, the dimension of the relative position code r finally obtained by the invention is (hw × hw × C), wherein the vector r of the ith row and the jth column _iK Representing the relative position coding between pixel i and pixel j. Coding r based on relative position _i Aggregating to matrix Q to form location prior information p _i The specific calculation formula is as described in the above paragraph, and will not be described in detail.

The context information embedding used in this embodiment is to enable the pixels to perceive the paired relationship between the ground objects in different scenes in the remote sensing image, and the key point is to construct the context diagonal matrix c. The specific method in the context information embedding is as follows: firstly, performing parallel global average pooling and maximum pooling on input rough feature representation local blocks through two branches, respectively using two times of feature mapping on respective pooling results of the two branches to obtain context vectors, sharing the same weight through the feature mapping used by the two branches, finally adding the context vectors obtained by the two branches element by element, outputting the added context vectors through a Sigmoid function, and converting the added context vectors into a context diagonal matrix, thereby performing up-and-down culture on rough features. In this embodiment, the specific calculation method of the context diagonal matrix can be expressed as the following formula: c = diag (σ (W) ₁ (W ₀ (AvgPool(Q)))+w ₁ (W ₀ (EaxPol (Q)))) where σ is a sigmoid function,

and &>

For two feature mappings, diag () maps a one-dimensional vector onto a corresponding diagonal matrix with dimensions (B' × C).

It should be noted that, before the semantic segmentation model SACANet is used for actual semantic segmentation, the semantic segmentation model SACANet is trained in advance by using labeled training data. To expand the training samples, the training data may be data enhanced. Loss functions adopted by the training of the semantic segmentation model are cross entropy losses, and the specific training process can refer to the existing training mode of the semantic segmentation model and is not repeated.

The method for semantic segmentation of the remote sensing image based on the scene perception attention is applied to a specific embodiment to show the technical effects that the method can achieve.

Examples

The semantic segmentation model SACANet used in this embodiment has the specific network structure described above, and is not described in detail. As shown in fig. 4, the overall process of performing semantic segmentation on a remote sensing image can be divided into three stages, namely data preprocessing, model training and image prediction.

1. Data preprocessing stage

For the obtained original remote sensing image (in this embodiment, a LoveDA data set is taken as an example), image preprocessing is performed, the image is cut into 512 × 512 sizes, and then the cut image is subjected to operations such as random rotation and inversion to perform data enhancement

2. Model training

Step 1, constructing training set data, and batching the training data set according to a fixed batch size, wherein the total number is N.

And 2, sequentially selecting a batch of training samples with index i from the training data set, wherein i belongs to {0,1, \8230;, N }. And training the semantic segmentation model SACANet by using the training samples of each batch. In the training process, calculating the cross entropy loss function of each training sample

And based on the total loss of all training samples in a batch>

Network parameters in the entire model are adjusted until all batches of the training data set are involved in model training. And after the specified iteration times are reached, the model is converged and the training is finished.

3. Image prediction

And directly inputting the images of the test set into a trained semantic segmentation model SACANet, finally predicting to obtain a probability vector with each pixel class, and selecting the class with the highest probability as a final result to be output through an activation function such as Sigmoid and the like, thereby realizing semantic segmentation.

In this embodiment, the test visualization result is shown in fig. 5, and the test datamation result is shown in table 1:

TABLE 1 test datamation results

Dataset	Back	Buil	Road	Water	Barren	Forest	Agri	mIoU
									LoveDA	47.6	59.1	58.4	80.5	17.8	46.7	67.1	53.9

As can be seen from fig. 5 and table 1, the semantic segmentation model SACANet of the present invention can well process the segmentation result for the remote sensing image, fully excavate the spatial correlation of the ground object in the remote sensing image by relying on the improved attention module embedded in the scene perception, and effectively alleviate the problems of complex background noise interference and large intra-class variance by introducing local-global attention, thereby improving the segmentation performance of the remote sensing image, and providing a new solution for the application of context modeling in the field of remote sensing image segmentation.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A remote sensing image semantic segmentation method based on scene perception class attention is characterized by comprising the following steps: inputting a remote sensing image to be subjected to semantic segmentation into a semantic segmentation model consisting of an encoder module and a decoder module to obtain a semantic segmentation result;

the decoder module comprises a class center generation submodule and a scene perception attention submodule, and takes rough feature representation output by the encoder as input; when the decoder works, firstly, the rough feature representation output by the encoder is subjected to pre-classification operation to obtain global class probability distribution, then the rough feature representation and the global class probability distribution are used as the input of a class center generation submodule to obtain a global class center, and the global class center is cut along the spatial dimension to obtain a plurality of cut global class center local blocks; meanwhile, the decoder module cuts the rough feature representation and the global class probability distribution respectively along the spatial dimension in the same way to obtain a plurality of pairs of rough feature representation local blocks and global class probability distribution local blocks with the same size, and inputs each pair of rough feature representation local blocks and global class probability distribution local blocks into a class center generation submodule to obtain a local class center; then, simultaneously inputting the rough feature representation local blocks obtained after cutting, the global class center local blocks obtained after cutting and the local class centers into a scene perception attention submodule to obtain enhanced feature representation, and splicing the enhanced feature representations of all the local blocks again according to the positions before cutting to restore the spatial dimensions same as those of the rough feature representation; finally, splicing the rough feature representation and the spliced enhanced feature representation along the channel direction to obtain an output feature representation, and performing up-sampling on the output feature representation to obtain a semantic segmentation result of the input remote sensing image;

the module firstly performs affinity operation on the input class probability distribution and the characteristic representation to obtain class representation information, then performs Argmax operation on the class representation information to obtain a pre-classification mask, and finally puts the class representation information back to the corresponding pixel position in the original rough characteristic representation according to the pre-classification mask to obtain the class center;

the scene perception attention submodule introduces context information embedding and position prior embedding in attention operation to embed the scene perception of pixels; in the sub-module, firstly, according to the rough characteristic representation local block, position prior information is obtained through position prior embedding, and simultaneously, context information is embedded into the rough characteristic representation local block to construct a context diagonal matrix and perform contextualization on the context diagonal matrix; then, the contextualized feature representation firstly aggregates local class centers, and then adds the element-by-element and the position prior information to obtain an affinity matrix; and finally, aggregating the global class centers according to the affinity matrix to obtain an enhanced feature representation after the perception of the embedded pixel scene.

2. The method for semantic segmentation of remote sensing images based on scene awareness attention according to claim 1, wherein the context information is embedded to construct a context diagonal matrix so that attention can be adjusted according to a given context, and the method comprises the following specific steps: firstly, performing parallel global average pooling and maximum pooling on input rough feature representation local blocks through two branches, respectively using two times of feature mapping on respective pooling results of the two branches to obtain context vectors, sharing the same weight through the feature mapping used by the two branches, finally adding the context vectors obtained by the two branches element by element, outputting the added context vectors through a Sigmoid function, and converting the added context vectors into a context diagonal matrix, thereby performing up-and-down culture on rough features.

3. The remote sensing image semantic segmentation method based on scene perception class attention as claimed in claim 1, wherein the position prior embedding is used for constructing relative position codes among pixels to embed coarse feature representation local blocks, and sensitivity of attention to spatial distribution is improved, and the method specifically comprises the following steps: the method comprises the steps of firstly calculating the offset of relative positions between pixels in the horizontal direction and the vertical direction, selecting a corresponding trainable vector in a coding bucket according to the offset to obtain relative position coding, and finally representing local block aggregation relative position coding by input rough features to obtain position prior information.

4. The method for semantic segmentation of remote sensing images based on scene-aware attention of claim 1, wherein the specific calculation algorithm in the scene-aware attention submodule is as follows:

first, the coarse features input in the submodules represent the local blocks R _l Local class center S _l And global class center local block S _g Performing 1 × 1 convolution to obtain three matrixes of Q, K and V, and reshaping the dimension of the three matrixes into (B '× hw × C), wherein B' is

B denotes the Batch size, H, of the input semantic segmentation model ^′ W' is the height and width represented by the rough features, C is the number of the feature channels represented by the rough features, and h and W are the height and width of each local block respectively; then constructing a relative position code r for embedding position prior information into the matrix Q, wherein the dimension of the r is (hw multiplied by C), and the ith matrix of hw multiplied by C represents the relative position code of the pixel i and all other pixels; then the relative position of the ith row of the matrix Q and the pixel i is coded r _i Is transposed to perform a matrix multiplication to obtain the position prior information ≥ of the pixel i>

Finally, splicing the position prior of each pixel along the vertical direction to obtain the position prior information p of the local block, wherein the dimension of the position prior information p is (B' × hw × hw); embedding context information for Q, constructing a context diagonal matrix C with the dimension of (B' × C × C), performing contextualization on the matrix Q by using the context diagonal matrix C, and then aggregating the matrix K to obtain a similarity matrix ^ based on>

The dimension is (B' × hw × hw); finally, an affinity matrix A = Sof0max (S + p) is calculated based on the position prior information p and the similarity matrix S, and a matrix V is aggregated according to the affinity matrix A to obtain an enhanced feature representation ^ based on the embedded pixel scene perception>

And its dimension was reshaped to (B' × C × h × w).

5. The remote sensing image semantic segmentation method based on scene perception class attention according to claim 1, wherein the backbone network is an HRNetv2-w32 model, and pre-training weights learned on ImageNet data sets are loaded.

6. The method for semantic segmentation of remote sensing images based on scene awareness class attention according to claim 1, wherein the pre-classification operation is implemented by two 1 x 1 convolutions in succession.

7. The method for semantic segmentation of remote sensing images based on scene awareness class attention according to claim 1, wherein the decoder cuts the coarse feature representation and the global class center to obtain local block sizes of 4 x 4.

8. The remote sensing image semantic segmentation method based on scene perception class attention according to claim 1, wherein the semantic segmentation model is trained with labeled training data in advance before being used for actual semantic segmentation.

9. The remote sensing image semantic segmentation method based on scene perception type attention according to claim 7, wherein training data need to be subjected to data enhancement, and loss functions adopted by training of the semantic segmentation model are cross entropy losses.

10. The method for semantically segmenting the remote sensing image based on the scene perception attention class as claimed in claim 1, wherein the remote sensing image is a high-resolution remote sensing image with a spatial resolution below 1 m.