CN114359297A

CN114359297A - Attention pyramid-based multi-resolution semantic segmentation method and device

Info

Publication number: CN114359297A
Application number: CN202210014091.9A
Authority: CN
Inventors: 冯结青; 姜丰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-15

Abstract

The invention discloses a multi-resolution semantic segmentation method based on an attention pyramid, which comprises the following steps: constructing a deep convolution neural network, wherein the problem of the checkerboard effect of the cavity convolution is solved by fusing the results of the cavity convolution with different cavity rates through serially connecting cavity space pyramid modules; information of different scales is extracted through a feature pyramid attention module, accurate and dense attention of a pixel level is provided, and the problems that multi-scale information cannot be extracted and attention of the pixel level cannot be provided in the conventional attention mechanism are solved; the feature map convolution streams with different resolutions are maintained in the decoder through the multi-resolution fusion decoder structure, and information is repeatedly exchanged among a plurality of resolution feature maps, so that the problem of insufficient utilization degree of context information in the general decoder structure is solved. The method has the advantages of stronger pixel sensitivity, capability of obtaining richer characteristic maps, better receptive field and capability of solving the chessboard effect problem of the cavity convolution.

Description

Attention pyramid-based multi-resolution semantic segmentation method and device

Technical Field

The invention belongs to the field of image semantic segmentation, and particularly relates to a multi-resolution semantic segmentation method and device based on an attention pyramid.

Background

Since the 20 th century and the 60 th era until today, image semantic segmentation technology has been advanced rapidly after decades of development, and a great deal of researchers at home and abroad make great contribution to the development of the image semantic segmentation technology. The definition of image semantic segmentation is that each pixel in an image is allocated with a predefined label capable of representing the semantic category, so that the whole image is segmented into a plurality of non-overlapping sub-regions, each sub-region represents a semantic category, and the mathematical definition is that for an image pixel set I, I is divided into a plurality of connected non-empty subsets I₁,I₂,…,I_NSo that

And there is a predicate P (-) that judges the consistency of the sets, so that

P(I_i)＝True,P(I_i∪I_j)＝False(i≠j)

Before deep learning is applied to the field of image semantic segmentation, the traditional unsupervised image semantic segmentation method usually extracts bottom-layer features such as image shapes, colors, textures and the like through a bottom-up method, then similar pixel points in a certain feature space are aggregated to generate a plurality of candidate regions, then scoring is carried out according to the matching degree of the candidate regions and the features of segmented objects, ranking is carried out according to scores, a certain number of effective regions with the highest scores are selected from the candidate regions, and finally the probability that each effective region belongs to a specific category is calculated according to each effective region. The method extracts only bottom-layer features, lacks global semantic feature information and has strong dependence on prior knowledge, so that the method has poor segmentation effect on some complex scenes and has insufficient robustness.

After deep learning appears, the feature extraction by using the convolutional neural network is also an effective means. In the semantic segmentation algorithm in this period, a plurality of candidate regions are generated from an image, then a convolutional neural network is used to extract high-level semantic features from the candidate regions, and finally a classifier is used to judge the probability that each region belongs to a certain semantic category. Although the algorithm can extract high-level semantic features, a window needs to be continuously slid in order to acquire a candidate region, and the calculation amount is increased; since the candidate region is usually smaller than the whole image, the convolutional neural network using the candidate region as an input can only extract the features of the local region, which reduces the classification accuracy.

The full convolution neural network converts the last full connection layer of the convolution neural network into a plurality of convolution layers, so that the output of the network is not a vector of class probability but a characteristic diagram, the network can accept input of any size, and the receptive field of the network is expanded through a plurality of pooling operations, thereby being capable of extracting features of the global level. From now on, image semantic segmentation algorithms based on full convolution neural networks have become the mainstream. After that, the precision of the semantic segmentation task is greatly improved due to the adoption of the technologies such as hole convolution, Encoder-Decoder, multi-scale feature extraction, attention mechanism and the like. However, the current semantic segmentation network still has some problems to limit the performance of the network.

Although the existing image semantic segmentation network based on deep learning improves the precision of semantic segmentation, some problems still exist. For example, hole convolution can maintain the receptive field and simultaneously does not reduce the resolution of the feature map, but has the problem of checkerboard effect, thereby reducing the segmentation performance of the network; although many studies on attention models exist, a global receptive field is usually extracted through a global pooling operation, and only an attention mechanism at a channel level or a space level or both can be provided, and the inability to extract multi-scale information and the inability to provide attention at a pixel level are factors limiting the development of the attention mechanism; in a common encoder-decoder architecture, the decoder is usually connected in series with representations of different resolutions, and the input of each sub-module comes from both the output of the last decoder module and the output of the corresponding module of the encoder, which makes insufficient use of the context information. Thus, there is still room for improvement in current semantic segmentation networks.

Disclosure of Invention

The invention provides a multi-resolution semantic segmentation method based on an attention pyramid, which has stronger pixel sensitivity, can obtain richer feature maps, has better receptive field and solves the chessboard effect problem of cavity convolution.

A multi-resolution semantic segmentation method based on an attention pyramid comprises the following steps:

(1) constructing a semantic segmentation training set;

(2) constructing a deep convolutional neural network, wherein the deep convolutional neural network comprises an encoder, a characteristic pyramid attention module, a series-connected cavity space pyramid module and a multi-resolution fusion decoder, inputting an initial semantic image to the encoder to obtain a plurality of resolution characteristic graphs, inputting the characteristic graph with the minimum resolution to the series-connected cavity space pyramid module, the series-connected cavity space pyramid module comprises a plurality of cavity convolution layers, the cavity convolution layers are combined in a cascade mode, the cavity rate is increased layer by layer, performing convolution processing after connecting the characteristic graph output by each cavity convolution layer with the connection characteristic graph output by the previous cavity convolution layer to adjust the number of filters of the connected characteristic graphs and extract semantic characteristic information to obtain a first sub-fusion characteristic graph, and connecting a plurality of first sub-fusion characteristic graphs to obtain a first fusion characteristic graph, connecting the first fused feature map, the resolution minimum feature map and the global pooling feature map to obtain a first connecting feature map, and inputting the first connecting feature map to a first decoder;

inputting each other resolution feature map into a plurality of average pooling layers of the feature pyramid attention module to obtain a first multi-resolution attention feature map set, and performing convolution processing on each first attention feature map after up-sampling to obtain a second fusion feature map; inputting each of the other resolution feature maps into a plurality of maximum pooling layers of the feature pyramid attention module to obtain a second multi-resolution attention feature map set, and performing up-sampling and convolution processing on each second attention feature map to obtain a third fusion feature map; connecting the second and third fused feature maps, performing convolution processing to obtain pixel level weights of the other resolution feature maps, performing alignment multiplication on the pixel level weights and the other resolution feature maps, and inputting an alignment multiplication result to a second decoder with a corresponding resolution;

the multi-resolution fusion decoder comprises a plurality of resolution decoding layers, each resolution decoding layer comprises a first decoder and a second decoder set with the current resolution and higher than the current resolution, and a final semantic segmentation image with the same resolution as the initial semantic image is obtained through the layer-by-layer convolution operation of the decoding layers and the up-down sampling;

(3) training a deep convolutional neural network through the semantic segmentation training set, and optimizing parameters to obtain a multi-resolution semantic segmentation model;

(4) when the semantic image segmentation model is applied, the semantic image is input into the multi-resolution semantic segmentation model to obtain a semantic segmentation image.

Multi-resolution fusion decoder architecture: in currently common encoder-decoder architectures, the encoder and decoder architectures are typically symmetric. The encoder is responsible for extracting features, and a maximum pooling layer or convolution layer with step 2 is usually used behind each sub-module to reduce resolution, so as to enlarge the receptive field and reduce subsequent calculation amount. The number of sub-modules of the decoder is generally the same as that of the encoder, and the output of the previous sub-module needs to be up-sampled once before each sub-module and fused with the output of the corresponding sub-module of the encoder so as to restore the resolution. However, the decoder structure is represented by serially connecting different resolutions, and each sub-module only utilizes two layers of characteristics of the output of the previous decoder module and the output of the corresponding encoder module, so that the utilization degree of the context information is insufficient. Therefore, the invention provides a multi-resolution fusion decoder structure, in the decoding process, the feature maps with different resolutions are connected in parallel, and the representation of a plurality of resolutions is fused by repeatedly interleaving the up-sampling, down-sampling and convolution operations, so that richer feature map representations can be obtained, and the pixel sensitivity is stronger.

Serially connecting the cavity space pyramid modules: in the encoder structure, in order to increase the field of view and reduce the amount of computation, it is always necessary to down-sample the feature map, and this method can increase the field of view but reduces the spatial resolution. The cavity convolution can keep the receptive field and simultaneously does not reduce the resolution of the characteristic diagram, but because the calculation mode of the cavity convolution is similar to a chessboard format, adjacent pixels in the result obtained by the cavity convolution are obtained by convolution from independent subsets, the dependency between the adjacent pixels is weakened, and the local information is lost; and with the increasing of the void rate, the input signal of the void convolution sampling becomes more and more sparse, so that when a small object is segmented, information acquired by remote convolution lacks correlation, and the void convolution is very unfriendly to the task of segmenting the small object. This is the question of the checkerboard (gridding) effect of the hole convolution. In order to solve the problem, the invention provides a series-connected cavity space pyramid module, and the cavity convolution results are arranged from small to large according to the cavity rate and are connected in series, so that the convolution result with the larger cavity rate is fused with the convolution result with the smaller cavity rate, the advantage that the cavity convolution keeps the receptive field can be kept, and the checkerboard effect problem of the cavity convolution can be solved.

Feature pyramid attention module: currently, there are many studies on attention, for example, SENet can provide attention at a channel level, that is, a network can adaptively adjust a characteristic relationship between channels by means of characteristic recalibration; GENet provides a more general modeling form than SENEt, and exploits the context information between features by fully utilizing the attention of space level; CBAM is then an integrated unit that can provide both channel-level and spatial-level attention. Although the feature extraction capability of the network can be improved to a certain extent after the attention module and the deep neural network are fused, the attention mechanisms still have problems generally, and all the attention modules extract global information through global pooling operation, but cannot extract multi-scale information; the attention provided is typically a coarser channel or spatial level of attention and does not provide a more accurate pixel level attention mechanism. Therefore, the present invention is inspired by the void space pyramid, and provides a feature pyramid attention module to solve the above-mentioned deficiencies of attention mechanism.

The cavity convolution layer is combined in a cascade mode, and the cavity rate increases layer by layer, and the method comprises the following steps:

setting the void ratio of the first layer of the void convolution layer to be 1, arranging each void convolution layer according to the void ratio from small to large, and respectively inputting the feature map with the minimum resolution ratio to the plurality of void convolution layers to obtain the feature map of each void convolution layer.

The convolution processing after the characteristic diagram output by each cavity convolution layer is connected with the connection characteristic diagram output by the previous cavity convolution layer comprises the following steps:

and the connection characteristic diagram output by the previous cavity convolution layer is used for fusing characteristic diagram information output by the cavity convolution layer with the cavity rate smaller than that of the current layer, and a plurality of first sub-fusion characteristic diagrams corresponding to the cavity rate are obtained through layer-by-layer connection.

And inputting the feature map with the minimum resolution into a global pooling layer to obtain a global pooling feature map, wherein the global pooling feature map is used for obtaining the features of the image level.

The inputting each other resolution feature map into the plurality of average pooling layers of the feature pyramid attention module to obtain a first set of multi-resolution attention feature maps comprises:

the average pooling layers comprise a global pooling layer and a plurality of average pooling layers with different sizes, the global pooling layer and the average pooling layers with different sizes are arranged from small to large according to the resolution, and each other resolution feature map is input into the average pooling layers to be downsampled to obtain a first attention feature map set arranged from small to large according to the resolution.

Performing convolution processing on each first attention feature map after upsampling to obtain a second fusion feature map, including:

the method comprises the steps of starting up sampling layer by layer from a first attention feature map with the minimum resolution, carrying out element level addition on an up-sampling feature map obtained by up-sampling of the previous layer and the first attention feature map output by the current layer to obtain an up-sampling feature map of the current layer, finally obtaining a final up-sampling feature map with the same resolution as each other input resolution feature map through up-sampling layer by layer, and carrying out convolution processing on the final up-sampling feature map to obtain a second fusion feature map.

The obtaining of the pixel level weight of the other resolution characteristic graphs by performing convolution processing after connecting the second and third fusion characteristic graphs comprises:

and connecting the second fusion feature map with the third fusion feature map, sequentially performing 1x1 convolution and sigmoid activation functions to obtain pixel-level weights, and giving the pixel-level weights to the other resolution feature maps based on counterpoint multiplication.

The final semantic segmentation image with the same resolution as the initial semantic image is obtained by performing convolution operation on decoding layers and up-sampling and down-sampling, and comprises

And fusing the feature maps with different sizes through the down-sampling, up-sampling and convolution operations of a first decoder and a second decoder set of the current resolution decoder layer and the next resolution decoder layer, and obtaining a final semantic segmentation image with the same resolution as the initial semantic image through layer-by-layer up-sampling.

The fusing the feature maps of different sizes through the down-sampling, up-sampling and convolution operations of the first decoder and the second decoder sets of the current resolution decoder layer and the next resolution decoder layer comprises the following steps:

based on the resolution size, performing down-sampling, up-sampling and convolution operations on a decoder of a current resolution decoder layer and a decoder of a next resolution decoder layer;

when the resolution of the decoder of the current resolution decoder layer is the same as the resolution of the decoder of the next resolution decoder layer, the convolution operation is adopted;

when the resolution of a decoder of the current resolution decoder layer is higher than the resolution of a decoder of the next resolution decoder layer, performing up-sampling;

when the resolution of the decoder of the current resolution decoder layer is lower than the resolution of the decoder of the next resolution decoder layer, down-sampling is carried out;

and finally, obtaining a final semantic segmentation image with the same resolution as the initial semantic image through convolution operation and upsampling.

An attention pyramid based multi-resolution semantic segmentation apparatus comprising a computer memory in which is employed the multi-resolution semantic segmentation model of any one of claims 1-9, a computer processor, and a computer program stored in and executable on the computer memory;

the computer processor, when executing the computer program, performs the steps of: and inputting the semantic image into the multi-resolution semantic segmentation model, and outputting the semantic segmentation image through calculation.

Compared with the prior art, the invention has the beneficial effects that:

the attention pyramid-based multi-resolution fusion network provided by the invention can improve the precision of a semantic segmentation technology. By means of the multi-resolution fusion decoder structure, the multi-resolution representation is repeatedly and alternately fused, so that the characteristic representation with stronger pixel sensitivity can be obtained; the problem of the checkerboard effect of cavity convolution is solved by serially connecting cavity space pyramid modules; and the attention module of the feature pyramid provides pixel-level attention to a feature map obtained by each submodule of the encoder by extracting multi-scale information so as to perform feature recalibration and obtain more accurate representation.

Drawings

Fig. 1 is a general structure diagram of a multi-resolution semantic segmentation method based on an attention pyramid according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a depth separable convolution according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a pyramid module with serial holes according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature extraction procedure of a feature pyramid attention module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a feature pyramid attention module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following embodiments and accompanying drawings.

The invention provides a multi-resolution semantic segmentation method based on an attention pyramid, which comprises the following specific steps as shown in figure 1:

constructing a semantic segmentation training set; the multi-resolution fusion network based on the attention pyramid uses pre-trained ResNet as a backbone network, full connection layers in the backbone network are removed to form a full convolution network, the first module of the ResNet performs down-sampling by a factor of 4, then each module performs down-sampling by a factor of 2, and the resolution of a feature map output finally is 1/32 of an original image.

And constructing a deep convolutional neural network, wherein the deep convolutional neural network comprises an encoder, a characteristic pyramid attention module, a series cavity space pyramid module and a multi-resolution fusion decoder, and inputting the initial semantic image into the encoder to obtain a plurality of resolution characteristic images.

Aiming at the resolution minimum feature map output by the Res-5 module, extracting multi-scale information by using a series cavity space pyramid module; aiming at the characteristic diagrams output by Res-2, Res-3 and Res-4 modules, a characteristic pyramid attention module is used for carrying out attention recalibration on the characteristic diagrams to obtain more accurate representation; in the multi-resolution fusion decoder structure, feature map convolution streams with different resolutions are maintained, and the representations of a plurality of resolutions are repeatedly and alternately fused through upsampling, downsampling and convolution operations to obtain richer feature map representations; and finally, upsampling the decoded feature map back to the resolution of the original size, thereby completing the segmentation task.

The multi-resolution fusion decoder structure provided by the invention still starts from the characteristic diagram of the lowest resolution extracted by the encoder, and gradually restores the original resolution by adopting a bilinear interpolation upsampling method. Unlike the general decoder structure, in the process of gradually restoring the resolution, the feature maps with low resolution are not discarded, but the convolution streams from low resolution to high resolution are gradually added, the convolution streams with multiple resolutions are connected in parallel, the feature map convolution streams with different resolutions are maintained, and information is repeatedly exchanged among the multiple resolution feature maps:

the multi-resolution fusion decoder comprises a plurality of resolution decoding layers, each resolution decoding layer comprises a first decoder and a second decoder set with the current resolution and higher than the current resolution, and a final semantic segmentation image with the same resolution as the initial semantic image is obtained through the layer-by-layer convolution operation of the decoding layers and the up-down sampling; based on the resolution size, performing down-sampling, up-sampling and convolution operations on a decoder of a current resolution decoder layer and a decoder of a next resolution decoder layer;

The Convolution operation is a depth Separable Convolution (DSConv) instead of the conventional Convolution. Compared with the traditional convolution, the depth separable convolution parameter number is less, the calculated amount is less, the speed is faster, the performance of the network is not influenced, and the fitting capability of the network can be even improved, and the specific steps are as follows:

corresponding to 4 down-sampling sub-modules of a ResNet coder, a multi-resolution fusion decoder structure comprises feature maps with 4 different resolutions, and the feature maps are respectively the resolutions of 4 times down-sampling, 8 times down-sampling, 16 times down-sampling and 32 times down-sampling of an original image. The whole decoding process can be divided into 4 stages, and assuming that the current stage maintains n kinds of feature maps with different resolutions, the stage needs to complete two tasks: on one hand, a feature map with higher resolution is recovered, and the method is that the current n feature maps with different resolutions are up-sampled by different multiples and are fused with the feature map corresponding to the encoder; on the other hand, information is interleaved and fused among n different low-resolution feature maps which are currently available through upsampling, downsampling and convolution operations. Finally, n +1 feature maps are output at the stage, and each feature map fuses the information of all other feature maps.

Compared with a universal decoder structure, the multi-resolution fusion decoder structure provided by the invention has the advantages that the finally obtained representation of each resolution is fused with information from all resolutions, the context information extracted by each resolution feature map is obviously richer, and the position sensitivity is stronger, so that great help is brought to the recovery of detail regions and the improvement of the segmentation precision.

As shown in fig. 2, the conventional convolution process considers all channels in the input feature map at the same time, and performs spatial convolution on all channels at the same time. The convolution process of the depth separable convolution is divided into two parts, namely, depth convolution is firstly executed, and then point-by-point convolution is executed. The depth convolution respectively executes space convolution on each channel, the number of input characteristic diagram channels cannot be changed, and the dimensionality of the input characteristic diagram cannot be expanded or shrunk; the point-by-point convolution mainly consists of 1x1 convolution, is responsible for mixing different channels of the output of the depth convolution at the same spatial position, and effectively expands or contracts the dimension so as to project a new feature map.

The purpose of the deep separable convolution is to learn richer feature expression by using fewer parameters, and the deep separable convolution decomposes the convolution operation into two processes, and performs spatial convolution and channel convolution separately, thereby greatly reducing the parameter amount and the calculation amount of a model. The semantic segmentation task is characterized in that features of different channels of an input feature map are relatively independent, information correlation on spatial positions is strong, and the semantic segmentation task is particularly suitable for using deep separable convolution.

Since the Batch Normalization (BN) layer helps gradient propagation and improves the generalization capability of the model, speeding up the convergence of the model, the convolution operation involved in the present invention is not a simple conventional convolution layer, but a module consisting of deep separable convolution + Batch Normalization layer + Relu activation layer (DSConv + BN + Relu).

As shown in FIG. 3, the order of the sizes of the void ratios (relationships) is 1<d₁<d₂<…<d_nThe series hole space pyramid module arranges the outputs of the n +1 hole convolutions from small to large according to the hole rate, and then connects the outputs together in sequence. To sample information between adjacent pixels, the hole rate of the first hole convolution is set to 1, which in fact has degenerated to the normal 3x3 convolution operation. The feature map output by each hole convolutional layer is connected with the connected feature map of the convolutional output with the hole rate smaller than that of the hole rate, and a convolutional layer of 1x1 is followed to adjust the number of filters and fuse the extracted information. And the connection characteristic diagram output by the previous cavity convolution layer is used for fusing characteristic diagram information output by the cavity convolution layer with the cavity rate smaller than that of the current layer, and a plurality of first sub-fusion characteristic diagrams corresponding to the cavity rate are obtained through layer-by-layer connection. Void ratio of d_kWill be able to extract the void ratio as 1, d respectively₁,d₂,…,d_k-1The output information of the hole convolution of (1). And finally, connecting the outputs of all the hollow convolution fused information together. The structure can extract more dense and accurate characteristic information, strengthens the dependency between adjacent pixels in the result of the cavity convolution and improves the segmentation effect on small objects. In addition, in order to extract the features at the image level, a global pooling branch is added in the series-connection cavity space pyramid module, and the feature map with the minimum resolution is input to a global pooling layer to obtain a global pooling feature map which is used for obtaining the features at the image level, so that the performance of the module is further improved.

And finally, the output of the series cavity space pyramid module consists of an original resolution minimum feature map, the output after global pooling and up-sampling and the result of cavity convolution of multi-scale multi-cavity rate and series fusion.

S4, the feature pyramid attention module feature extraction step is shown in FIG. 4. Firstly, besides the global pooling operation, the pooling branches with different scales are added, so that the multi-scale information can be extracted. Secondly, the feature extraction pyramid module arranges the pooled feature maps from small to large according to the resolution, then gradually samples upwards from the pooled output with the minimum resolution, and adds the element levels with the pooled output with the high resolution, and finally restores the original resolution. Finally, the feature pyramid attention module convolves the output at the original resolution by 3x3 to sample the information of neighboring pixels and adjust the up-sampled result to make the final output more accurate.

In the previous attention mechanism, only global mean pooling is generally used in the step of extracting features, and in view of the fact that the mean value and the maximum value can describe information from different angles when the features are extracted, the maximum value pooling is additionally added in the invention, referring to fig. 5, in a feature pyramid attention module, firstly, feature extraction pyramid modules which use mean pooling and maximum value pooling are connected in parallel, then, second and third fused feature maps which are output by the two modules are connected, after the number of channels is adjusted by convolution with 1x1, sigmoid activation function is used for activation, pixel-level weights are provided for an original feature map, and finally, pixel-level multiplication is carried out on the original feature map.

In this embodiment, a test data set of the PASCAL VOC 2012 is used to test the performance of the multi-resolution semantic segmentation model provided by this patent, and the PASCAL VOC 2012 data set is a reference data set for object class identification and detection in the computer vision field, and provides a standard image and annotation data set and a standard evaluation program for computer vision and deep learning communities. The data set used for semantic segmentation in the PASCAL VOC 2012 data set contains 20 foreground object classes and one background class, 1464 images for the training set, 1449 images for the validation set and 1456 images for the final test are provided in the original data set. Obviously, the number of images in the training set is too small to train an effective network. Hariharan et al enhanced and augmented the original data set, and the number of images eventually used for training increased to 10582.

To facilitate network reading, this section uniformly crops the image of the input network to a 384 × 384 resolution size. For convenience of performance evaluation, the networks used in this section all use ResNet101 as the backbone network. Under the condition that MS-COCO data set pre-training and a Dense-CRF algorithm are not used for post-processing, the network provided by the patent achieves 77.5% of MIoU, and has higher performance than other algorithms. The results of comparative analysis of the quantitative indicators are listed in table 1.

TABLE 1 semantic segmentation algorithm Performance comparison on test data set of PASCAL VOC 2012

Claims

1. A multi-resolution semantic segmentation method based on an attention pyramid is characterized by comprising the following steps:

(1) constructing a semantic segmentation training set;

2. The attention pyramid-based multi-resolution semantic segmentation method according to claim 1, wherein the hole convolution layers are combined in a cascade manner, and a hole rate increases layer by layer, comprising:

3. The attention pyramid-based multi-resolution semantic segmentation method according to claim 2, wherein the convolution processing after the connection of the feature map output by each hole convolutional layer and the connection feature map output by the previous hole convolutional layer comprises:

4. The attention pyramid-based multi-resolution semantic segmentation method according to claim 1, wherein the minimum-resolution feature map is input to a global pooling layer to obtain a global pooled feature map for obtaining image-level features.

5. The attention pyramid-based multi-resolution semantic segmentation method of claim 1 wherein the inputting each other resolution feature map into a plurality of averaged pooling layers of the feature pyramid attention module results in a first set of attention feature maps of multi-resolution comprising:

6. The attention pyramid-based multi-resolution semantic segmentation method according to claim 5, wherein the upsampling and convolution processing each first attention feature map to obtain a second fused feature map comprises:

7. The attention pyramid-based multi-resolution semantic segmentation method according to claim 1, wherein the obtaining of the pixel level weights of the other resolution feature maps by performing convolution processing after connecting the second and third fused feature maps comprises:

8. The attention pyramid-based multi-resolution semantic segmentation method according to claim 1, wherein the obtaining of the final semantic segmentation image with the same resolution as the initial semantic image through decoding layer-by-layer convolution operation and up and down sampling comprises:

9. The attention pyramid based multi-resolution semantic segmentation method of claim 1, wherein the fusing feature maps of different sizes through the down-sampling, up-sampling and convolution operations of the first decoder, the second decoder set of the current resolution decoder layer and the next resolution decoder layer comprises:

10. An attention pyramid based multi-resolution semantic segmentation apparatus comprising a computer memory, a computer processor and a computer program stored in and executable on the computer memory, wherein the multi-resolution semantic segmentation model of any one of claims 1 to 9 is employed in the computer memory;