CN111833273B

CN111833273B - Semantic boundary enhancement method based on long-distance dependence

Info

Publication number: CN111833273B
Application number: CN202010690090.7A
Authority: CN
Inventors: 韩震; 陈曦; 刘小平; 李志强; 李庆利; 朱敏; 刘敏
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2021-08-13
Anticipated expiration: 2040-07-17
Also published as: CN111833273A

Abstract

A semantic boundary enhancement method based on long-distance dependence belongs to the technical field of image processing based on deep learning. The invention aims at the problem that the long-range dependency relationship of semantic boundaries is ignored in the existing semantic segmentation algorithm, so that the boundaries are fuzzy. The method comprises the following steps: inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X; the high-level feature map X is transmitted through three parallel branches; wherein the first branch directly passes the high level feature map X; the second branch obtains a position attention feature map; the third branch obtains a boundary attention feature map; and merging the feature maps output by the three parallel branches, and decoding by a decoder to obtain a final semantic segmentation result map. The invention effectively improves the precision of semantic segmentation.

Description

Semantic boundary enhancement method based on long-distance dependence

Technical Field

The invention relates to a semantic boundary enhancement method based on long-distance dependence, and belongs to the technical field of image processing based on deep learning.

Background

Semantic segmentation attempts to assign pixels of the same class into the same semantic class. Semantic segmentation as a fundamental work for image understanding is becoming increasingly important in many areas, such as the analysis of medical images and urban street scenes. The success of semantic segmentation depends on whether semantically consistent regions can be understood and segmented. Therefore, it is important to improve the intra-class inconsistency and to improve the distinction between classes, and the discrimination capability between classes and intra-class can be improved by using the information of the boundary and the context.

Currently, only a few semantic segmentation methods focus on boundary detection. The underlying reason is that the boundaries do not directly contribute to the performance improvement, since the boundaries only account for a small part of the entire image. However, without boundary detection, most existing semantic segmentation methods blur the boundary and limit further performance improvements. This is because: 1) the existing semantic segmentation method depends on fixed spatial support, namely a patch form or a convolution kernel form; 2) the use of merging and striding operations significantly reduces the spatial resolution of the resulting feature map. The resolution of the high-level feature map of these methods is relatively low. The low resolution limits the ability to extract fine granularity of the object.

In fact, the boundary information can be used to maintain strong connections on the same object and weak connections between different objects. The boundary information may enhance segmentation performance by separating classes on both sides of the boundary. From low-level to high-level definitions, boundaries can be divided into simple edges, depth boundaries, object boundaries, and semantic boundaries. The Sobel or Canny algorithm can find simple boundaries. Depth boundaries are commonly used in indoor layout estimation to identify concave boundaries including ceiling and floor. The target boundaries may be extracted to inform the low-level boundary detection process to help perform many high-level visual tasks. Semantic boundaries are always expressed as binary or class-aware semantic boundary detection.

Although semantic boundary detection improves the accuracy of the segmentation. But due to its fixed geometric architecture, it can only satisfy the constraints on local receptive fields and short-range context information. Since the labels of the data set inevitably contain noise, the boundary detection can be taken as one layer of the network and the penalty calculated to predict the maximum response along the boundary normal direction. However, these methods cannot acquire remote spatial correlation, and thus cannot improve the accuracy of semantic segmentation.

In order to capture the spatial dependencies at dense and distant distances, some algorithms have been able to propose extracting context information for the purpose. Conditional random fields can be used as post-processing to combine low-level and high-level features for fine-grained prediction. The multi-scale feature map or global feature information may extract context information through the pyramid module to utilize global context information. Although these methods can detect features in different proportions, they fail to capture remote spatial dependencies.

To remedy the above deficiencies, a self-attention mechanism is used in semantic segmentation to improve performance. An attention mechanism is employed to learn to weight the multi-scale features for each pixel location by training in conjunction with the multi-scale input images. A point-by-point spatial attention network (PSANet) aggregates remote context by learning location-sensitive context dependencies and bi-directional information propagation based on adaptive learning attention maps. A crossbar network (CCNet) captures remote context dependencies on crossbar paths to optimize computation and storage efficiency. The existing method of the self-attention mechanism ignores semantic boundaries in the high-level feature map, so that key information of high-level semantic information in the feature map generated after an object in an image passes through an encoder becomes fragile, such as a boundary.

Disclosure of Invention

Aiming at the problem that the long-range dependency relationship of a semantic boundary is ignored in the existing semantic segmentation algorithm, so that the boundary is fuzzy, the invention provides a semantic boundary enhancement method based on long-range dependency.

The invention relates to a semantic boundary enhancement method based on long-distance dependency, which comprises the following steps,

inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X; the high-level feature graph X is merged after being transmitted by three parallel branches; wherein the content of the first and second substances,

the first branch passes the high-level feature map X directly;

the second branch adopts a position attention module to capture the remote context information of the high-level feature map to obtain a position attention feature map;

the third branch adopts an unsharp masking algorithm to extract the semantic boundary of the high-level feature graph X to obtain the high-level feature graph X after the boundary is extracted, and then the high-level remote dependency relationship around the semantic boundary is enhanced through a boundary attention module to obtain a boundary attention feature graph;

and merging the feature maps output by the three parallel branches, and decoding by a decoder to obtain a final semantic segmentation result map.

According to the semantic boundary enhancement method based on long-distance dependency of the invention,

the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries comprises:

calculating the average value of p multiplied by p continuous pixels of the high-level feature map X with the boundary extracted through an average pooling layer, wherein p is 2, 3, 4 and 5; obtaining an average pooled output feature map X':

wherein R is_jRepresenting a pooling zone, i being R_jPixel of (2), X_jIs R_jThe above features.

According to the long-distance dependency-based semantic boundary enhancement method, the process of enhancing advanced remote dependency relationships around semantic boundaries by the boundary attention module further comprises the following steps:

the high frequency detail profile X ″ is obtained from the difference between the high level profile X and the profile X':

adding the high-frequency detail feature map X' with the high-level feature map X to generate an enhanced feature map A:

where C represents the number of profile channels, H represents the height of the enhanced profile A, and W represents the width of the enhanced profile A.

processing the enhanced feature map A through three convolution layers of 1 x 1 respectively to obtain a feature map B, a feature map C and a feature map D correspondingly, and

and transforming the feature map B and the feature map C into:

n is H × W, N represents the number of all pixels in the feature map;

multiplying the transformation matrixes of the characteristic diagram B and the characteristic diagram C, and generating a boundary probability map S after passing through a softmax layer:

where { i, j }. belongs to [1, N ]]，B_iIs the ith feature vector, C, of the feature map B_jIs the jth feature vector, s, of the feature map C_jiIs B_iMultiplying by C_jRepresents B_iAnd C_jThe correlation of similar features.

transforming the feature map D into

Multiplying the transformation matrix of the feature map D by the boundary probability map S, multiplying the obtained result by the feature factor alpha, and adding the result with the enhanced feature map A to obtain a boundary attention feature map E:

wherein E_jIs the jth feature vector, D, of the boundary attention feature map E_iThe ith feature vector, A, representing the feature map D_jA jth feature vector representing the enhanced feature map a; alpha is initialized to zero and the weights are learned automatically.

According to the semantic boundary enhancement method based on long-distance dependency, the position attention module is adopted to capture the remote context information of the high-level feature map X, and the process of obtaining the position attention feature map comprises the following steps:

processing the high-level feature map X through three convolution layers respectively to obtain a feature map F, a feature map G and a feature map H in a corresponding manner, and

and transforming the feature map F and the feature map G into:

multiplying the transposed matrixes of the characteristic diagram F and the characteristic diagram G, and generating a boundary probability map T after passing through a softmax layer:

where { i, j }. belongs to [1, N ]]，F_iIs the ith feature vector, G, of the feature map F_jIs the jth feature vector, t, of the feature map G_jiIs F_iMultiplied by G_jRepresents F_iAnd G_jThe correlation of similar features.

According to the long-distance dependency-based semantic boundary enhancement method, the position attention module is adopted to capture the remote context information of the high-level feature map X, and the process of obtaining the position attention feature map further comprises the following steps:

transforming the feature map H into

Multiplying the transformation matrix of the feature map H by the boundary probability map T, multiplying the obtained result by the feature factor beta, and adding the result with the high-level feature map X to obtain a position attention feature map I:

wherein I_jJ-th feature vector, H, representing the location attention feature map I_iThe ith feature vector, representing the feature map H, is initialized to zero and the weights are automatically learned.

According to the long-distance dependency-based semantic boundary enhancement method, the expansion convolutional layer outputs a feature map K after mixing the feature maps output by three parallel branches:

K＝X+E+I，

k comprises semantic boundaries, positions and context information of targets of the image to be recognized;

and inputting the feature graph K into a decoder to restore spatial information to obtain a final semantic segmentation result graph.

The invention has the beneficial effects that: the method adopts an unsharp masking algorithm to generate an enhanced semantic boundary so as to improve the quality of semantic segmentation; while using an attention mechanism to enhance contextual information around semantic boundaries; finally, the output is combined with the results of the traditional attention module to produce enhanced remote context information. The method effectively improves the precision of semantic segmentation by combining with a self-attention mechanism based on the long-range dependency relationship of semantic boundaries.

Drawings

FIG. 1 is an overall flow chart of the semantic boundary enhancement method based on long-distance dependency according to the invention; CNN in the figure represents an encoder; PAM denotes the location attention module; BAM represents the boundary attention module;

FIG. 2 is a process flow diagram of a boundary attention module;

FIG. 3 is a process flow diagram of a location attention module;

FIG. 4 is an atlas of images to be identified formed from four images to be identified;

FIG. 5 is an atlas of segmentation results processed for FIG. 4 using an existing reference network;

FIG. 6 is a graph set of segmentation results processed by the method of the present invention for FIG. 4;

FIG. 7 is a genuine label corresponding to FIG. 4;

FIG. 8 is a visualization of an atlas of images to be identified on a Cityscapes validation set;

FIG. 9 is a visual representation of the feature map of FIG. 8 using an existing reference network;

FIG. 10 is a visual representation of the signature graph of FIG. 8 using the method of the present invention;

fig. 11 is a genuine label corresponding to fig. 8.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

Detailed description of the inventionas shown in fig. 1, the present invention provides a semantic boundary enhancement method based on long-distance dependency, which includes,

the first branch passes the high-level feature map X directly;

the second branch adopts a position attention module to capture the remote context information of the high-level feature map X to obtain a position attention feature map;

As shown in fig. 1, in the present embodiment, a new convolutional neural network sbent (automatic Boundary Enhancement with Long-range Dependency for automatic Segmentation) is constructed, and the sbent gradually reduces the resolution of the image to be recognized through the encoder cnn, and can generate the high-level feature image X having the spatial size of H × W pixels. After obtaining the profile X, X is fed into three parallel branches in the unified network. The first branch is the original raw feature map X. In the second branch, X is sent to the location attention module to capture remote context information. In the third branch, the semantic boundaries are extracted using unsharp masking algorithms and then input into the attention module to enhance the high-level remote dependencies around the semantic boundaries. The feature maps from the three branches are combined and fed to the decoder, generating the final segmentation map.

In the SBENet architecture of the present embodiment, in order to obtain more details and generate dense feature maps, the last two downsampling operations in the ResNet convolutional neural network are deleted, and extended convolution substitution is used. Thus, the final output segmentation map is 1/8 of the original image size to be recognized.

Some high-level feature maps X are blurred due to the reduced resolution. Blurred profiles may reduce detail and result in unclear edges. To extract semantic boundaries. The input high-level feature map X is sent to two sub-branches. One was processed through the average pooling layer. The method comprises the following specific steps:

further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries includes:

As an example, the averaging pooling layer includes a linear invariant low pass filter.

Still further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries further includes:

the high frequency detail feature map X ", i.e. the language boundary, is obtained from the difference between the high level feature map X and the feature map X', and can be represented as:

semantic boundaries have a global context view, which helps to refine the segmentation result along object boundaries. The semantic boundaries may be enlarged using a scaling factor and then added to the original feature map X to generate an enhanced feature map a, thereby deblurring the high-level feature map X and improving the clarity of the high-level details (i.e., remote semantic boundaries). Thus, context around semantic boundaries is selectively aggregated. The method specifically comprises the following steps:

After enhancement using the sharpening module, the relationships between objects should be utilized in the global context or remote context information, which is crucial for semantic segmentation. In order to model the relationship between objects around semantic boundaries, a boundary attention module is designed in this embodiment. The boundary attention module encodes extensive context information, thereby improving the representation capability of the feature map.

respectively processing the enhanced feature map A with the minimum resolution by three convolution layers of 1 x 1 to correspondingly obtain a feature map B, a feature map C and a feature map D, and

and transforming the feature map B and the feature map C into:

n is H × W, N represents the number of all pixels in the feature map;

multiplying transformation matrixes of the characteristic diagram B and the characteristic diagram C, and generating a boundary probability map S after passing through a softmax (normalized exponential function) layer:

transforming the feature map D into

wherein E_jIs the jth feature vector, D, of the boundary attention feature map E_iThe ith feature vector, A, representing the feature map D_jA jth feature vector representing the enhanced feature map a; alpha is initialized to zero and the weights are learned automatically. The resulting boundary attention at each location is a weighted sum over all locations. Thus, the boundary attention feature map E aggregates context information based on boundary attention, facilitating similar semantic features. Resulting in greater semantic consistency and feature resolvability. Thus, the representation capability can be enhanced by context information encoded by a broad semantic boundary around. The boundary attention module has the advantage of being able to preserve high level details, especially semantic boundaries. The module is equivalent to a global spatial filtering module, and consistency is enhanced.

The boundary attention module calculates a weighted sum of all locations of the enhanced feature map to highlight the context around the semantic boundary and high level details of the entire image. In order to further capture remote context information, the embodiment merges the original feature map X, the feature maps output by the boundary attention module and the position attention module, and finally obtains a segmentation map with enhanced semantic boundaries.

Still further, with reference to fig. 3, the process of capturing the remote context information of the advanced feature map X by using the location attention module to obtain the location attention feature map includes:

and transforming the feature map F and the feature map G into:

The location attention module uses a self-attention mechanism algorithm with X as an input and then outputs F, G and H.

Still further, with reference to fig. 3, the process of capturing the remote context information of the advanced feature map X by using the location attention module, and obtaining the location attention feature map further includes:

transforming the feature map H into

Still further, as shown in fig. 1, the extended convolutional layer outputs a feature map K after mixing feature maps output by three parallel branches:

K＝X+E+I，

k comprises semantic boundaries, positions and rich context information of the target of the image to be recognized;

The extended convolutional layer may be integrated directly into the existing FCN network fabric. Furthermore, mixed attention effectively enhances the feature expression without adding too many parameters.

The beneficial effects of the invention are verified by experiments as follows:

to evaluate the method of the invention, extensive experiments were performed on the cityscaps dataset, the PASCAL VOC2012 dataset and the cammid dataset. Experimental results show that the method achieves the most advanced performance in the three data sets.

Table 1 collects the results of the test set method in citrescaps. The ResNet-101 is used as a backbone network to train the SBENet architecture provided by the invention, and the test result is submitted to an official evaluation server for evaluation. Firstly, only the data of a finely labeled training set is used for training the SBENet, and the performance of the model is improved by about 2.3 percent compared with the PSPNet of the existing method, even better than that of the PSANet method which also uses a verification set for training. To obtain better results, training is performed using both training data and validation data. From table 1 it can be seen that SBENet of the present invention is significantly superior to all existing methods, including denseas pp, which uses a more powerful backbone, and that SBENet achieves the best effect at present, achieving 82.2% mlou (average of the intersection and union ratios of true and predicted values for all classes).

The present invention uses a boundary and location attention module on top of the backbone to capture remote dependencies and semantic boundaries to better understand the scenario. Extensive experiments were conducted using different configurations to demonstrate the performance of the attention module in table 2.

As can be seen from Table 2, the process of the present invention significantly improved the performance. On the cityscaps validation set, 77.1% mliou was obtained using the location attention module, 3.6% higher than the ResNet-50 baseline. In addition, the performance can be improved by 4.3% by adopting the boundary attention module, and the validity of semantic boundary information is proved. When the invention combines two modules together for use, the final result is improved to 78.4%. In addition, the invention can adopt a network ResNet-101 with stronger function as a reference, and two modules are integrated into the network. Finally, mlou increased to 80.6%. These results verify that the proposed method is effective in capturing remote context information and semantic boundary information.

Qualitative comparisons between sbenets of the present invention and reference networks are provided in fig. 4-7. The use of squares marks some challenging areas that are difficult to identify. It can be seen from the figures that some details and object boundaries become clearer by the processing of the method of the invention, such as the corresponding "sidewalk" in the first picture and "fence" in the second picture in fig. 4 to 7. The visualization results can well prove the effectiveness of the semantic segmentation method.

To further validate SBENet validity, the present embodiment compares SBENet with several context aggregation methods tested on the cityscaps validation set, all of which use ResNet-50 and ResNet-101 as backbone networks. The method of comparison was proposed in PSPNet and deep bv3 using Pyramid Pooling (PPM) and void space pyramid pooling (ASPP) modules, respectively. The relevant experimental results for the single scale testing are shown in table 3. It can be found that SBENet performs better than with PPM and ASPP modules. The results demonstrate the superiority of the process of the invention.

The invention also adopts a series of different improvement strategies to further enhance the mIoU on the Cityscapes test set. Table 4 shows the experimental results. First predicted using a single scale to reach 79.8% mIoU. When multi-scale [0.75,1.0,1.25,1.5] and left-right turning are adopted, the performance is improved from 79.8% to 80.7%. Finally, to further improve performance, the present invention employs a training set and a validation set, and performs training without coarse labeling data. The final mliou reached 82.2%.

In order to understand the method proposed by the invention more deeply, the characteristic maps of SBENet and the base line (ResNet-101) are visualized on the Cityscapes verification set. As shown in fig. 8 to 11, for each input image, a point is selected (marked with a cross), and its corresponding SBENet and baseline characteristic maps are displayed in fig. 9 and 10, respectively. It can be seen that SBENet can capture target boundaries better than baseline. For example, in the corresponding third photograph in the atlas of fig. 8-11, where the cross is marked on the building, the method of the invention can identify its boundaries very clearly. Furthermore, it can be observed that sbenets in fig. 10 can enhance semantic similarity and remote dependencies.

In the present invention, comparative experiments were performed on the PASCAL VOC2012 validation set, further demonstrating the effectiveness of each part of the method of the present invention. The results of the PASCAL VOC2012 validation set are shown in table 5. Using ResNet-101 as a baseline, 72.6% mIoU was obtained by single scale testing. When only the location attention module or the boundary attention module is used, the performance can be improved by 6.5% and 7.1%, respectively. When the location attention module and the boundary attention module are used simultaneously, the performance can reach 80.7% under the multi-scale and left-right flip strategy. And finally, the model is subjected to fine tuning by using the original training set, and the result is improved to 81.3% mIoU, which is greatly improved compared with the baseline.

For comparison with the most advanced method, the invention also performed experiments on the PASCAL VOC2012 test set and submitted the test results to the official server to verify the validity of the method of the invention. The present invention is first trained using an augmented training set, and then the original training and validation set is used to further fine tune the model of the present invention. The method of the invention was compared with the latest method and the detailed results for each subject are shown in table 6, it is clear that the performance of the method of the invention is significantly better than the best previous method on the PASCAL VOC2012 tester and reaches the best performance of 84.6% mlio u.

In order to further verify the generalization performance of the method provided by the invention, experiments were also carried out on the cammid dataset. Table 7 shows a comparison with the prior state of the art method. It can be observed that SBENet far exceeds methods such as DeconvNet, SegNet and Dense-Decoder, and reaches 74.1% mIoU. This further demonstrates the importance of the inventive method to capture remote context information and semantic boundary information in scene segmentation.

The specific embodiment is as follows:

in the experiment of the present invention, the mean intersection of class associations (mlou) was used as an evaluation index.

Cityscaps are a city detail data set collected from 50 different cities. It contains 5,000 high-quality pixel-labeled images and 20,000 coarse-labeled images. Each image has 19 semantic categories of 1024 x 2048 pixels in size. In the experiment, only 5,000 high quality pixels were used to label the images, which were segmented into 2,975, 500, and 1,525 images for training, validation, and testing, respectively.

The PASCAL VOC2012 was a semantically segmented reference dataset, initially with 1,464 images for training, 1,449 images for verification and 1,456 images for testing. It includes 21 object classes, one of which is a background class.

CamVid is a street scene image segmentation dataset comprising 367 training images, 101 verification images and 233 test images. The dataset provides 11 categories for semantic segmentation evaluation.

The network architecture of the present embodiment is implemented based on a pytorech. Using as backbone network a pre-trained ResNet-101 on ImageNet, whose last two downsampling operations were replaced by expansion convolutions set to 2 and 4, respectively. After down-sampling, the output feature map is 1/8 of the original size of the input image. In addition, the standard Batchnorm was deleted and the mean and standard deviation of BatchNorm were synchronized across multiple GPUs using InPlace-ABN.

The training uses a random gradient descent (SGD) with small batches as the optimizer for training. A ploy learning rate strategy is used in which the basic learning rate is multiplied by the update factor after each iteration. The initial learning rate was 0.001 for the PASCAL VOC2012 and cammid data sets and 0.005 for the cityscaps data set. The momentum and weight decay were 0.9 and 0.0001, respectively. The data enhancement strategy used in this embodiment includes random left-right flipping, random scaling from 0.5 to 2.0, and random cropping of 769 × 769 image blocks for the cityscaps dataset. A4 block 1080TI GPU was used for training, where Cityscapes had a batch size of 4 and the other datasets had a batch size of 12.

TABLE 1

In the comparison of the results of the cityscaps test set,

indicating that the training set and validation set data are used simultaneously.

TABLE 2

Ablation experiments on the cityscapes validation set, PAM stands for location attention module and BAM for boundary attention module.

TABLE 3

Context aggregation method comparison on cityscaps validation set

TABLE 4

Results of different strategies on the Cityscape verification set are compared, SS means that a single scale is used for testing, ms means that multi-scale is used for testing, and W/val means that the verification set is added for training.

TABLE 5

In the ablation experiment of the PASCALVOC2012 verification set, PAM represents a position attention module, BAM represents a boundary attention module, ms represents that a multi-scale test is carried out, and FT represents that a training model is finely adjusted on an original training set.

TABLE 6

Results for each class of the PASCALVOC2012 test set

TABLE 7

Comparison of results in CamVid test set

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. A semantic boundary enhancement method based on long-distance dependency comprises the following steps,

inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X0; the high-level feature map X0 is merged after being transmitted by three parallel branches; wherein the content of the first and second substances,

the first branch passes the high level feature map X0 directly;

the second branch adopts a position attention module to capture the remote context information of the high-level feature map X0 to obtain a position attention feature map;

the third branch adopts an unsharp masking algorithm to extract the semantic boundary of the high-level feature graph X0 to obtain the high-level feature graph X after the boundary is extracted, and then the high-level remote dependency relationship around the semantic boundary is enhanced through a boundary attention module to obtain a boundary attention feature graph;

merging the feature maps output by the three parallel branches, and then decoding by a decoder to obtain a final semantic segmentation result map;

wherein R is_jRepresenting a pooling area, i1 being R_jPixel of (2), X_jIs R_jThe above features;

the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries further comprises:

obtaining a high-frequency detail feature map X ″ from the difference between the high-level feature map X after the boundary extraction and the average pooled output feature map X':

adding the high-frequency detail feature map X' with the high-level feature map X after the boundary is extracted to generate an enhanced feature map A:

where C represents the number of feature map channels, H represents the height of the enhanced feature map A, W represents the width of the enhanced feature map A,

representing a real space;

and transforming the feature map B and the feature map C into:

n is H × W, N represents the number of all pixels in the feature map;

where { i, j }. belongs to [1, N ]]，B_iIs the ith feature vector, C, of the feature map B_jIs the jth feature vector, s, of the feature map C_jiIs B_iMultiplying by C_jRepresents B_iAnd C_jCorrelation of similar features of (c);

transforming the feature map D into

wherein E_jIs the jth feature vector, D, of the boundary attention feature map E_iThe ith feature vector, A, representing the feature map D_jA jth feature vector representing the enhanced feature map a; a is initialized to zero and the weight is learned automatically;

the remote context information of the high-level feature map X0 is captured by a position attention module, and the process of obtaining the position attention feature map comprises the following steps:

processing the high-level feature map X0 by three convolution layers respectively to obtain a feature map F, a feature map G and a feature map H, and

and transforming the feature map F and the feature map G into:

where { i, j }. belongs to [1, N ]]，F_iIs the ith feature vector, G, of the feature map F_jIs the jth feature vector, t, of the feature map G_jiIs F_iMultiplied by G_jRepresents F_iAnd G_jCorrelation of similar features of (c);

capturing remote context information of the high-level feature map X0 using a location attention module, the process of obtaining the location attention feature map further comprising:

transforming the feature map H into

Multiplying the transformation matrix of the feature map H by the boundary probability map T, multiplying the obtained result by the feature factor beta, and adding the result to the high-level feature map X0 to obtain a position attention feature map I:

2. The long-range dependency based semantic boundary enhancement method of claim 1,

and the expansion convolution layer outputs a characteristic diagram K after mixing the characteristic diagrams output by the three parallel branches:

K＝X0+E+I，