CN111833273B - Semantic boundary enhancement method based on long-distance dependence - Google Patents

Semantic boundary enhancement method based on long-distance dependence Download PDF

Info

Publication number
CN111833273B
CN111833273B CN202010690090.7A CN202010690090A CN111833273B CN 111833273 B CN111833273 B CN 111833273B CN 202010690090 A CN202010690090 A CN 202010690090A CN 111833273 B CN111833273 B CN 111833273B
Authority
CN
China
Prior art keywords
feature map
boundary
feature
semantic
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010690090.7A
Other languages
Chinese (zh)
Other versions
CN111833273A (en
Inventor
韩震
陈曦
刘小平
李志强
李庆利
朱敏
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202010690090.7A priority Critical patent/CN111833273B/en
Publication of CN111833273A publication Critical patent/CN111833273A/en
Application granted granted Critical
Publication of CN111833273B publication Critical patent/CN111833273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details
    • G06T2207/20192Edge enhancement; Edge preservation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

A semantic boundary enhancement method based on long-distance dependence belongs to the technical field of image processing based on deep learning. The invention aims at the problem that the long-range dependency relationship of semantic boundaries is ignored in the existing semantic segmentation algorithm, so that the boundaries are fuzzy. The method comprises the following steps: inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X; the high-level feature map X is transmitted through three parallel branches; wherein the first branch directly passes the high level feature map X; the second branch obtains a position attention feature map; the third branch obtains a boundary attention feature map; and merging the feature maps output by the three parallel branches, and decoding by a decoder to obtain a final semantic segmentation result map. The invention effectively improves the precision of semantic segmentation.

Description

Semantic boundary enhancement method based on long-distance dependence
Technical Field
The invention relates to a semantic boundary enhancement method based on long-distance dependence, and belongs to the technical field of image processing based on deep learning.
Background
Semantic segmentation attempts to assign pixels of the same class into the same semantic class. Semantic segmentation as a fundamental work for image understanding is becoming increasingly important in many areas, such as the analysis of medical images and urban street scenes. The success of semantic segmentation depends on whether semantically consistent regions can be understood and segmented. Therefore, it is important to improve the intra-class inconsistency and to improve the distinction between classes, and the discrimination capability between classes and intra-class can be improved by using the information of the boundary and the context.
Currently, only a few semantic segmentation methods focus on boundary detection. The underlying reason is that the boundaries do not directly contribute to the performance improvement, since the boundaries only account for a small part of the entire image. However, without boundary detection, most existing semantic segmentation methods blur the boundary and limit further performance improvements. This is because: 1) the existing semantic segmentation method depends on fixed spatial support, namely a patch form or a convolution kernel form; 2) the use of merging and striding operations significantly reduces the spatial resolution of the resulting feature map. The resolution of the high-level feature map of these methods is relatively low. The low resolution limits the ability to extract fine granularity of the object.
In fact, the boundary information can be used to maintain strong connections on the same object and weak connections between different objects. The boundary information may enhance segmentation performance by separating classes on both sides of the boundary. From low-level to high-level definitions, boundaries can be divided into simple edges, depth boundaries, object boundaries, and semantic boundaries. The Sobel or Canny algorithm can find simple boundaries. Depth boundaries are commonly used in indoor layout estimation to identify concave boundaries including ceiling and floor. The target boundaries may be extracted to inform the low-level boundary detection process to help perform many high-level visual tasks. Semantic boundaries are always expressed as binary or class-aware semantic boundary detection.
Although semantic boundary detection improves the accuracy of the segmentation. But due to its fixed geometric architecture, it can only satisfy the constraints on local receptive fields and short-range context information. Since the labels of the data set inevitably contain noise, the boundary detection can be taken as one layer of the network and the penalty calculated to predict the maximum response along the boundary normal direction. However, these methods cannot acquire remote spatial correlation, and thus cannot improve the accuracy of semantic segmentation.
In order to capture the spatial dependencies at dense and distant distances, some algorithms have been able to propose extracting context information for the purpose. Conditional random fields can be used as post-processing to combine low-level and high-level features for fine-grained prediction. The multi-scale feature map or global feature information may extract context information through the pyramid module to utilize global context information. Although these methods can detect features in different proportions, they fail to capture remote spatial dependencies.
To remedy the above deficiencies, a self-attention mechanism is used in semantic segmentation to improve performance. An attention mechanism is employed to learn to weight the multi-scale features for each pixel location by training in conjunction with the multi-scale input images. A point-by-point spatial attention network (PSANet) aggregates remote context by learning location-sensitive context dependencies and bi-directional information propagation based on adaptive learning attention maps. A crossbar network (CCNet) captures remote context dependencies on crossbar paths to optimize computation and storage efficiency. The existing method of the self-attention mechanism ignores semantic boundaries in the high-level feature map, so that key information of high-level semantic information in the feature map generated after an object in an image passes through an encoder becomes fragile, such as a boundary.
Disclosure of Invention
Aiming at the problem that the long-range dependency relationship of a semantic boundary is ignored in the existing semantic segmentation algorithm, so that the boundary is fuzzy, the invention provides a semantic boundary enhancement method based on long-range dependency.
The invention relates to a semantic boundary enhancement method based on long-distance dependency, which comprises the following steps,
inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X; the high-level feature graph X is merged after being transmitted by three parallel branches; wherein the content of the first and second substances,
the first branch passes the high-level feature map X directly;
the second branch adopts a position attention module to capture the remote context information of the high-level feature map to obtain a position attention feature map;
the third branch adopts an unsharp masking algorithm to extract the semantic boundary of the high-level feature graph X to obtain the high-level feature graph X after the boundary is extracted, and then the high-level remote dependency relationship around the semantic boundary is enhanced through a boundary attention module to obtain a boundary attention feature graph;
and merging the feature maps output by the three parallel branches, and decoding by a decoder to obtain a final semantic segmentation result map.
According to the semantic boundary enhancement method based on long-distance dependency of the invention,
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries comprises:
calculating the average value of p multiplied by p continuous pixels of the high-level feature map X with the boundary extracted through an average pooling layer, wherein p is 2, 3, 4 and 5; obtaining an average pooled output feature map X':
Figure BDA0002589054920000021
wherein R isjRepresenting a pooling zone, i being RjPixel of (2), XjIs RjThe above features.
According to the long-distance dependency-based semantic boundary enhancement method, the process of enhancing advanced remote dependency relationships around semantic boundaries by the boundary attention module further comprises the following steps:
the high frequency detail profile X ″ is obtained from the difference between the high level profile X and the profile X':
Figure BDA0002589054920000031
according to the long-distance dependency-based semantic boundary enhancement method, the process of enhancing advanced remote dependency relationships around semantic boundaries by the boundary attention module further comprises the following steps:
adding the high-frequency detail feature map X' with the high-level feature map X to generate an enhanced feature map A:
Figure BDA0002589054920000032
Figure BDA0002589054920000033
where C represents the number of profile channels, H represents the height of the enhanced profile A, and W represents the width of the enhanced profile A.
According to the long-distance dependency-based semantic boundary enhancement method, the process of enhancing advanced remote dependency relationships around semantic boundaries by the boundary attention module further comprises the following steps:
processing the enhanced feature map A through three convolution layers of 1 x 1 respectively to obtain a feature map B, a feature map C and a feature map D correspondingly, and
Figure BDA0002589054920000034
and transforming the feature map B and the feature map C into:
Figure BDA0002589054920000035
n is H × W, N represents the number of all pixels in the feature map;
multiplying the transformation matrixes of the characteristic diagram B and the characteristic diagram C, and generating a boundary probability map S after passing through a softmax layer:
Figure BDA0002589054920000036
Figure BDA0002589054920000037
where { i, j }. belongs to [1, N ]],BiIs the ith feature vector, C, of the feature map BjIs the jth feature vector, s, of the feature map CjiIs BiMultiplying by CjRepresents BiAnd CjThe correlation of similar features.
According to the long-distance dependency-based semantic boundary enhancement method, the process of enhancing advanced remote dependency relationships around semantic boundaries by the boundary attention module further comprises the following steps:
transforming the feature map D into
Figure BDA0002589054920000041
Multiplying the transformation matrix of the feature map D by the boundary probability map S, multiplying the obtained result by the feature factor alpha, and adding the result with the enhanced feature map A to obtain a boundary attention feature map E:
Figure BDA0002589054920000042
Figure BDA0002589054920000043
wherein EjIs the jth feature vector, D, of the boundary attention feature map EiThe ith feature vector, A, representing the feature map DjA jth feature vector representing the enhanced feature map a; alpha is initialized to zero and the weights are learned automatically.
According to the semantic boundary enhancement method based on long-distance dependency, the position attention module is adopted to capture the remote context information of the high-level feature map X, and the process of obtaining the position attention feature map comprises the following steps:
processing the high-level feature map X through three convolution layers respectively to obtain a feature map F, a feature map G and a feature map H in a corresponding manner, and
Figure BDA0002589054920000044
and transforming the feature map F and the feature map G into:
Figure BDA0002589054920000045
multiplying the transposed matrixes of the characteristic diagram F and the characteristic diagram G, and generating a boundary probability map T after passing through a softmax layer:
Figure BDA0002589054920000046
Figure BDA0002589054920000047
where { i, j }. belongs to [1, N ]],FiIs the ith feature vector, G, of the feature map FjIs the jth feature vector, t, of the feature map GjiIs FiMultiplied by GjRepresents FiAnd GjThe correlation of similar features.
According to the long-distance dependency-based semantic boundary enhancement method, the position attention module is adopted to capture the remote context information of the high-level feature map X, and the process of obtaining the position attention feature map further comprises the following steps:
transforming the feature map H into
Figure BDA0002589054920000048
Multiplying the transformation matrix of the feature map H by the boundary probability map T, multiplying the obtained result by the feature factor beta, and adding the result with the high-level feature map X to obtain a position attention feature map I:
Figure BDA0002589054920000051
Figure BDA0002589054920000052
wherein IjJ-th feature vector, H, representing the location attention feature map IiThe ith feature vector, representing the feature map H, is initialized to zero and the weights are automatically learned.
According to the long-distance dependency-based semantic boundary enhancement method, the expansion convolutional layer outputs a feature map K after mixing the feature maps output by three parallel branches:
K=X+E+I,
k comprises semantic boundaries, positions and context information of targets of the image to be recognized;
and inputting the feature graph K into a decoder to restore spatial information to obtain a final semantic segmentation result graph.
The invention has the beneficial effects that: the method adopts an unsharp masking algorithm to generate an enhanced semantic boundary so as to improve the quality of semantic segmentation; while using an attention mechanism to enhance contextual information around semantic boundaries; finally, the output is combined with the results of the traditional attention module to produce enhanced remote context information. The method effectively improves the precision of semantic segmentation by combining with a self-attention mechanism based on the long-range dependency relationship of semantic boundaries.
Drawings
FIG. 1 is an overall flow chart of the semantic boundary enhancement method based on long-distance dependency according to the invention; CNN in the figure represents an encoder; PAM denotes the location attention module; BAM represents the boundary attention module;
FIG. 2 is a process flow diagram of a boundary attention module;
FIG. 3 is a process flow diagram of a location attention module;
FIG. 4 is an atlas of images to be identified formed from four images to be identified;
FIG. 5 is an atlas of segmentation results processed for FIG. 4 using an existing reference network;
FIG. 6 is a graph set of segmentation results processed by the method of the present invention for FIG. 4;
FIG. 7 is a genuine label corresponding to FIG. 4;
FIG. 8 is a visualization of an atlas of images to be identified on a Cityscapes validation set;
FIG. 9 is a visual representation of the feature map of FIG. 8 using an existing reference network;
FIG. 10 is a visual representation of the signature graph of FIG. 8 using the method of the present invention;
fig. 11 is a genuine label corresponding to fig. 8.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
Detailed description of the inventionas shown in fig. 1, the present invention provides a semantic boundary enhancement method based on long-distance dependency, which includes,
inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X; the high-level feature graph X is merged after being transmitted by three parallel branches; wherein the content of the first and second substances,
the first branch passes the high-level feature map X directly;
the second branch adopts a position attention module to capture the remote context information of the high-level feature map X to obtain a position attention feature map;
the third branch adopts an unsharp masking algorithm to extract the semantic boundary of the high-level feature graph X to obtain the high-level feature graph X after the boundary is extracted, and then the high-level remote dependency relationship around the semantic boundary is enhanced through a boundary attention module to obtain a boundary attention feature graph;
and merging the feature maps output by the three parallel branches, and decoding by a decoder to obtain a final semantic segmentation result map.
As shown in fig. 1, in the present embodiment, a new convolutional neural network sbent (automatic Boundary Enhancement with Long-range Dependency for automatic Segmentation) is constructed, and the sbent gradually reduces the resolution of the image to be recognized through the encoder cnn, and can generate the high-level feature image X having the spatial size of H × W pixels. After obtaining the profile X, X is fed into three parallel branches in the unified network. The first branch is the original raw feature map X. In the second branch, X is sent to the location attention module to capture remote context information. In the third branch, the semantic boundaries are extracted using unsharp masking algorithms and then input into the attention module to enhance the high-level remote dependencies around the semantic boundaries. The feature maps from the three branches are combined and fed to the decoder, generating the final segmentation map.
In the SBENet architecture of the present embodiment, in order to obtain more details and generate dense feature maps, the last two downsampling operations in the ResNet convolutional neural network are deleted, and extended convolution substitution is used. Thus, the final output segmentation map is 1/8 of the original image size to be recognized.
Some high-level feature maps X are blurred due to the reduced resolution. Blurred profiles may reduce detail and result in unclear edges. To extract semantic boundaries. The input high-level feature map X is sent to two sub-branches. One was processed through the average pooling layer. The method comprises the following specific steps:
further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries includes:
calculating the average value of p multiplied by p continuous pixels of the high-level feature map X with the boundary extracted through an average pooling layer, wherein p is 2, 3, 4 and 5; obtaining an average pooled output feature map X':
Figure BDA0002589054920000071
wherein R isjRepresenting a pooling zone, i being RjPixel of (2), XjIs RjThe above features.
As an example, the averaging pooling layer includes a linear invariant low pass filter.
Still further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries further includes:
the high frequency detail feature map X ", i.e. the language boundary, is obtained from the difference between the high level feature map X and the feature map X', and can be represented as:
Figure BDA0002589054920000072
semantic boundaries have a global context view, which helps to refine the segmentation result along object boundaries. The semantic boundaries may be enlarged using a scaling factor and then added to the original feature map X to generate an enhanced feature map a, thereby deblurring the high-level feature map X and improving the clarity of the high-level details (i.e., remote semantic boundaries). Thus, context around semantic boundaries is selectively aggregated. The method specifically comprises the following steps:
still further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries further includes:
adding the high-frequency detail feature map X' with the high-level feature map X to generate an enhanced feature map A:
Figure BDA0002589054920000073
Figure BDA0002589054920000081
where C represents the number of profile channels, H represents the height of the enhanced profile A, and W represents the width of the enhanced profile A.
After enhancement using the sharpening module, the relationships between objects should be utilized in the global context or remote context information, which is crucial for semantic segmentation. In order to model the relationship between objects around semantic boundaries, a boundary attention module is designed in this embodiment. The boundary attention module encodes extensive context information, thereby improving the representation capability of the feature map.
Still further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries further includes:
respectively processing the enhanced feature map A with the minimum resolution by three convolution layers of 1 x 1 to correspondingly obtain a feature map B, a feature map C and a feature map D, and
Figure BDA0002589054920000082
and transforming the feature map B and the feature map C into:
Figure BDA0002589054920000083
n is H × W, N represents the number of all pixels in the feature map;
multiplying transformation matrixes of the characteristic diagram B and the characteristic diagram C, and generating a boundary probability map S after passing through a softmax (normalized exponential function) layer:
Figure BDA0002589054920000084
Figure BDA0002589054920000085
where { i, j }. belongs to [1, N ]],BiIs the ith feature vector, C, of the feature map BjIs the jth feature vector, s, of the feature map CjiIs BiMultiplying by CjRepresents BiAnd CjThe correlation of similar features.
Still further, as shown in fig. 2, the process of the boundary attention module enhancing high-level remote dependencies around semantic boundaries further includes:
transforming the feature map D into
Figure BDA0002589054920000086
Multiplying the transformation matrix of the feature map D by the boundary probability map S, multiplying the obtained result by the feature factor alpha, and adding the result with the enhanced feature map A to obtain a boundary attention feature map E:
Figure BDA0002589054920000087
Figure BDA0002589054920000091
wherein EjIs the jth feature vector, D, of the boundary attention feature map EiThe ith feature vector, A, representing the feature map DjA jth feature vector representing the enhanced feature map a; alpha is initialized to zero and the weights are learned automatically. The resulting boundary attention at each location is a weighted sum over all locations. Thus, the boundary attention feature map E aggregates context information based on boundary attention, facilitating similar semantic features. Resulting in greater semantic consistency and feature resolvability. Thus, the representation capability can be enhanced by context information encoded by a broad semantic boundary around. The boundary attention module has the advantage of being able to preserve high level details, especially semantic boundaries. The module is equivalent to a global spatial filtering module, and consistency is enhanced.
The boundary attention module calculates a weighted sum of all locations of the enhanced feature map to highlight the context around the semantic boundary and high level details of the entire image. In order to further capture remote context information, the embodiment merges the original feature map X, the feature maps output by the boundary attention module and the position attention module, and finally obtains a segmentation map with enhanced semantic boundaries.
Still further, with reference to fig. 3, the process of capturing the remote context information of the advanced feature map X by using the location attention module to obtain the location attention feature map includes:
processing the high-level feature map X through three convolution layers respectively to obtain a feature map F, a feature map G and a feature map H in a corresponding manner, and
Figure BDA0002589054920000092
and transforming the feature map F and the feature map G into:
Figure BDA0002589054920000093
multiplying the transposed matrixes of the characteristic diagram F and the characteristic diagram G, and generating a boundary probability map T after passing through a softmax layer:
Figure BDA0002589054920000094
Figure BDA0002589054920000095
where { i, j }. belongs to [1, N ]],FiIs the ith feature vector, G, of the feature map FjIs the jth feature vector, t, of the feature map GjiIs FiMultiplied by GjRepresents FiAnd GjThe correlation of similar features.
The location attention module uses a self-attention mechanism algorithm with X as an input and then outputs F, G and H.
Still further, with reference to fig. 3, the process of capturing the remote context information of the advanced feature map X by using the location attention module, and obtaining the location attention feature map further includes:
transforming the feature map H into
Figure BDA0002589054920000101
Multiplying the transformation matrix of the feature map H by the boundary probability map T, multiplying the obtained result by the feature factor beta, and adding the result with the high-level feature map X to obtain a position attention feature map I:
Figure BDA0002589054920000102
Figure BDA0002589054920000103
wherein IjJ-th feature vector, H, representing the location attention feature map IiThe ith feature vector, representing the feature map H, is initialized to zero and the weights are automatically learned.
Still further, as shown in fig. 1, the extended convolutional layer outputs a feature map K after mixing feature maps output by three parallel branches:
K=X+E+I,
k comprises semantic boundaries, positions and rich context information of the target of the image to be recognized;
and inputting the feature graph K into a decoder to restore spatial information to obtain a final semantic segmentation result graph.
The extended convolutional layer may be integrated directly into the existing FCN network fabric. Furthermore, mixed attention effectively enhances the feature expression without adding too many parameters.
The beneficial effects of the invention are verified by experiments as follows:
to evaluate the method of the invention, extensive experiments were performed on the cityscaps dataset, the PASCAL VOC2012 dataset and the cammid dataset. Experimental results show that the method achieves the most advanced performance in the three data sets.
Table 1 collects the results of the test set method in citrescaps. The ResNet-101 is used as a backbone network to train the SBENet architecture provided by the invention, and the test result is submitted to an official evaluation server for evaluation. Firstly, only the data of a finely labeled training set is used for training the SBENet, and the performance of the model is improved by about 2.3 percent compared with the PSPNet of the existing method, even better than that of the PSANet method which also uses a verification set for training. To obtain better results, training is performed using both training data and validation data. From table 1 it can be seen that SBENet of the present invention is significantly superior to all existing methods, including denseas pp, which uses a more powerful backbone, and that SBENet achieves the best effect at present, achieving 82.2% mlou (average of the intersection and union ratios of true and predicted values for all classes).
The present invention uses a boundary and location attention module on top of the backbone to capture remote dependencies and semantic boundaries to better understand the scenario. Extensive experiments were conducted using different configurations to demonstrate the performance of the attention module in table 2.
As can be seen from Table 2, the process of the present invention significantly improved the performance. On the cityscaps validation set, 77.1% mliou was obtained using the location attention module, 3.6% higher than the ResNet-50 baseline. In addition, the performance can be improved by 4.3% by adopting the boundary attention module, and the validity of semantic boundary information is proved. When the invention combines two modules together for use, the final result is improved to 78.4%. In addition, the invention can adopt a network ResNet-101 with stronger function as a reference, and two modules are integrated into the network. Finally, mlou increased to 80.6%. These results verify that the proposed method is effective in capturing remote context information and semantic boundary information.
Qualitative comparisons between sbenets of the present invention and reference networks are provided in fig. 4-7. The use of squares marks some challenging areas that are difficult to identify. It can be seen from the figures that some details and object boundaries become clearer by the processing of the method of the invention, such as the corresponding "sidewalk" in the first picture and "fence" in the second picture in fig. 4 to 7. The visualization results can well prove the effectiveness of the semantic segmentation method.
To further validate SBENet validity, the present embodiment compares SBENet with several context aggregation methods tested on the cityscaps validation set, all of which use ResNet-50 and ResNet-101 as backbone networks. The method of comparison was proposed in PSPNet and deep bv3 using Pyramid Pooling (PPM) and void space pyramid pooling (ASPP) modules, respectively. The relevant experimental results for the single scale testing are shown in table 3. It can be found that SBENet performs better than with PPM and ASPP modules. The results demonstrate the superiority of the process of the invention.
The invention also adopts a series of different improvement strategies to further enhance the mIoU on the Cityscapes test set. Table 4 shows the experimental results. First predicted using a single scale to reach 79.8% mIoU. When multi-scale [0.75,1.0,1.25,1.5] and left-right turning are adopted, the performance is improved from 79.8% to 80.7%. Finally, to further improve performance, the present invention employs a training set and a validation set, and performs training without coarse labeling data. The final mliou reached 82.2%.
In order to understand the method proposed by the invention more deeply, the characteristic maps of SBENet and the base line (ResNet-101) are visualized on the Cityscapes verification set. As shown in fig. 8 to 11, for each input image, a point is selected (marked with a cross), and its corresponding SBENet and baseline characteristic maps are displayed in fig. 9 and 10, respectively. It can be seen that SBENet can capture target boundaries better than baseline. For example, in the corresponding third photograph in the atlas of fig. 8-11, where the cross is marked on the building, the method of the invention can identify its boundaries very clearly. Furthermore, it can be observed that sbenets in fig. 10 can enhance semantic similarity and remote dependencies.
In the present invention, comparative experiments were performed on the PASCAL VOC2012 validation set, further demonstrating the effectiveness of each part of the method of the present invention. The results of the PASCAL VOC2012 validation set are shown in table 5. Using ResNet-101 as a baseline, 72.6% mIoU was obtained by single scale testing. When only the location attention module or the boundary attention module is used, the performance can be improved by 6.5% and 7.1%, respectively. When the location attention module and the boundary attention module are used simultaneously, the performance can reach 80.7% under the multi-scale and left-right flip strategy. And finally, the model is subjected to fine tuning by using the original training set, and the result is improved to 81.3% mIoU, which is greatly improved compared with the baseline.
For comparison with the most advanced method, the invention also performed experiments on the PASCAL VOC2012 test set and submitted the test results to the official server to verify the validity of the method of the invention. The present invention is first trained using an augmented training set, and then the original training and validation set is used to further fine tune the model of the present invention. The method of the invention was compared with the latest method and the detailed results for each subject are shown in table 6, it is clear that the performance of the method of the invention is significantly better than the best previous method on the PASCAL VOC2012 tester and reaches the best performance of 84.6% mlio u.
In order to further verify the generalization performance of the method provided by the invention, experiments were also carried out on the cammid dataset. Table 7 shows a comparison with the prior state of the art method. It can be observed that SBENet far exceeds methods such as DeconvNet, SegNet and Dense-Decoder, and reaches 74.1% mIoU. This further demonstrates the importance of the inventive method to capture remote context information and semantic boundary information in scene segmentation.
The specific embodiment is as follows:
in the experiment of the present invention, the mean intersection of class associations (mlou) was used as an evaluation index.
Cityscaps are a city detail data set collected from 50 different cities. It contains 5,000 high-quality pixel-labeled images and 20,000 coarse-labeled images. Each image has 19 semantic categories of 1024 x 2048 pixels in size. In the experiment, only 5,000 high quality pixels were used to label the images, which were segmented into 2,975, 500, and 1,525 images for training, validation, and testing, respectively.
The PASCAL VOC2012 was a semantically segmented reference dataset, initially with 1,464 images for training, 1,449 images for verification and 1,456 images for testing. It includes 21 object classes, one of which is a background class.
CamVid is a street scene image segmentation dataset comprising 367 training images, 101 verification images and 233 test images. The dataset provides 11 categories for semantic segmentation evaluation.
The network architecture of the present embodiment is implemented based on a pytorech. Using as backbone network a pre-trained ResNet-101 on ImageNet, whose last two downsampling operations were replaced by expansion convolutions set to 2 and 4, respectively. After down-sampling, the output feature map is 1/8 of the original size of the input image. In addition, the standard Batchnorm was deleted and the mean and standard deviation of BatchNorm were synchronized across multiple GPUs using InPlace-ABN.
The training uses a random gradient descent (SGD) with small batches as the optimizer for training. A ploy learning rate strategy is used in which the basic learning rate is multiplied by the update factor after each iteration. The initial learning rate was 0.001 for the PASCAL VOC2012 and cammid data sets and 0.005 for the cityscaps data set. The momentum and weight decay were 0.9 and 0.0001, respectively. The data enhancement strategy used in this embodiment includes random left-right flipping, random scaling from 0.5 to 2.0, and random cropping of 769 × 769 image blocks for the cityscaps dataset. A4 block 1080TI GPU was used for training, where Cityscapes had a batch size of 4 and the other datasets had a batch size of 12.
TABLE 1
In the comparison of the results of the cityscaps test set,
Figure BDA0002589054920000132
indicating that the training set and validation set data are used simultaneously.
Figure BDA0002589054920000131
TABLE 2
Ablation experiments on the cityscapes validation set, PAM stands for location attention module and BAM for boundary attention module.
Figure BDA0002589054920000141
TABLE 3
Context aggregation method comparison on cityscaps validation set
Figure BDA0002589054920000142
TABLE 4
Results of different strategies on the Cityscape verification set are compared, SS means that a single scale is used for testing, ms means that multi-scale is used for testing, and W/val means that the verification set is added for training.
Figure BDA0002589054920000143
TABLE 5
In the ablation experiment of the PASCALVOC2012 verification set, PAM represents a position attention module, BAM represents a boundary attention module, ms represents that a multi-scale test is carried out, and FT represents that a training model is finely adjusted on an original training set.
Figure BDA0002589054920000151
TABLE 6
Results for each class of the PASCALVOC2012 test set
Figure BDA0002589054920000152
Figure BDA0002589054920000161
TABLE 7
Comparison of results in CamVid test set
Figure BDA0002589054920000162
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (2)

1. A semantic boundary enhancement method based on long-distance dependency comprises the following steps,
inputting an image to be identified into a convolutional neural network, and processing the image to be identified by adopting an encoder to obtain a high-level feature map X0; the high-level feature map X0 is merged after being transmitted by three parallel branches; wherein the content of the first and second substances,
the first branch passes the high level feature map X0 directly;
the second branch adopts a position attention module to capture the remote context information of the high-level feature map X0 to obtain a position attention feature map;
the third branch adopts an unsharp masking algorithm to extract the semantic boundary of the high-level feature graph X0 to obtain the high-level feature graph X after the boundary is extracted, and then the high-level remote dependency relationship around the semantic boundary is enhanced through a boundary attention module to obtain a boundary attention feature graph;
merging the feature maps output by the three parallel branches, and then decoding by a decoder to obtain a final semantic segmentation result map;
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries comprises:
calculating the average value of p multiplied by p continuous pixels of the high-level feature map X with the boundary extracted through an average pooling layer, wherein p is 2, 3, 4 and 5; obtaining an average pooled output feature map X':
Figure FDA0003136737440000011
wherein R isjRepresenting a pooling area, i1 being RjPixel of (2), XjIs RjThe above features;
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries further comprises:
obtaining a high-frequency detail feature map X ″ from the difference between the high-level feature map X after the boundary extraction and the average pooled output feature map X':
Figure FDA0003136737440000012
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries further comprises:
adding the high-frequency detail feature map X' with the high-level feature map X after the boundary is extracted to generate an enhanced feature map A:
Figure FDA0003136737440000013
Figure FDA0003136737440000014
where C represents the number of feature map channels, H represents the height of the enhanced feature map A, W represents the width of the enhanced feature map A,
Figure FDA0003136737440000021
representing a real space;
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries further comprises:
processing the enhanced feature map A through three convolution layers of 1 x 1 respectively to obtain a feature map B, a feature map C and a feature map D correspondingly, and
Figure FDA0003136737440000022
and transforming the feature map B and the feature map C into:
Figure FDA0003136737440000023
n is H × W, N represents the number of all pixels in the feature map;
multiplying the transformation matrixes of the characteristic diagram B and the characteristic diagram C, and generating a boundary probability map S after passing through a softmax layer:
Figure FDA0003136737440000024
Figure FDA0003136737440000025
where { i, j }. belongs to [1, N ]],BiIs the ith feature vector, C, of the feature map BjIs the jth feature vector, s, of the feature map CjiIs BiMultiplying by CjRepresents BiAnd CjCorrelation of similar features of (c);
the process of the boundary attention module enhancing advanced remote dependencies around semantic boundaries further comprises:
transforming the feature map D into
Figure FDA0003136737440000026
Multiplying the transformation matrix of the feature map D by the boundary probability map S, multiplying the obtained result by the feature factor alpha, and adding the result with the enhanced feature map A to obtain a boundary attention feature map E:
Figure FDA0003136737440000027
Figure FDA0003136737440000028
wherein EjIs the jth feature vector, D, of the boundary attention feature map EiThe ith feature vector, A, representing the feature map DjA jth feature vector representing the enhanced feature map a; a is initialized to zero and the weight is learned automatically;
the remote context information of the high-level feature map X0 is captured by a position attention module, and the process of obtaining the position attention feature map comprises the following steps:
processing the high-level feature map X0 by three convolution layers respectively to obtain a feature map F, a feature map G and a feature map H, and
Figure FDA0003136737440000031
and transforming the feature map F and the feature map G into:
Figure FDA0003136737440000032
multiplying the transposed matrixes of the characteristic diagram F and the characteristic diagram G, and generating a boundary probability map T after passing through a softmax layer:
Figure FDA0003136737440000033
Figure FDA0003136737440000034
where { i, j }. belongs to [1, N ]],FiIs the ith feature vector, G, of the feature map FjIs the jth feature vector, t, of the feature map GjiIs FiMultiplied by GjRepresents FiAnd GjCorrelation of similar features of (c);
capturing remote context information of the high-level feature map X0 using a location attention module, the process of obtaining the location attention feature map further comprising:
transforming the feature map H into
Figure FDA0003136737440000035
Multiplying the transformation matrix of the feature map H by the boundary probability map T, multiplying the obtained result by the feature factor beta, and adding the result to the high-level feature map X0 to obtain a position attention feature map I:
Figure FDA0003136737440000036
Figure FDA0003136737440000037
wherein IjJ-th feature vector, H, representing the location attention feature map IiThe ith feature vector, representing the feature map H, is initialized to zero and the weights are automatically learned.
2. The long-range dependency based semantic boundary enhancement method of claim 1,
and the expansion convolution layer outputs a characteristic diagram K after mixing the characteristic diagrams output by the three parallel branches:
K=X0+E+I,
k comprises semantic boundaries, positions and context information of targets of the image to be recognized;
and inputting the feature graph K into a decoder to restore spatial information to obtain a final semantic segmentation result graph.
CN202010690090.7A 2020-07-17 2020-07-17 Semantic boundary enhancement method based on long-distance dependence Active CN111833273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010690090.7A CN111833273B (en) 2020-07-17 2020-07-17 Semantic boundary enhancement method based on long-distance dependence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010690090.7A CN111833273B (en) 2020-07-17 2020-07-17 Semantic boundary enhancement method based on long-distance dependence

Publications (2)

Publication Number Publication Date
CN111833273A CN111833273A (en) 2020-10-27
CN111833273B true CN111833273B (en) 2021-08-13

Family

ID=72924548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010690090.7A Active CN111833273B (en) 2020-07-17 2020-07-17 Semantic boundary enhancement method based on long-distance dependence

Country Status (1)

Country Link
CN (1) CN111833273B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381057A (en) * 2020-12-03 2021-02-19 上海芯翌智能科技有限公司 Handwritten character recognition method and device, storage medium and terminal
CN113807354B (en) * 2020-12-29 2023-11-03 京东科技控股股份有限公司 Image semantic segmentation method, device, equipment and storage medium
CN112967300A (en) * 2021-02-23 2021-06-15 艾瑞迈迪医疗科技(北京)有限公司 Three-dimensional ultrasonic thyroid segmentation method and device based on multi-scale fusion network
CN113191367B (en) * 2021-05-25 2022-07-29 华东师范大学 Semantic segmentation method based on dense scale dynamic network
CN115546239B (en) * 2022-11-30 2023-04-07 珠海横琴圣澳云智科技有限公司 Target segmentation method and device based on boundary attention and distance transformation
CN116030260B (en) * 2023-03-27 2023-08-01 湖南大学 Surgical whole-scene semantic segmentation method based on long-strip convolution attention
CN117171766B (en) * 2023-07-31 2024-04-05 上海交通大学 Data protection method, system and medium based on deep neural network model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122796A (en) * 2017-04-01 2017-09-01 中国科学院空间应用工程与技术中心 A kind of remote sensing image sorting technique based on multiple-limb network integration model
CN109544559A (en) * 2018-10-19 2019-03-29 深圳大学 Image, semantic dividing method, device, computer equipment and storage medium
CN109886986A (en) * 2019-01-23 2019-06-14 北京航空航天大学 A kind of skin lens image dividing method based on multiple-limb convolutional neural networks
CN110443805A (en) * 2019-07-09 2019-11-12 浙江大学 A kind of semantic segmentation method spent closely based on pixel
CN111127470A (en) * 2019-12-24 2020-05-08 江西理工大学 Image semantic segmentation method based on context and shallow space coding and decoding network
CN111160311A (en) * 2020-01-02 2020-05-15 西北工业大学 Yellow river ice semantic segmentation method based on multi-attention machine system double-flow fusion network
CN111222515A (en) * 2020-01-06 2020-06-02 北方民族大学 Image translation method based on context-aware attention
US10671083B2 (en) * 2017-09-13 2020-06-02 Tusimple, Inc. Neural network architecture system for deep odometry assisted by static scene optical flow

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122796A (en) * 2017-04-01 2017-09-01 中国科学院空间应用工程与技术中心 A kind of remote sensing image sorting technique based on multiple-limb network integration model
US10671083B2 (en) * 2017-09-13 2020-06-02 Tusimple, Inc. Neural network architecture system for deep odometry assisted by static scene optical flow
CN109544559A (en) * 2018-10-19 2019-03-29 深圳大学 Image, semantic dividing method, device, computer equipment and storage medium
CN109886986A (en) * 2019-01-23 2019-06-14 北京航空航天大学 A kind of skin lens image dividing method based on multiple-limb convolutional neural networks
CN110443805A (en) * 2019-07-09 2019-11-12 浙江大学 A kind of semantic segmentation method spent closely based on pixel
CN111127470A (en) * 2019-12-24 2020-05-08 江西理工大学 Image semantic segmentation method based on context and shallow space coding and decoding network
CN111160311A (en) * 2020-01-02 2020-05-15 西北工业大学 Yellow river ice semantic segmentation method based on multi-attention machine system double-flow fusion network
CN111222515A (en) * 2020-01-06 2020-06-02 北方民族大学 Image translation method based on context-aware attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Semantic Segmentation With Context Encoding and Multi-Path Decoding》;Henghui Ding; Xudong Jiang; Bing Shuai; AiQun Liu; Gang Wang;《IEEE Transactions on Image Processing》;20200109;第29卷;3520 - 3533 *

Also Published As

Publication number Publication date
CN111833273A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111833273B (en) Semantic boundary enhancement method based on long-distance dependence
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN110738207B (en) Character detection method for fusing character area edge information in character image
CN101520894B (en) Method for extracting significant object based on region significance
CN112861720A (en) Remote sensing image small sample target detection method based on prototype convolutional neural network
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN113177518B (en) Vehicle re-identification method based on weak supervision area recommendation
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN112200143A (en) Road disease detection method based on candidate area network and machine vision
Xia et al. PANDA: Parallel asymmetric network with double attention for cloud and its shadow detection
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
Yang et al. Spatiotemporal trident networks: detection and localization of object removal tampering in video passive forensics
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
CN112287906B (en) Template matching tracking method and system based on depth feature fusion
CN114202743A (en) Improved fast-RCNN-based small target detection method in automatic driving scene
Liu et al. D-unet: a dual-encoder u-net for image splicing forgery detection and localization
Gao A method for face image inpainting based on generative adversarial networks
Zheng et al. Template-aware transformer for person reidentification
CN113591545A (en) Deep learning-based multistage feature extraction network pedestrian re-identification method
CN112613474A (en) Pedestrian re-identification method and device
CN116703996A (en) Monocular three-dimensional target detection algorithm based on instance-level self-adaptive depth estimation
CN115953312A (en) Joint defogging detection method and device based on single image and storage medium
CN115984765A (en) Pedestrian re-identification method based on double-current block network, electronic equipment and medium
CN112818818B (en) Novel ultra-high-definition remote sensing image change detection method based on AFFPN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant