CN116229056A

CN116229056A - Semantic segmentation method, device and equipment based on double-branch feature fusion

Info

Publication number: CN116229056A
Application number: CN202211623747.3A
Authority: CN
Inventors: 周书仁; 晏周荃; 朱俣键
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-06-06

Abstract

The application relates to a semantic segmentation method, a semantic segmentation device and semantic segmentation equipment based on double-branch feature fusion. The method comprises the following steps: the patent provides a two-channel semantic segmentation network, which is divided into two branches, namely a detail branch and a semantic branch, wherein the detail branch strengthens semantic information by utilizing edge characteristics, and the semantic branch extracts advanced characteristics, so that the problem of ignoring detailed information around object boundaries and small objects is solved. The space pyramid module is embedded at the tail end of the semantic branch to capture multi-scale features and further improve the semantic information extraction capability of the high-dimensional features. Notably, we have studied the way of feature fusion modules to fuse high-level semantic and detail information to enhance feature representation. Furthermore, the fusion module employs an attention mechanism for handling feature mappings from both branches to establish context dependencies of spatial and channel dimensions, which can help the network focus on more meaningful features.

Description

Semantic segmentation method, device and equipment based on double-branch feature fusion

The invention relates to the technical field of computer vision, in particular to a semantic segmentation method, a semantic segmentation device and semantic segmentation equipment based on double-branch feature fusion.

Background

Semantic segmentation is one of the key tasks in computer vision, whose purpose is to assign dense labels, i.e., a concrete to abstract process, to all pixels in an image. In recent years, the structure of convolutional neural networks has been innovated, and impressive effects are obtained. The full convolutional network proves that the end-to-end, pixel-to-pixel convolutional neural network exceeds the most advanced prior art, and comprises the steps of converting a full connection layer into a convolutional layer and up-sampling through deconvolution; the use of a jump connection structure allows semantic information to be combined with characterization information, resulting in accurate and fine segmentation. As the depth of the network increases, the receptive field of the full convolution grows slowly, and this limited receptive field cannot fully mimic the long distance relationship between pixels in the image. The Unet combines low resolution information (object type) and high resolution information (accurate segmentation and positioning), is perfectly suitable for medical image segmentation, and convolution is good in many recognition tasks, but is not applied to specific tasks all the time, the training scale is limited, the network scale is not guaranteed, and therefore the Unet is difficult to attach to importance. And the PSANT introduces an attention mechanism in the decoder, and the pixels at each position of the feature map are connected through the self-adaptive attention mechanism to promote information transfer and improve the segmentation effect in a complex scene. Deep labv3+ uses a void space pyramid pool module to extract multi-scale features of objects, explicitly preserving a high resolution representation. Nevertheless, the network merges only one layer. The Bisenet provides that a space path can encode rich space information and detail information, a feature fusion module is utilized to fuse the space information and the detail information, new thinking is provided, the space information is required to be paid attention to while the speed is improved, the design is also one-time thinking of a semantic segmentation backbone network, and the method can be applied to not only real-time semantic segmentation algorithms, but also other fields, and especially under the condition that space detail and context information are simultaneously required. But these methods tend to involve expensive computational costs. In addition, they typically take the original high resolution image as input, which further increases the computational effort, and previous network structures have easily ignored the detailed appearance around boundaries and small objects, thereby missing high resolution information.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a double-channel network which is divided into two branches: detail branches and semantic branches, which allow the network to obtain low-level detail information and high-level semantic information, respectively. Notably, the two branches in our network are not independently performed, we obtain detail edge features by extracting texture information from the input pictures, the detail branches are complements of semantic branches, the detail graphs are added to the semantic feature graphs to complement the detail features, and the two optimized features are combined into the final segmented representation.

Comprises the following steps of;

s1, constructing a network frame, wherein the network consists of two parts, namely a detail branch and a semantic branch; inputting a given image into a backbone network to extract semantic features, firstly, reducing the size of the input image by 16 times through an encoder;

s2, the extracted features in the backbone network are subjected to a cavity space convolution pooling pyramid, the core idea is to gather receptive fields with different scales, and the cavity space convolution pooling pyramid is also used for solving the problem of different scales of different segmentation targets; the method consists of 3X 3 cavity convolutions with three expansion rates of 3, 6 and 12 in sequence of 1X 1 convolution kernels;

s3, reducing the channel number by using 1 multiplied by 1 convolution, and then connecting with a BN, a ReLU activation function and a Dropout; 4 times up-sampling it using bilinear interpolation;

s4, extracting texture information from an input picture by a detail branch to obtain detail edge characteristics, wherein the purpose is to extract space detail information, then, the edge characteristics are utilized to strengthen semantic information, and the detail branch is added with a semantic feature map to supplement the detail characteristics as the supplement of the semantic branch;

s5, a fusion module (FFM) is used for fusing high-level semantic and detail information, the fusion module (FFM) is shown in FIG. 3, semantic information is introduced into the bottom-layer features, detail information is introduced into the high-level features, and subsequent fusion is enabled to be more effective, so that feature representation is enhanced;

s6, inserting the feature map part obtained through fusion into a detail header to generate a classification detail label, and guiding the detail features of the bottom learning space by using the classification detail label as the guide of the detail feature map; finally, the number of channels is reduced in the feature map part through 1X 1 convolution in sequence, and then a BN and ReLU activation function is connected;

s7, splicing the feature graphs of the S3 and the S5; then input to the decoder, which is shown in fig. 1 (b), i.e., subjected to 3×3 convolution, BN, reLU, dropout; up-sampling by 4 times to obtain a final result;

s8, combining and optimizing detail learning by Binary cross entropy and Dice Loss, focal Loss and CE Loss.

The invention provides a semantic segmentation method, device and equipment based on double-branch feature fusion. Compared with the prior art, the method has the following beneficial effects:

the actual structure network can respectively obtain low-level detail information and high-level semantic information, detail edge characteristics are obtained through texture information extracted from an input picture, detail branches are complemented with semantic branches, the detail information is obtained from a low-level network, the semantic information is obtained from a high-level network, and then fusion is carried out, so that the situation of missing certain information is avoided. The high-order semantic information has the effect of optimizing the low-order edge information, and then two optimized features are combined into the final segmentation representation. Second, we propose a fusion module (FFM) for fusing high-level semantic and detail information to enhance feature representation. Furthermore, the fusion module employs an attention mechanism for handling feature mappings from both branches to establish context dependencies of spatial and channel dimensions, which can help the network focus on more meaningful features. The frame detail extraction branch extraction feature map generates final prediction through a detail segmentation head so as to improve the performance of the frame detail extraction branch extraction feature map, and the cost is negligible. Meanwhile, the shallow layer information is guided to encode the space information by adopting the joint loss of Binary cross entropy and Dice, and the loss of the feedback iteration optimization model is carried out, so that the final loss of the model reaches the minimum value, and the accuracy and the robustness of the characteristics are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a general network architecture diagram of a semantic segmentation method model based on dual-branch feature fusion in the patent of the invention, which comprises a decoder and a detail head.

Fig. 2 is a block diagram of the spatial and channel attention mechanism of the present invention patent.

Fig. 3 is a block diagram of a feature fusion module in the present patent.

Fig. 4 is a device of the present invention.

Fig. 5 is an apparatus in the patent of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention will be described in detail with reference to the drawings and the detailed description. The semantic segmentation method, the semantic segmentation device and the semantic segmentation equipment for the double-branch feature fusion comprise the steps S1-S8:

s5, a fusion module (FFM) is used for fusing high-level semantic and detail information, semantic information is introduced into the bottom-layer features, and detail information is introduced into the high-layer features, so that subsequent fusion is more effective, and feature representation is enhanced;

s6, inserting a detail header into the feature map part obtained through fusion to generate a classification detail label, wherein the detail header is shown in the figure 1 (c), and then guiding the detail features of the bottom learning space by using the classification detail label as the guide of the detail feature map; finally, the number of channels is reduced in the feature map part through 1X 1 convolution in sequence, and then a BN and ReLU activation function is connected;

The respective steps are described in detail below.

In step S1, a network architecture is constructed, as shown in fig. 1, and a network frame is constructed, where the network is composed of two parts, namely a semantic branch and a detail branch. The method comprises the following steps:

the backbone network used in the feature extraction section was xception, VGGNet, resnet18.xception, VGGNet, resnet18 appears in many classical network architectures, has been widely accepted and demonstrated, and we need to construct a lightweight network, so xception, VGGNet, resnet is used to feature extraction of images while demonstrating the effectiveness of our proposed method.

In step S2, features extracted from the backbone network are subjected to ASPP, and the ASPP is also proposed to solve the problem of different dimensions of different segmentation targets, where the steps specifically include:

s201, a core idea of a cavity space convolution pooling pyramid is to concentrate a multiscale receptive field, and the cavity space convolution pooling pyramid is composed of 1 piece of 1*1 convolution and three pieces of 3*3 cavity convolution with 3, 6 and 12 cavity rates in sequence.

And S202, adding the characteristic map obtained by the 1*1 convolution and the characteristics obtained by the 3*3 cavity convolution with the cavity rates of 3, 6 and 12 in sequence, wherein the obtained characteristic map is the output map of the cavity space convolution pooling pyramid.

In step S3, the number of channels is reduced by using 1X 1 convolution, and then a BN, a ReLU activation function and Dropout are connected; 4 times up-sampling it using bilinear interpolation;

in step S4, the detail branches extract texture information from the input picture to obtain detail edge features, so as to extract spatial detail information, and then use the edge features to strengthen semantic information, and the detail branches, as a supplement to the semantic branches, can be added with the semantic feature map to supplement the detail features. The method comprises the following specific steps:

in the detail branch of S401, firstly, the texture information of the picture is extracted, and learning robust texture representation is crucial to texture recognition. In this patent, one-step Sobel, laplace and local binary pattern methods are used.

S402, merging a plurality of convolution layers for texture representation, and extracting complementary relations between the convolution layers by utilizing multi-texture information.

In step S5, a fusion module (FFM) is configured to fuse high-level semantic and detail information, as shown in fig. 3, introduce semantic information into the bottom-level features, and introduce detail information into the high-level features, so that subsequent fusion is more effective, and the specific steps are as follows:

s501, attention can help the model to give different weights to each input part, more key and important information is extracted, so that the model can make more accurate judgment, and meanwhile, larger expenditure is not brought to calculation and storage of the model. For the shallow feature map S extracted by the backbone network, to further enrich the space details thereof, we weight it to obtain S' by using a space attention mechanism, the space attention mechanism is shown in fig. 2 (b), and the calculation formula is as follows:

s502, for a feature diagram D extracted by a detail branch, the attention weight is calculated by using the attention of a channel, and is multiplied by D to obtain D', so as to enhance the detail distinction between different channels, the attention mechanism of the channel is shown in fig. 2 (a), and the calculation formula is as follows:

s503, S 'and D' have the same channel number, and the fusion of the shallow layer characteristic diagram and the detail characteristic diagram can be realized by adding elements one by one. The S 'and the D' are added to further enhance the detail information contained in the shallow feature map, the fused feature map is used as the input of a backbone network, the feature map of the semantic branch and the detail feature are added to be used as the input of the semantic branch, and the whole calculation process of the module can be expressed by the following formula:

Si+1＝S’i+D’i (3)

in step S6, inserting a detail header into the feature map part obtained by fusion to generate a two-class detail label, wherein the detail header is shown in FIG. 1 (c), and then guiding the detail features of the bottom learning space by using the two-class detail label as the guide of the detail feature map; finally, the number of channels is reduced by 1×1 convolution in the feature map part, and then a BN and ReLU activation function is connected, as shown in fig. 1 (a), the specific steps are as follows:

s601, we first generate a two-class detail tag from the semantically partitioned real tag by laplace convolution. The detail head is inserted into the shallow feature part to generate a classification detail label, the detail head is shown in fig. 1 (c), and then the classification detail label is used as the guide of a detail feature diagram to guide the bottom learning space detail feature. The two-class detail label graph with detail guidance may encode more spatial detail than low-level feature results.

S602, generating binary detail background labels from semantic segmentation background labels through a detail aggregation module. The operation may be implemented by a two-dimensional convolution kernel Laplacian convolution kernel and a 1 x 1 convolution. We generate detail feature maps with different step sizes using the laplacian convolution kernel shown in fig. 1 to obtain multi-scale detail information. We then upsample the detail feature map to the original size and fuse it with a trainable 1 x 1 convolution.

And S603, finally, converting the predicted detail into a final binary detail label with boundary and corner information by adopting a threshold value of 0.1. Detail prediction is a classical balancing problem since the number of detail pixels is much smaller than non-detail pixels. Since weighted cross entropy always leads to coarse results, we use Binary cross entropy and Dice to jointly optimize detail learning. The Dice measures the degree of overlap between the prediction graph and the real labels. Furthermore, it is insensitive to the number of foreground/background pixels, which means that it can alleviate the class imbalance problem.

Thus, for a predicted detail map having a height H and a width W, detail loss L _detail The formula of (2) is as follows:

L_ _detail( p_ _d ，g_ _d) ＝L _{_bce} (p_ _d ，g_ _d )+L_dice(p_ _d ，g_ _d ) (4)

wherein p/u _d ∈R ^H×W Representing prediction details and g/u _d ∈R ^H×W Representing the corresponding real label, L\u _bce Representing binary cross entropy loss, l\u _dice Represents Dice, as follows:

where i denotes the ith pixel and e is the laplace smoothing term, we set e=1 to estimate the probability that no phenomenon has occurred. As shown in fig. 1, we use the detail header to generate a detail profile that directs shallow information to encode spatial information. The detail header includes a 3 x 3Conv and BN and ReLu followed by a 1 x 1 convolution to obtain the output detail. The detail head can effectively enhance the feature representation, and finally, the learned detail features are fused with the context features of the deep blocks of the decoder to perform segmentation prediction. But this branch is discarded during the inference phase. Therefore, the side information can easily improve the accuracy of the segmentation task without any inference cost.

In step S7, the feature maps of S3 and S5 are spliced; then input to the decoder, which is shown in fig. 1 (b), i.e., subjected to 3×3 convolution, BN, reLU, dropout; up-sampling by 4 times to obtain a final result;

in step S8, the detail learning is jointly optimized with Binary cross entropy and Dice Loss, focal Loss, CE Loss. Because the training process of the network model is a continuous loss optimizing process, the loss obtained at present is fed back to the network model for continuous iterative optimization so as to reduce the loss, thereby obtaining more robust features.

The invention improves the accuracy and the robustness of the model by improving the extraction mode of the characteristics. The actual structure network can respectively obtain low-level detail information and high-level semantic information, the detail edge characteristics are obtained through texture information extracted from the input picture, the detail branches are complemented with the semantic branches, the detail information is obtained from the low-level network, the semantic information is obtained from the high-level network, and then the semantic information is fused, so that the condition of missing certain information is avoided. The high-order semantic information has the effect of optimizing the low-order edge information, and then two optimized features are combined into the final segmentation representation. Second, we propose a fusion module (FFM) for fusing high-level semantic and detail information to enhance feature representation. Furthermore, the fusion module employs an attention mechanism for handling feature mappings from both branches to establish context dependencies of spatial and channel dimensions, which can help the network focus on more meaningful features. The frame detail extraction branch extraction feature map generates final prediction through a detail segmentation head so as to improve the performance of the frame detail extraction branch extraction feature map, and the cost is negligible. The method constructs a new and effective method for semantic segmentation, and provides a more efficient framework for semantic segmentation in practical application.

The invention also provides a device, as shown in fig. 4, comprising a semantic segmentation network training module for fusion of the double-branch features, and further used for inputting the fusion feature map into a decoder to finally obtain a sample prediction result.

The invention also proposes a computer device, as in fig. 5, comprising a processor, a memory, a network interface, a display and input means, said processor implementing the steps of the method described above when executing said computer program. Wherein the processor of the device is configured to provide computing and control capabilities. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a semantic segmentation method based on dual-branch feature fusion. The display screen of the device can be a liquid crystal display screen or an electronic ink display screen, and the input device of the device can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer device, and can also be an external keyboard, a touch pad or a mouse and the like.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims

1. Semantic segmentation method, device and equipment based on double-branch feature fusion, and the method is characterized by comprising the following steps:

obtaining a high-resolution image, and labeling the image to obtain a training, verifying and testing sample;

constructing a network frame, wherein the network consists of two parts, namely a detail branch and a semantic branch, inputting a given image into a backbone network to extract semantic features, and firstly reducing the size of the input image by 16 times;

the method comprises the steps that extracted features in a backbone network are subjected to a cavity space convolution pooling pyramid, and the pyramid is formed by 3 multiplied by 3 cavity convolutions with three expansion rates of 3, 6 and 12 in sequence, wherein the expansion rates are 1 multiplied by 1 convolution kernel;

reducing the channel number by using 1×1 convolution, and then connecting with a BN, a ReLU activation function and a Dropout; 4 times up-sampling is carried out on the semantic branch by using bilinear interpolation to obtain a feature map of the semantic branch;

the detail branches are used for obtaining detail edge characteristics from texture information extracted from an input picture, the purpose is to extract space detail information, then the edge characteristics are utilized to strengthen semantic information, and the detail branches are added with a semantic feature map to supplement the detail characteristics as the supplement of semantic branches;

the fusion module (FFM) is used for fusing high-level semantic and detail information, semantic information is introduced into the bottom-layer features, and detail information is introduced into the high-layer features, so that subsequent fusion is more effective, and feature representation is enhanced;

inserting the feature map part obtained by fusion into a detail header to generate a classification detail label, and guiding the detail feature of the bottom learning space by using the classification detail label as the guide of the detail feature map; finally, the number of channels is reduced in the feature map part through 1X 1 convolution in sequence, and then a BN and ReLU activation function is connected;

combining the feature graphs of the fusion module and the semantic branches; then input into the decoder, i.e. subjected to 3 x 3 convolution, BN, reLU, dropout; up-sampling by 4 times to obtain a final result;

combining Binary cross entropy with the Dice Loss, the Focal Loss and the CE Loss to optimize detail learning;

training the network to obtain a trained algorithm model based on semantic segmentation of double-branch feature fusion; and obtaining an image to be tested, and inputting the image to be tested into a trained segmentation model to obtain a prediction result of the image.

2. The semantic segmentation method, device and equipment based on the dual-branch feature fusion according to claim 1, wherein the establishing an image dataset comprises:

collecting an image sample;

the collected image samples are marked by an image marking tool, and the result of semantic segmentation is that the images are changed into color blocks with certain semantic information. The semantic segmentation technology can identify the semantic category of each color block and label each pixel with the corresponding label;

and constructing a sample data set by using the marked picture sample, dividing the sample data set into a training data set, a verification set and a test data set, and preprocessing the training data set.

3. The semantic segmentation method, device and equipment based on the dual-branch feature fusion according to claim 1, wherein a backbone network used by the feature extraction part is Xception, VGGNet, resnet.

4. The semantic segmentation method, device and equipment based on the double-branch feature fusion according to claim 1, wherein the network consists of two parts, namely a semantic branch and a detail branch. The detail branches are used for obtaining detail edge characteristics from texture information extracted from an input picture, the purpose is to extract space detail information, then the edge characteristics are utilized to strengthen semantic information, and the detail branches are added with a semantic feature map to supplement the detail characteristics as the supplement of the semantic branches.

5. The semantic segmentation method, device and equipment based on the double-branch feature fusion according to claim 4, wherein in the detail branches, firstly, texture information of pictures is extracted, and robust texture representation is learned, wherein the texture representation comprises a gradient Sobel, laplacian and a local binary mode method. We fuse the convolution layers from multiple texture representations, extract the complementary relationship between them using multi-texture information, and then use the fusion module to fuse semantic information and detail information.

6. The semantic segmentation method, device and equipment based on the dual-branch feature fusion according to claim 5, wherein a fusion module (FFM) is used for fusing high-level semantic and detail information, semantic information is introduced into the bottom-level features, and detail information is introduced into the high-level features, so that subsequent fusion is more effective, and feature representation is enhanced. For the shallow feature map S extracted by the backbone network, in order to further enrich the space details thereof, a space attention mechanism is used for weighting the shallow feature map S to obtain S ', for the feature map D extracted by the detail branch, the attention weight is calculated by using the channel attention and is multiplied by D to obtain D', so as to enhance the detail distinction degree among different channels of the shallow feature map S, the S 'and the D' have the same channel number, and the fusion of the shallow feature map and the detail feature map can be realized by adding elements one by one. And inserting the fused feature map part into a detail header to generate a classification detail label, and guiding the detail features of the bottom learning space by using the classification detail label as the guide of the detail feature map.

7. The semantic segmentation method, device and equipment based on the double-branch feature fusion according to claim 6, wherein Binary cross entropy and Diceloss, focalloss are adopted in the training process to jointly optimize detail learning, the training process of the network model is a continuous loss optimizing process, and the currently obtained loss is fed back to the network model to perform continuous iterative optimization.

8. Semantic segmentation method, device and equipment based on double-branch feature fusion, and is characterized in that the device comprises the following steps:

the semantic segmentation method based on double-branch feature fusion is constructed, the segmentation network comprises a backbone network, semantic branches, detail branches and a fusion module, a given image is input into the backbone network to extract semantic features, and firstly, the size of the input image is reduced by 16 times;

reducing the channel number by using 1×1 convolution, and then connecting with a BN, a ReLU activation function and a Dropout; 4 times up-sampling it using bilinear interpolation;

the detail learning is jointly optimized with Binary cross entropy and Dice Loss, focal Loss, CE Loss.

9. An apparatus comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.