CN115601542A

CN115601542A - Image semantic segmentation method, system and equipment based on full-scale dense connection

Info

Publication number: CN115601542A
Application number: CN202211229781.2A
Authority: CN
Inventors: 熊炜; 田紫欣; 陈奕博; 强观臣; 郑大定; 汪锋; 邹勤; 王松; 李利荣; 宋海娜; 李婕; 涂静敏
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-01-13
Anticipated expiration: 2042-10-08
Also published as: CN115601542B

Abstract

The invention discloses an image semantic segmentation method, a system and equipment based on full-scale dense connection, wherein an image to be segmented is preprocessed and cut or filled to be a preset size; then, realizing semantic segmentation of the image to be segmented by using an image semantic segmentation network; the image semantic segmentation network (UNet 4 +) of the present invention receives the intermediate aggregated feature map from encoders of different scales through full-scale and dense hop connections, while each node in the decoder receives the intermediate aggregated feature map not only from encoders and decoders of different scales, but also from encoders of the same scale. Thus, the aggregation layer in the decoder can learn to use all the collected feature maps on the nodes. UNet4+ of the present invention alleviates the problem of gradient disappearance, which also maximizes information flow in the network; meanwhile, the feature propagation in the network is enhanced; has more compact model and extreme characteristic reusability.

Description

Image semantic segmentation method, system and equipment based on full-scale dense connection

Technical Field

The invention belongs to the technical field of artificial intelligence, deep learning and image processing, and relates to an image semantic segmentation method, system and equipment, in particular to an image semantic segmentation method, system and equipment based on a full-scale dense connection semantic segmentation network.

Background

Image Semantic Segmentation (Semantic Segmentation) is an important ring in image processing and machine vision technology with respect to image understanding, and is also an important branch in the AI field. The semantic segmentation is to classify each pixel point in the image, determine the category (such as belonging to the background, people or vehicles) of each point, and thus perform region division. At present, semantic segmentation is widely applied to scenes such as automatic driving and unmanned aerial vehicle point-of-fall determination.

At present, the problem of image semantic segmentation is solved, and a UNet architecture and UNet are adopted ^e UNet +, UNet + +, UNet3+, and the like.

The UNet architecture (O.Ronneberger, P.Fischer, and T.Brox, "U-net: computational networks for biological Image segmentation," in 18th International Conference on Medical Image Computing and Computer-Assisted interpretation (MICCAI 2015), munich, GERMANY,2015, reference proceedings, pp.234-241.) has become a de facto standard for various Image segmentation tasks and has met with great success. It is a typical encoder-decoder cascaded architecture, where the encoder (the contracted path) performs feature extraction and the decoder (the expanded path) performs resolution restoration. The UNet architecture is most attractive with its long hop connections, which allows the same scale of information to flow directly from the encoder to the decoder, enabling the model to make better predictions.

However, such a relatively fixed structure makes it difficult for the model to balance the receptive field size and the boundary segmentation accuracy. It is now generally accepted that deeper networks have better non-linear characterizations, which can learn more complex transformations, adapting to more complex features. But deeper networks introduce the so-called gradient vanishing problem and reduce the learning power of the shallow layers. When the network depth reaches a certain level, the segmentation performance does not improve, but may decrease.

To determine the optimal depth of UNet architecture, zhou et al (Z.Zhou, M.M.R.Siddique, N.Tajbakhsh, and J.Liang, "Unet + +: identifying skip connections to extension multiscale defects in image segmentation," IEEE Transactions on medical Imaging, vol.39, no.6, pp.1856-1867, 2020.) propose an integrated architecture UNet architecture ^e It combines unets of different depths into one unified architecture. Integrated architectures benefit from knowledge sharing, UNet ^e All UNet parts within the architecture share the encoder, but have separate decoders. Since the decoder in this architecture is off, the deeper UNet cannot provide a supervisory signal to the shallower counterpart. Therefore, explicit deep supervision is required in the combination.

Another solution to overcome the above limitation is to use UNet ^e All hopping connections in the structure are removed, and a short hopping connection is usedTo connect each neighboring node in the set to form a nested structure called UNet +, so that gradient backpropagation will pass from the deeper decoder to the shallower corresponding node. This idea is almost simultaneously addressed by Yu et al (F.Yu, D.Wang, E.Shell, and T.Darrell, "Deep layer aggregation," in 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), salt Lake City, UT, USA,2018, conference proceedings, pp.2403-2412.) and Zhou et al (Z.Zhou, M.M.R.Siddique, N.Tajbakhs, and J.Liang, "net +: A nested u-network architecture for the statistical Image segmentation," in 4th International works on Deep Learning in Medical Image Analysis (DLMIA 2018) Held in connection with MICCAI 2018, granada, SPAIN,2018, conference proceedings, pp.3-11), respectively.

Notably, each node in the UNet + architecture integrates the feature maps of its neighboring ancestors on different scales from a horizontal perspective in conjunction with the feature maps of their neighboring ancestors on the same scale from a vertical perspective. To ensure maximum information flow between unets of all different depths within the UNet + architecture, zhou et al also proposes a nested UNet architecture with dense hop connections, called UNet + +, whose decoders are densely connected in the same dimension from a horizontal perspective. Redesigned same-scale hopping connections make dense feature propagation more flexible, connecting all previous feature maps directly together.

Although convincing as a natural design, there is no solid theory to ensure that the same scale feature map is the best match for feature fusion. To utilize full scale features in image segmentation, huang et al (H.Huang, L.Lin, R.Tong, H.Hu, Q.Zhang, Y.Iwamoto, X.Han, Y. -W.Chen, and J.Wu, "Unet 3+: A full-scale connected equation for media segmentation," in 45th IEEE International Conference on optics, speed, and Signal Processing (ICASSP 2020), barcelona, SPAIN,2020, conference proceedings, 1055-1059.) propose UNet3+, which combines fine-grained low-level detailed feature maps with coarse-grained high-level feature maps of different scales. However, UNet3+ only partially redesigns the long hop connection between the encoder and decoder and the short hop connection within the decoder.

Although the use of different scale feature maps in a decoder using the UNet3+ architecture is much less restrictive than the use of the same scale feature maps in an encoder using UNet, UNet + and UNet + + architectures, there is still room for improvement.

Disclosure of Invention

In order to solve the above technical problem, the image semantic segmentation network adopted by the invention uses all full-scale and dense jump connections inside and between the encoder and the decoder, thereby forming the image semantic segmentation network (UNet 4+ architecture) of the invention.

The technical scheme adopted by the method is as follows: an image semantic segmentation method based on full-scale dense connection comprises the following steps:

step 1: preprocessing an image to be segmented, and cutting or filling the image to be segmented into a preset size;

and 2, step: realizing semantic segmentation of an image to be segmented by using an image semantic segmentation network;

the image semantic segmentation network comprises an encoder, a decoder, full-scale dense jump connection and full-scale deep supervision; the encoder consists of 5 coding convolution blocks, the 1st to 4th coding convolution blocks respectively comprise 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence and 1 downsampling layer MaxPoint, and the 5th coding convolution block only comprises 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence; the number of output channels of each coding convolution block is C, 2C, 4C, 8C and 16C respectively, the sizes of convolution kernels are 3 multiplied by 3, and the size of a maximum pooling kernel and the pooling step length are 2 multiplied by 2; the decoder is composed of 4 decoding convolution blocks, each decoding convolution block comprises 1 upsampling layer Biliner, 1 fusion layer Conscatenate and 2 convolution layers, all encoder feature maps or decoder feature maps positioned in front of the decoding block are cascaded together through full-scale dense skip connection, and the side output of each decoding convolution block is subjected to channel number alignment by 1 x 1 convolution layer, so that subsequent full-scale deep supervision is realized.

The technical scheme adopted by the system of the invention is as follows: an image semantic segmentation system based on full-scale dense connection comprises the following modules:

the module 1 is used for preprocessing an image to be segmented and cutting or filling the image to be segmented into a preset size;

the module 2 is used for realizing semantic segmentation of the image to be segmented by using an image semantic segmentation network;

The technical scheme adopted by the equipment of the invention is as follows: an image semantic segmentation device based on full-scale dense connection comprises:

one or more processors;

a storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the full-scale dense connectivity-based image semantic segmentation method.

The image semantic segmentation network (UNet 4 +) network provided by the invention has the following advantages:

(1) UNet4+ is connected by a direct hop between any two volume blocks, thereby alleviating the problem of gradient vanishing, which also maximizes information flow in the network.

(2) UNet4+ makes extensive use of feature concatenation, thereby enhancing feature propagation in the network.

(3) UNet4+ results in a more compact model and extreme feature reusability by aggregating a large number of feature maps in the network back-end volume block.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an image semantic segmentation network (UNet 4 +) according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the image semantic segmentation method based on full-scale dense connection provided by the present invention includes the following steps:

in this embodiment, the image to be segmented may be read in grayscale or color, where the number of channels of the grayscale image is 1 and the number of channels of the color image is 3. The input image resolution may be any size and is cropped into an image block of 512 x 512 size. When the image is cut, the overlapping area of the adjacent image blocks is recommended to be not less than 5% so as to avoid that the tiny targets at the edges of the image blocks cannot be completely detected. If the input image resolution is less than 512 x 512, the image block boundaries are filled with the mirror image.

Step 2: realizing semantic segmentation of an image to be segmented by using an image semantic segmentation network;

referring to fig. 2, the image semantic segmentation network of the present embodiment includes an encoder, a decoder, full-scale dense jump connection, and full-scale deep supervision; wherein the encoder is composed of 5 convolutional blocks, each of the 1st to 4th convolutional blocks includes 2 convolutional layers (Conv → InstanceNorm → leakyreu) and 1 downsampling layer (MaxPooling), and the 5th convolutional block includes only 2 convolutional layers. The number of output channels of each convolution block is C, 2C, 4C, 8C and 16C respectively, the sizes of convolution kernels are 3 multiplied by 3, and the size of the maximum pooling kernel and the pooling step length are 2 multiplied by 2. The decoder consists of 4 convolution blocks, each convolution block comprises 1 upsampling layer (upsampling Biliner), 1 fusion layer (conditioner) and 2 convolution layers, all the codec characteristic diagrams (downsampling or upsampling is needed if necessary to ensure the consistent characteristic diagram dimension) positioned in front of the decoding block are cascaded together through full-scale dense skip connection, and the side output of each decoding convolution block is subjected to channel number alignment by 1 × 1 convolution layer, so that the subsequent full-scale deep supervision is realized.

The full-scale dense hop connection is redesigned in the image semantic segmentation network (UNet 4+ architecture) of the embodiment. Let node X ⁱ X for the output characteristic map ⁱ Where the superscript i is indexed along the downsampled layer of the encoder and N represents the depth of the network layer. The characteristic diagrams of the encoder side and the decoder side are respectively used

And

it can be expressed as:

and

wherein the content of the first and second substances,

showing the layer of the convolution layer,

the representation being formed of a plurality of successive

The convolution layer is formed by convolution layers to form a convolution block,

and

respectively representing a down-sampling layer and an up-sampling layer, the number of output channels of a node following each sampling layer being determined by

Adjustment of the convolutional layer, symbol [. ]]Indicating a cascading operation.

As shown in FIG. 2, only one input passes through the encoder node

Enter the UNet4+ architecture proposed in this embodiment and locate in the ith>Other encoder nodes of layer 1

Only i-1 down-sampled inputs can be received from all upper nodes of the encoder. Is located at the ith<Decoder node of N layers

N-i-1 upsampled inputs are received from the decoding side and N inputs (of which i-1 downsampled, 1 co-scale, N-i upsampled inputs) are received from the encoding side. The main reason for designing all previous signatures to be accumulated and concatenated to the current node is that this embodiment utilizes dense hop connections both between the encoder and decoder and within.

The present embodiment introduces two distinct full-scale deep supervision mechanisms in UNet4+ architecture.

Mechanism 1: with UNet ^e UNet + and UNet + + pairs of intermediate same-scale feature mapsInstead of performing deep supervision, the proposed UNet4+ produces a side output at each decoded volume block, similar to UNet3+, but with several subtle and important differences. This embodiment is implemented in the decoder node

And

the side output ends of the nodes are added with 1 up-sampling layer of bilinear interpolation, so that the output characteristic graphs of the nodes have AND nodes

The same spatial resolution. The 4 side outputs are then cascaded or summed pixel by pixel in the channel dimension, and a predicted image is output via 1 3 × 3 convolutional layer (Conv → Sigmoid) (the input of which is mapped to [0,1 ] by Sigmoid activation function]In between).

Mechanism 2: decoder node

The side outputs 1 up-sampling layer with bilinear interpolation and 1 convolution layer with 1 multiplied by 1, so that the output characteristic graph has the same node

The spatial resolution and the channel dimension are the same, and then multiplication or addition operation is carried out pixel by pixel; the fused feature map is output with nodes through 1 bilinear interpolation up-sampling layer and 1 x 1 convolution layer

The spatial resolution and the channel dimension are the same, and then multiplication or addition operation is carried out pixel by pixel; the fused feature graph is further processed by 1 bilinear interpolation upsampling layer and 1 convolution layer of 1 multiplied by 1, so that the output of the feature graph has a node

The same spatial resolution and channel dimensions, and then pixel-by-pixel multiplication or addition. Finally, a prediction image is output through 1 3 × 3 convolution layers (Conv → Sigmoid).

The image semantic segmentation network is a trained image semantic segmentation network; this embodiment defines a blended segmentation loss function that is optimized as a weighted average of the Binary Cross Entropy (BCE) loss, the Die Similarity Coefficient (DSC) loss, and the image average accuracy loss at different IoU thresholds.

The binary cross entropy loss of this example is defined as:

wherein y and

and the prediction segmentation probability maps are respectively corresponding to the GT binary label and the model.

The die similarity factor loss of this embodiment is defined as:

wherein, y and

and the prediction segmentation probability graphs are respectively corresponding to the GT binary label and the model.

The present embodiment also evaluates using image average precision values for different IoU thresholds t, ranging from 0.5 to 0.95, with a step size of 0.05 (i.e., 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95). For example, below a threshold of 0.5, a predicted tag is considered a hit if the IoU of the GT tag is greater than 0.5. Therefore, the loss of image average accuracy of the present embodiment is defined as:

wherein, t is different IoU threshold values,

to represent

The prediction result at threshold t, | thresholds | is the total number of different IoU thresholds.

Finally, by combining all three loss terms, the mixed partition loss used in this embodiment is defined as:

in all experiments, the weighting factor α _BCE 、α _DSC And alpha _mAP Set to 0.4, 0.2 and 0.4, respectively.

The present invention proposes to use all full-scale and dense hopping connections inside and between the encoder and decoder, thus forming the final UNet4+ architecture of the present embodiment. With full-scale and dense hop connections, each node in the encoder receives the intermediate aggregated feature map from encoders of different scales, while each node in the decoder receives the intermediate aggregated feature map not only from encoders and decoders of different scales, but also from encoders of the same scale. Thus, the aggregation layer in the decoder can learn to use all collected feature maps on the node. And UNet ^e In contrast, none, UNet + +, UNet3+, and the proposed UNet4+ architecture require explicit deep supervision.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A full-scale dense connection-based image semantic segmentation method is characterized by comprising the following steps:

the image semantic segmentation network comprises an encoder, a decoder, full-scale dense jump connection and full-scale deep supervision; the encoder consists of 5 coding convolution blocks, the 1st to 4th coding convolution blocks respectively comprise 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence and 1 downsampling layer MaxPoint, and the 5th coding convolution block only comprises 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence; the number of output channels of each coding convolution block is C, 2C, 4C, 8C and 16C respectively, the sizes of convolution kernels are 3 multiplied by 3, and the size of a maximum pooling kernel and the pooling step length are 2 multiplied by 2; the decoder consists of 4 decoding convolution blocks, each decoding convolution block comprises 1 upsampling layer Biliner, 1 fusion layer Conscatenate and 2 convolution layers, all the encoder feature diagrams or decoder feature diagrams positioned in front of the decoding block are cascaded together through full-scale dense skip connection, and the side output of each decoding convolution block is subjected to channel number alignment by 1 multiplied by 1 convolution layer, so that subsequent full-scale deep supervision is realized.

2. The full-scale dense connection-based image semantic segmentation method according to claim 1, characterized in that: in the step 1, if the resolution of the image to be segmented is larger than the preset size, the image to be segmented is segmented into image blocks with the preset size; and if the resolution of the image to be segmented is smaller than the preset size, filling the image block boundary by adopting mirror image, and filling the image block boundary into the image with the preset size.

3. The full-scale dense connection-based image of claim 1The semantic segmentation method is characterized by comprising the following steps: in step 2, the characteristic diagrams of the encoder end and the decoder end of the image semantic segmentation network are respectively used

And

representing input by encoder nodes

Enter the image semantic segmentation network and is positioned at the ith>Other encoder nodes of layer 1

Only i-1 down-sampled inputs can be received from all upper nodes of the encoder; is located at the ith<Decoder node of N layers

Receiving N-i-1 upsampled inputs from the decoding side and N inputs from the encoding side; wherein the superscript i is indexed along a downsampled layer of the encoder, and N represents the depth of the network layer;

the full-scale deep supervision is performed at a decoder node

And

The same spatial resolution; then, the 4 side outputs are cascaded or added pixel by pixel in channel dimension, and then 1 is composed of Conv and SigmoidThe 3 x 3 convolutional layer outputs a predicted image.

4. The full-scale dense connection-based image semantic segmentation method according to claim 1, characterized in that: in step 2, the characteristic diagrams of the encoder end and the decoder end of the image semantic segmentation network are respectively used

And

representing input by encoder nodes

the full-scale deep supervision is performed at a decoder node

The spatial resolution and the channel dimension are the same, and then multiplication or addition operation is carried out pixel by pixel; the fused feature map is formed by 1An upsampled layer of bilinear interpolation and 1 convolutional layer of 1 × 1, with output having an AND node

The spatial resolution and the channel dimension are the same, and then multiplication or addition operation is carried out pixel by pixel; finally, a predicted image is output through 1 3 × 3 convolutional layer composed of Conv and Sigmoid.

5. The method for semantically segmenting the image based on the full-scale dense connection according to any one of claims 1 to 4, wherein: the image semantic segmentation network is a trained image semantic segmentation network; the loss function adopted in the training is a mixed segmentation loss function which is a weighted average of binary cross entropy BCE loss, dice similarity coefficient DSC loss and image average precision loss under different IoU thresholds;

the binary cross-entropy BCE loss is defined as:

wherein y and

the image semantic segmentation network comprises GT binary labels and a prediction segmentation probability graph corresponding to the image semantic segmentation network;

the dice similarity coefficient DSC loss is defined as:

the average precision loss of the images under the different IoU thresholds is defined as:

wherein t is different IoU threshold values, the threshold value range is from 0.5 to 0.95, and the step length is 0.05;

to represent

The predicted result under the threshold t, | thresholds | is the total number of different IoU thresholds;

finally, by combining all three loss terms, a mixed partition loss is obtained as:

wherein alpha is _BCE 、α _DSC And alpha _mAP Respectively, are weighting coefficients.

6. An image semantic segmentation system based on full-scale dense connection is characterized by comprising the following modules:

the image semantic segmentation network comprises an encoder, a decoder, full-scale dense jump connection and full-scale deep supervision; the encoder consists of 5 coding convolution blocks, the 1st to 4th coding convolution blocks respectively comprise 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence and 1 downsampling layer MaxPoint, and the 5th coding convolution block only comprises 2 convolution layers consisting of Conv, instanceNorm and LeakyReLU which are connected in sequence; the number of output channels of each coding convolution block is respectively C, 2C, 4C, 8C and 16C, the sizes of convolution kernels are all 3 multiplied by 3, and the size of the maximum pooling kernel and the pooling step length are all 2 multiplied by 2; the decoder is composed of 4 decoding convolution blocks, each decoding convolution block comprises 1 upsampling layer Biliner, 1 fusion layer Conscatenate and 2 convolution layers, all encoder feature maps or decoder feature maps positioned in front of the decoding block are cascaded together through full-scale dense skip connection, and the side output of each decoding convolution block is subjected to channel number alignment by 1 x 1 convolution layer, so that subsequent full-scale deep supervision is realized.

7. An image semantic segmentation device based on full-scale dense connection is characterized by comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the full-scale dense-connectivity-based image semantic segmentation method according to any one of claims 1 to 5.