CN116129119A

CN116129119A - Rapid semantic segmentation network and semantic segmentation method integrating local and global features

Info

Publication number: CN116129119A
Application number: CN202310086646.5A
Authority: CN
Inventors: 徐国平; 冷雪松; 王霞霞; 廖文涛; 张炫; 吴兴隆
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-05-16

Abstract

The invention provides a rapid semantic segmentation network and a semantic segmentation method for fusing local and global features, which are used for respectively extracting local information and global information of an input image through a double-branch structure, so that the local information and the global information are better fused to obtain segmented features containing more information, the problem that the local and global features are difficult to interact is solved, and the functions of improving the segmentation efficiency and the overall performance are effectively realized. The invention provides a novel multi-scale feature fusion module, which utilizes context information of a transducer and local representation features of convolution; local and global feature information can be processed simultaneously, exhibiting very excellent performance in rapid medical image segmentation. The invention reduces the complexity in calculation while extracting the global features and the local features, reduces the time required by network training while guaranteeing the segmentation result, and realizes the function of improving the segmentation performance while guaranteeing the segmentation speed.

Description

Rapid semantic segmentation network and semantic segmentation method integrating local and global features

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a rapid semantic segmentation network and a semantic segmentation method integrating local and global features.

Background

Image segmentation plays an important role in medical image analysis, and is widely used for quantitative analysis of anatomical structures, particularly in clinical diagnosis. With the development of deep learning technology, convolutional neural networks have made substantial progress in medical image segmentation, and in particular, full convolutional networks and variants thereof, such as UNet and deep lab, have become practical choices. Based on these methods, many works have made great progress in medical applications, such as chest CT vessel segmentation, MRI heart segmentation, and lymph node segmentation.

Early studies on object detection and image segmentation have shown the efficiency of multi-scale feature fusion. However, it is not particularly clear whether segmentation performance can be further improved by integrating features from the convolutional layer and the transducer layer. Therefore, we propose a multi-scale feature fusion module. For low resolution feature maps we use a linear bottleneck structure and interpolation operation that yields feature maps of the same dimension and resolution as compared to the previous output. For example, an input with a resolution of 1/32 of the original image would have the same size after the linear bottleneck structure through the upsampling operation.

Although FCN-based methods have special representation capabilities, their ability to capture global context information and remote dependencies is limited due to the local acceptance capabilities of convolution operations. This limitation can lead to suboptimal segmentation of variable shape and scale structures when capturing multi-scale context information. Previous studies have attempted to alleviate this problem by extended convolution in deep lab, feature pyramid pooling in PSPNet, self-attention mechanisms in UNet, and the like. However, in medical image segmentation tasks, there is still no study that can fully extract global context features.

The Transformer-based model was proposed in sequence-to-sequence modeling in the NLP domain and has achieved advanced results in a variety of tasks. The self-attention mechanism in the convertors allows them to learn remote dependencies and establish global relationships between sequences. Transformer also has achieved the most advanced performance in image classification tasks in terms of computer vision. Later, many works were proposed on the basis of semantic segmentation by transformers, such as SETR, swin transformers, transUNet, swin-UNet, DS-TransUNet, transFuse, VOLO, etc. However, this method based on the transducer has a very large amount of calculation and a very complex spatial structure when modeling the remote dependency. This has a great impediment to real-time medical diagnosis, such as radiotherapy, in medical image processing.

Disclosure of Invention

The invention aims to solve the technical problems that: a rapid semantic segmentation network and a semantic segmentation method which are fused with local and global features are provided, and are used for improving the performance of segmented images.

The technical scheme adopted by the invention for solving the technical problems is as follows: a rapid semantic segmentation network for fusing local features and global features comprises a first branch, a second branch and an MSFFM multi-scale fusion module; the first branch is a CNN branch and comprises a plurality of convolution layers for extracting local characteristic information of an image; the second branch is a transducer branch, comprising an LN layer, a plurality of outlook attention layers and a series of converters MLP, for extracting global feature information and contextual feature information of the image through a downsampling and self-focusing mechanism; the characteristic information output by the first branch and the second branch are fused in an interactive mode through a plurality of bilateral connections; each calculated output of the second branch interacts with the output of the first branch, so that the first branch can learn global features better without increasing calculation complexity; each calculation output of the second branch is connected to an MSFFM multi-scale fusion module, and the MSFFM multi-scale fusion module is used for fusing the characteristic information output by the first branch and the second branch which are fused in an interactive mode, and performing multi-scale fusion operation on the information.

According to the scheme, in the first branch, X is set as an input characteristic image, conv is convolution operation, BN is layer normalization operation, and ReLU is activation operation; each convolution layer comprises a Conv operation, a BN operation and an activation operation ReLU function; conv operation is used for extracting the characteristics of the characteristic map; the BN operation is used for avoiding gradient explosion and gradient disappearance of the image in the gradient operation process; the ReLU function is used for controlling the output of each layer within a preset range and enabling the value of the ReLU function to be a negative value smaller than 0;

the formula of the convolution layer is:

X′＝F.ReLU(BN(Conv(X))) (1)。

further, in the second branch, LN is set as layer standardization; outlooker att is an Outlooker attention layer used for generating refined representation by space coding; the MLP is used for carrying out information interaction and aggregation global information among channels and establishing a remote dependency relationship; each section is preceded by a block embedding module for mapping the input to a specified shape; the input characteristic image X sequentially passes through an LN layer, an outlook module and an MLP to extract the global characteristic of the image:

X′＝OutlookerAtt(LN(X))+X (2)，

Z＝MLP(LN(X′))+X′ (3)。

a rapid semantic segmentation method for fusing local features and global features comprises the following steps:

s1: setting an input image to be subjected to semantic segmentation as A; performing double-branch operation on the image A, and extracting local features and global features of the image through CNN operation and transform operation respectively;

s2: and after interaction, the output of the double branches is input into an MSFFM multi-scale fusion module, and fusion of different scales is carried out on the local features extracted by CNN operation and the global features extracted by the transform operation, so that a feature map fused with global feature information and local feature information is obtained.

In the step S1, the specific steps are as follows:

s11: the image A is subjected to convolution operation with downsampling twice to generate a feature map A1 with the resolution of 1/4 of the image A, and the feature map A1 is respectively input into a CNN branch and a transducer branch;

s12: the transducer branch produces a feature map A2 with a resolution of 1/8 of image A; inputting the feature map A2 into an MSFFM multi-scale fusion module; meanwhile, carrying out convolution operation of upsampling on the feature map A2, then carrying out addition operation on the feature map A1, and inputting the feature map A2 into a CNN branch;

s13: performing one-time downsampling CNN operation on a new feature map obtained after the convolution operation of CNN branches, and then performing addition operation with the feature map A2;

s14: the transducer branches use a transducer operation to fuse the summed feature maps;

s15: repeating steps S12 to S14 for three times, and dividing the transducer branch into half to generate a characteristic diagram A2, a characteristic diagram A3 and a characteristic diagram A4, wherein the corresponding resolutions are 1/8, 1/16 and 1/32 of the image A respectively;

s16: and performing convolution operation of upsampling on the feature map A4, then performing addition operation on the feature map and the feature map output by the last step of the CNN branch, inputting the feature map into the CNN branch, and outputting a feature map A5 through convolution operation, wherein the resolution is 1/4 of that of the image A.

In the step S2, the specific steps are as follows:

s21: outputting and loading the feature map A5 obtained by CNN branching and the feature map A2, the feature map A3 and the feature map A4 obtained by converting branching into a fusion module;

s22: according to the resolution of the output feature map, carrying out convolution operation of sampling on the feature map obtained by the transform branch and interpolation operation of a linear bottleneck structure respectively to obtain feature maps with the same dimension and resolution; and then performing concatate operation with the feature map obtained by the CNN branch, and fusing to obtain final output.

In the step S22, the specific steps are as follows:

s221: performing CNN operation on the last output characteristic diagram A4 of the transducer branch to obtain a characteristic diagram A4 'with twice amplified resolution, performing interpolation operation on the characteristic diagram A3 obtained by the two transducer operations to obtain a characteristic diagram A3' with a linear bottleneck structure, and performing addition operation on the characteristic diagrams A4 'and A3' with the same resolution to obtain a characteristic A33;

s222: performing CNN operation on the feature map A33 to obtain a feature map A22 with resolution amplified twice, performing interpolation operation of a linear bottleneck structure on the feature map A2 obtained through one-time transform operation to obtain a feature map A2', and performing addition operation on the feature maps A22 and A2' with the same resolution to obtain a feature A11;

s223: and performing CNN operation of upsampling on the A11 to obtain a feature map A00 with the resolution being doubled, and performing conccate operation on the feature map A5 obtained by branching the CNN to obtain final output.

A computer storage medium having stored therein a computer program executable by a computer processor for performing a method of fast semantic segmentation that fuses local features with global features.

The beneficial effects of the invention are as follows:

1. according to the rapid semantic segmentation network and the semantic segmentation method for fusing local and global features, the local information and the global information of an input image are respectively extracted through the double-branch structure, so that the segmentation features containing more information are better fused, the problem that the local features and the global features are difficult to interact is solved, and the functions of improving the segmentation efficiency and the overall performance are effectively realized.

2. The invention provides a novel multi-scale feature fusion module, which utilizes context information of a transducer and local representation features of convolution; local and global feature information can be processed simultaneously, exhibiting very excellent performance in rapid medical image segmentation.

3. The invention reduces the complexity in calculation while extracting the global features and the local features, reduces the time required by network training while guaranteeing the segmentation result, and realizes the function of improving the segmentation performance while guaranteeing the segmentation speed.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of a fast semantic segmentation network fusing local and global features according to an embodiment of the present invention.

FIG. 3 is a block diagram of a fast semantic segmentation network fusing local and global features according to an embodiment of the present invention.

FIG. 4 is a block diagram of a novel multi-scale feature fusion module in accordance with an embodiment of the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description.

1. Fast semantic segmentation network integrating local and global features

Embodiments of the present invention include two branches, one branch being a convolution operation to extract local feature information of an image and the other branch being a transform operation, the main purpose of which is to extract global information. The information of the two branches is fused together in an interactive mode, and finally, all the characteristic information is thoroughly fused together through a novel multi-scale fusion module. Referring to fig. 3, the first row is a CNN module, the second row is a transducer module, and the third row of novel multi-scale fusion modules are connected to form the CNN module.

S11, the segmentation network mainly considers the speed and performance problems of semantic segmentation in feature extraction, adopts a double-branch structure to respectively extract local features and global features of an input image, and then fuses and outputs the local features and the global features through a feature fusion module. For the feature extraction section, features are extracted by two different branches (convolution and transducer). One of the branches consists of a series of convolution operations that learn local features of the feature image, such as boundaries, shapes, etc. The other branch consists of three converters modules, and the main function is to perform downsampling operation on input features, and extract rich global context information by using a self-attention mechanism. For the feature fusion part, we propose a novel multi-scale fusion module, which fuses the local feature information from the convolution block and the global feature information from the transducer in an interactive manner through a plurality of bilateral connections, and can effectively fuse the context information of the transducer and the local detail information of the convolution.

S12, for the convolution branches of the feature extraction part, the convolution branches are jointly realized by a plurality of groups of convolution layers, and local detail features of the input image can be extracted. Wherein each set of convolution layers includes one Conv operation plus one BN operation and is activated using a ReLU function, the formula of the convolution layers is as follows:

X′＝F.ReLU(BN(Conv(X))) (1)

where X represents the input feature image, cony represents the convolutional layer, BN represents the layer normalization operation, and ReLU represents the activation operation. The Conv operation is mainly aimed at extracting features of the feature map, and the BN layer is mainly aimed at avoiding gradient explosion and gradient disappearance phenomena generated in the process of performing gradient operation on the image. The primary purpose of the ReLU layer is to control the output of each layer to a range that is negative of less than 0.

The transform-based module is mainly used for capturing global features and contextual features during image segmentation. In contrast to the traditional convolution-based dual-branch architecture, one of the branches of the semantic segmentation network consists mainly of a variant of the transducer, "outlook", mainly to establish remote dependencies. The module consists of an outlook attention layer for spatial encoding and an MLP for inter-channel information interaction. The specific formula is as follows:

X′＝OutlookerAtt(LN(X))+X， (2)

Z＝MLP(LN(X′))+X′ (3)

where X represents the input and LN represents the layer normalization. The input feature image X first passes through the LN layer, which operates similarly to BN in the convolution operation, and serves to perform the normalization operation. And then the image is passed through an outlook module to extract the global features of the image. The outlook module can be seen as a structure with two independent phases, a first part containing a heap of outlook for generating a refined representation, and a second part deploying a series of converters to aggregate global information. Before each portion, there is a block embedding module that maps the input to a specified shape. The output of each transducer branch interacts with the output of the CNN module in S12, which aims to allow the CNN branch to learn the global features better without adding significant computational complexity.

Meanwhile, the output of each transition branch is directly input to an MSFFM module, which is also called a multi-scale fusion module. The module integrates the contents of the convolution branch and the transform branch together, and performs multi-scale fusion operation on the information.

Assuming that an image to be involved in semantic segmentation is named a, as can be seen from fig. 1 and 2, the image a is first subjected to a convolution operation with downsampling twice to generate a feature map A1. The resolution of the feature map A1 is 1/4 of the original image a, and the feature map A1 is subjected to CNN operation and transform operation, respectively. The transducer branch generates a first signature A2 that is input to the novel multi-scale fusion module, the size of the signature A2 being 1/8 of the original a size. Meanwhile, the feature map A2 and the feature map A1 perform layer addition operation, but because the resolution of the feature map A2 and the resolution of the feature map A1 are not matched, the feature map A2 needs to be subjected to convolution operation with upsampling, so that the feature map A2 and the feature map A1 can perform addition operation and then perform CNN operation, and a new feature map can be obtained. After the convolution operation of the same CNN branch, the new feature map is also subjected to CNN operation with downsampling once, and then is subjected to addition operation with the feature map A2. Unlike CNN branches, the transducer branches use a transducer operation to fuse the added feature maps.

The above operation is repeated 3 times, resulting in 3 global feature maps, named feature maps A2, A3, and A4, respectively, from the transducer branches. Their dimensions are 1/8,l/16 and 1/32 of the original image size, respectively. In fig. 2, all are labeled. The final output of the CNN branch is named as a feature map A5, and the resolution of the CNN branch is 1/4 of that of the original map.

And then fusing all the outputs in an MSFFM multi-scale fusion module, so as to obtain a feature map fused with global feature information and local feature information.

It can be seen in fig. 3 that each convolution operation in the CNN branch will involve either an up-sampling or a down-sampling operation, which is done primarily for the purpose of matching the size of the feature map when the addition operation is performed.

2. Multi-scale feature fusion module (MSFFM)

S21, FIG. 4 is a framework diagram of a multi-scale feature fusion module, wherein the operations such as up-sampling convolution and the like are firstly used for enabling the input of different sizes, a linear bottleneck structure and interpolation operation are used, and compared with the previous output, the operation can obtain feature diagrams with the same dimension and resolution. So that feature fusion operations can be performed with the input of the CNN branch.

And loading the outputs obtained by the two branches of the feature extraction part into a fusion module, namely a feature map obtained by a convolution branch and four feature maps obtained by a transform branch. Firstly, carrying out up-sampling operation on a feature map obtained by converting branches according to the output size; and then fusing the feature images with the uniform size obtained by the up-sampling operation with branches obtained by convolution branches, and finally obtaining final output.

The specific upsampling operation is as follows: performing CNN operation of one-time band up-sampling on the final output characteristic diagram F of the transducer to obtain a twice amplified characteristic F ', and performing linear bottleneck operation on the characteristic diagram F after the twice transducer to obtain F' ₁ Feature maps F ' and F ' of the same size ' ₁ Performing addition operation to obtain feature A ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then repeating the above operation, up-sampling the feature map A1, amplifying it to the feature map F obtained after one conversion ₂ The same size, then linear bottleneck again followed by F ₂ Fusion to obtain output A ₂ . Finally, according to the result after up-sampling, the size of the feature map obtained by the convolution operation is inconsistent, and for this purpose, the method needs to be applied to A ₂ And (3) performing up-sampling amplification twice, and then performing concatate operation on the feature map obtained by convolution operation to obtain the final output of the fusion module.

S22, designing a plurality of convolution blocks in a rapid semantic segmentation network fusing local and global features, wherein the main purpose is to extract the features with relatively high resolution. For example, the first convolution block uses a stride of 2 and a convolution kernel of 3 x 3 size to learn the high resolution features, which is designed primarily as a result of the tradeoff between accuracy and efficiency. Other convolution blocks are mainly used to extract features with a resolution of 1/4 of the original size. Each image to be processed is convolved with a step of 2 twice before entering the double branch. The purpose of this is mainly to save some time when performing a transform operation on the feature image. Moreover, the global feature information can be ensured not to be lost too much by only performing downsampling twice, so that the result is influenced.

S23, the main structure of the novel multi-scale fusion module provided by the invention is completed by convolution operation with up-sampling and interpolation operation with linear bottleneck structure. The resolution of the original feature map is reduced by half every time the transform operation is used in S13, so that the resolution of the feature map input into the novel multi-scale fusion module is not uniform. And the resolution of the feature map which is finally input into the novel multi-scale fusion module by the CNN branch in the S12 is 1/4 of the original map. It is necessary to upsample it so that the global low resolution information and the local high resolution information of the image can be fused. Firstly, the global information with low resolution is fused together through upsampling and adding operation, and after 3 global low resolution features are fused together, the global feature information and the local feature information are fused together through a confeate operation.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. A quick semantic segmentation network integrating local features and global features is characterized in that: the system comprises a first branch, a second branch and an MSFFM multi-scale fusion module;

the first branch is a CNN branch and comprises a plurality of convolution layers for extracting local characteristic information of an image;

the second branch is a transducer branch, comprising an LN layer, a plurality of outlook attention layers and a series of converters MLP, for extracting global feature information and contextual feature information of the image through a downsampling and self-focusing mechanism;

the characteristic information output by the first branch and the second branch are fused in an interactive mode through a plurality of bilateral connections; each calculated output of the second branch interacts with the output of the first branch, so that the first branch can learn global features better without increasing calculation complexity;

each calculation output of the second branch is connected to an MSFFM multi-scale fusion module, and the MSFFM multi-scale fusion module is used for fusing the characteristic information output by the first branch and the second branch which are fused in an interactive mode, and performing multi-scale fusion operation on the information.

2. A rapid semantic segmentation network fusing local features and global features as defined in claim 1, wherein: in the first branch of the first branch,

setting X as an input characteristic image, conv as convolution operation, BN as layer normalization operation and ReLU as activation operation; each convolution layer comprises a Conv operation, a BN operation and an activation operation ReLU function; conv operation is used for extracting the characteristics of the characteristic map; the BN operation is used for avoiding gradient explosion and gradient disappearance of the image in the gradient operation process; the ReLU function is used for controlling the output of each layer within a preset range and enabling the value of the ReLU function to be a negative value smaller than 0;

the formula of the convolution layer is:

X′＝F.ReLU(BN(Conv(X))) (1)。

3. a rapid semantic segmentation network fusing local features and global features as claimed in claim 2, wherein: in the second branch of the first branch,

setting LN as layer standardization; outlooker att is an Outlooker attention layer used for generating refined representation by space coding; the MLP is used for carrying out information interaction and aggregation global information among channels and establishing a remote dependency relationship; each section is preceded by a block embedding module for mapping the input to a specified shape; the input characteristic image X sequentially passes through an LN layer, an outlook module and an MLP to extract the global characteristic of the image:

X′＝OutlookerAtt(LN(X))+X (2)，

Z＝MLP(LN(X′))+X′ (3)。

4. a semantic segmentation method based on the rapid semantic segmentation network fusing local features and global features as claimed in any one of claims 1 to 3, characterized in that: the method comprises the following steps:

5. The semantic segmentation method according to claim 4, characterized in that: in the step S1, the specific steps are as follows:

6. The semantic segmentation method according to claim 5, characterized in that: in the step S2, the specific steps are as follows:

7. The semantic segmentation method according to claim 6, characterized in that: in the step S22, the specific steps are as follows:

8. A computer storage medium, characterized by: a computer program executable by a computer processor, the computer program executing the semantic segmentation method according to any one of claims 4 to 7.