CN115984560A

CN115984560A - Image segmentation method based on CNN and Transformer

Info

Publication number: CN115984560A
Application number: CN202211686784.9A
Authority: CN
Inventors: 王兴起; 王海林; 魏丹; 方景龙
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-18

Abstract

The invention discloses an image segmentation method based on CNN and Transformer. The invention transmits the image to a CNN module, and extracts the local low-level characteristics of the image through layer-by-layer convolution and pooling; the same images are transmitted to a Transformer module, the images are divided into proper blocks and subjected to feature extraction, and the global information and the remote dependency relationship of the images are extracted and stored; rearranging the extraction results of the two modules which are executed in parallel; the intermediate result of each layer of the CNN module is transmitted to an upper sampling layer, and the result of the last layer is additionally processed by a Transformer layer to further strengthen the global characteristics; and finally, uniformly delivering the fused data and the data transmitted by the CNN module to an up-sampling module, and recovering the image resolution by using global and local features. The invention improves the accuracy of the image semantic segmentation.

Description

Image segmentation method based on CNN and Transformer

Technical Field

The invention belongs to the field of semantic segmentation of images, and relates to a parallel medical image segmentation method combining a Convolutional Neural Network (CNN) and a transform.

Background

Medical image segmentation is a necessary prerequisite for developing a medical care system, particularly for disease diagnosis and treatment planning, the diagnosis of diseases requires professionals to segment tumors, organs and the like in images, and the manual image segmentation consumes a large amount of human resources, so that the segmentation of medical images by using an artificial intelligence technology is an important method for reducing the analysis cost of medical images. However, due to the complexity of medical images, it is very difficult to accurately segment the images that are frequently interlaced with organs. Therefore, there is a strong need in the medical field for an automated high-precision image segmentation technique.

In the past decade, various technologies using artificial intelligence to help relevant people segment medical images have appeared, such as UNet architecture based on full convolution neural network, deep lab architecture, contrast learning, etc.; there are also Vi s ion Transformer, swinUNet, transUNet, etc., which have been inspired by the field of natural language processing. Among them, UNet architecture is one of the most widely used architectures due to its good performance and precision. The focus of the prior UNet-based techniques is to accurately segment tissues or organs with different semantics or to prevent tissues or organs with the same semantic information from being over-segmented, and few consideration is given to combining the two, for semantic segmentation, the two requirements are completely different directions, the prior methods may cause over-segmentation or incomplete segmentation of the same or different organs, and the analysis is mainly based on the following facts:

1. the technology based on the full convolution neural network extracts semantic information of an image through convolution, then compresses the image and strengthens image characteristics through down-sampling, obtains semantic information with different resolutions and different layers through multi-layer down-sampling, and then restores the image resolution through up-sampling and combining the extracted semantic information with different layers, and the CNN has the characteristic of naturally extracting low-level characteristics of the image, so the technology can learn the characteristics with the same semantic information.

2. On the contrary, the transform-based technology firstly divides an image into blocks with equal size, serializes each block as a segment of semantic information, encodes the serialized information and position information through an encoder, can learn semantic features among the blocks, then transmits the encoded semantic information to a decoder, and the decoder decodes and recombines input data to restore the original image. The technology can well learn the remote dependence in the image through the mutual learning between each serialized block.

However, because the two technologies have different directions and emphasis points, in the prior art, one of the two technologies is often easily overlooked and a good result in one direction is excessively pursued, and the connection between the two technologies is overlooked, so that a good balance between the two directions is not found.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an image segmentation method based on CNN and Transformer.

An image segmentation method based on CNN and Transformer specifically comprises the following steps:

step 1: extracting global semantic features of the image;

inputting an image into a transform coding part, firstly passing through a patch segmentation layer, then passing through four down-sampling layers, carrying out patch fusion in the down-sampling layers, and then inputting the down-sampling layers into a SwinTransformerBlock to extract global semantic features;

step 2: extracting local detail features of the image through a CNN module;

step 2-1: convolution and down-sampling, four times of down-sampling are carried out in the whole CNN stage, and the final image format is

And step 3: reconstructing and merging the characteristics;

after the CNN module and the Transformer module extract the features, the CNN module and the Transformer module need to reconstruct the feature data of the convolutional network due to different organization specifications of the feature marking data, and then the reconstructed data and the output of the feature marking data are combined in a dimension channel; the combined characteristic data passes through a Transformer block, and then the output result is reconstructed again to enable the output result to meet the data format of the up-sampling requirement;

and 4, step 4: decoding the characteristic information and obtaining a segmentation result;

step 4-1: the decoder inputs the merged and reconstructed data into an up-sampling layer, gradually restores the resolution of the characteristic image, merges the data with the characteristic images of different levels reserved in the CNN through jump connection after up-sampling, then convolutes the merged image, and restores the specification of the image to the standard through three times of up-sampling and three times of jump connection

And finally, restoring the image to H multiplied by W multiplied by C through a 4 times up-sampled patch expansion layer.

Step 4-3: and (4) segmentation prediction, namely transmitting the feature marks after the upsampling to a linear mapping layer, and outputting a final pixel-level segmentation prediction result.

Preferably, the patch split layer specifically includes:

dividing an input RGB image into non-overlapping patches through a patch dividing module, wherein each patch is regarded as a mark, and the characteristic of each patch is set as the series connection of RGB values of an original pixel;

preferably, the four downsampling layers are specifically:

wherein in the first downsampled layer, the patch merge layer is replacedFor the linear embedded layer, the patch is reconstructed to dimensions

The structure of (1); the features are extracted via SwinTransformarmerBlock after the patch merge layer or the linear embedding layer.

Preferably, the patch merging layer is used for reducing the number of marks and strengthening global features, and H and W of the patch become original

And the feature dimension becomes 2 times of the original dimension; and the patch merging layer transmits the merged patch to SwinTransformmerBlock for feature extraction.

Preferably, the swintnformer block is constructed by replacing a standard multi-headed self-attention module in an original Transformer with a shift window-based module, since the module is a hierarchical structure, the transformers in each block must exist in pairs, a patch is divided into four blocks in a window multi-headed self-attention layer of the first Transformer and global features of the patch are learned, the extracted features are input into a second Transformer, and the patch is re-divided and combined according to a specific rule in a second shift window multi-headed self-attention layer to extract patch features, wherein the calculation amount of the standard multi-headed self-attention and the self-attention based on the shift window is as follows:

Ω(MSA)＝4hwC ² +2(hw) ² C (1)

Ω(W-MSA)＝4hwC ² +2M ² hwC (2)

wherein h is the number of blocks divided in the image height direction, w is the number of blocks divided in the image width direction, the calculated amount in formula (1) is the square relation of the patch number hw, the calculated amount in formula (2) is linearly related to hw, and the calculated amount of the transform based on the shift window is significantly better than that of the original transform.

Preferably, the CNN stage performs four downsampling processes, specifically:

passing the same image to the convolutional network portion at the same time, in order to be identical toThe patch size of the Transformer part is kept synchronous, the convolution network part performs 4 times of down-sampling, image information is extracted through convolution operation before the first down-sampling, then the down-sampling is performed through a first pooling layer, the calculated amount is reduced, the receptive field is expanded, and H and W of the image after pooling become H and W

And &>

The dimensionality becomes C, and each of the following 3 downsamplings changes H and W of the image into original->

The dimension becomes 2 times the original dimension and after each down-sampling a copy of the extracted image signature is retained for use in the skip-join of the up-sampled part of the decoding process.

Preferably, the upsampling in the step 4-1 specifically includes:

it consists of multiple upsampling steps, decodes the hidden features, and outputs the final segmentation mask. After the image is encoded by the hybrid encoder, the full resolution is achieved by a plurality of upsampling blocks, wherein each block consists of a 2 x upsampling operator, a 3 x 3 convolutional layer and a ReLU layer.

Preferably, the jump connection specifically includes: the cascade upsampling and hybrid encoder forms a U-shaped framework, direct upsampling enables low-resolution patches to be upsampled into high-resolution patches, feature information is lost, low-level features of images are stored in feature graphs reserved by the CNN, feature aggregation of different resolution levels is achieved through skip connection, the original images are recovered as far as possible by fusing intermediate features stored after the CNN is downsampled and the features after the CNN is upsampled, the overall patch features tend to global semantic information during the first upsampling, and therefore in the first skip connection, the feature graphs additionally pass through a Transformer block for further enhancing the global characteristics of the low-level feature graphs reserved by the CNN.

Compared with the prior art, the invention has the advantages that: the invention adopts a U-shaped semantic segmentation frame to realize efficient and accurate semantic segmentation.

(1) The method adopts a Transformer to encode the input marked image block as an input sequence for extracting the global context, and can extract more remote dependence information due to the inherent global self-attention mechanism, thereby solving the problem of global semantic information loss.

(2) An encoding method for mixing CNN and a Transformer is provided, an improved Transformer, namely SwinTrnasformer, and a classical Re sNet50 work in parallel, the feature extraction capability of a model is improved, and the problem of insufficient low-level detail features extracted by the Transformer is successfully solved.

(3) A new semantic segmentation method is provided, an improved encoder is fused into a U-Net model, local detail features and global semantic information of an image can be considered through convolution operation of CNN and a global self-attention mechanism of a Transformer, the problem of incomplete feature extraction in the existing method is solved, original features of the image can be recovered through skip connection and up-sampling in a decoder, and the accuracy of semantic segmentation is further improved.

Drawings

FIG. 1 is an overall flow diagram;

FIG. 2 is a schematic diagram of a hierarchical Transformer block;

FIG. 3 is a partial schematic diagram of a decoder; .

Detailed Description

The invention provides an image semantic segmentation method in the medical field aiming at the defects of the prior art. The whole flow is shown in fig. 1, and is divided into two parts, wherein the first part is a parallel hybrid encoder composed of CNN and Transformer and used for encoding an input image. Firstly, inputting an image into a transform module, segmenting the image through a patch segmentation layer, then down-sampling and coding a patch through a patch fusion layer and a transform block, simultaneously extracting the features of the same image through a CNN module to obtain the detail features of the image, obtaining image feature marks of different levels through convolution and down-sampling, and reconstructing feature data through a Reshape process to enable the two extracted features to have the same organization form. And the second part is a decoding part, merging the characteristic data extracted by the CNN and the Transformer, inputting the merged characteristic data into a decoder, gradually restoring the original resolution of the image by the decoder through upsampling, combining global semantic information obtained through upsampling with local detail information obtained through jump connection, restoring the original information of the image to the maximum extent, and finally obtaining a pixel level segmentation prediction result according to the extracted image characteristic mark through a linear mapping layer.

The invention will be explained in detail with reference to the accompanying drawings, and the specific steps are as follows:

step 1: and carrying out global semantic feature extraction on the image through a Transformer module.

Step 1-1: patch splitting

Using an image of format H × W × 3 as input, the image is divided into 4 parts in both W and H directions in a division layer and connected in series in the dimension direction, the image after division has dimensions of 4 × 4 × 3=48, and the patch size is set to

Step 1-2: linear embedding, the feature dimension of the patch after passing through the linear embedding layer is projected to an arbitrary dimension C, where C is 64.

Step 1-3: transformer block coding

As shown in fig. 2, the specific flow of the hierarchical structure can be represented as:

wherein z is ^l-1 To input, z ^l+1 In order to be output, the output is,

and &>

Intermediate results for each hierarchy.

Step 1-4: patch combining, similar to the downsampling of convolutional networks, is used for downsampling and increasing dimensionality.

Step 2: and extracting local detail information of the image through a CNN module.

Step 2-2: and (4) reconstructing and combining the characteristics, reconstructing the output result of the CNN into a structure which is the same as the output result of the Transformer, and combining the CNN and the Transformer.

And step 3: and decoding the characteristic information to obtain a segmentation result.

Step 3-1: and (4) feature strengthening and reconstruction, namely, passing the combined feature data through a transform block, and reconstructing an output result again to enable the output result to meet the data format of the up-sampling requirement.

Step 3-2: as shown in fig. 3, the decoder inputs the merged and reconstructed data into the upsampling layer, gradually restores the resolution of the feature image, merges the merged feature image with the feature image of different levels retained in the CNN through skip connection after upsampling, then convolves the merged image, and restores the specification of the image to the original specification after three times of upsampling and skip connection

Step 3-3: and (4) segmentation prediction, namely transmitting the feature marks after the upsampling to a linear mapping layer, and outputting a final pixel-level segmentation prediction result.

Experimental results for image segmentation: an average Dice Similarity Coefficient (DSC) and an average Hausdorff distance were used as evaluation indexes.

As for the experimental results of image segmentation, the results are shown in table 1, all experiments in the table are completed on the Synapse dataset, and the method herein is superior to the method TransUnet, which is also a mixture of CNN and Transformer, in both DSC and HD, and is superior to the latest method swininunet, which is completely based on swinintransformer.

TABLE 1 Experimental comparison of different image segmentation methods

/>

Claims

1. An image segmentation method based on CNN and Transformer; the method is characterized in that: the method specifically comprises the following steps:

step 1: carrying out global semantic feature extraction on the image;

step 2: extracting local detail features of the image through a CNN module;

And step 3: reconstructing and merging the characteristics;

step 4-1: the decoder inputs the merged and reconstructed data into an up-sampling layer, gradually restores the resolution of the characteristic image, merges the data with the characteristic images of different levels reserved in the CNN through jump connection after up-sampling, then convolutes the merged image, and restores the specification of the image to the standard through three times of up-sampling and three times of jump connection continuously

Finally, restoring the image to H multiplied by W multiplied by C through a 4-time up-sampling patch expansion layer;

step 4-2: and (4) segmentation prediction, namely transmitting the feature marks after the upsampling to a linear mapping layer, and outputting a final pixel-level segmentation prediction result.

2. The CNN and Transformer based image segmentation method according to claim 1; the method is characterized in that: the patch split layer specifically includes:

the input RGB image is segmented by a patch segmentation module into non-overlapping patches, each of which is treated as a "marker" whose characteristics are set as a concatenation of the original pixel RGB values.

3. The CNN and Transformer based image segmentation method according to claim 1; the method is characterized in that: through four down-sampling layers, specifically:

wherein in the first downsampling layer, the patch merging layer is replaced by a linear embedding layer, and the patch is reconstructed into dimensions of

The structure of (1); the patch merge layer or linear embedding layer is followed by a SwinTransformarmerBlock extraction feature.

4. The CNN and Transformer based image segmentation method according to claim 3; the method is characterized in that: the patch merging layer is used for reducing the number of marks and strengthening global features, and H and W of the patch become original

The characteristic dimension is changed to 2 times of the original dimension; and the patch merging layer transmits the merged patch to SwinTransformmerBlock for feature extraction.

5. The CNN and Transformer based image segmentation method according to claim 3; the method is characterized in that: the SwinTransformarmerBlock is constructed by replacing a standard multi-head self-attention module in an original Transformer with a shift window-based module, and because the SwinTransformer Block is of a layered structure, the Transformers in each block have to exist in pairs, a patch is divided into four blocks in a window multi-head self-attention layer of the first Transformer and global features of the patch are learned, then the extracted features are input into a second Transformer, and in a second shift window patch self-attention layer, the multi-heads are subdivided and combined according to specific rules to extract patch features, and the calculation amount of the standard multi-head self-attention and the shift window-based self-attention is as follows:

Ω(MSA)＝4hwC ² +2(hw) ² C (1)

Ω(W-MSA)＝4hwC ² +2M ² hwC (2)

where h is the number of blocks divided in the image height direction, w is the number of blocks divided in the image width direction, the amount of computation in equation (1) is the square relation of the patch number hw, and the amount of computation in equation (2) is linearly related to hw.

6. The CNN and Transformer based image segmentation method according to claim 1; the method is characterized in that: the CNN stage performs four times of downsampling, specifically:

the same image is transmitted to a convolution network part at the same time, in order to keep the synchronization with the patch size of a Transformer part, the convolution network part carries out down-sampling for 4 times, image information is extracted through convolution operation before the first down-sampling, then the down-sampling is carried out through a first pooling layer, the calculated amount is reduced, the receptive field is expanded, and H and W of the image after pooling are changed into H and W

And &>

The dimension is changed to 2 times of the original dimension, and after each down-sampling, one extracted image feature mark is reserved for the jump connection of the up-sampling part in the decoding process.

7. The CNN and Transformer based image segmentation method according to claim 1; the method is characterized in that: the upsampling in the step 4-1 specifically comprises:

the method comprises a plurality of up-sampling steps, wherein the hidden features are decoded, and a final segmentation mask is output; after an image is encoded by a hybrid encoder, full resolution is achieved by a plurality of upsampling blocks, wherein each block consists of a 2 x upsampling operator, a 3 x 3 convolutional layer, and a ReLU layer.

8. The CNN and Transformer based image segmentation method according to claim 1; the method is characterized in that: the jump connection specifically includes: the cascade upsampling and hybrid encoder forms a U-shaped structure, the direct upsampling can lose feature information when a low-resolution patch is upsampled into a high-resolution patch, a feature graph reserved by the CNN stores low-level features of an image, feature aggregation of different resolution levels is realized through skip connection, an original image is restored as far as possible by fusing intermediate features stored after the CNN is downsampled and the features after the CNN is upsampled, the overall patch features tend to global semantic information during the first upsampling, and therefore in the first skip connection, the feature graph can additionally pass through a Transformer block for further strengthening the global characteristics of the low-level feature graph reserved by the CNN.