CN111369563B

CN111369563B - Semantic segmentation method based on pyramid void convolutional network

Info

Publication number: CN111369563B
Application number: CN202010108637.8A
Authority: CN
Inventors: 史景伦; 张宇; 傅钎栓; 李显惠; 林阳城
Original assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Current assignee: Guangzhou Menghui Robot Co ltd; South China University of Technology SCUT
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-04-07
Anticipated expiration: 2040-02-21
Also published as: CN111369563A

Abstract

The invention discloses a semantic segmentation method based on a pyramid hole convolution network, which comprises the following steps of: acquiring a medical image data set containing a real segmentation result, and performing preprocessing operations such as data enhancement on the data set; obtaining shallow image characteristics of the preprocessed image through a residual recursive convolution module and a pooling layer; obtaining deep image characteristics through a network formed by connecting a pyramid pooling module and a hole convolution module in parallel; decoding deep image features through an deconvolution layer, a jump connection and a residual recursive convolution module; inputting the decoding result into a softmax layer to obtain the category of each pixel; training a pyramid cavity convolution network, establishing a loss function, and determining network parameters through training samples; and inputting the test image into the trained pyramid hole convolution network to obtain a semantic segmentation result of the image. The method for pooling the void volume and the pyramid can effectively extract multi-scale semantic information and detail information and improve the network segmentation effect.

Description

Semantic segmentation method based on pyramid void convolutional network

Technical Field

The invention relates to the technical field of computer vision, in particular to a semantic segmentation method based on a pyramid cavity convolution network.

Background

In recent years, with the rapid development of deep learning technology, the application of the deep learning technology in the field of medical image analysis is also becoming wider. Among them, the semantic segmentation technique plays a great role in various application scenarios such as treatment planning, disease diagnosis, pathological research, etc. For medical images, accurate identification of the type of each object in the image requires knowledge background in the professional domain and time consuming for the professional authority. Through the research on the semantic segmentation technology, the input medical image can be automatically and accurately segmented, so that a doctor can conveniently make more accurate judgment and a better treatment plan is designed.

The traditional semantic segmentation algorithm comprises a segmentation method based on watershed, a segmentation method based on clustering and a segmentation method based on statistical characteristics, but with the development of deep learning technology, the semantic segmentation method based on a CNNs model becomes mainstream, and especially with the proposal of an FCN (fuzzy C-channel network), a great deal of door is opened for the development of the semantic segmentation technology, and more researchers propose a plurality of improved semantic segmentation models based on the FCN model. Particularly, the U-Net model has the advantage that the model effect is still good under the condition that the training set is small, so that the U-Net model is widely used in the field of medical image semantic segmentation.

In the encoder structure of the U-Net model, downsampling is carried out in a maximum pooling mode, and pooling can increase the receptive field, so that deeper semantic information can be obtained. However, after pooling, the resolution of the feature map of the image may be reduced accordingly, resulting in loss of detail information. Although multi-scale detail information is acquired by means of hopping connection in the U-Net network, the loss of boundary position information and the reduction of the discrimination capability of the model space are still caused.

In the process proposed by the present invention, at least hole convolution has been found to be widely used because it has the advantage of being able to increase the field of view without causing a reduction in the resolution of the feature image. Meanwhile, in order to further improve the effect of the U-Net model, attention mechanism, pyramid pooling module, recursive convolution, residual connection, dense connection and other technologies are also used to combine with the U-Net model.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a semantic segmentation method based on a pyramid void convolution network, which extracts features of different scales by using a plurality of residual recursive convolution modules, a void convolution module and a pyramid pooling module, and then restores the size of a feature image by using multi-layer up-sampling and jump connection.

The technical purpose of the invention is realized by the following technical scheme:

a semantic segmentation method based on a pyramid hole convolution network comprises a first residual error recursive convolution module, a second residual error recursive convolution module, a pooling layer, a pyramid pooling module, a hole convolution module, an deconvolution layer, a third residual error recursive convolution module, a fourth residual error recursive convolution module and a softmax prediction layer, and the method is structurally connected in the following mode: the first residual recursive convolution module is sequentially connected with the pooling layer, the second residual recursive convolution module and the pooling layer in series, the pyramid pooling module and the cavity convolution module are connected in parallel and then connected with the pooling layer in series, and then the deconvolution layer, the third residual recursive convolution module, the deconvolution layer, the fourth residual recursive convolution module and the softmax prediction layer are sequentially connected in series; the semantic segmentation method comprises the following steps:

s1, acquiring a medical image data set containing a real segmentation result, and performing preprocessing operation on the data set to realize data enhancement;

s2, the preprocessed image sequentially passes through the first residual recursive convolution module, the pooling layer, the second residual recursive convolution module and the pooling layer, the semantic information of the image is extracted in a multi-scale mode, and the shallow image characteristics F are obtained respectively ₁₁ 、F ₁₂ 、F ₂₁ 、F ₂₂ ；

S3, image feature F ₂₂ By a network of pyramid pooling modules and hole convolution modules in parallel, wherein the image features F are ₂₂ Obtaining image characteristics F through a pyramid pooling module ₃ Image feature F ₂₂ Obtaining image characteristics F through a hole convolution module ₄ (ii) a Image feature F ₃ 、F ₄ Performing aggregation operation channel by channel, and performing convolution layer with convolution kernel of 1 × 1 to obtain deep image feature F ₅ Therefore, deep semantic information can be further extracted;

s4, image characteristicsF ₅ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₂₁ Performing channel-by-channel aggregation operation to obtain image feature F ₆₁ (ii) a Then image feature F ₆₁ Obtaining image characteristics F through a third residual error recursive convolution module ₆₂ Wherein, the jump connection directly transmits the shallow feature and carries out channel-by-channel aggregation with the result after passing through the reverse convolution layer; by using the skip connection, more detail information of the original image can be kept in the output image characteristics, so that the boundary of the predicted segmentation image is smoother.

S5, image feature F ₆₂ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₁₁ Performing channel-by-channel aggregation operation to obtain image feature F ₇₁ (ii) a Then image feature F ₇₁ Obtaining image characteristics F through a fourth residual error recursive convolution module ₇₂ ；

S6, image feature F ₇₂ Inputting the image into a softmax prediction layer to obtain the category of each pixel in the original input image;

s7, training a pyramid cavity convolution network, establishing a loss function, and determining network parameters through training samples;

and S8, inputting the test image to be segmented into the trained pyramid cavity convolution network to obtain the semantic segmentation result of the image.

Further, the preprocessing operation in step S1 includes rotation, slicing, normalization, and adaptive histogram equalization.

Furthermore, the first residual error recursive convolution module, the second residual error recursive convolution module, the third residual error recursive convolution module and the fourth residual error recursive convolution module have the same structure, and each residual error recursive convolution module firstly passes through two recursive convolution layers which are connected in series and then is added with the input in a residual error mode to obtain the output; the recursive convolutional layer has the structure connection of conv, reLU, add, conv and ReLU in sequence, wherein conv is a convolutional layer with a convolution kernel of 3 x 3, and Add is pixel-by-pixel addition with input. The use of residual concatenation may help train deeper networks than the use of ordinary convolutional layers, while the use of recursive convolutional networks may better extract semantic information contained in the image.

Further, the pyramid pooling module in step S3 includes four adaptive average pooling layers with different pooling sizes, and is configured to obtain the image feature F obtained in step S2 from multiple scales ₂₂ The four pooling layers adopt the pooling sizes of N, N/2, N/3 and N/6 respectively, wherein N represents the image feature F ₂₂ The resolution of (2); then, the image features with different sizes obtained by different pooling layers are respectively passed through a convolution layer with convolution kernel of 1 × 1, and then the transposition convolution is carried out to obtain the image features F ₂₂ Image features F of uniform size ₃₁ 、F ₃₂ 、F ₃₃ 、F ₃₄ Then the up-sampling result of each scale and the input image characteristic F are combined ₂₂ Aggregating, and passing the aggregated image features through a convolution layer with convolution kernel of 3 × 3 to obtain image features F ₃ I.e. F ₃ ＝Conv(Concatenate(F ₂₂ ,F ₃₁ ,F ₃₂ ,F ₃₃ ,F ₃₄ ) Concatenate is an aggregation operation and Conv is a convolution operation of 3 x 3). By performing pooling operation of multiple scales, detailed information and deeper semantic information contained in the image can be better acquired.

Further, the cavity convolution module in step S3 is formed by connecting three cavity convolution units with different cavity factors in series, the cavity factors of the three cavity convolution units are 1, 2 and 4, respectively, and the sizes of the cavity convolution kernels are all 3 × 3; input image feature F ₂₂ Then, the image characteristics obtained by the three cavity convolution units are respectively F ₄₁ 、F ₄₂ 、F ₄₃ (ii) a The cavity convolution units are connected in a dense connection mode, wherein the dense connection mode is that the input of each cavity convolution unit is added with the output of the cavity convolution unit to be used as output; after passing through the cavity convolution module, a resolution and an image characteristic F can be obtained ₂₂ Equal image features F ₄ ，F ₄ ＝Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ) Where Add is a pixel-by-pixel addition operation. By using the hole convolution instead of the common convolution and pooling, deeper semantic information can be acquired by increasing the receptive field, and the problem that the detail information is lost due to the reduction of the resolution ratio caused by the pooling operation can be solved.

Further, the deconvolution layer in step S4 and step S5 is a transposed convolution.

Further, in step S6, end-to-end training is performed on the established pyramid hole convolution network, a random gradient descent algorithm is adopted for the training strategy, and a loss function uses categorical _ cross, and the formula is as follows:

wherein l _c Representing a segmented feature map F _s Class cross entropy loss of f _s Representation feature mapping F _s M is a feature map F _s K is the number of classes,

representing a voxel f _s Whether or not it belongs to the category k,

representing a voxel f _s The probability of belonging to class k.

Compared with the prior art, the invention has the following advantages and effects:

(1) The method adopts the hole convolution module to extract the deep semantic information, and compared with the traditional convolution and pooling mode, the hole convolution module can increase the receptive field and simultaneously can not cause the reduction of the resolution. Meanwhile, the cavity convolution module comprises three cavity convolution layers with different cavity factors, and the three cavity convolution layers are connected in a dense connection mode, so that semantic information can be acquired in multiple scales.

(2) The invention combines and uses the pyramid space pooling module to extract the information of a plurality of scales contained in the image, thereby effectively acquiring the semantic information of deep level and the detail information of shallow level contained in the image.

(3) The invention uses residual recursive convolution to replace the common convolution, thereby helping to train deeper network structure and obtain better feature representation capability of segmentation task.

(4) The residual error recursive convolution, the cavity convolution, the pyramid pooling and other modules contained in the method are an algorithm capable of performing end-to-end training, and compared with a two-stage algorithm, the method is smaller in parameter number and more convenient to train.

Drawings

FIG. 1 is a flow chart of a semantic segmentation method based on a pyramid hole convolution network disclosed by the invention;

FIG. 2 (a) is a schematic diagram of a residual recursive convolution module in an embodiment of the present invention, and FIG. 2 (b) is a schematic diagram of a recursive convolution unit used in FIG. 2 (a);

FIG. 3 is a schematic diagram of a spatial pyramid pooling module in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hole convolution module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Embodiment as shown in fig. 1, the present embodiment provides a semantic segmentation method based on a pyramid hole convolutional network, which specifically includes the following steps:

s1, acquiring a medical image data set containing a real segmentation result, and performing data enhancement and other preprocessing operations on the data set; since most medical image data sets have the characteristics of small capacity, low contrast and the like, images in the data sets are firstly rotated, sliced, standardized and subjected to adaptive histogram equalization.

S2, the preprocessed image sequentially passes through the first residual error recursive convolution module, the pooling layer, the second residual error recursive convolution module and the pooling layer, semantic information of the image is extracted in a multi-scale mode, and shallow image features F are obtained respectively ₁₁ 、F ₁₂ 、F ₂₁ 、F ₂₂ . The method specifically comprises the following steps: as shown in fig. 2 (a), the residual recursive convolution module is to pass the input through two cascaded recursive convolution layers, and then add the input and the input in a residual manner to obtain an output; as shown in fig. 2 (b), the recursive convolution units are connected in sequence by conv, reLU, add, conv, and ReLU, where conv is a convolution layer with a convolution kernel of 3 × 3, and Add is a pixel-by-pixel addition to the input; the pooling layer adopts a maximum pooling layer with a step length of 2.

S3, image feature F ₂₂ By a network of pyramid pooling modules and hole convolution modules in parallel, wherein the image features F are ₂₂ Obtaining image characteristics F through a pyramid pooling module ₃ Image feature F ₂₂ Obtaining image characteristics F through a cavity convolution module ₄ (ii) a Then image feature F ₃ 、F ₄ Performing aggregation operation channel by channel, and performing convolution layer with convolution kernel of 1 × 1 to obtain deep image feature F ₅ Therefore, deep semantic information can be further extracted; the method specifically comprises the following steps:

as shown in fig. 3, the pyramid pooling module includes four adaptive average pooling layers (i.e. avgpool in fig. 3) with different pooling sizes, and is used for obtaining the image feature F obtained in step S2 from multiple scales ₂₂ The four pooling layers adopt the pooling sizes of N, N/2, N/3 and N/6 respectively, wherein N represents the image feature F ₂₂ The resolution of (2); then to different pooling layersThe obtained image features with different sizes respectively pass through a convolution layer with convolution kernel of 1 × 1 (i.e. conv 1 × 1 in fig. 3), and then are subjected to transposition convolution (i.e. up-conv in fig. 3), so as to obtain the image features F ₂₂ Image features F of uniform size ₃₁ 、F ₃₂ 、F ₃₃ 、F ₃₄ Then the up-sampling result of each scale and the input image characteristic F are combined ₂₂ Aggregating, and passing the aggregated image features through a convolution layer with convolution kernel of 3 × 3 to obtain image features F ₃ I.e. F ₃ ＝Conv(Concatenate(F ₂₂ ,F ₃₁ ,F ₃₂ ,F ₃₃ ,F ₃₄ ) Concatenate is an aggregation operation and Conv is a convolution operation of 3 x 3).

As shown in fig. 4, the hole convolution module is formed by connecting three hole convolution units with different hole factors in series, the hole factors of the three hole convolution units are 1, 2 and 4 respectively, and the sizes of the hole convolution kernels are all 3 × 3; input image feature F ₂₂ Then, the image characteristics obtained by the three cavity convolution units are respectively F ₄₁ 、F ₄₂ 、F ₄₃ (ii) a The hole convolution units are connected in a dense connection mode, and the dense connection mode is that the input of each hole convolution unit is added with the output of the hole convolution unit to be used as output; after passing through the cavity convolution module, a resolution and an image characteristic F can be obtained ₂₂ Equal image features F ₄ ，F ₄ ＝Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ) Where Add is a pixel-by-pixel addition operation.

In this embodiment, the channel-by-channel aggregation operation means that the aggregation operation is performed in the channel dimension, that is, the image feature F is assumed ₃ The number of channels of is C ₁ Image feature F ₄ The number of channels is C ₂ If the number of channels of the image features obtained after the aggregation is C ₁ +C ₂ 。

S4, image feature F ₅ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₂₁ Conducting a channel-by-channel polymerizationOperating to obtain image characteristics F ₆₁ (ii) a Then image feature F ₆₁ Obtaining image characteristics F through a third residual error recursive convolution module ₆₂ (ii) a The method specifically comprises the following steps: the deconvolution layer adopts a transposition convolution; the jump connection is to directly transfer the shallow feature and perform channel-by-channel aggregation with the result after passing through the deconvolution layer, and the channel-by-channel aggregation operation is as described in the previous step S3.

S5, image feature F ₆₂ Through an inverse convolution layer and then with the shallow image features F transmitted through the skip connection ₁₁ Performing channel-by-channel aggregation operation to obtain image feature F ₇₁ (ii) a Then image feature F ₇₁ Obtaining image characteristics F through a fourth residual error recursive convolution module ₇₂ (ii) a The method specifically comprises the following steps: the deconvolution layer adopts a transposition convolution; the jump connection is to directly transfer the shallow feature and perform channel-by-channel aggregation with the result after passing through the deconvolution layer, and the channel-by-channel aggregation operation is as described in the previous step S3.

S6, image feature F ₇₂ And inputting the image into a softmax prediction layer to obtain the category to which each pixel in the original input image belongs.

S7, training the pyramid cavity convolution network, establishing a loss function, and determining network parameters through training samples, wherein the network parameters specifically comprise a learning rate, weight reduction, momentum items and a training strategy. End-to-end training is carried out on the established pyramid cavity convolution network, a random gradient descent algorithm is adopted in a training strategy, the initial learning rate is set to be 0.001, and the weight descent is set to be 10 ^-4 Adding 0.9 momentum term momentum; the loss function uses the coordinated _ cross-entropy loss function, which differs from the original cross-entropy loss function in that coordinated _ cross-entropy for k ^th The loss function of class voxels is increased by the corresponding loss weight v ^k The weight size and the voxel belong to k ^th The categories are inversely proportional, and the formula is:

representing a voxel f _s Whether or not it belongs to the category k,

representing a voxel f _s The probability of belonging to class k.

In summary, the semantic segmentation method based on the pyramid hole convolutional network disclosed in this embodiment provides and trains a pyramid hole convolutional network, establishes a loss function, and determines network parameters through training samples; and inputting the test image into the trained pyramid hole convolution network to obtain a semantic segmentation result of the image. The method for the void convolution and pyramid pooling can effectively extract multi-scale semantic information and detail information, and improves the segmentation effect of the network.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The semantic segmentation method based on the pyramid hole convolution network is characterized in that the pyramid hole convolution network comprises a first residual error recursive convolution module, a second residual error recursive convolution module, a pooling layer, a pyramid pooling module, a hole convolution module, an deconvolution layer, a third residual error recursive convolution module, a fourth residual error recursive convolution module and a softmax prediction layer, and the structural connection mode is as follows: the pyramid pooling module and the cavity convolution module are connected in parallel and then connected in series with the pooling layer, and then sequentially connected in series with the deconvolution layer, the third residual recursive convolution module, the deconvolution layer, the fourth residual recursive convolution module and the softmax prediction layer; the semantic segmentation method comprises the following steps:

s1, acquiring a medical image data set containing a real segmentation result, and preprocessing the data set to realize data enhancement;

S3, image feature F ₂₂ By a network of pyramid pooling modules and hole convolution modules in parallel, wherein the image features F are ₂₂ Obtaining image characteristics F through a pyramid pooling module ₃ Image feature F ₂₂ Obtaining image characteristics F through a cavity convolution module ₄ (ii) a Image feature F ₃ 、F ₄ Performing aggregation operation channel by channel, and performing convolution layer with convolution kernel of 1 × 1 to obtain deep image feature F ₅ Thereby further extracting deep semantic information;

s4, image feature F ₅ Through an inverse convolution layer and then coupled with the shallow image features F delivered through the skip-join ₂₁ Performing channel-by-channel aggregation operation to obtain image feature F ₆₁ (ii) a Then image feature F ₆₁ Obtaining image characteristics F through a third residual error recursive convolution module ₆₂ Wherein, the jump connection directly transmits the shallow feature and carries out channel-by-channel aggregation with the result after passing through the reverse convolution layer;

2. The method for semantic segmentation based on the pyramid hole convolutional network of claim 1, wherein the preprocessing operation in step S1 includes rotation, slicing, normalization, and adaptive histogram equalization.

3. The semantic segmentation method based on the pyramid hole convolution network according to claim 1, characterized in that the first residual recursive convolution module, the second residual recursive convolution module, the third residual recursive convolution module and the fourth residual recursive convolution module have the same structure, and each residual recursive convolution module is formed by first passing an input through two recursive convolution layers connected in series and then adding the input and the input in a residual manner to obtain an output; the structure connection of the recursive convolutional layer is conv, reLU, add, conv and ReLU in sequence, wherein conv is a convolutional layer with a convolution kernel of 3 multiplied by 3, and Add is pixel-by-pixel addition with input;

4. the semantic segmentation method based on the pyramid hole convolutional network of claim 1, wherein the pyramid pooling module in step S3 comprises four adaptive average pooling layers with different pooling sizes, and is used for obtaining the image feature F obtained in step S2 from multiple scales ₂₂ The four pooling layers adopt the pooling sizes of N, N respectivelyN2, N/3, N/6, where N represents an image feature F ₂₂ The resolution of (2); then, the image features with different sizes obtained by different pooling layers are respectively passed through a convolution layer with convolution kernel of 1 × 1, and then the transposition convolution is carried out to obtain the image features F ₂₂ Image features F of uniform size ₃₁ 、F ₃₂ 、F ₃₃ 、F ₃₄ Then the up-sampling result of each scale and the input image characteristic F are combined ₂₂ Polymerizing, and passing the polymerized image features through a convolution layer with convolution kernel of 3 × 3 to obtain image features F ₃ I.e. F ₃ ＝Conv(Concatenate(F ₂₂ ,F ₃₁ ,F ₃₂ ,F ₃₃ ,F ₃₄ ) Concatenate is an aggregation operation and Conv is a convolution operation of 3 x 3).

5. The semantic segmentation method based on the pyramid hole convolution network according to claim 1, wherein in step S3, the hole convolution module is formed by connecting three hole convolution units with different hole factors in series, the hole factors of the three hole convolution units are 1, 2 and 4, respectively, and the sizes of the hole convolution kernels are all 3 × 3; input image feature F ₂₂ Then, the image characteristics obtained by the three cavity convolution units are respectively F ₄₁ 、F ₄₂ 、F ₄₃ (ii) a The cavity convolution units are connected in a dense connection mode, wherein the dense connection mode is that the input of each cavity convolution unit is added with the output of the cavity convolution unit to be used as output; after passing through the cavity convolution module, a resolution and an image characteristic F can be obtained ₂₂ Equal image features F ₄ ，F ₄ ＝Add(F ₂₂ ,F ₄₁ ,F ₄₂ ,F ₄₃ ) Where Add is a pixel-by-pixel addition operation.

6. The method as claimed in claim 1, wherein the deconvolution layer in step S4 and step S5 is a transposed convolution.

7. The method for semantic segmentation based on the pyramid hole convolutional network of claim 1, wherein in step S6, the established pyramid hole convolutional network is trained end to end, a random gradient descent algorithm is adopted as a training strategy, and a loss function uses catagorical _ cross, and the formula is as follows:

representing a voxel f _s Whether it belongs to the category k, or not>

Representing a voxel f _s The probability of belonging to class k. />