CN110223304B

CN110223304B - Image segmentation method and device based on multipath aggregation and computer-readable storage medium

Info

Publication number: CN110223304B
Application number: CN201910419055.9A
Authority: CN
Inventors: 刘琚; 林枫茗; 吴强; 孔祥茂
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2023-01-24
Anticipated expiration: 2039-05-20
Also published as: CN110223304A

Abstract

The invention provides an image segmentation method and device based on multipath aggregation and a computer readable storage medium. Firstly, a bottom-up path aggregation encoder structure is adopted, the whole feature hierarchical structure is enhanced by utilizing spatial position information in low-layer features, an information path between the low-layer features and the top-layer features is shortened, and more complete low-layer features are used. Second, the enhanced decoder of the present invention has greater feature-holding capability. Thirdly, in order to further improve the efficiency of mask prediction, an efficient feature pyramid method is provided, and a feature pyramid effect is completed by using fewer resources. Algorithm verification is carried out on the BraTS2017 data set and the BraTS2018 data set, and the method is superior to the traditional method and has better segmentation results.

Description

Image segmentation method and device based on multipath aggregation and computer readable storage medium

Technical Field

The invention belongs to the technical field of image processing and analysis, and particularly relates to an image segmentation method and device based on multipath aggregation and a computer readable storage medium.

Background

With the development of computer science and artificial intelligence, the running speed of a computer is continuously accelerated, and the effect of the deep learning method is superior to that of the traditional algorithm when the problems of practical application are faced. The semantic segmentation of the image is to classify each pixel in the image and segment the classes with the same semantic meaning. In recent years, applications of image segmentation in industries such as automatic driving, unmanned aerial vehicles, picture beautification, smart homes, smart medical treatment and the like are increasing, and more products and devices need better image segmentation technology as a support.

Conventional automatic image segmentation algorithms include thresholding, edge detection, region growing, watershed algorithms, model-based methods (level sets), and the use of a variety of methods in combination. The traditional algorithm has high operation efficiency, but the accuracy still cannot meet the application requirement, and human intervention is needed in the segmentation process. The image segmentation algorithm based on deep learning is superior to the above traditional algorithm in accuracy performance, but still has many shortcomings in performance. After the image passes through the multilayer convolutional neural network, deeper features are obtained, and shallow features in the process of extracting the features are ignored. In the traditional image segmentation method based on deep learning, the final segmentation image only adopts deep features, and does not use equally important shallow features. At present, a multilayer feature fusion method based on a feature pyramid exists, but the use of shallow features is still insufficient. There is therefore a need for a method of image segmentation with channel enhancement.

Disclosure of Invention

The traditional image segmentation method based on the convolutional neural network does not use shallow features or is not used sufficiently, and the spatial position information of the image is ignored. Aiming at the problem, the invention provides a neural network method based on multipath aggregation, which can fully and effectively utilize shallow features of images, combine deep features of the images, and output segmentation results together, thereby improving the segmentation precision.

The technical scheme adopted by the invention is as follows:

a image segmentation method based on multi-path aggregation combines shallow features and deep features of an image by using a path aggregation structure to obtain a final segmentation result, and specifically comprises the following steps:

data preprocessing: carrying out normalization processing on the data set, and adjusting the distribution of image gray values; if multi-modal data exist, the multi-modal data are fused to form multi-channel data, and if the data are in a single mode, subsequent processing is directly carried out; cleaning data, and removing the image without the label to obtain final data;

and (II) performing down-sampling processing on data through an encoder: the down-sampling region consists of two convolution layers and a pooling layer, in order to prevent gradient dispersion, each convolution layer is added with a batch normalization layer for reactivation, data passes through the down-sampling region for four times, the image size is reduced, and finally the data passes through the down-sampling region without the pooling layer to obtain final down-sampling output;

(III) performing upsampling processing on the data through an enhanced decoder: after down-sampling, the image scale becomes smaller, and the image is restored to the original scale by adopting an up-sampling mode, wherein the specific method comprises the following steps: the up-sampling region comprises an anti-convolution layer, a connection layer and two convolution layers, wherein the connection layer is used for connecting the feature map with the same scale in down-sampling with the feature map obtained by deconvolution, and each convolution layer passes through an activation function; the feature graph which is finally output after down sampling passes through four up-sampling areas, and the image is restored to the original image scale to obtain the final up-sampling output; to accommodate more information features, the number of decoder channels is increased, unlike the number of encoder channels, as follows:

D(x _i )＝D(x _i-1 )+E(x _i )

D(x _i ) Is the output characteristic of the ith decoder, E (x) _i ) Is an output characteristic of the ith encoder;

and (IV) performing downsampling processing on the data through a path aggregation encoder: the path aggregation region is composed of two convolution layers, a down-sampling layer and a connection layer, the connection layer connects the feature maps with the same scale in the turbo decoder with the path aggregation layer, the feature maps pass through the path aggregation region three times in total, and the input of the path aggregation region is added to obtain four feature maps with different scales, which are the output of the path aggregation region, and the process is as follows:

A(x _i )＝A(x _i-1 )+D(x _i )

A(x _i ) Is an output characteristic of the ith path aggregation encoder;

and (V) fusing the output feature maps of the path aggregation regions through the efficient feature pyramid, and outputting a final segmentation result: the output of the path aggregation area is up-sampled to the original image scale, then pixel values are added, and in the up-sampling process, the number of channels of each path is reduced, and the process is as follows:

p (x) is the output characteristic of the high-efficiency characteristic pyramid, and finally, the final multi-task segmentation result is obtained through a convolution layer and an activation function, wherein the process is as follows:

h (x) is the final output result of the network;

(VI) result prediction: and storing the trained model, and inputting the model into a test set to obtain a final segmentation result.

In order to implement the above method, the present invention also provides an image processing apparatus comprising a data acquisition component, a memory, and a processor, wherein,

the data acquisition component is used for carrying out normalization processing on the data set and adjusting the distribution of image gray values; if multi-modal data exist, the multi-modal data are fused to form multi-channel data, and if the data are in a single mode, subsequent processing is directly carried out; cleaning data, and removing the image without the label to obtain final data;

the memory stores a computer program that, when executed by the processor, is capable of implementing steps (two) to (six) of the method as previously described.

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth above.

In the data preprocessing process of model training, data are randomly disordered, a data set is divided into n parts, n-1 parts of the data set are selected as a training set, and the remaining 1 parts of the data set are selected as a verification set to obtain a final segmentation result. Compared with the traditional method, the method has higher segmentation precision and better generalization capability.

Drawings

Fig. 1 is a schematic block diagram of an image segmentation method based on multipath aggregation.

FIG. 2 is a block diagram of a segmentation model used in the present invention.

Fig. 3 is a schematic block diagram of an image processing apparatus of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail by the accompanying drawings.

As shown in fig. 1, the image segmentation method based on multipath fusion of the present invention is as follows:

the method comprises the following steps: a training stage: training is performed using the labeled data set. And (3) sending the data set into a network to participate in training, taking a cross entropy function as a loss function, updating parameters of a path aggregation network by using an Adam optimizer, training for 70 times, storing the model in each iteration, verifying by using a verification set part in the data set after storing the model, and finally storing the model with the highest verification accuracy.

Step two: and (3) a testing stage: preprocessing data, cutting the data of a plurality of modes, performing standard operation of subtracting a mean value and dividing the mean value by a variance, sending the data into a model with the optimal effect obtained in a training stage, and obtaining and displaying a segmentation result graph through model calculation.

As shown in fig. 2, the network structure and the specific method of path aggregation are as follows:

data preprocessing: and carrying out normalization processing on the data set and adjusting image distribution. If multi-modal data exist, the multi-modal data are fused to form multi-channel data, and if the data are in single mode, the subsequent processing is directly carried out. The data is read in, and the size of the obtained data is b multiplied by w multiplied by h multiplied by c, wherein b is the number of images, w is the width of the images, h is the height of the images, and c is the number of channels. And (4) disordering the data sequence and carrying out normalization processing on the data set. And (5) cleaning data, and removing the image without the label to obtain final data. Dividing all data sets into n parts, respectively storing the n parts in an array form, reading n-1 parts in the data sets as a training set, and taking the rest 1 part as a test set.

(II) an encoder:

and carrying out downsampling processing on the training data. The downsampling process contains five regions, four of which are downsampled regions and the last of which is a non-downsampled region. The downsampled region consists of two convolutional layers and one downsampled layer. The convolution kernel size of the convolution layer is 3, the step length is 1, and the maximum pooling layer is adopted in the down-sampling layer. The number of convolution kernels of the first to fourth downsampling areas is increased. The last non-downsampled region, the downsampled region, removes the pooling layer. To prevent gradient diffusion, a batch normalization layer is added after each convolution layer, and the convolution layers in the downsampling process use a ReLU activation function. When the training data passes through the down-sampling area, the width and height are changed to 1/2 of the original width and height, the number of channels of the final down-sampling output is 512, and the width and height of the output are 1/16 of the original width and height.

(III) enhancing the decoder:

and performing upsampling processing on the training data. The up-sampling process comprises four regions, each region is composed of an up-sampling layer, a connecting layer and two convolution layers. The upper sampling layer adopts an interpolation method, the connection layer connects the feature map with the same scale in the lower sampling process with the feature map after the upper sampling, the sizes of convolution kernels of the two convolution layers are 3, the number of the convolution kernels in the first to the fourth lower sampling areas is decreased progressively but is more than that of the encoder, an asymmetric structure is adopted, and the feature analysis effect of the decoder is enhanced. A batch normalization layer is added after each convolution layer, and the convolution layers in the up-sampling process all pass through an activation function. When the training data passes through the upsampling area, the width and height of the training data are changed to be 2 times of the original width and height, and finally the number of the upsampled output channels is 64.

D(x _i )＝D(x _i-1 )+E(x _i )

D(x _i ) Is the output characteristic of the ith decoder, E (x) _i ) Is the output characteristic of the ith encoder.

(IV) path aggregation encoder:

and carrying out downsampling processing on the training data. The path aggregation region comprises three downsampling regions, each region consisting of one downsampling layer, one connection layer and two convolution layers. The downsampling adopts convolution layers with the step length of 2, the connection layers connect the feature maps with the same scale in the upsampling process and the feature maps in the path aggregation process, the sizes of convolution kernels of the two convolution layers are 2 and 3 respectively, and the number of convolution kernels of the first downsampling area to the third downsampling area is 128, 256 and 512 respectively. A batch normalization layer is added after each convolution layer, and the convolution layers in the down-sampling process all use ReLU activation functions. When the training data passes through the path aggregation region, the width and height are changed to 1/2 of the original width and height. The training data passes through three path aggregation areas in total, four feature maps with the length and width of 1/1, 1/2, 1/4 and 1/8 of the original image are obtained by calculating the input of the path aggregation areas, the number of channels is increased progressively, the four feature maps are output of the path aggregation areas, and the process is as follows:

A(x _i )＝A(x _i-1 )+D(x _i )

A(x _i ) Is the output characteristic of the ith path aggregation encoder.

(V) efficient characteristic pyramid:

and fusing the output characteristic graphs of the path aggregation areas and outputting a final segmentation result. And respectively up-sampling four outputs of the path aggregation area to the original image scale, then adding pixel values, and finally passing through a convolution layer. The up-sampling process comprises a convolution layer and an up-sampling layer, the number of channels of the convolution layer is 32, the size of the convolution kernel is 1, the purpose is to change the number of channels of the characteristic diagram, otherwise, the characteristic diagram with more channels occupies a larger storage space after up-sampling, and is not beneficial to training, and the up-sampling layer adopts an interpolation method. The procedure is as follows:

p (x) is the output characteristic of the high-efficiency characteristic pyramid, and finally, a final multitask segmentation result is obtained through a convolution layer and an activation function, wherein the process is as follows:

h (x) is the final output of the network.

In the data preprocessing process of model training, data are randomly scrambled, a data set is divided into n parts, n-1 parts of the data set are selected as a training set, and the rest 1 parts of the data set are selected as a verification set, so that a final segmentation result is obtained.

Fig. 3 shows a schematic block diagram of an image processing apparatus of the present invention. As shown in fig. 3, the image processing apparatus includes a data acquisition component, a memory and a processor, wherein the data acquisition component performs normalization processing on a data set to adjust the distribution of image gray values; if the multi-modal data exist, the multi-modal data are fused to form multi-channel data, and if the data are in a single mode, subsequent processing is directly carried out; cleaning data, and removing the image without the label to obtain final data; the memory stores a computer program that, when executed by the processor, is capable of implementing steps (two) to (six) of the method as previously described.

Compared with the traditional method, the method has higher segmentation precision and better generalization capability.

The effect of the present invention can be further illustrated by the segmentation result:

to verify the performance of the invention, datasets BraTS2017 and BraTS2018 were used, containing multimodal data for 285 patients. The standard data set is divided into a training set and a verification set, and the image segmentation method of multipath aggregation is compared with other methods which do not use multipath aggregation. The dice coefficient, recall ratio and precision ratio of edema, necrosis and enhancement parts are respectively compared.

Table 1 shows the segmentation results of the BraTS2017 dataset according to the present invention. Table 2 shows the segmentation results of the BraTS2018 dataset according to the present invention. The method is a combination of a path aggregation encoder, an efficient feature pyramid and an enhanced decoder, wherein VGG, DUnet and FCNN are classical methods in the field of deep learning image segmentation, and PA + EFP + ED is the combination of the path aggregation encoder, the efficient feature pyramid and the enhanced decoder. The bold numbers in the table are the maximum values in the column, representing the optimum effect. In conclusion, the invention has better segmentation effect than the classical method.

TABLE 1

TABLE 2

Claims

1. A image segmentation method based on multi-path aggregation combines shallow features and deep features of an image by using a path aggregation structure to obtain a final segmentation result, and specifically comprises the following steps:

D(x _i )＝D(x _i-1 )+E(x _i )

and (IV) performing downsampling processing on the data through a path aggregation coder: the path aggregation region is composed of two convolution layers, a down-sampling layer and a connection layer, the connection layer connects the feature maps with the same scale in the turbo decoder with the path aggregation layer, the feature maps pass through the path aggregation region three times in total, and the input of the path aggregation region is added to obtain four feature maps with different scales, which are the output of the path aggregation region, and the process is as follows:

A(x _i )＝A(x _i-1 )+D(x _i )

A(x _i ) Is an output characteristic of the ith path aggregation encoder;

h (x) is the final output result of the network;

2. An image processing apparatus comprising a data acquisition component, a memory and a processor, wherein,

the data acquisition component is used for carrying out normalization processing on the data set and adjusting the distribution of image gray values; if the multi-modal data exist, the multi-modal data are fused to form multi-channel data, and if the data are in a single mode, subsequent processing is directly carried out; cleaning data, and removing the image without the label to obtain final data;

the memory stores a computer program that, when executed by the processor, is capable of implementing steps (two) to (six) of the method of claim 1.

3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 1.