CN114170249A

CN114170249A - Image semantic segmentation method based on CNUNet3+ network

Info

Publication number: CN114170249A
Application number: CN202210118688.8A
Authority: CN
Inventors: 张斌; 欧阳红林; 朱颖达; 刘其圣
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2022-03-11
Anticipated expiration: 2042-02-08
Also published as: NL2032957B1; CN114170249B; NL2032957A

Abstract

The invention discloses an image semantic segmentation method based on a CNUNet3+ network, which comprises the following steps: constructing a CNUNet3+ network model, wherein the network model is of a U-shaped structure and adopts an encoder-decoder structure; the depth is N, and N is more than or equal to 3; the left arm of the U-shaped structure is an encoder, and the right arm of the U-shaped structure is a decoder; the number of channels is doubled when the depth of the encoder is deepened by one layer; down-sampling operation is used between adjacent nodes of the encoder, up-sampling operation is used between adjacent nodes of the decoder, and dense connection is used between the nodes of the decoder; when the encoder node and the central node are at the same depth, convolution operation is adopted, and when the encoder node and the central node are at different depths, downsampling operation is adopted; the encoder node, the central node and the decoder node are all neuron nodes; the number of central nodes may range from

And (4) respectively. The method is used for medical X-ray and CT image segmentation, and can improve the segmentation accuracy of small targets.

Description

Image semantic segmentation method based on CNUNet3+ network

Technical Field

The invention belongs to the field of image segmentation, and relates to an image semantic segmentation method based on a CNUNet3+ network.

Background

Clinical commonly used medical images include X-rays, computed tomography CT, magnetic resonance MRI, and ultrasound. Compared with general natural images, the medical images have specificity and complexity. The distinction between different parts of a medical image is often minimal compared to natural images. The medical image is typically a black and white image. In medical images, organs such as liver, spleen, kidney, stomach and the like have almost the same color in an abdominal CT picture, and are often difficult to distinguish. If a tumor appears in the liver, the color of the tumor part is slightly different from the color of the rest part, and the slight difference can be distinguished only by medical professionals and is difficult to distinguish by ordinary people, which causes the complexity of image segmentation of medical images.

In medical image diagnosis, it is necessary to identify a lesion region of a patient. For example, liver tumor diagnosis, it is necessary to identify a tumor region in the liver and label the tumor region in an image, which is a medical image segmentation technique. The medical image diagnosis usually depends on the level and experience of a film reading doctor, and has the problems of strong subjectivity, low repeatability and the like. The computer image processing technology can help doctors to improve the diagnosis accuracy and improve the film reading efficiency.

With the rapid development of artificial intelligence technology represented by deep learning, medical images have become an important application field of artificial intelligence. In recent years, deep learning has taken a leading position in the field of image segmentation, and has also become the mainstream of medical image segmentation technology.

Among the various algorithms of deep learning, Convolutional neural networks (Convolutional neural networks) have been the dominant role in image processing. In 2015, Jonathan Long et al proposed a Full Convolutional Network (FCN). Because of the superior performance of FCNs, a number of FCN-based deep learning methods have been proposed since then. For example, UNet and SegNet which incorporate the codec method, DANet and CCNet which incorporate the attentiveness mechanism, and deep lab which employs the hole convolution.

Among the methods for medical image segmentation, UNet network is the most classical one. In 2015, Olaf Ronneberger et al proposed a UNet network. The encoding part of the UNet network extracts high-level features, the decoding part restores image space information, and combines jump connection ResNet, and high-level and low-level features. The UNet network image segmentation is relatively accurate, and the network structure is simple. UNet networks have found widespread use in medical image segmentation studies.

On the basis of the original UNet network, various variants are layered endlessly. Of these variants, the most important are UNet + + networks and UNet3+ networks.

In 2018, Zhou indulge, et al proposed a UNet + + network. UNet + + can extract different features from structures of different depths and use hopping connections of different lengths in conjunction with a DenseNet network. UNet networks have also found widespread use in medical image segmentation studies.

In 2020, Huimin Huang et al proposed a UNet3+ network. The left arm in the UNet3+ structure diagram represents the encoding process and the right arm represents the decoding process. The circle in UNet3+ structure diagram represents a neuron node, and the number below the circle represents the number of output channels of the node. In the encoding process, the number of output channels is doubled when the depth is increased by one layer.

UNet3+ performs better than other methods of medical image segmentation. However, because the number of neurons is too small, when the number of samples is small, segmentation of a fine target is not ideal.

Disclosure of Invention

In order to solve the problems, the invention provides an image semantic segmentation method based on a CNUNet3+ network. The method is used for medical image segmentation, particularly for segmentation of an X-ray image and a CT image, and can segment a tiny focus area, so that an imaging doctor is assisted in diagnosing the state of an illness. The method specifically comprises the following steps:

constructing a CNUNet3+ network model, wherein the CNUNet3+ network model is of a U-shaped structure and adopts a coder-decoder structure; the depth is N, and N is a positive integer greater than or equal to 3; the left arm of the U-shaped structure is an encoder, and the right arm of the U-shaped structure is a decoder; the number of channels is doubled when the depth of the encoder is deepened by one layer; encodingDown-sampling operation is used between adjacent nodes of the device, up-sampling operation is used between adjacent nodes of the decoder, and dense connection is used between the nodes of the decoder; a central node is inserted in each of the first layer to the L-th layer,

(ii) a When the encoder node and the central node are at the same depth, convolution operation is adopted; when the encoder node and the central node are at different depths, adopting down-sampling operation; when the central node and the decoder node are at the same depth, adopting convolution operation; when the central node and the decoder node are at different depths, adopting down-sampling operation; convolution calculation is adopted between nodes of an encoder and a decoder without a central node layer, and the encoder node, the central node and the decoder node are all neuron nodes;

inputting training set data into the CNUNet3+ network model, and training the CNUNet3+ network model, wherein the training set data comprises original images and images marked with segmentation results as training set samples;

preprocessing the image to be segmented, and modifying all the images to be segmented into the same size;

and (3) segmenting the image to be segmented by using the trained CNUNet3+ network model.

Further, the calculation formula of the central node is as follows:

wherein the content of the first and second substances,

represents a central node located at the ith level;

an encoder node representing an i-th layer; m represents the number of central nodes;

representing a down-sampling operation;

representing a convolution operation;

representing a join operation;

representing hybrid integration operation, firstly performing convolution operation, then performing batch standardization operation, and then performing ELU activation; in the formula

。

Further, the CNUNet3+ network model employs an ELU activation function.

Further, down-sampling is performed using a maximum convergence method.

Further, the up-sampling is performed using bilinear interpolation.

Compared with a UNet3+ network, the invention adds a plurality of central nodes and corrects the defect of too few UNet3+ network nodes; and meanwhile, the connection mode of the network nodes is changed greatly. Compared with the UNet + + network, the number of nodes is greatly reduced, and the number of parameters is also greatly reduced. When the method is used for segmenting the medical X-ray image and the CT image, good performance still exists when the number of samples is small.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a CNUNet3+ network model in which the encoder has five levels of depth with three nodes in the center;

fig. 2 is a CNUNet3+ network model in which the encoder has five levels of depth with two nodes in the center;

fig. 3 is a CNUNet3+ network model in which the encoder has five levels of depth with only one node in the center;

FIG. 4 is a CNUNet3+ network model in which the encoder has four layers of depth with two nodes in the center;

fig. 5 is a CNUNet3+ network model in which the encoder has four layers deep with only one node in the center.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The image semantic segmentation method based on the CNUNet3+ network provided by the invention can be used under the condition of less sample number, and comprises the following steps:

first, a CNUNet3+ network model is constructed.

The CNUNet3+ network model is of a U-shaped structure and adopts an encoder-decoder structure; the depth is N, and N is a positive integer greater than or equal to 3; the left arm of the U-shaped structure is an encoder, and the right arm of the U-shaped structure is a decoder; the number of channels is doubled when the depth of the encoder is deepened by one layer; down-sampling operation is used between adjacent nodes of the encoder, up-sampling operation is used between adjacent nodes of the decoder, and dense connection is used between the nodes of the decoder; a central node is inserted in each of the first layer to the L-th layer,

(ii) a When the encoder node and the central node are at the same depth, convolution operation is adopted; when the encoder node and the central node are at different depths, adopting down-sampling operation; when the central node and the decoder node are at the same depth, adopting convolution operation; when the central node and the decoder node are at different depths, adopting down-sampling operation; inter-node sampling for encoders and decoders without a central node layerAnd performing convolution calculation, wherein the encoder node, the central node and the decoder node are all neuron nodes.

And secondly, network training is carried out.

And inputting training set data into the CNUNet3+ network model, and training the CNUNet3+ network model, wherein the training set data comprises original images and images marked with segmentation results as training set samples.

And thirdly, preprocessing the image to be segmented.

And modifying all the images to be segmented into the same size, and if the number of input pictures is too small, performing data enhancement.

And fourthly, segmenting the image to be segmented by using the trained CNUNet3+ network model.

One embodiment of the present invention employs a CNUNet3+ network model diagram as shown in fig. 1. The circles in fig. 1 represent the neuron nodes, and the numbers below the circles represent the number of output channels of the nodes. For example

The number "64" below the node indicates that the number of output channels is 64. Neuron node

、

、

、

And

the left arm is formed, and the left arm is an encoding (Encode) process.

Is connected to

The number of output channels is doubled from 64 to 128.

Is connected to

The number of output channels is doubled from 128 to 256.

Is connected to

The number of output channels is doubled from 256 to 512.

Is connected to

The number of output channels is doubled again from 512 to 1024.

、

、

、

And

the right arm is formed, and the right arm is a decoding (Decode) process.

Is connected to

The number of output channels is changed from 1024 to 320. From

After that, the number of output channels is not changed and is always maintained at 320. The left arm and the right arm form the basic structure of UNet, and also form the basic structure of UNet3+, and also form the main structure of the CNUNet3+ network of the invention.

The present invention makes extensive use of Down Sampling and Up Sampling. The arrows at the lower part of the non-same depth layer are all down-sampled. For example from

Node to

Node of from

And

node to

Node of from

、

And

node to

The nodes are down-sampled. The arrows above the non-same depth layers are all up-sampled. For example from

Node to

Node of from

And

node to

Node of from

、

And

node to

Node of from

、

、

And

node to

The nodes are all up-sampled.

The arrows at the same depth level represent convolution operations, neither up-sampling nor down-sampling.

The calculation formula of the central node is as follows:

in the formula

Representing an intermediate neuron node located at the ith layer;

an encoder node representing an i-th layer;

；

m is the number of central nodes;

represents a Down Sampling (Down Sampling) operation;

represents a Convolution (Convolution) operation;

representing a join operation;

indicating a hybrid integration operation, a convolution operation is performed first, then a Batch Normalization operation is performed, and then an ELU (explicit Linear Unit) activation is performed.

The ReLU activation function is often adopted in the convolutional neural network, but a neuron using the ReLU activation function is easier to 'die' during training, so that the neuron can never be activated during the training process. To avoid this problem, the present invention preferably employs an ELU activation function. The activation function herein may be replaced by other activation functions, such as a ReLU function, a leak ReLU function, a prilu function, a Softplus function, a Swish function, a GeLU function, and the like.

When i =3, the central node can be obtained according to the above formula

The calculating method of (2):

when i =2, the central node can be obtained according to the above formula

The calculating method of (2):

when i =1, the central node can be obtained according to the above formula

The calculating method of (2):

in the invention, a maximum convergence method (Max clustering) is used for carrying out down-sampling operation. The maximum convergence method can be used to make the resolution of the images of adjacent layers different by 2 times, and when the resolution is reduced, the resolution is divided by 2. The difference of the image resolutions is 4 times every other layer, and the resolution is divided by 4 when the sampling is reduced; the resolution of two layers of images is different by 8 times, and the resolution is divided by 8 when the sampling is reduced.

For example, in FIG. 1, assume that the original picture size is 256 × 256, then the input is

The picture size at the node is 256 × 256;

from

Down-sampling to

Dividing the picture size by 2 in the down-sampling process to obtain

The node picture size is 128 x 128;

from

Down-sampling to

Dividing the picture size by 2 in the down-sampling process to obtain

The node picture size is 64 x 64;

from

Down-sampling to

Dividing the picture size by 2 in the down-sampling process to obtain

The node picture size is 32 x 32;

from

Down-sampling to

Dividing the picture size by 2 in the down-sampling process to obtain

The node picture size is 16 x 16.

And for example from

Down-sampling to

Suppose that

The picture size at the node is 64 x 64, and when down-sampling, the picture size is divided by 2 to obtain

The picture size at the nodes is 32 x 32.

From

Down-sampling to

Suppose that

The picture size of (2) is 128 x 128, and the down-sampling is performed by dividing the picture size by 4 to obtain

The picture size at the nodes is 32 x 32.

From

Down-sampling to

Suppose that

The size of the picture is 256 × 256, and when down-sampling, the picture is processedThe chip size was divided by 8 to give

The picture size at the nodes is 32 x 32.

The pictures of the same layer must be the same size to be connected. Suppose that

The picture size at the node is 32 x 32, then

Up-sampling to

The picture size at the node is 32 x 32; from

Down-sampling to

The picture size at the nodes is also 32 x 32; from

Down-sampling to

The picture size at the nodes is also 32 x 32; from

Down-sampling to

The picture size at the nodes is also 32 x 32. Thus arriving from different layers

The picture size at the nodes is 32 x 32, so that the nodes can be connected with each other.

The calculation method of the decoder node of the CNUNet3+ network is as follows:

in the formula, U represents an Up Sampling (Up Sampling) operation.

The present invention uses Bilinear Interpolation (Bilinear Interpolation) for Up-Sampling (Up Sampling). When the bilinear interpolation method is used for up-sampling, the resolution difference of adjacent layer images can be 2 times, and the resolution is multiplied by 2 when the up-sampling is carried out. The image resolution difference is 4 times every other layer, and the resolution is multiplied by 4 when the sampling is carried out; two layers of images are separated by 8 times of image resolution, and the resolution is multiplied by 8 when the image is up-sampled.

Taking FIG. 1 as an example, assume that

The picture size at the node is 16 x 16;

from

Up sampling to

Multiplying the picture size by 2 in upsampling to obtain

The node picture size is 32 x 32;

from

Up sampling to

Multiplying the picture size by 2 in upsampling to obtain

The node picture size is 64 x 64;

from

Up sampling to

Multiplying the picture size by 2 in upsampling to obtain

The node picture size is 128 x 128;

from

Up sampling to

Multiplying the picture size by 2 in upsampling to obtain

The picture size at the node is 256 × 256.

And for example from

Up sampling to

Suppose that

The picture size at the node is 128 x 128, and when up-sampling, the picture size is multiplied by 2 to obtain

The picture size at the node is 256 × 256.

From

Up sampling to

Suppose that

The picture size of (2) is 64 x 64, and the up-sampling is performed by multiplying the picture size by 4 to obtain

The picture size at the node is 256 × 256.

From

Up sampling to

Suppose that

The picture size of (2) is 32 x 32, and the picture size is multiplied by 8 in the up-sampling process to obtain

The picture size at the node is 256 × 256.

The picture size at the node is 256 × 256, then

Up-sampling to

The picture size at the node is 256 × 256; from

Up sampling to

The picture size at the node is also 256 × 256; from

Up sampling to

The picture size at the node is also 256 × 256; from

Up sampling to

The picture size at the node is also 256 × 256. Thus arriving from different nodes

The picture size at the nodes is 256 × 256, so that they can be connected to each other.

Fig. 2 is a diagram of a CNUNet3+ network model used in another embodiment. Unlike fig. 1, there are three nodes in the middle of fig. 1, and fig. 2 is changed to 2 nodes. Also the numbers under each circle are the number of channels.

For this embodiment, the central node's calculation formula remains:

however, the calculation formula is slightly different for the decoder node, and the calculation method of the decoder node is as follows:

fig. 3 is a CNUNet3+ network model diagram according to another embodiment of the present invention. Unlike fig. 1 and 2, fig. 3 is instead a single node. The calculation formula of each node changes accordingly.

The depth of the CNUNet3+ network model may vary. When the processing object is not very complicated, or in order to increase the processing speed, the depth can be changed from the five-layer coding depth shown in fig. 1, 2 and 3 to four-layer coding depth.

Fig. 4 shows a CNUNet3+ network model diagram according to another embodiment of the present invention, in which the coding depth is four layers and there are two nodes in the center.

Fig. 5 shows a CNUNet3+ network model diagram according to another embodiment of the present invention, in which the coding depth is four layers and there is a node in the center.

The encoder may also be changed to three layers of depth when the picture size processed is small, or for fast processing.

When the processed picture size is large, the encoder can be modified to six layers of depth. When the encoder is six layers deep, the number of central nodes may be 1 node, 2 nodes, 3 nodes, or 4 nodes.

When the size of the processed picture is larger, the depth of the encoder can be modified to be more than seven layers, and the depth can reach dozens of layers at most. Assuming that the encoder depth is N, the number of central nodes may range from 1 to (N-2).

While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. Based on

A method for semantic segmentation of images of a network, the method comprising:

construction of

Network model, said

The network model is in a U-shaped structure and adopts a coder-decoder structure; the depth is N, and N is a positive integer greater than or equal to 3; the left arm of the U-shaped structure is an encoder, and the right arm of the U-shaped structure is a decoder; the number of channels is doubled when the depth of the encoder is deepened by one layer; down-sampling operation is used between adjacent nodes of the encoder, up-sampling operation is used between adjacent nodes of the decoder, and dense connection is used between the nodes of the decoder; a central node is inserted in each of the first layer to the L-th layer,

inputting training set data into the

In the network model, for the

Training a network model, wherein the training set data takes an original image and an image marked with a segmentation result as training set samples;

after use training

And the network model is used for segmenting the image to be segmented.

2. The image semantic segmentation method according to claim 1, wherein the central node has a calculation formula of:

wherein the content of the first and second substances,

represents a central node located at the ith level;

representing a down-sampling operation;

representing a convolution operation;

representing a join operation;

。

3. The image semantic segmentation method according to claim 1, characterized in that the image semantic segmentation method is

The network model employs an ELU activation function.

4. The method of image semantic segmentation of claim 1, wherein the downsampling operation uses a maximum aggregation method.

5. The method for semantic segmentation of images according to claim 1, wherein the upsampling operation uses a bilinear interpolation method.