CN114119474A

CN114119474A - Method for automatically segmenting human tissues in ultrasonic image through deep learning

Info

Publication number: CN114119474A
Application number: CN202111232141.2A
Authority: CN
Inventors: 孔维真
Original assignee: Shanghai Wumei Technology Co ltd
Current assignee: Shanghai Wumei Technology Co ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-03-01

Abstract

The invention relates to a medical image processing technology, and aims to provide a method for automatically segmenting human tissues in an ultrasonic image through deep learning. The method comprises the following steps: establishing an ultrasonic image training and testing sample set of human tissues; taking the NFNet without the batch normalization layer as a basic network, and constructing a coding network structure of a segmentation network; constructing a decoding network structure of the segmentation network; cross training a segmentation model; testing and evaluating the performance of multiple models; and performing real-time segmentation after model fusion. According to the invention, a segmentation network structure without a batch normalization layer is designed based on NFNet-F0, the problem that the performance of a model is deteriorated due to the fact that a smaller batch is used in the segmentation training of the network structure with the batch normalization layer is solved, and the accuracy of the segmentation model and the speed of application inference are improved by using all context information through complete image training.

Description

Method for automatically segmenting human tissues in ultrasonic image through deep learning

Technical Field

The invention relates to a medical image processing technology, in particular to a method for automatically segmenting human body tissues (skin, fat, fascia, muscle, bone, internal organs and the like) in an ultrasonic image by using deep learning, and particularly relates to application of a convolution neural network without a batch normalization layer to the ultrasonic image automatic segmentation technology.

Background

The rapid automatic segmentation of the ultrasonic image region of interest has important application value, and a plurality of research conclusions and practical application achievements exist in the identification aspects of internal organ tissues, tumors, nodules and the like of a human body at present.

In the process of applying ultrasonic therapy, the positions of different human tissues need to be judged in time in the treatment process. Therefore, if tissues such as skin, fat, fascia, muscle, bone, internal organs and the like can be simultaneously and rapidly segmented aiming at the ultrasonic image, the changes of the thickness, the area and the like of various tissues in the treatment process can be calculated in real time, the operation time is reduced, the treatment precision is improved, and the normal tissues are prevented from being accidentally injured. However, due to the existence of a large amount of speckle textures and artifacts in the ultrasound image and the uneven echo, the existing ultrasound image segmentation technology can only perform better identification on visceral organ tissues, tumors, nodules and the like, but cannot meet the requirement of rapidly and accurately segmenting skin, fat, fascia, muscle and bones.

The existing research results show that the segmentation algorithm based on the convolutional neural network has a good effect on ultrasound image segmentation, and generally, the deeper the network, the better the segmentation effect. The deeper network structure mainly comprises a convolution layer, a pooling layer, an active layer and a Batch Normalization layer (Batch Normalization), wherein the Batch Normalization layer mainly has the functions of solving the problem of internal covariate deviation, preventing gradient dispersion, accelerating convergence by using a larger learning rate and having a regularization effect. However, the batch normalization layer has some defects, the batch normalization layer obviously increases the training time of the model, and the results are different during model training and reasoning; in addition, the batch normalization layer is very sensitive to the batch size, and the performance of the model is deteriorated due to the too small batch size, and the size of the maximum model is limited due to the limited hardware resources. When the ultrasonic image segmentation is trained by using a larger model, more context information can be learned by adopting the method of inputting the whole image, the segmentation accuracy and the application reasoning speed are improved, and when the model and the input image are larger in size, only a tiny batch value can be used for training, so that the performance of the final model is poorer. Obviously, such models do not meet the requirements for rapid, accurate segmentation of skin, fat, fascia, muscle and bone.

In order to remove the batch normalization layer and still realize the function of the original batch normalization layer, NFNet suppresses the activation scale on the residual branch during initialization by introducing a small constant or learnable scalar, and eliminates the mean shift phenomenon in the hidden activation function by using scale weight normalization. NFNet has higher accuracy and can train deeper network structures than ResNet with the same amount of parameters. Currently, NFNet is mainly used for natural image recognition, and is not used for an image segmentation task, for example, when a conventional segmentation network construction method is used for simultaneously recognizing and segmenting human tissues such as skin, fat, fascia, muscles, bones, internal organs and the like in an ultrasonic image, problems of difficulty in decoding network design, incomplete long and narrow tissue regions, accurate segmentation and the like exist, and great troubles are caused to researchers.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects in the prior art and provides a method for automatically segmenting human tissues in an ultrasonic image through deep learning.

In order to solve the technical problem, the solution of the invention is as follows:

a method for automatically segmenting human tissues in an ultrasonic image by deep learning is provided, which comprises the following steps:

(1) establishing an ultrasonic image training and testing sample set of human tissues;

(2) taking the NFNet without the batch normalization layer as a basic network, and constructing a coding network structure of a segmentation network;

(3) constructing a decoding network structure of the segmentation network;

(4) cross training a segmentation model;

(5) testing and evaluating the performance of multiple models;

(6) and performing real-time segmentation after model fusion.

As a preferred embodiment of the present invention, the step (1) comprises:

(1.1) collecting an ultrasound image containing human tissue, which is skin, fat, fascia, muscle, bone and internal organs; cutting out an ultrasonic image area on the image, removing a non-ultrasonic image area, and renaming the image file;

(1.2) delineating a region contour of skin, fat, fascia, muscle, bone or internal organs on the ultrasonic image, and generating a mask image as a real label image of ultrasonic image segmentation;

(1.3) taking the ultrasonic images of various human tissues and the corresponding label images as units, randomly dividing all data into multiple parts, taking 1 part as a test sample set, and taking the rest as a training sample set.

As a preferable aspect of the present invention, the step (2) includes:

(2.1) selecting NFNet-F0 as a basic network, and selecting a ReLU containing a fixed scale factor by an activation function; the convolution kernel size of the last 1 convolution layer of the NFNet-F0 is modified to be 3 multiplied by 3, the number of output channels is adjusted to be 2048, and the size of an output characteristic diagram is kept unchanged; removing the last global pooling layer, the Dropout layer and the full link layer of NFNet-F0;

(2.2) training the modified NFNet-F0 network on a public data set, cutting an image area by adopting a random area and an aspect ratio in the training process, using random data augmentation, label smoothing, MixUp and CutMix as data input regularization strategies, and selecting a model parameter with the highest accuracy on a verification set matched with the public data set;

(2.3) using the model parameters trained in the step (2.2) to reinitialize the NFNet-F0 network modified in the step (2.1) as the coding network structure of the segmented network.

As a preferred embodiment of the present invention, in the step (2.2), the modified NFNet-F0 network is trained on the ImageNet data set of the ILSVRC2012 competition, and the accuracy of the model parameters is verified on the ImageNet verification set.

As a preferable aspect of the present invention, the step (3) includes:

(3.1) connecting 1 double-size upsampling layer after the output layer of the network coding part is divided, and increasing the size of the characteristic diagram by using a double-size upsampling method;

(3.2) connecting 1 3 × 3 convolutional layer after the double-sized upsampling layer for decoding the segmented features after the net learning upsampling, and adjusting the output feature map size using 1 double-sized upsampling layer after the convolutional layer; then, 1 3 × 3 convolution layer is used to enhance the feature learning of the decoding network; finally, amplifying the size of the characteristic diagram by using 1 double-size upsampling layer, outputting 6 characteristic channels by using 1 convolutional layer, and outputting a segmentation probability diagram after performing softmax mapping; the decoding network and the coding network do not carry out layer jump connection;

and (3.3) completing a decoding network structure of the segmentation network according to the steps (3.1) and (3.2), performing down-sampling on the real label image corresponding to the input image by 4 times by using nearest neighbor interpolation during each training, wherein the size of the down-sampled label image is the same as that of the decoding network output probability map, so as to calculate a loss function and perform gradient back propagation.

As a preferable aspect of the present invention, the step (4) includes:

(4.1) training a segmentation model on a training set in an intersecting manner, firstly, randomly and averagely dividing the ultrasonic image training set into at least 4 parts, selecting 1 part of the training set as a verification set each time, and using the rest training set as the training set;

(4.2) randomly setting the scaling scale of the longest edge of the image during each training because the sizes of the ultrasonic images are not uniform; if the width or height of the image is smaller than a preset value, random translation filling is used; randomly cutting an image area with a specified size in the filled image, performing data amplification by using horizontal mirror image, brightness, contrast and sharpness random transformation, and inputting the data into a network;

(4.3) using the softmax cross entropy as a loss function, optimizing the loss function by adopting a random gradient descent method, and performing a plurality of times of training on a training set; and after the cross training is finished, selecting at least 4 models with the highest average intersection ratio on the verification set as segmentation models for each training.

As a preferable aspect of the present invention, the step (5) includes: testing the trained model on the test set, and evaluating the performance of the model; the method specifically comprises the following steps:

after the segmentation probability map is output, carrying out inverse operation on the probability map according to the processing mode of the input image, and restoring the segmentation probability map to the size of the original image to obtain a segmentation predicted value corresponding to each pixel value; on the test set, each image is subjected to segmentation prediction by using at least 4 models, and the prediction average value is used as a final segmentation result; and selecting a proper threshold value to calculate the average intersection ratio, and selecting the primary training result with the highest average intersection ratio as a segmentation model.

As a preferable aspect of the present invention, the step (6) includes:

(6.1) after adjusting the parameters, repeating the step (4) and the step (5), connecting convolution layer parameters of a plurality of models by using group convolution, and keeping the total quantity of the parameters unchanged after the models are combined to obtain a final segmentation model;

and (6.2) acquiring an ultrasonic image in real time, inputting the ultrasonic image into the final segmentation model, and acquiring segmentation results aiming at different human tissues in the ultrasonic image in real time.

The invention also provides a realization method of the deep learning automatic segmentation model, which takes the NFNet without the batch normalization layer as a basic network structure, and modifies the size of the convolution kernel and the number of output channels of the last 1 convolution layer to be used as a coding network structure of the model; the decoding part of the model uses the mixed operation of multiple times of double-size up-sampling and convolution without layer jump connection; the decoding network samples 8 times up in total, the real label image is sampled 4 times down in model training, then a loss function is calculated with the output value of the decoding network, gradient back propagation is carried out, and a final segmentation model is obtained after parameters are updated.

Description of the inventive principles:

the invention innovatively provides the following steps by researching and using NFNet and constructing an ultrasonic image segmentation network: the modified NFNet without the batch normalization layer is used as a coding part of the segmentation network, a decoding network structure without layer jump connection and the batch normalization layer is designed, the video memory consumption and the calculation amount are reduced while all coding network output information is kept, and the whole image training can be performed in a small batch. Therefore, the method can perfectly solve the problem of the batch normalization layer in the deep network model training, and improve the ultrasound image segmentation accuracy and the reasoning speed in application.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, a segmentation network structure without a batch normalization layer is designed based on NFNet-F0, the problem that the performance of a model is deteriorated due to the fact that a smaller batch is used in the segmentation training of the network structure with the batch normalization layer is solved, and the accuracy of the segmentation model and the speed of application inference are improved by using all context information through complete image training.

Drawings

Fig. 1 is a diagram of a batch-free normalization layer segmentation network structure used in an embodiment of the present invention.

FIG. 2 is an illustration of ultrasound image artwork and delineated contour images of various tissue regions used in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The examples may provide those skilled in the art with a more complete understanding of the present invention, and are not intended to limit the invention in any way.

The embodiment provides a method for realizing a deep learning automatic segmentation model, which takes NFNet without a batch normalization layer as a basic network structure, modifies the sizes of convolution kernels and the number of output channels of the last 1 convolution layer and then takes the modified convolution kernels as a coding network structure of the model; the decoding part of the model uses the mixed operation of multiple times of double-size up-sampling and convolution without layer jump connection; the decoding network samples 8 times up in total, the real label image is sampled 4 times down in model training, then a loss function is calculated with the output value of the decoding network, gradient back propagation is carried out, and a final segmentation model is obtained after parameters are updated.

Based on the deep learning automatic segmentation model, the method for deep learning automatic segmentation of human body tissues in ultrasonic images comprises the following specific steps:

establishing an ultrasonic image training and testing sample set;

(1.1) collecting an ultrasonic image containing human tissues, cutting out an ultrasonic image area on the image, removing a non-ultrasonic image area, and renaming an image file.

The portion of the ultrasound image used in this embodiment is shown on the left side of fig. 2. The ultrasound image includes skin, fat, fascia, muscle and bone, and the following delineation operations of the ultrasound image are also described with respect to these human tissues. Since the techniques for the identification and segmentation of internal organs are well established and similar to the procedures described below, the present invention is not illustrated separately.

(1.2) delineating the region of skin, fat, fascia, muscle and bone on the ultrasonic image according to the conventional standard, and generating a mask image as a real label image of the ultrasonic image segmentation. In the generated label image, the background is indicated by 0, the skin region is indicated by 1, the fat region is indicated by 2, the fascia region is indicated by 3, the muscle region is indicated by 4, and the bone region is indicated by 5, which are totally 6 types. The different tissue areas delineated are shown on the right side of fig. 2, with skin, fat, fascia, muscle and bone, in order from top to bottom, each indicated by a different color.

The delineation process is performed manually according to conventional medical image recognition rules. Generally, the certified doctors in the ultrasonic examination departments of all hospitals can complete the work.

(1.3) taking the ultrasonic images of various human tissues (skin, fat, fascia, muscle, bone, internal organs and the like) and the corresponding label images as units, randomly dividing all data into a plurality of parts, taking 1 part of the data as a test sample set, and taking the rest as a training sample set.

Process two, constructing coding network structure of segmentation network

(2.1) selecting NFNet-F0 as a basic network, and selecting a ReLU containing a fixed scale factor by an activation function; the convolution kernel size of the last 1 convolutional layer of NFNet-F0 based on classification task is 1 × 1, the output feature channel number is 3072, because the main function of the convolutional layer with convolution kernel 1 is to adjust the output channel number, it does not have very strong feature expression capability, and it occupies a large amount of video memory in the partition network, therefore, the convolution kernel size of the last 1 convolutional layer is modified to 3 × 3, the output channel number is adjusted to 2048, and the output feature graph size remains unchanged. The last global pooling layer, the Dropout layer, and the fully connected layer of NFNet-F0 are removed.

(2.2) training the modified NFNet-F0 network on an ImageNet data set of the ILSVRC2012 competition, wherein the size of a network input image is 192 x 192, cutting an image area by adopting a random area and an aspect ratio in the training process, using random data augmentation, label smoothing, MixUp and CutMix as data input regularization strategies, training 350 times on the training set in total, and selecting a model parameter with the highest accuracy on an ImageNet verification set.

(2.3) using the model parameters trained in the step (2.2) to reinitialize the NFNet-F0 network modified in the step (2.1) as the coding network structure of the segmented network. The split network input image size is 896 × 896, and the coding network feature map size is downsampled by 32 times in total, i.e., the final 1 convolutional layer output feature map size is 28 × 28.

Thirdly, constructing a decoding network structure of the segmentation network

(3.1) feature size is increased using a double size upsampling method. The double-size upsampling layer is a network layer which replaces a bilinear interpolation and deconvolution upsampling layer, the network layer changes the size of an input feature map to be 2 times of the original size, the number of channels is reduced to be 1/4, and the total data volume of the feature map is kept unchanged. After dividing the output layer of the network coding part, 1 double-sized up-sampling layer is connected, the output feature map size becomes 56 × 56, and the number of feature channels becomes 512. Since double-size upsampling utilizes all features, no duplicate information is introduced, which greatly reduces computational complexity and video memory consumption.

(3.2) connecting 1 3 × 3 convolutional layer after the double-sized upsampling layer for decoding the segmented features after the net learning upsampling, adjusting the output feature map size to be 112 × 112 by using 1 double-sized upsampling layer after the convolutional layer, and changing the number of feature channels to be 128; then, 1 3 × 3 convolutional layer is used, the number of output channels is 128, the feature learning of the decoding network is strengthened, finally, 1 double-size upsampling layer is used for amplifying the size of the feature map to 224 × 224, 1 convolutional layer is used for outputting 6 feature channels, and the segmentation probability map is output after softmax mapping is carried out. The decoding network and the encoding network do not have layer hopping connections.

(3.3) completing the decoding network structure of the segmentation network according to the steps (3.1) and (3.2), wherein the size of the decoding network output segmentation probability graph of the final segmentation network is 224 multiplied by 224, the size is 1/4 of the size of the input image, the real label image corresponding to the input image is downsampled by 4 times by using nearest neighbor interpolation in each training, and the size of the downsampled label image is the same as that of the decoding network output probability graph, so that the loss function can be calculated and the gradient back propagation can be carried out.

Step four, cross training segmentation model

(4.1) training a segmentation model on a training set by adopting 4 crosses, firstly, randomly and averagely dividing the ultrasonic image training set into 4 parts, selecting 3 parts of the training set as the training set each time, and taking the other 1 part of the training set as a verification set.

(4.2) randomly setting the scaling of the longest edge of the image at each training time due to non-uniform sizes of the ultrasonic images, wherein the scaling range is [627, 1164], if the width or height of the image is less than 896, using random translation filling to ensure that the width or height is not less than 896, then randomly cutting 896 multiplied by 896 image areas in the filled image, and inputting the image into the network after data augmentation by using horizontal mirror image, brightness, contrast and sharpness random transformation.

(4.3) using softmax cross entropy as a loss function, optimizing the loss function by adopting a random gradient descent method, wherein the initial learning rate is 0.01, the learning rate is adjusted by adopting polynomial attenuation, the weight attenuation is set to be 2e-5, the impulse is set to be 0.9, the single-equipment training batch value is 2, the total batch value is 40, and the training is performed on a training set for 50 times. After the cross training is finished, selecting 4 models with the highest average intersection ratio on the verification set as the segmentation models of each training.

Procedure five, Multi-model testing and Performance evaluation

And after the model training is finished, testing on the test set to evaluate the performance of the model.

During testing, the longest edge of the image is zoomed in to 896, the aspect ratio is kept unchanged, two edges of the short edge are uniformly filled, and the edge length after filling is 896. And after the segmentation probability map is output, carrying out inverse operation on the probability map according to the processing mode of the input image, and restoring the segmentation probability map to the size of the original image to obtain a segmentation predicted value corresponding to each pixel value. On the test set, 4 models are used for segmentation prediction of each image, the prediction average value is used as the final segmentation result, a proper threshold value is selected to calculate the average intersection ratio, and the primary training result with the highest average intersection ratio is selected as the segmentation model.

Step six, performing real-time segmentation after model fusion

And (6.1) adjusting parameters, and repeating the step (4) and the step (5) to obtain a final segmentation model.

In practical application, although the reasoning speed of a single segmentation network is high, the sequential execution of 4 models needs a long time and cannot meet the real-time application requirement, so that 4 models need to be fused. Because the output characteristic diagrams of the convolutional layers of the plurality of models are mutually independent by reasoning, the convolutional layer parameters of 4 models can be connected by using group convolution, the total quantity of the parameters is kept unchanged after the models are combined, and the final segmentation result can be obtained by only once forward reasoning.

(6.2) acquiring an ultrasonic image to be segmented in real time, inputting the ultrasonic image into the final segmentation model, and acquiring segmentation results (shown in figure 2) aiming at skin, fat, fascia, muscle, bone and internal organs in the ultrasonic image in real time.

Finally, it should be noted that the above-mentioned list is only a specific embodiment of the present invention. It is obvious that the present invention is not limited to the above embodiments, and many modifications and application scenarios are possible, and all modifications that can be derived or suggested by a person skilled in the art from the disclosure of the present invention should be considered as the protection scope of the present invention.

Claims

1. A method for automatically segmenting human tissues in an ultrasonic image by deep learning is characterized by comprising the following steps:

(3) constructing a decoding network structure of the segmentation network;

(4) cross training a segmentation model;

(5) testing and evaluating the performance of multiple models;

(6) and performing real-time segmentation after model fusion.

2. The method of claim 1, wherein step (1) comprises:

(1.1) collecting an ultrasonic image of human tissue, cutting out an ultrasonic image area on the image, removing a non-ultrasonic image area, and renaming an image file; the human tissue refers to skin, fat, fascia, muscle, bone and internal organs;

(1.3) taking various human tissue ultrasonic images and corresponding label images as units, randomly dividing all data into multiple parts, taking 1 part as a test sample set, and taking the rest as a training sample set.

3. The method of claim 1, wherein step (2) comprises:

4. The method of claim 3, wherein in step (2.2), the modified NFNet-F0 network is trained on ImageNet data set of ILSVRC2012 competition and the accuracy of the model parameters is verified on ImageNet verification set.

5. The method of claim 1, wherein step (3) comprises:

6. The method of claim 1, wherein the step (4) comprises:

7. The method of claim 1, wherein step (5) comprises: testing the trained model on the test set, and evaluating the performance of the model; the method specifically comprises the following steps:

8. The method of claim 1, wherein the step (6) comprises:

9. A realization method of a deep learning automatic segmentation model is characterized in that NFNet without a batch normalization layer is used as a basic network structure, and the size of a convolution kernel and the number of output channels of the last 1 convolution layer are modified to be used as a coding network structure of the model; the decoding part of the model uses the mixed operation of multiple times of double-size up-sampling and convolution without layer jump connection; the decoding network samples 8 times up in total, the real label image is sampled 4 times down in model training, then a loss function is calculated with the output value of the decoding network, gradient back propagation is carried out, and a final segmentation model is obtained after parameters are updated.