CN112465111A

CN112465111A - Three-dimensional voxel image segmentation method based on knowledge distillation and countertraining

Info

Publication number: CN112465111A
Application number: CN202011289486.7A
Authority: CN
Inventors: 李天波; 候亚庆; 周东生; 张强; 魏小鹏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-03-09

Abstract

The invention belongs to the field of image segmentation, and provides a three-dimensional voxel image segmentation method based on knowledge distillation and confrontation training, which is divided into two parts: training a plurality of teacher networks for integrated learning, wherein the plurality of teacher networks are used as prior knowledge to obtain soft targets for training student networks; and (3) carrying out confrontation training on the student network, carrying out supervision training by using soft targets and image labels of the teacher network, carrying out supervision by using a characteristic diagram in the middle of the teacher network, and adding a discriminator to construct GANs loss supervision training. The invention expands the available data using knowledge distillation and performs knowledge migration using knowledge distillation, enabling model compression. For the deep learning model of the three-dimensional voxel image, parameters are saved, and the model is easier to deploy. By combining knowledge distillation and antagonistic training, the regularization effect is more obvious, and the overfitting problem of the three-dimensional voxel image caused by the rare data volume is solved to a certain extent.

Description

Three-dimensional voxel image segmentation method based on knowledge distillation and countertraining

Technical Field

The invention belongs to the field of image segmentation, and relates to a three-dimensional voxel image segmentation method based on knowledge distillation and confrontation training.

Background

With the rapid development of science and technology and imaging technology, three-dimensional images are more and more widely used, especially three-dimensional voxel images. The voxel is the minimum unit of digital data on three-dimensional space segmentation, and the voxel is widely applied to the fields of three-dimensional imaging, scientific data, medical images and the like. In the use scene of the three-dimensional voxel image, a plurality of basic tasks of the three-dimensional voxel image segmentation depending on the foundation are applied. But is very complex and inefficient if the segmentation is performed manually on a three-dimensional voxel image. This is not only because the volume of the three-dimensional voxel image needs to be segmented on each slice, but also the accurate segmentation of each slice requires a professional in the corresponding usage scenario, which results in a large amount of high personnel and labor costs, which is obviously not feasible.

For a three-dimensional voxel image, the characteristics of large single data and small total amount of data are generally provided, which also causes that the influence of the model volume on the three-dimensional voxel image is large, and the model is easy to be over-fitted. For traditional machine learning methods, extracting features is too complex and generally not feasible. The depth learning-based method is mostly applied to two-dimensional natural image segmentation, and if the depth learning-based two-dimensional natural image segmentation method is applied to a three-dimensional voxel image, efficiency and accuracy are low. Because the characteristics of the three-dimensional voxel image and the two-dimensional natural image are completely different, the two-dimensional natural image usually has small single data and large data total amount. Therefore, a new method is needed to be designed to be applied to the three-dimensional voxel image based on the characteristics of the three-dimensional voxel image.

Hinton et al first proposed the concept of knowledge distillation in 2015, the general framework of which mainly includes a teacher network and a student network. Soft goals for the teacher network are introduced as part of the total loss to induce training on the student network and to effect knowledge transfer. Since hard objects are single hot coded, the entropy of the information contained in hard objects is low, while the entropy of the information contained in soft objects is high, and there is a relationship between the different classes. The temperature T is used to smooth the classification results, obtain soft targets with higher entropy of information, and train the model using soft target supervision as part of the loss.

The generation of countermeasure networks (GANs) was proposed by Goodfellow et al in 2014. GANs consist of two neural networks, one being a generator and the other being a discriminator. The generator attempts to capture the data distribution and the discriminator estimates the probability that the sample is from training data rather than generated data. That is, the generator attempts to create an image that looks "natural" and requires that it be as consistent as possible with the original data distribution. The task of the discriminator is to determine whether a given image appears to be "natural". Pauline Luc et al trained semantic segmentation models for the first time in 2016 using GANs and performed pixel-level dense classification on the original images. The method uses a discriminator to distinguish the results of the segmented network from the correct labels, where the loss function combines the traditional multi-class cross-entropy loss and the GANs loss. By using the countermeasure network, the robustness of the model under small-scale data is improved, overfitting is prevented, the robustness of the model is improved, and the target image space is more continuous and smooth.

The use of GANs in semantic segmentation has a very large gap between the true tags and the output results. Since the tags are one-hot coded and the output result is non-one-hot coded, the discriminator can easily and directly judge the tags and the output result, and thus the GANs have a limited effect on semantic segmentation. Why do this result? It is theoretically analyzed that this is due to "unnatural" tags, so tags like soft objects are needed to serve as "natural" data.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a three-dimensional voxel image segmentation method based on knowledge distillation and countertraining, and aims to construct an efficient and highly robust three-dimensional voxel image segmentation method. The invention uses knowledge distillation to expand data, and can obtain a soft target aiming at the data without labels, thereby relieving the problem of less total data of the three-dimensional voxel image to a certain extent. The original label training is used, so that the final result is offset to a certain extent, and the accuracy deficiency of the teacher network can be made up. Knowledge transfer is performed by using knowledge distillation, and knowledge learned from a collection of a plurality of large models is transferred to a small model to realize model compression. Therefore, the method is beneficial to being put into practical production, and the robustness of the model is improved by using a mode of countertraining, so that the result image becomes smoother. Knowledge distillation and countermeasure training are combined, the judgment easiness of a discriminator in the countermeasure training can be improved, the robustness of a model is further improved, and the three-dimensional voxel image segmentation method model can be put into production and land more easily.

A three-dimensional voxel image segmentation method based on knowledge distillation and antagonistic training is divided into two big modules:

(1) training a plurality of teacher networks for integrated learning, wherein the plurality of teacher networks are used as prior knowledge to obtain soft targets for training student networks; the training teacher network is divided into two submodules, wherein the first submodule is a data preprocessing part and mainly used for carrying out random cutting, rotation and intensity offset on a three-dimensional voxel image to carry out data enhancement and then carrying out standardized processing on data by using a Z-Score (Z-Score); the second sub-module is used for carrying out supervision training on a teacher network, modifying a network structure by 3DUNet to serve as the teacher network, and then carrying out supervision by using image labels; saving a plurality of trained teacher models for subsequent student network training;

(2) the method comprises the following steps of training a student network in an antagonistic manner, carrying out supervision training by using a soft target and an image label of a teacher network, carrying out supervision by using a characteristic diagram in the middle of the teacher network, and adding a discriminator to construct GANs loss supervision training; the training student network is divided into four sub-modules, wherein the first sub-module is a data preprocessing part and mainly used for carrying out random cutting, rotation and intensity offset on a three-dimensional voxel image to carry out data enhancement and then carrying out standardized processing on data by using Z-Score; the second sub-module is transmitted to the bottom layer in a forward direction to carry out auxiliary supervision training; the third sub-module is used for forward propagation and finally output and jointly supervised training by using a soft target and an image label; the fourth submodule utilizes the GANs loss supervision training for constructing a discriminator; and 3DUNet is also modified as a student network structure, the number of layers of the student network structure is less than that of the teacher network structure, and compression is performed.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the method comprises the following specific steps:

step (1) train multiple teacher networks

(1.1) cutting the original three-dimensional voxel image to a proper size by a random center, then randomly rotating, shifting the intensity, and finally performing pretreatment by Z-Score standardization;

the formula for Z-Score is as follows:

the Z-Score is zero-mean unit variance operation, so that training can be accelerated; x is an input image, X' is an output image, mu is the mean value of all images, and sigma is the variance of all images;

(1.2) training a teacher network, and training the teacher network end to end by using the modified 3DUNet network structure and using image labels; the 3DUNet structure mainly depends on three-dimensional convolution operation and three-dimensional maximum pooling operation, and a group of standardized operation and leakage correction unit activation functions are connected after the three-dimensional convolution operation is performed each time; the reason why batch normalization is not used is that the three-dimensional voxel data is large, and the batch sent into the network each time can only be small, so that batch normalization has little effect. After the image is subjected to convolution operation for two times, the resolution of the image is reduced to be half of the original resolution by using three-dimensional maximum pooling; the up-sampling uses a trilinear interpolation method, and is connected with the same convolution operation to form a coder decoder structure; the final activation function uses a sigmoid activation function to finally output a segmentation image; performing supervised training on a teacher network by using a soft dice loss function, wherein the condition of unbalanced segmentation labels usually exists on a three-dimensional voxel image; the training effect is not good if the softmax activation function and the multi-class cross entropy loss function are used, so the method uses the soft dice loss function to train the teacher network. Finally, a plurality of trained teacher networks are stored as prior knowledge to prepare for later training student networks;

the formula of the teacher network soft dice loss function is as follows:

teacher network Soft dice loss function, L_tRepresenting the loss of the teacher network, K representing the class, t representing the segmentation result of the teacher network, and c representing the image label; ε is a constant close to 0 to avoid a denominator of 0.

Step (2) confrontation training student network, using the teacher network and the image label in the above and constructing a discriminator to supervise training;

(2.1) similarly to (1.1), the preprocessing part of the data is also performed, except that when the student network is trained, additional unlabelled data is used as an input image, because the soft loss of the unlabelled image is obtained by using the previously obtained trained teacher network, the unlabelled image is utilized, and the usable data set is expanded to a certain extent;

and (2.2) the teacher network and the student network are in a modified 3DUNet network structure, so that the intermediate results of the teacher network can be used for assisting in supervising the student network training. Because the student network is smaller than the teacher model and mainly embodied in that the number of layers is smaller than that of the teacher model, the student network can be mapped to a higher dimension through auxiliary convolution to be matched with the intermediate layer characteristic diagram of the teacher network, so that a loss function for auxiliary supervision can be obtained, and MSE loss function is used for auxiliary supervision;

the equation for MSE is as follows:

equation (3) is described as an aided supervised MSE loss function,

for the auxiliary loss of the student network, V is all voxels, as is the intermediate result of the student network, and at is the intermediate result of the teacher network;

and (2.3) propagating forward to the rearmost end, and training the student network by using the soft target of the teacher network and the image label in a joint supervision mode. The 3d onnet structure of the student network structure is very similar to that of the teacher network, the main difference being that the number of layers of the student network is less than that of the teacher network. And finally outputting the segmentation result of the student network through the encoder-decoder structure, and measuring the loss between the segmentation result of the student network and a soft target of a teacher network by using a soft dice loss function, and measuring the loss between the segmentation record of the student network and the image tag by using the soft dice loss function. The foregoing illustrates the use of unlabeled image data, where is how does unlabeled image data jointly supervise? The problem is solved by dynamically adjusting the weight of the joint loss function, joint supervision is used for data with image labels, and only soft target supervision is used for data without image labels;

the soft dice loss function for the student network is as follows:

in the formula (4), the first and second groups,

representing a loss function between the student network segmentation result and the teacher network soft target, s representing the segmentation result of the student network, and t representing the soft target of the teacher network;

in the formula (5), the first and second groups,

representing a loss function between the student network segmentation result and the image label, s representing the segmentation result of the student network, and c representing the image label;

in the formula (6), L_sRepresents the total loss of the student network, as a weighted sum between several losses; lambda [ alpha ]₁To represent

And

weight between, λ₂To represent

A proportion weight;

(2.4) constructing a discriminator to construct the GANs loss, which has the regularization effect. The original image is spliced with the student network segmentation result or the teacher network soft target and input into the discriminator. The discriminator is a full convolution network, the characteristics are extracted through an average pooling layer in the middle, and finally two classification results are output to represent whether the image is true or false;

in the formula (7), the first and second groups,

for the discriminator to be lost, the image label from the student network needs to be set to 0, and the image label from the teacher network needs to be set to 1; e is a mathematical expectation; q^tFor teacher networks, split labels, Q^sA split label for a student network; i is an original image; p is a radical of_tRepresenting teacher distribution, p_sRepresenting a student distribution; d () represents the result output by the discriminator; i represents a splicing operation;

in the formula (8), the first and second groups,

for the loss of the GANs of the student network, the image label from the student network needs to be set to 1 for the student network;

loss L of final student network_totalIs a weighted sum of student network losses and GANs losses; lambda [ alpha ]_ganTo represent

The weight of (c).

The invention has the beneficial effects that: the invention expands the available data using knowledge distillation and performs knowledge migration using knowledge distillation, enabling model compression. For the deep learning model of the three-dimensional voxel image, parameters are saved, and the model is easier to deploy. By combining knowledge distillation and antagonistic training, the regularization effect is more obvious, and the overfitting problem of the three-dimensional voxel image caused by the rare data volume is solved to a certain extent.

Drawings

Fig. 1 is an overall framework of the present invention.

FIG. 2 is a schematic diagram of a teacher model structure according to the present invention.

Fig. 3 is a schematic structural diagram of a student model of the invention.

FIG. 4 is a schematic diagram of the structure of the discriminator according to the present invention.

FIG. 5 is a flow chart of a teacher network training method of the present invention.

FIG. 6 is a flow chart of the student model and discriminator confrontation training method of the present invention.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

The present invention will be described in further detail with reference to the attached drawings and detailed description, so that the conception and the technical scheme of the present invention can be more completely, accurately and deeply understood by those skilled in the art.

The invention can be used for the segmentation tasks of various three-dimensional voxel images, the overall framework of the invention is shown in figure 1 and is subdivided into each network model, the structure of a teacher model is shown in figure 2, the structure of a student model is shown in figure 3, and the structure of a discriminator is shown in figure 4.

The embodiment is applied to the tumor segmentation task of the three-dimensional nuclear magnetic resonance brain imaging graph, and the specific embodiment discussed is only used for illustrating the implementation mode of the invention and does not limit the scope of the invention.

The following describes embodiments of the present invention mainly in detail with respect to a tumor segmentation task of a three-dimensional nmr brain imaging graph, which is mainly divided into training of a teacher network (as shown in fig. 5) and training of a student network (as shown in fig. 6):

(1) training of the teacher network:

and performing data preprocessing according to a module I in the invention content, performing random center cutting, random rotation and intensity offset on the three-dimensional voxel image, and performing Z-Score processing and inputting into a trained teacher model. The input three-dimensional voxel image has the dimensions of 4x128x128x128, and the foremost 4 represents the mode images of 4 nuclear magnetic resonances, which are obtained by splicing. The image is transmitted forward, a segmentation graph is obtained through a coder decoder, soft dice loss is conducted on the segmentation graph and the image label, then optimization training is conducted through an Adam optimizer, and therefore end-to-end training is achieved. And optimizing until the model converges, training 3 teacher models with the same framework by using the same process, and storing.

(2) Training of a student network:

and performing data preprocessing according to a module II in the invention content, performing random center cutting, random rotation and intensity offset on the three-dimensional voxel image, performing Z-Score processing, inputting the processed three-dimensional voxel image into a trained student model, and simultaneously using the three-dimensional voxel data which is not labeled to expand the usable data amount. The input three-dimensional voxel image has the dimensions of 4x128x128x128, and the foremost 4 represents the mode images of 4 nuclear magnetic resonances, which are obtained by splicing. The confrontation training student model, specifically the alternate training student network and the discriminator, firstly randomly obtains a training teacher model for obtaining the teacher network intermediate result and the soft target.

When a student model is trained, an original three-dimensional voxel image is input into a student model and a teacher model, forward propagation is carried out, 3 losses can be obtained in the forward propagation process, soft dice losses are carried out on the segmentation result of the student model and a soft target of the teacher model at the tail end of a decoder to obtain loss1, soft dice losses are carried out on the segmentation result of the student model and an image label at the tail end of the decoder to obtain loss2, and MSE losses are carried out on the middle result of the student model and the middle result of the teacher model at the bottom layers of the student model and the teacher model to obtain loss 3. And then splicing the student network segmentation result and the original image, inputting the splicing result and the original image into a discriminator, carrying out forward propagation to obtain a binary classification result, carrying out GANS loss to obtain a loss4, and finally carrying out optimization training by using an Adam optimizer in loss weighted fusion. The parameters in the teacher model are fixed here because the teacher model is already trained and does not need to be trained again.

When the discriminator is trained, the spliced image and the soft target are sent into the discriminator to be transmitted forward, a binary classification result is obtained, the GANs loss is made, and an Adam optimizer is used for optimization training.

Claims

1. A three-dimensional voxel image segmentation method based on knowledge distillation and antagonistic training is characterized by comprising the following steps:

step (1) train multiple teacher networks

(1.1) cutting the original three-dimensional voxel image at random center, then randomly rotating, shifting intensity, and finally carrying out pretreatment by Z-Score standardization;

the formula for Z-Score is as follows:

(1.2) training a teacher network, and training the teacher network end to end by using the modified 3DUNet network structure and using image labels; the 3DUNet structure mainly depends on three-dimensional convolution operation and three-dimensional maximum pooling operation, and a group of standardized operation and leakage correction unit activation functions are connected after the three-dimensional convolution operation is performed each time; after the image is subjected to convolution operation for two times, the resolution of the image is reduced to be half of the original resolution by using three-dimensional maximum pooling; the up-sampling uses a trilinear interpolation method, and is connected with the same convolution operation to form a coder decoder structure; the final activation function uses a sigmoid activation function to finally output a segmentation image; performing supervision training on the teacher network by using a soft dice loss function; finally, a plurality of trained teacher networks are stored as prior knowledge to prepare for later training student networks;

the formula of the teacher network soft dice loss function is as follows:

(2.1) similarly to (1.1), the preprocessing part of the data is also made, except that when the student network is trained, additional unlabelled data is used as an input image, because the soft loss of the unlabelled image is obtained by using the previously obtained teacher network with training elements, the unlabelled image is utilized, and the usable data set is enlarged;

(2.2) the teacher network and the student network are spread forward to a network bottom layer for auxiliary supervision, and as the structures of the teacher network and the student network are very similar and are modified 3DUNet network structures, the intermediate result of the teacher network is used for auxiliary supervision of student network training; because the student network is smaller than the teacher model and mainly embodied in that the number of layers is smaller than that of the teacher model, the student network is mapped to a higher dimension through auxiliary convolution to be matched with an intermediate layer characteristic diagram of the teacher network, so that an auxiliary supervision loss function is obtained, and MSE (mean square error) loss function is used for auxiliary supervision;

the equation for MSE is as follows:

equation (3) is described as an aided supervised MSE loss function,

(2.3) propagating forward to the rearmost end, and using a teacher network soft target and an image label to jointly supervise and train a student network; finally, the segmentation result of the student network is output through the encoder-decoder structure, the loss between the segmentation result of the student network and a teacher network soft target is measured by using a soft dice loss function, and the loss between the segmentation record of the student network and the image label is measured by using the soft dice loss function; the method comprises the steps that joint supervision is used for data with image labels by dynamically adjusting the weight of a joint loss function, and only soft target supervision is used for data without the image labels;

the soft dice loss function for the student network is as follows:

in the formula (4), the first and second groups,

in the formula (5), the first and second groups,

And

weight between, λ₂To represent

A proportion weight;

(2.4) constructing a discriminator to construct the GANs loss, so that the regularization effect is achieved; splicing the original image with the student network segmentation result or the teacher network soft target and inputting the spliced original image into a discriminator; the discriminator is a full convolution network, the characteristics are extracted through an average pooling layer in the middle, and finally two classification results are output to represent whether the image is true or false;

in the formula (7), the first and second groups,

in the formula (8), the first and second groups,

The weight of (c).