CN110544264B

CN110544264B - Temporal bone key anatomical structure small target segmentation method based on 3D deep supervision mechanism

Info

Publication number: CN110544264B
Application number: CN201910799709.5A
Authority: CN
Inventors: 李晓光; 弓照鹏; 张辉; 卓力
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2023-01-03
Anticipated expiration: 2039-08-28
Also published as: CN110544264A

Abstract

A small target segmentation method of a temporal bone key anatomical structure based on a 3D deep supervision mechanism belongs to the field of medical image processing, a 3D coding and decoding network is designed, dense connection networks are adopted in a coding stage to extract features, propagation of the features is enhanced, the utilization rate of the features is improved, a migration module is designed among different dense connection network blocks, and the migration module adopts a 3D multi-pooling feature fusion strategy to fuse the features after max posing and average posing. And in the decoding stage, a 3D deep supervision mechanism is introduced to jointly guide the network training by the output results of a hidden layer and a backbone network. Aiming at the problems that the critical anatomical structure of the temporal bone is small in size and insufficient in extraction characteristics, the 3D network is adopted to fully utilize the spatial information of the temporal bone CT, and the automatic segmentation of the critical anatomical structure of the temporal bone, namely the malleus, the incus, the outer wall of the cochlea, the inner cavity of the cochlea, the outer semicircular canal, the rear semicircular canal, the front semicircular canal, the vestibule and the inner auditory canal is realized.

Description

Temporal bone key anatomical structure small target segmentation method based on 3D deep supervision mechanism

Technical Field

The invention belongs to the field of medical image processing, and particularly relates to a temporal bone key anatomical structure small target segmentation method based on a 3D deep supervision mechanism.

Background

Temporal computed tomography is an established examination of the ear to examine whether there is anatomical variation in critical anatomical structures of the temporal bone. With the increasing clinical demand, the temporal bone imaging data is rapidly growing, and more data need to be observed and processed by doctors, thereby greatly increasing the workload of the doctors. Therefore, automatically segmenting the key anatomical structure concerned by the doctor from the temporal bone CT has important significance for reducing the workload of the doctor and reducing missed diagnosis and misdiagnosis. The accurate segmentation of the key anatomical structure of the temporal bone is not only beneficial to improving the processing efficiency of medical image data, but also has important significance in the aspects of clinical teaching and scientific research.

The medical image segmentation method mainly comprises two major types of segmentation methods based on manual features and segmentation methods based on deep learning. The deep learning arouses that many segmentation algorithms such as pre-threshold segmentation, region growing, active contour models and the like are applied to the segmentation task of medical images. Although the segmentation method based on the manual features is relatively simple to implement, factors influencing the segmentation precision are many, and the precision requirement of the small target segmentation task of the medical image is high, so that the segmentation method based on the manual features is not suitable for adopting the traditional segmentation method based on the manual features

In recent years, semantic segmentation of medical images has become a popular research direction for intelligent analysis of medical images. The segmentation of small targets in medical images is a challenging task due to the influence of small image proportion of target regions, unobvious contrast between the target and background regions, fuzzy boundaries and large differences of shapes and sizes of different individuals. The critical anatomy of the temporal bone is relatively small. For example, in a 512 x 199 voxel volume data, the largest anatomical structures have only 1298 voxels in the auditory canal, and the smallest anatomical structures have only 184 voxels in the malleus. Furthermore, the variability between different anatomical structures is large. These features present challenges to the intelligent segmentation of critical anatomical structures of the temporal bone.

A pioneer full-convolution neural network of the semantic segmentation network replaces a convolution layer with a full-connection layer to classify images at a pixel level, trains an end-to-end coding and decoding network and solves the problem of natural image semantic level segmentation. The segmentation result is not accurate enough, and the detail information such as the boundary and the like is easy to lose. Different from natural images, medical images are often three-dimensional volume data, and different slices not only contain characteristic information in layers, but also contain abundant spatial information between layers. The existing medical image segmentation method is mostly suitable for relatively large anatomical structures such as liver, heart, lung and the like, and has poor performance on small target segmentation.

The invention provides a small target segmentation method for a key anatomical structure of a temporal bone based on a 3D deep supervision mechanism. A3D coding and decoding network is designed, dense connection networks are adopted in the coding stage to extract features, the propagation of the features is enhanced, the utilization rate of the features is improved, a migration module is designed among different dense connection network blocks, the migration module adopts a 3D multi-pooling feature fusion strategy, and the features after max circulation and average circulation are fused. And in the decoding stage, a 3D deep supervision mechanism is introduced to jointly guide the network training by the output results of a hidden layer and a backbone network.

Disclosure of Invention

The invention aims to overcome the defects of the existing segmentation method, provides a segmentation network based on a 3D deep supervision mechanism aiming at the problems that the key anatomical structure of the temporal bone is small in volume and insufficient in extractable characteristics, and adopts the 3D network to fully utilize the spatial information of the temporal bone CT to realize the automatic segmentation of the key anatomical structure of the temporal bone, namely, the malleus, the incus, the outer wall of the cochlea, the inner cavity of the cochlea, the outer semicircular canal, the posterior semicircular canal, the anterior semicircular canal, the vestibule and the inner auditory canal.

The invention is realized by adopting the following technical means:

a temporal bone key anatomical structure segmentation method based on a 3D deep supervision mechanism. The overall architecture of the method is mainly divided into two stages: an encoding stage and a decoding stage, as shown in figure 1.

The encoding stage comprises dense connection network extraction features and multi-pooling feature fusion.

The decoding stage comprises a long and short jump connection recovery characteristic and a 3D deep supervision mechanism.

The method specifically comprises the following steps:

1) And (3) an encoding stage:

first, the dense connection network extracts features. The raw CT data is pre-processed to extract a 48X 48 cube for networking. The coding stage designs three densely connected network blocks containing different number of layers, wherein the input of each convolution layer is the aggregation of direct connection of all the previous output layers, and the design of directly connecting all the previous layers with the subsequent layers enhances the utilization rate of characteristics, alleviates the problem of gradient disappearance, and improves the transmission of the whole network information flow and the gradient flow, thereby facilitating the training. Dense connections require fewer parameters than traditional convolutional networks and do not require learning some redundant feature maps. Note X _l Is 1 of ^th Output of layer, x ₀ …x _l-1 For the feature cube previously exported from level 0 to level l-1, the design of the inner layer of each densely connected net block can be represented by equation (1):

Xl＝Hl([x ₀ ，x ₁ ，…，x _l-1 ])#(1)

wherein [.]Aggregation operations representing output characteristics of different layers, H _l (. Cndot.) contains three successive operations of Batch Normalization (BN), normalized linear unit (ReLU) and 3 × 3 × 3 convolution, with a growth rate k =32. To prevent overfitting from occurring using the dropout layer immediately after the 3 × 3 × 3 convolution operation, the droprate is 0.5.

And secondly, fusing the multi-pooling characteristics. After the dense connection network block of each level is output, BN-ReLU-Conv3D is adopted, a dropout layer with 0.5 of droprate is adopted in order to prevent overfitting, and then 3D max posing and 3D average posing are adopted simultaneously, and the result after pooling is spliced. The 3D max porous may retain edge features of the volumetric data and the 3D average porous may retain background information of the volumetric data. The splicing of the two can provide rich characteristic information for subsequent segmentation.

2) And a decoding stage:

first, the long and short jump connection restores the low-level semantic information. The output data extracted by the dense connection network block at the bottom layer in the encoding stage is tensor feature data with the resolution of 12 multiplied by 12, the tensor feature data is up-sampled by adopting transposition convolution, the tensor feature resolution is restored to 24 multiplied by 24, and the tensor feature resolution is spliced with the feature output by the dense connection network block at the second layer in the encoding stage through long connection; performing 3D convolution on the spliced features for 1 time to extract the features after splicing the low-level semantic features and the high-level semantic features, performing up-sampling on the obtained features through a transposition convolution operation until the features are equal to 48 multiplied by 48 with the size of an input three-dimensional cube, extracting the features by adopting 64 convolution kernels firstly, and splicing the features by adopting a short-connection lengthening connection mode instead of only adopting a long connection mode, so that the semantic gap between the low-level semantic features and the high-level semantic features is mainly eliminated.

And secondly, guiding network training by a 3D deep supervision mechanism. The characteristics output by the first dense connection network block in the coding stage are extracted by adopting 64 convolution kernels, then the characteristics are firstly convoluted by 1 multiplied by 1,immediately following this is a softmax layer, which outputs the auxiliary segmentation result. And performing convolution operation on the spliced features by the second layer in the decoding stage to further extract the features, and performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by using a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result. And in the decoding stage, the last layer outputs the prediction result of the trunk network after convolution operation containing different convolution kernels on the spliced characteristics, and the prediction result of the trunk network and the prediction result of the branch network jointly guide the training of the network. In the process of network training, the loss function of the main network and the loss time function of the branch network jointly form a joint objective function, including

Coeffient (DSC) loss function and cross entropy loss function. The DSC loss function is defined as shown in equation (2):

wherein X and Y respectively represent prediction voxels and real target voxels, n represents the number of classes (including background) of the target to be segmented, X _i And y _i Respectively representing the number of target labeled voxels contained in the predicted voxel data and the true target voxel data. Introducing a weight denoted as W for the cross entropy loss function, as shown in equation (3):

wherein N is _k Representing the number of labels of target voxels in the voxel data to be segmented, N _c Representing the number of all voxels in the voxel data to be segmented. The cross entropy loss function is shown in equation (4):

constructing a joint objective function based on the loss function defined above is shown in equation (5):

where λ is the hyper-parameter of the branch network loss function. And constructing a target loss function based on loss functions of the main network and the branch network to jointly guide network training, so that gradient disappearance is reduced, and the convergence speed of the network is accelerated.

In order to verify the effectiveness of the method, three common evaluation indexes of medical images are segmentation similarity (DSC), average symmetric surface distance (ASD) and average Hugh distance (AVD).

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

the method takes a 3D convolution neural network as a basis, fully utilizes three-dimensional volume data information, provides a new coding stage feature extraction module on the basis of the traditional 3D-Unet, adopts dense connection networks containing different layers to extract features, strengthens the spread of the features and improves the utilization rate of the features; in the decoding stage, a 3D deep supervision mechanism is changed into a mode that single supervision is adopted as the joint supervision training of a main network and an auxiliary network, so that the network is easier to train; in addition, the semantic gap existing between the high-level semantic features and the low-level semantic features is eliminated in a mode of combining long and short jump connections. The method is suitable for the small target segmentation task through the improved design of the encoding and decoding stages, and the method can effectively improve the small target segmentation precision.

The invention has the characteristics that:

1. the algorithm designs a new U-shaped 3D convolution neural network for segmenting medical images, and the new U-shaped 3D convolution neural network is applied to a task of segmenting a key anatomical structure of a temporal bone for the first time;

2. the algorithm provides a multi-pooling feature fusion strategy which makes full use of multi-scale and multi-level features to improve the accuracy of small target segmentation. In addition, the combination of dense connection and long and short jump connection strengthens the fusion of boundary and detail characteristics;

3. the algorithm introduces a 3D deep supervision mechanism to guide network training and improve the robustness of a segmentation model by constructing an accompanying objective function for a hidden layer;

description of the drawings:

FIG. 1, a network overall architecture diagram;

FIG. 2 is a comparison before and after multi-plane reconstruction;

FIG. 3 is a schematic diagram of a seamless partitioning strategy;

the specific implementation mode is as follows:

the following description of the embodiments of the invention is provided in connection with the accompanying drawings:

the invention adopts temporal bone CT data set to train and test. The temporal bone CT data set comprises temporal bone CT image data of different ages and different sexes. The data set contained normal temporal bone CT data for 64 individuals. Of these, 33 were males and 31 were females, with an average age of 44 years. Each case of data was subjected to multi-plane reconstruction with a resolution of 420 x 420 containing 60 sheets. Marking the data after the multi-plane reconstruction by adopting marking software, and marking 9 key anatomical structures of the malleus, the incus, the outer wall of the cochlea, the inner cavity of the cochlea, the outer semicircular canal, the rear semicircular canal, the front semicircular canal, the vestibule and the inner auditory meatus. In the experiment, 8 persons of data are selected as a test set, and 56 persons of data are selected as a training set.

The data preprocessing adopted by the invention comprises two stages of multi-plane reconstruction and data annotation.

(1) Multi-plane reconstruction phase

The original CT imaging is influenced by the settings of scanning parameters such as collimation and helical pitch and the body position of a patient, the imaging presents different degrees of skew, and the key anatomical structures of the temporal bones on both sides are asymmetric. In order to ensure that temporal bone CT data can maintain consistent layer thickness, layer spacing and resolution under different imaging conditions, and simultaneously ensure bilateral key anatomical structures to be symmetrical, a post-processing workstation is adopted to perform multi-plane reconstruction on the original CT data, and comparison between the pre-and post-multi-plane reconstruction is shown in fig. 2. The specific operation steps are as follows:

the first step is as follows: the outer semicircular canal is symmetrical. The layer with the fullest outer semicircular canal is found at the sagittal observation site, and the reference lines are parallel and equally divide the outer semicircular canal. And switching to axial observation positions, rotating the right-side image back and forth to find the fullest layer of the outer semicircular canals, and enabling the left and right rotation axis position images to enable the outer semicircular canals on the two sides to be symmetrical.

The second step is that: and (5) carrying out normalization processing. The scale of the image is set to 1. Setting a rectangular frame with the width of 10cm and the length of the image, placing the outer semicircular tube in the rectangular frame, ensuring that the upper edge of the outer semicircular tube is 5cm away from the upper edge and the lower edge of the rectangular frame, and cutting the image.

The third step: and (4) batch processing. And selecting 44 slices upwards and 88 slices downwards to obtain all reconstructed layers by taking the layer with the most plump outer semicircular canals as a starting point. The reconstruction is completed with the setting layer thickness, the layer interval of 0.7mm and the sequence number of 60.

(2) Data annotation phase

The first step is as follows: importing the image after the multi-plane reconstruction into a Materialise Mimics software, building different masks for different key anatomical structures, and setting a threshold range allowing labeling for each Mask;

the second step is that: an experienced radiologist uses a painting brush to respectively mark voxels of 9 key anatomical structures of the temporal bone;

the third step: review and modification of annotated results by another experienced radiologist;

the fourth step: dicom images of each of the 9 key anatomical structures were derived.

The overall architecture of the proposed method is shown in figure 1. The algorithm is mainly divided into two stages: an encoding stage and a decoding stage.

(1) Encoding stage

The specific implementation steps of the encoding stage are as follows:

a) Dense join extraction features

The first step is as follows: cubes are extracted for training. A 48 x 48 original data cube and a labeled data cube were randomly drawn from a 420 x 60 voxel cube of input data. And checking whether the label in the labeling data cube contains 1, if the label does not contain 1, the extracted cube does not contain the target anatomical structure, and re-extraction is needed until the label 1 is contained in the labeling cube. In order to eliminate the influence of background pixels on the segmentation task, the threshold interval of the target anatomical structure is set to be-999 to 2347 according to the threshold range of the 9 critical anatomical structures of the temporal bone, the Hugh value smaller than-999 is set to be-999, and the Hugh value larger than 2347 is set to be 2347. The Hugh value of the cube is divided by 255 to reduce the amount of computation. It was then normalized to a mean 0 and variance 1 data distribution. The data enhancement is realized by simultaneously rotating the original data and the labeled data by an angle (-25 degrees);

the second step is that: and (4) extracting features. For the extracted original data cube, firstly, a convolution kernel of 3 × 3 × 3 is adopted to extract features, the step lengths of three dimensions are all 1, SAME is adopted in the convolution padding mode, and 0 is adopted for filling, so that 64 features are obtained. Then inputting the characteristics into a 3-layer dense connection network, wherein the input of each convolution operation in the dense connection network is the aggregation of the characteristics of all the convolution outputs, and the convolution kernel size, the step length and the filling mode adopted by the dense connection network are the same as those of the dense connection network;

the third step: and (5) reducing the dimension of the feature. And aggregating the previously output features in the dense connection block, and then reducing the number of feature cubes by adopting a bottleneck strategy. Firstly, carrying out batch regularization and ReLU activation operation on the features, and then outputting 4k features by adopting a convolution kernel of 3 x 3, wherein k is the growth rate.

b) Multi-pooling feature fusion

And designing a multi-pooling feature fusion migration module among different densely connected network blocks.

The first step is as follows: and performing batch regularization on the features extracted from the densely connected network blocks and increasing the nonlinearity of the network by adopting a ReLU activation function. Then, a three-dimensional convolution kernel with the size of 3 multiplied by 3 is adopted to extract features, and dropout is adopted to prevent the overfitting problem, wherein the dropout rate is 0.5.

The second step is that: for the 3D max and 3D average firing performed on features separately, the pooled kernel size is 2 × 2 × 2, and the step size for each of the three dimensions is 2. Selecting the maximum value in the pooling nuclear space range by the 3D max firing; the 3D average potential was taken as the average over the pooled nuclear space. The former can better retain the edge characteristics, and the latter can retain global background information. The features obtained after max and average potential are spliced together.

(2) Decoding stage

The specific implementation steps of the decoding stage are as follows:

a) Long and short hopping connections are combined.

The first step is as follows: the output characteristics of the first, second and third densely connected network blocks in the encoding stage are respectively F ₁ ,F ₂ ,F ₃ The resolutions thereof were 48 × 48 × 48, 24 × 24 × 24, and 12 × 12 × 12, respectively. To F is aligned with ₃ The feature is transposed and convoluted, the step length of three dimensions is 2, the padding mode is SAME, 0 is used for filling, and the feature group T obtained after the transposition and convolution is carried out ₂ Has a resolution of 24 × 24 × 24;

the second step: features F output by the second densely-connected network block of the encoding phase ₂ And T ₂ Splicing to form a new feature set D ₂ . Extracting D using 3D convolution ₂ The features of (1);

the third step: characteristic F of first dense connection network output in coding stage ₁ Firstly, extracting features through a 3D convolution to obtain 64 features M ₁ For feature group D ₂ Performing transposition convolution operation to restore the resolution of the features to 48 × 48 × 48 and noted as T ₁ Set of features F ₁ ，M ₁ And T ₁ Spliced together to obtain a feature group D ₁ Wherein M is ₁ And F ₁ The semantic gap between the low-level semantic features and the high-level semantic features is eliminated to a certain extent by splicing the short connection mode and the long connection mode respectively.

b) 3D deep supervision mechanism

The first step is as follows: set of characteristics D for the output of the decoding stage ₂ Firstly, performing transposed convolution operation to restore the resolution to 48 multiplied by 48, then adopting convolution kernel convolution with the size of 1 multiplied by 1, wherein the number of output feature cubes is 2, and calculating the probability value of each voxel as a target anatomical structure by adopting softmax and recording as aux _ pred1;

the second step is that: for feature group M output by encoding stage ₁ Similarly, convolution kernel convolution with the size of 1 × 1 × 1 is adopted, and softmax is adopted to calculate the classification probability of each voxel and is recorded as aux _ pred2;

the third step: for feature group D ₁ Extracting features by successively adopting convolution kernels with the size of 3 multiplied by 3, respectively outputting 128 features and 64 features, then adopting convolution kernels with the size of 1 multiplied by 1 to convolve the features, and finally adopting softmax to calculate the classification probability of each voxel and recording the classification probability as main _ pred;

the fourth step: the prediction voxel cubes obtained in the first step and the second step are auxiliary prediction results, and the prediction voxel cube obtained in the third step is a main network prediction result. And calculating a cross entropy loss function and a DSC loss function by using the aux _ pred1, the aux _ pred2 and the main _ pred and a ground route respectively, and forming a joint loss function by using the loss function obtained by calculating the auxiliary prediction result and the main network loss function together to guide network training.

The following describes the process of network training and testing:

a segmentation model is trained separately for each critical anatomical structure of the temporal bone to be segmented. The size of the input data received by the network is 48 × 48 × 48, 2 labels are included in the real object, 0 represents the background, and 1 represents the object anatomy. The output of the network is the same size as the input, and 2 cubes are output, wherein the segmentation results for the background and the foreground are respectively represented.

a) Model training

During network training, the batch size is set to be 1, the initial learning rate is 0.001, the momentum coefficient is 0.5, and after each batch training is finished, one piece of data is randomly extracted from the verification set for verification. The model was saved every 10000 times for 180000 iterations.

b) Model testing

The size of the CT data for each person after multi-planar reconstruction is 420 x 60 voxels, and in order to meet the input data size received by the model, a seamless segmentation strategy is adopted in the testing stage as shown in fig. 3. The data to be tested is first decomposed into cubes of size 48 x 48 voxels with an overlap factor of 4 according to a seamless segmentation strategy. And then the prediction results are respectively sent into the trained models to obtain prediction results, and finally the prediction results of the small cubes are recombined to obtain the final segmentation result of the data to be detected.

The accuracy of the algorithm compared with different algorithms on the task of segmenting the key anatomical structure of the temporal bone is shown in the accompanying description table 1.

TABLE 1 results of segmentation of 9 key anatomical structures of temporal bone by different methods

Note: malleus, incus inculus, ECC cochlea outer wall, ICC cochlea inner cavity, LSC external semicircular canal, PSC posterior semicircular canal, SSC anterior semicircular canal, vestibule, IAM internal auditory canal

Note: malleus, incus inculus, ECC cochlea outer wall, ICC cochlea inner cavity, LSC external semicircular canal, PSC posterior semicircular canal, SSC anterior semicircular canal, vestibule, and IAM internal auditory canal.

Claims

1. A small target segmentation method for a key anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

1) And (3) an encoding stage:

firstly, extracting features by a dense connection network; preprocessing original CT data, extracting a cube of 48 multiplied by 48 to be sent into a network;

note X _l Is 1 of ^th Output of layer, x ₀ …x _l-1 For the feature cube previously exported from level 0 to level l-1, the design of the inner layer of each densely connected net block is represented by equation (1):

X _l ＝H _l (x ₀ ,x ₁ ,…,x _l-1 ])#(1)

wherein [.]Aggregation operations representing output characteristics of different layers, H _l (. The) contains three consecutive operations of Batch Normalization (BN), normalized linear unit (ReLU) and convolution of 3 x 3, with growth rate k =32; to prevent overfitting from occurring using the dropout layer immediately after the 3 × 3 × 3 convolution operation, the droprate is 0.5；

Secondly, fusing multi-pooling characteristics;

adopting BN-ReLU-Conv3D after the dense connection network block of each level is output, adopting a dropout layer with a drop rate of 0.5 in order to prevent overfitting, simultaneously adopting 3D max firing and 3D average firing after the dropouting, and splicing tensor features after pooling;

2) And a decoding stage:

firstly, recovering low-level semantic information by long and short jump connection; tensor features with the resolution of 12 x 12 are output by a bottom layer dense connection network block in the encoding stage, the tensor features are up-sampled by adopting transposition convolution, the tensor feature resolution is restored to 24 x 24, and the tensor feature resolution is spliced with features output by a second layer dense connection network block in the encoding stage through long connection; performing 3D convolution for 1 time on the spliced features to extract the features after splicing the low-level semantic features and the high-level semantic features, performing transposition convolution operation on the obtained features, up-sampling the features until the size of 48 multiplied by 48 is equal to that of an input three-dimensional cube, and extracting the features by adopting 64 convolution kernels for the features output by a first dense connection network block in the encoding stage, wherein the features are spliced by adopting a short connection lengthening connection mode instead of only adopting a long connection mode;

secondly, a 3D deep supervision mechanism guides network training; in the encoding stage, the features output by the first dense connection network block adopt 64 convolution kernels to extract the features, then the features are firstly convolved by 1 multiplied by 1, and then a softmax layer is arranged, and an auxiliary segmentation result is output; performing convolution operation on the spliced features by a second layer in the decoding stage to extract the features, performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by using a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result;

in the decoding stage, the last layer outputs the prediction result of the trunk network after convolution operation containing different convolution kernels is carried out on the spliced features, and the prediction result of the trunk network and the prediction result of the branch network jointly guide the training of the network; in the process of network training, the loss function of the main network and the loss time function of the branch network jointly form a joint objective function, including

A coeffient (DSC) loss function and a cross entropy loss function; the DSC loss function is defined as shown in equation (2):

wherein X and Y respectively represent a prediction voxel and a real target voxel, n represents the number of classes of the target to be segmented, including background, X _i And y _i Respectively representing the number of target mark voxels contained in the predicted voxel data and the real target voxel data; introducing a weight denoted as W for the cross entropy loss function, as shown in equation (3):

wherein N is _k Representing the number of target voxel labels in the voxel data to be segmented, N _c Representing the number of all voxels in the voxel data to be segmented; the cross entropy loss function is shown in equation (4):

wherein λ is a hyper-parameter of the branch network loss function; and constructing a target loss function based on the loss functions of the main network and the branch network to jointly guide network training.

2. A small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

training and testing by adopting a temporal bone CT data set; the temporal bone CT data set comprises temporal bone CT image data of different ages and different sexes; the resolution of each case of data after multi-plane reconstruction is 420 x 420; labeling data after multi-plane reconstruction by using labeling software, and labeling 9 key anatomical structures of a malleus, an incus, a cochlea outer wall, a cochlea inner cavity, an outer semicircular canal, a rear semicircular canal, a front semicircular canal, a vestibule and an inner auditory canal;

the adopted data preprocessing comprises two stages of multi-plane reconstruction and data annotation;

(1) Multi-plane reconstruction phase

The original CT imaging is influenced by collimation, helical pitch scanning parameter setting and the body position of a patient, the imaging presents different degrees of skew, and key anatomical structures of bilateral temples are asymmetric;

adopting a post-processing workstation to carry out multi-plane reconstruction on the original CT data, and comprising the following specific operation steps:

the first step is as follows: the outer semicircular canal is symmetrical; finding the fullest layer of the outer semicircular canal in the sagittal observation site, and enabling the reference lines to be parallel and equally divide the outer semicircular canal; switching to axial position observation positions, rotating the right side image back and forth to find the layer with the fullest outer semicircular canals, and rotating the left and right axis position images to enable the outer semicircular canals on the two sides to be symmetrical;

the second step is that: carrying out normalization processing; uniformly setting the scale of the image to be 1; setting a rectangular frame with the width of 10cm and the length of the image length, placing the outer semicircular tube in the rectangular frame, ensuring that the distance between the upper edge of the outer semicircular tube and the upper edge and the lower edge of the rectangular frame is 5cm, and cutting the image;

the third step: carrying out batch treatment;

(2) Data annotation phase

The first step is as follows: importing the image after the multi-plane reconstruction into Materialise Mimics software, building different masks for different key anatomical structures, and setting a threshold range allowing labeling for each Mask;

the second step is that: respectively carrying out voxel marking on 9 key anatomical structures of the temporal bone;

the third step: auditing and modifying the marked result;

the fourth step: dicom images of each 9 key anatomical structures were derived.

3. A small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising two stages: an encoding stage and a decoding stage;

(1) Encoding stage

The specific implementation steps of the encoding stage are as follows:

a) Dense join extraction features

The first step is as follows: extracting a cube for training; randomly extracting a 48 × 48 × 48 original data cube and a labeled data cube from a 420 × 420 × 60 voxel cube of input data; checking whether the label in the labeling data cube contains 1, if not, indicating that the extracted cube does not contain the target anatomical structure, and extracting again until the label in the labeling data cube contains the label 1; in order to eliminate the influence of background pixels on the segmentation task, setting a threshold interval of a target anatomical structure to be-999-2347 according to threshold ranges of 9 critical anatomical structures of the temporal bone, setting a Huhs value smaller than-999 to be-999, and setting a Huhs value larger than 2347 to be 2347; dividing the Hu's value of the cube by 255; then normalizing the data distribution into data distribution with the mean value of 0 and the variance of 1; the original data and the marked data are rotated by an angle at the same time, so that data enhancement is realized; the angle is-25 to 25 degrees;

the second step: extracting characteristics; firstly, extracting features of an extracted original data cube by adopting a convolution kernel of 3 multiplied by 3, wherein the step length of three dimensions is 1, and the convolution padding mode adopts SAME and adopts 0 for filling to obtain 64 features; inputting the characteristics into a 3-layer dense connection network, wherein the input of each convolution operation in the dense connection block is the aggregation of the characteristics of all the convolution outputs, and the size, the step length and the filling mode of a convolution kernel adopted by the dense connection network are the same as those of the dense connection network;

the third step: reducing dimension of the features; aggregating the previously output features in the dense connection blocks, and then reducing the number of feature graphs by adopting a bottleneck strategy; firstly, performing batch regularization and ReLU activation operation on the features, and outputting 4k features by adopting a convolution kernel of 3 multiplied by 3, wherein k is the growth rate;

b) Multi-pooling feature fusion

A multi-pooling feature fusion migration module is designed among different densely connected network blocks;

the first step is as follows: performing batch regularization on the features extracted from the densely connected network blocks and increasing the nonlinearity of the network by adopting a ReLU activation function; then, extracting features by adopting a three-dimensional convolution kernel with the size of 3 multiplied by 3, and preventing the overfitting problem by adopting dropout, wherein the dropout rate is 0.5;

the second step is that: 3D max firing and 3D average firing are respectively carried out on the characteristics, the size of the pooling nucleus is 2 multiplied by 2, and the step length of three dimensions is 2; selecting the maximum value in the pooling nuclear space range by the 3D max firing; 3Daverage posing selects the average value in the pooled nuclear space range; splicing the features obtained after max and average firing together;

(2) Decoding stage

The specific implementation steps of the decoding stage are as follows:

a) Long and short jump connections are combined;

the first step is as follows: the output characteristics of the first, second and third densely connected network blocks in the encoding stage are respectively F ₁ ,F ₂ ,F ₃ The resolution is 48 × 48 × 48, 24 × 24 × 24, 12 × 12 × 12 respectively; to F ₃ The feature is transposed and convoluted, the step length of three dimensions is 2, the padding mode is SAME, 0 is used for filling, and the feature group T obtained after the transposition and convolution is carried out ₂ Has a resolution of 24 × 24 × 24;

the second step is that: features F output by the second densely-connected network block of the encoding phase ₂ And T ₂ Splicing to form a new feature set D ₂ (ii) a Extracting D using 3D convolution ₂ The features of (1);

the third step: characteristic F of first dense connection network output in coding stage ₁ Firstly, extracting features through 3D convolution to obtain 64 features M ₁ For feature group D ₂ Performing transposition convolution operation to restore the resolution of the features to 48 × 48 × 48 and noted as T ₁ Set of features F ₁ ，M ₁ And T ₁ Spliced together to obtain a feature set D ₁ Wherein M is ₁ And F ₁ The short connection and the long connection are spliced respectively;

b) 3D deep supervision mechanism

the second step: for feature group M output in encoding stage ₁ Similarly, convolution kernel convolution with the size of 1 × 1 × 1 is adopted, and softmax is adopted to calculate the classification probability of each voxel and record the classification probability as aux _ pred2;

the third step: for feature group D ₁ Extracting features by successively adopting convolution kernels with the size of 3 multiplied by 3, respectively outputting 128 features and 64 features, then adopting convolution kernels with the size of 1 multiplied by 1 to carry out convolution, and finally adopting softmax to calculate the classification probability of each voxel and marking the classification probability as main _ pred;

the fourth step: the prediction voxel cubes obtained in the first step and the second step are auxiliary prediction results, and the prediction voxel cube obtained in the third step is a main network prediction result; and calculating a cross entropy loss function and a DSC loss function by using the aux _ pred1, the aux _ pred2 and the main _ pred and a ground route respectively, and forming a joint loss function by using the loss function obtained by calculating the auxiliary prediction result and the main network loss function together to guide network training.

4. A small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

the following describes the process of network training and testing:

respectively training a 3D deep supervision mechanism segmentation network for each key anatomical structure of the temporal bone to be segmented; the size of input data received by the network is 48 × 48 × 48, 2 labels are contained in a real target, 0 represents a background, and 1 represents a target anatomical structure; the output of the network is the same as the input size, and 2 cubes are output, wherein the cubes respectively represent segmentation results of the background and the foreground;

a) Network training

During network training, the batch size is set to be 1, the initial learning rate is 0.001, the momentum coefficient is 0.5, and after each batch training is finished, a sample of data is randomly extracted from a verification set for verification; the model is stored every 10000 times and iterated for more than 180000 times;

b) Network testing

The size of CT data for each person after multi-planar reconstruction is 420 x 60 voxels, and in order to meet the input data size received by the model, a seamless segmentation strategy is adopted in the testing stage: firstly, decomposing data to be tested into a plurality of cubes with the size of 48 multiplied by 48 voxels according to a seamless segmentation strategy, wherein the overlapping factor is 4; and then the prediction results are respectively sent into the trained models to obtain prediction results, and finally the prediction results of the small cubes are recombined to obtain the final segmentation result of the data to be detected.