CN110544264A

CN110544264A - Temporal bone key anatomical structure small target segmentation method based on 3D deep supervision mechanism

Info

Publication number: CN110544264A
Application number: CN201910799709.5A
Authority: CN
Inventors: 李晓光; 弓照鹏; 张辉; 卓力
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-12-06
Anticipated expiration: 2039-08-28
Also published as: CN110544264B

Abstract

a small target segmentation method for a key anatomical structure of a temporal bone based on a 3D deep supervision mechanism belongs to the field of medical image processing, a 3D coding and decoding network is designed, dense connection networks are adopted in a coding stage to extract features, the spread of the features is enhanced, the utilization rate of the features is improved, a migration module is designed among different dense connection network blocks, and the migration module adopts a 3D multi-pooling feature fusion strategy to fuse the features after max posing and average posing. And in the decoding stage, a 3D deep supervision mechanism is introduced to jointly guide the network training by the output results of a hidden layer and a backbone network. Aiming at the problems that the critical anatomical structure of the temporal bone is small in size and insufficient in extraction characteristics, the 3D network is adopted to fully utilize the spatial information of the temporal bone CT, and the automatic segmentation of the critical anatomical structure of the temporal bone, namely the malleus, the incus, the outer wall of the cochlea, the inner cavity of the cochlea, the outer semicircular canal, the rear semicircular canal, the front semicircular canal, the vestibule and the inner auditory canal is realized.

Description

Temporal bone key anatomical structure small target segmentation method based on 3D deep supervision mechanism

Technical Field

The invention belongs to the field of medical image processing, and particularly relates to a temporal bone key anatomical structure small target segmentation method based on a 3D deep supervision mechanism.

Background

temporal computed tomography is an established otic examination standard to examine whether anatomical variations occur in critical anatomical structures of the temporal bone. With the increasing clinical demand, the imaging data of the temporal bone is rapidly increased, more and more data are needed to be observed and processed by doctors, and the workload of the doctors is greatly increased. Therefore, the automatic segmentation of the key anatomical structures concerned by the doctor from the temporal bone CT is of great significance to the reduction of the workload of the doctor and the reduction of missed diagnosis and misdiagnosis. The accurate segmentation of the key anatomical structure of the temporal bone is not only beneficial to improving the processing efficiency of medical image data, but also has important significance in the aspects of clinical teaching and scientific research.

the medical image segmentation method mainly comprises two major types of segmentation methods based on manual features and segmentation methods based on deep learning. The deep learning arouses that many segmentation algorithms such as pre-threshold segmentation, region growing, active contour models and the like are applied to the segmentation task of medical images. Although the segmentation method based on the manual features is relatively simple to implement, factors influencing the segmentation precision are many, and the precision requirement of the small target segmentation task of the medical image is high, so that the segmentation method based on the manual features is not suitable for adopting the traditional segmentation method based on the manual features

In recent years, semantic segmentation of medical images has become a popular research direction for intelligent analysis of medical images. The segmentation of small targets in medical images is a challenging task due to the influence of small image proportion of target regions, unobvious contrast between the target and background regions, fuzzy boundaries and large differences of shapes and sizes of different individuals. The critical anatomy of the temporal bone is relatively small. For example, in a 512 x 199 voxel volume data, the largest anatomical structures have only 1298 voxels in the auditory meatus, and the smallest anatomical structures have only 184 voxels in the malleus. Furthermore, the variability between different anatomical structures is large. These features present challenges to the intelligent segmentation of critical anatomical structures of the temporal bone.

a pioneer full-convolution neural network of the semantic segmentation network replaces a convolution layer with a full-connection layer to classify images at a pixel level, trains an end-to-end coding and decoding network and solves the problem of natural image semantic level segmentation. The segmentation result is not accurate enough, and the detail information such as the boundary and the like is easy to lose. Different from natural images, medical images are often three-dimensional volume data, and different slices not only contain characteristic information in layers, but also contain abundant spatial information between layers. The existing medical image segmentation method is mostly suitable for relatively large anatomical structures such as liver, heart, lung and the like, and has poor performance on small target segmentation.

the invention provides a small target segmentation method for a key anatomical structure of a temporal bone based on a 3D deep supervision mechanism. A3D coding and decoding network is designed, dense connection networks are adopted in the coding stage to extract features, the propagation of the features is enhanced, the utilization rate of the features is improved, a migration module is designed among different dense connection network blocks, the migration module adopts a 3D multi-pooling feature fusion strategy, and the features after max circulation and average circulation are fused. And in the decoding stage, a 3D deep supervision mechanism is introduced to jointly guide the network training by the output results of a hidden layer and a backbone network.

disclosure of Invention

The invention aims to overcome the defects of the existing segmentation method, provides a segmentation network based on a 3D deep supervision mechanism aiming at the problems that the key anatomical structure of the temporal bone is small in volume and insufficient in extractable characteristics, and adopts the 3D network to fully utilize the spatial information of the temporal bone CT to realize the automatic segmentation of the key anatomical structure of the temporal bone, namely, the malleus, the incus, the outer wall of the cochlea, the inner cavity of the cochlea, the outer semicircular canal, the posterior semicircular canal, the anterior semicircular canal, the vestibule and the inner auditory canal.

the invention is realized by adopting the following technical means:

a temporal bone key anatomical structure segmentation method based on a 3D deep supervision mechanism. The overall architecture of the method is mainly divided into two stages: an encoding stage and a decoding stage, as shown in figure 1.

the encoding stage comprises dense connection network extraction features and multi-pooling feature fusion.

the decoding stage comprises a long and short jump connection recovery characteristic and a 3D deep supervision mechanism.

The method specifically comprises the following steps:

1) And (3) an encoding stage:

First, the dense connection network extracts features. The raw CT data is pre-processed to extract a 48X 48 cube for networking. The coding stage designs three densely connected network blocks containing different number of layers, wherein the input of each convolution layer is the aggregation of direct connection of all the previous output layers, and the design of directly connecting all the previous layers with the subsequent layers enhances the utilization rate of characteristics, alleviates the problem of gradient disappearance, and improves the transmission of the whole network information flow and the gradient flow, thereby facilitating the training. Dense connections require fewer parameters than traditional convolutional networks and do not require learning some redundant feature maps. Let Xl be the output of the lth layer, and x0 … Xl-1 be the feature cube previously output from layer 0 to layer l-1, then the design of the inner layer of each densely connected net block can be represented by formula (1):

Xl＝Hl([x，x，…，x])#(1)

where [ ] denotes the aggregate operation of different layer output characteristics, and Hl (·) includes three consecutive operations of Batch Normalization (BN), normalized linear unit (ReLU), and convolution of 3 × 3 × 3, with a growth rate k ═ 32. To prevent overfitting from occurring using the dropout layer immediately after the 3 × 3 × 3 convolution operation, the droprate is 0.5.

and secondly, fusing the multi-pooling characteristics. After the dense connection network block output of each level, BN-ReLU-Conv3D is adopted, in order to prevent overfitting, a dropout layer with a droprate of 0.5 is generally adopted, and then 3D max and 3D average are adopted simultaneously, and the result after pooling is spliced. The 3D max porous may preserve edge features of the volumetric data and the 3D average porous may preserve background information of the volumetric data. The splicing of the two can provide rich characteristic information for subsequent segmentation.

2) And a decoding stage:

first, the long and short jump connection restores the low-level semantic information. The output data extracted by the dense connection network block at the bottom layer in the encoding stage is tensor feature data with the resolution of 12 multiplied by 12, the tensor feature data is up-sampled by adopting transposition convolution, the tensor feature resolution is restored to 24 multiplied by 24, and the tensor feature resolution is spliced with the feature output by the dense connection network block at the second layer in the encoding stage through long connection; performing 3D convolution on the spliced features for 1 time to extract the features after splicing the low-level semantic features and the high-level semantic features, performing up-sampling on the obtained features through a transposition convolution operation until the features are equal to 48 multiplied by 48 with the size of an input three-dimensional cube, extracting the features by adopting 64 convolution kernels firstly, and splicing the features by adopting a short-connection lengthening connection mode instead of only adopting a long connection mode, so that the semantic gap between the low-level semantic features and the high-level semantic features is mainly eliminated.

And secondly, guiding network training by a 3D deep supervision mechanism. In the encoding stage, the feature output by the first dense connection network block adopts 64 convolution kernels to extract the feature, then the feature is firstly subjected to convolution of 1 multiplied by 1, and then a softmax layer is arranged, and an auxiliary segmentation result is output. And performing convolution operation on the spliced features by the second layer in the decoding stage to further extract the features, and performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by using a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result. And in the decoding stage, the last layer outputs the prediction result of the trunk network after convolution operation containing different convolution kernels is carried out on the spliced features, and the prediction result of the trunk network and the prediction result of the branch network jointly guide the training of the network. In the network training process, a loss function of the main network and a loss time function of the branch network jointly form a joint objective function, and the joint objective function comprises a Coefficient (DSC) loss function and a cross entropy loss function. The DSC loss function is defined as shown in equation (2):

wherein X and Y represent the prediction voxel and the real target voxel, respectively, n represents the number of classes (including background) of the target to be segmented, and xi and yi represent the number of target labeling voxels included in the prediction voxel data and the real target voxel data, respectively. Introducing a weight denoted as W for the cross entropy loss function, as shown in equation (3):

Where Nk represents the number of target voxel labels in the voxel data to be segmented and Nc represents the number of total voxels in the voxel data to be segmented. The cross entropy loss function is shown in equation (4):

Constructing a joint objective function based on the loss function defined above is shown in equation (5):

Where λ is the hyper-parameter of the branch network loss function. And constructing a target loss function based on the loss functions of the main network and the branch network to jointly guide network training, reduce gradient disappearance and accelerate the convergence speed of the network.

in order to verify the effectiveness of the method, three common evaluation indexes of the medical image are segmentation similarity (DSC), average symmetric surface distance (ASD) and average Hugh distance (AVD).

compared with the prior art, the invention has the following obvious advantages and beneficial effects:

The method takes a 3D convolutional neural network as a basis, fully utilizes three-dimensional volume data information, provides a new coding stage feature extraction module on the basis of the traditional 3D-Unet, adopts dense connection networks containing different layers to extract features, enhances the propagation of the features and improves the utilization rate of the features; in the decoding stage, a 3D deep supervision mechanism is changed into single supervision as joint supervision training of a main network and an auxiliary network, so that the network is easier to train; in addition, the semantic gap existing between the high-level semantic features and the low-level semantic features is eliminated in a mode of combining long and short jump connections. The method is suitable for the small target segmentation task through the improved design of the encoding and decoding stages, and the method can effectively improve the small target segmentation precision.

The invention has the characteristics that:

1. The algorithm designs a new U-shaped 3D convolutional neural network for segmenting medical images, and the new U-shaped 3D convolutional neural network is applied to a task of segmenting a key anatomical structure of a temporal bone for the first time;

2. The algorithm provides a multi-pooling feature fusion strategy which makes full use of multi-scale and multi-level features to improve the accuracy of small target segmentation. In addition, the combination of dense connections and long and short hopping connections strengthens the boundary and detail feature fusion;

3. the algorithm introduces a 3D deep supervision mechanism, constructs a concomitant objective function for the hidden layer, guides network training and improves the robustness of a segmentation model;

description of the drawings:

FIG. 1, a network architecture diagram;

FIG. 2 is a comparison before and after multi-plane reconstruction;

FIG. 3 is a schematic diagram of a seamless split strategy;

the specific implementation mode is as follows:

the following description of the embodiments of the present invention is provided in conjunction with the accompanying drawings:

the invention adopts temporal bone CT data set to train and test. The temporal bone CT data set comprises temporal bone CT image data of different ages and different sexes. The data set contained normal temporal bone CT data for 64 individuals. Of these, 33 were male, 31 were female, and the average age was 44 years. Each case of data was subjected to multi-plane reconstruction with a resolution of 420 x 420 containing 60 sheets. And labeling the data after the multi-plane reconstruction by using labeling software, wherein 9 key anatomical structures including a malleus, an incus, a cochlea outer wall, a cochlea inner cavity, an outer semicircular canal, a rear semicircular canal, a front semicircular canal, a vestibule and an inner auditory canal are labeled. In the experiment, 8 persons of data are selected as a test set, and 56 persons of data are selected as a training set.

The data preprocessing adopted by the invention comprises two stages of multi-plane reconstruction and data annotation.

(1) Multi-plane reconstruction phase

The original CT imaging is influenced by the scanning parameter settings such as collimation and helical pitch and the body position of a patient, the imaging presents different degrees of skew, and key anatomical structures of bilateral temples are asymmetric. In order to ensure that temporal bone CT data can maintain consistent layer thickness, layer spacing and resolution under different imaging conditions, and simultaneously ensure bilateral key anatomical structures to be symmetrical, a post-processing workstation is adopted to perform multi-plane reconstruction on the original CT data, and comparison between the pre-and post-multi-plane reconstruction is shown in fig. 2. The specific operation steps are as follows:

The first step is as follows: the outer semicircular canal is symmetrical. The fullest layer of the outer semicircular canal was found at the sagittal observation site, with the reference lines parallel and bisecting the outer semicircular canal. And switching to axial observation positions, rotating the right-side image back and forth to find the fullest layer of the outer semicircular canals, and enabling the left and right rotation axis position images to enable the outer semicircular canals on the two sides to be symmetrical.

The second step is that: and (6) carrying out normalization processing. The scale of the image is uniformly set to 1:1, and the size of the scanned image is made to coincide with the actual size. Setting a rectangular frame with the width of 10cm and the length of the image, placing the outer semicircular tube in the rectangular frame, ensuring that the upper edge of the outer semicircular tube is 5cm away from the upper edge and the lower edge of the rectangular frame, and cutting the image.

the third step: and (4) batch processing. And selecting 44 slices upwards and 88 slices downwards to obtain all reconstructed layers by taking the layer with the fullest external semicircular canal as a starting point. The layer thickness, the layer spacing of 0.7mm and the sequence number of 60 are set to complete the reconstruction.

(2) data annotation phase

The first step is as follows: importing the image after the multi-plane reconstruction into Materialise Mimics software, building different masks for different key anatomical structures, and setting a threshold range allowing labeling for each Mask;

the second step is that: an experienced radiologist uses a painting brush to perform voxel marking on 9 key anatomical structures of the temporal bone respectively;

the third step: review and modification of annotated results by another experienced radiologist;

the fourth step: dicom images of each 9 key anatomical structures were derived.

The overall architecture of the proposed method is shown in figure 1. The algorithm is mainly divided into two stages: an encoding stage and a decoding stage.

(1) encoding stage

the specific implementation steps of the encoding stage are as follows:

a) Dense join extraction features

the first step is as follows: cubes are extracted for training. A 48 x 48 original data cube and a labeled data cube are randomly drawn from a 420 x 60 voxel cube of input data. And checking whether the label in the labeling data cube contains 1, if the label does not contain 1, the extracted cube does not contain the target anatomical structure, and re-extraction is needed until the label 1 is contained in the labeling cube. In order to eliminate the influence of background pixels on the segmentation task, the threshold interval of the target anatomical structure is set to be-999 to 2347 according to the threshold range of the 9 critical anatomical structures of the temporal bone, the Hugh value smaller than-999 is set to be-999, and the Hugh value larger than 2347 is set to be 2347. The Hugh value of the cube is divided by 255 to reduce the amount of computation. It is then normalized to a data distribution with a mean of 0 and a variance of 1. The data enhancement is realized by simultaneously rotating the original data and the labeled data by an angle (-25 degrees);

the second step is that: and (5) feature extraction. For the extracted original data cube, firstly, a convolution kernel of 3 × 3 × 3 is adopted to extract features, the step lengths of three dimensions are all 1, SAME is adopted in the convolution padding mode, and 0 is adopted for filling, so that 64 features are obtained. Inputting the characteristics into a 3-layer dense connection network, wherein the input of each convolution operation in the dense connection block is the aggregation of the characteristics of all the convolution outputs, and the size, the step length and the filling mode of a convolution kernel adopted by the dense connection network are the same as those of the dense connection network;

the third step: and (5) reducing the dimension of the feature. And aggregating the previously output features in the dense connection block, and then reducing the number of feature cubes by adopting a bottleneck strategy. Firstly, carrying out batch regularization and ReLU activation operation on the features, and then outputting 4k features by adopting a convolution kernel of 3 x 3, wherein k is the growth rate.

b) multi-pooling feature fusion

and designing a multi-pooling feature fusion migration module among different densely connected network blocks.

the first step is as follows: and performing batch regularization on the features extracted from the densely connected network blocks and increasing the nonlinearity of the network by adopting a ReLU activation function. Then, a three-dimensional convolution kernel with the size of 3 multiplied by 3 is adopted to extract features, and dropout is adopted to prevent the overfitting problem, wherein the dropout rate is 0.5.

The second step is that: for the 3D max and 3D average firing performed on features separately, the pooled kernel size is 2 × 2 × 2, and the step size for each of the three dimensions is 2. Selecting the maximum value in the pooling nuclear space range by the 3D max firing; the 3D average potential was taken as the average over the pooled nuclear space. The former can better retain the edge characteristics, and the latter can retain global background information. The features obtained after max and average potential are spliced together.

(2) Decoding stage

the specific implementation steps of the decoding stage are as follows:

a) Long and short hopping connections are combined.

the first step is as follows: note that the first, second and third densely connected network blocks output during the encoding stage are characterized by F1, F2, F3, respectively, having resolutions of 48 × 48 × 48, 24 × 24 × 24, 12 × 12 × 12, respectively. Performing transposition convolution operation on the F3 feature, wherein the step sizes of three dimensions are all 2, the padding mode is SAME, the padding mode is filled with 0, and the resolution of a feature set T2 obtained after transposition convolution is 24 multiplied by 24;

The second step is that: and splicing the characteristics F2 output by the second densely connected network block in the encoding stage with T2 to form a new characteristic group D2. Extracting the characteristics of D2 by adopting 3D convolution;

The third step: the method comprises the steps of extracting features from a feature F1 output by a first dense connection network in an encoding stage through a 3D convolution to obtain 64 features M1, performing transposition convolution operation on a feature group D2, restoring the resolution of the features to 48 multiplied by 48 to be recorded as T1, and splicing feature groups F1, M1 and T1 to obtain a feature group D1, wherein M1 and F1 are spliced through short connection and long connection respectively, so that semantic gaps between low-level semantic features and high-level semantic features are eliminated to a certain extent.

b)3D deep supervision mechanism

the first step is as follows: performing transpose convolution operation on a feature group D2 output in a decoding stage to restore the resolution to 48 × 48 × 48, performing convolution with convolution kernels of 1 × 1 × 1, wherein the number of output feature cubes is 2, and calculating the probability value of each voxel as a target anatomical structure by using softmax and recording as aux _ pred 1;

the second step is that: similarly, convolution kernel convolution with the size of 1 × 1 × 1 is adopted for the feature group M1 output in the encoding stage, and softmax is adopted to calculate the classification probability of each voxel and record the classification probability as aux _ pred 2;

the third step: successively adopting convolution kernels with the size of 3 multiplied by 3 to extract features from the feature group D1, respectively outputting 128 features and 64 features, adopting convolution kernels with the size of 1 multiplied by 1 to perform convolution, and finally adopting softmax to calculate the classification probability of each voxel and recording the classification probability as main _ pred;

the fourth step: the prediction voxel cubes obtained in the first step and the second step are auxiliary prediction results, and the prediction voxel cube obtained in the third step is a main network prediction result. And calculating a cross entropy loss function and a DSC loss function by using the aux _ pred1, the aux _ pred2 and the main _ pred and a ground route respectively, and forming a joint loss function by using the loss function obtained by calculating the auxiliary prediction result and the main network loss function together to guide network training.

The following describes the process of network training and testing:

A segmentation model is trained separately for each critical anatomical structure of the temporal bone to be segmented. The size of the input data received by the network is 48 × 48 × 48, 2 tags are contained in the real object, 0 represents the background, and 1 represents the object anatomy. The output of the network is the same size as the input, outputting 2 cubes, which represent the segmentation results for the background and foreground, respectively.

a) Model training

during network training, the batch size is set to be 1, the initial learning rate is 0.001, the momentum coefficient is 0.5, and after each batch training is finished, one piece of data is randomly extracted from the verification set for verification. The model was saved every 10000 times for 180000 iterations.

b) model testing

the size of the CT data for each person after multi-planar reconstruction is 420 x 60 voxels, and in order to meet the input data size received by the model, a seamless segmentation strategy is adopted in the testing stage as shown in fig. 3. The data to be tested is first decomposed into cubes of size 48 x 48 voxels with an overlap factor of 4 according to a seamless segmentation strategy. And then respectively sending the small cubes into the trained model to obtain a prediction result, and finally recombining the prediction results of the small cubes to obtain a final segmentation result of the data to be detected.

The accuracy of the algorithm compared with different algorithms on the task of segmenting the key anatomical structure of the temporal bone is shown in the accompanying description table 1.

TABLE 1 results of segmentation of 9 key anatomical structures of temporal bone by different methods

note: malleus, incus inculus, ECC cochlea outer wall, ICC cochlea inner cavity, LSC external semicircular canal, PSC posterior semicircular canal, SSC anterior semicircular canal, vestibule, IAM internal auditory canal

note: malleus, incus inculus, ECC cochlea outer wall, ICC cochlea inner cavity, LSC external semicircular canal, PSC posterior semicircular canal, SSC anterior semicircular canal, vestibule, and IAM internal auditory canal.

Claims

1. a small target segmentation method for a key anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

1) And (3) an encoding stage:

Firstly, extracting features by a dense connection network; preprocessing original CT data, extracting a cube of 48 multiplied by 48 to be sent into a network;

Let Xl be the output of the lth layer, and x0 … Xl-1 be the feature cube previously output from layer 0 to layer l-1, then the design of the inner layer of each densely connected network block is represented by formula (1):

X＝H([x，x，…，x])#(1)

Wherein [ ] denotes the aggregation operation of different layer output characteristics, and Hl (·) comprises three continuous operations of Batch Normalization (BN), normalized linear unit (ReLU), and convolution of 3 × 3 × 3, with a growth rate k equal to 32; to prevent overfitting from occurring using the dropout layer immediately after the 3 × 3 × 3 convolution operation, the droprate is 0.5;

secondly, fusing multi-pooling characteristics;

Adopting BN-ReLU-Conv3D after the dense connection network block of each level is output, adopting a dropout layer with a drop rate of 0.5 in order to prevent overfitting, simultaneously adopting 3D max firing and 3D average firing after the dropouting, and splicing tensor features after pooling;

2) And a decoding stage:

firstly, recovering low-level semantic information by long and short jump connection; tensor features with the resolution of 12 x 12 are output by a bottom layer dense connection network block in the encoding stage, the tensor features are up-sampled by adopting transposition convolution, the tensor feature resolution is restored to 24 x 24, and the tensor feature resolution is spliced with features output by a second layer dense connection network block in the encoding stage through long connection; performing 3D convolution for 1 time on the spliced features to extract the features after splicing the low-level semantic features and the high-level semantic features, performing transposition convolution operation on the obtained features, up-sampling the features until the size of 48 multiplied by 48 is equal to that of an input three-dimensional cube, and extracting the features by adopting 64 convolution kernels for the features output by a first dense connection network block in the encoding stage, wherein the features are spliced by adopting a short connection lengthening connection mode instead of only adopting a long connection mode;

secondly, a 3D deep supervision mechanism guides network training; in the encoding stage, the features output by the first dense connection network block adopt 64 convolution kernels to extract the features, then the features are firstly convolved by 1 multiplied by 1, and then a softmax layer is arranged, and an auxiliary segmentation result is output; performing convolution operation on the spliced features by the second layer in the decoding stage to further extract the features, firstly performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by adopting a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result;

In the decoding stage, the last layer outputs the prediction result of the trunk network after convolution operation containing different convolution kernels is carried out on the spliced features, and the prediction result of the trunk network and the prediction result of the branch network jointly guide the training of the network; in the network training process, a loss function of a main network and a loss time function of a branch network jointly form a joint objective function, and the joint objective function comprises a Dice-coefficient (DSC) loss function and a cross entropy loss function; the DSC loss function is defined as shown in equation (2):

Wherein X and Y respectively represent a prediction voxel and a real target voxel, n represents the number of classes (including background) of a target to be segmented, xi and yi respectively represent the number of target marking voxels contained in the prediction voxel data and the real target voxel data; introducing a weight denoted as W for the cross entropy loss function, as shown in equation (3):

Wherein Nk represents the number of target voxel marks in the voxel data to be segmented, and Nc represents the number of all voxels in the voxel data to be segmented; the cross entropy loss function is shown in equation (4):

Wherein λ is a hyper-parameter of the branch network loss function; and constructing a target loss function based on the loss functions of the backbone network and the branch network to jointly guide network training.

2. a small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

training and testing by adopting a temporal bone CT data set; the temporal bone CT data set comprises temporal bone CT image data of different ages and different sexes; the resolution of each case of data after multi-plane reconstruction is 420 x 420; labeling data after multi-plane reconstruction by using labeling software, and labeling 9 key anatomical structures of a malleus, an incus, a cochlea outer wall, a cochlea inner cavity, an outer semicircular canal, a rear semicircular canal, a front semicircular canal, a vestibule and an inner auditory canal;

The adopted data preprocessing comprises two stages of multi-plane reconstruction and data annotation;

(1) multi-plane reconstruction phase

the original CT imaging is influenced by the scanning parameter settings such as collimation and screw pitch and the body position of a patient, the imaging presents different degrees of skew, and key anatomical structures of bilateral temples are asymmetric;

Adopting a post-processing workstation to carry out multi-plane reconstruction on the original CT data, and comprising the following specific operation steps:

the first step is as follows: the outer semicircular canal is symmetrical; finding the fullest layer of the outer semicircular canal in the sagittal observation site, and enabling the reference lines to be parallel and equally divide the outer semicircular canal; switching to an axial observation position, rotating the right side image back and forth to find the fullest layer of the outer semicircular canals, and enabling the left and right rotation axis position images to enable the outer semicircular canals on the two sides to be symmetrical;

the second step is that: carrying out normalization processing; uniformly setting the scale of the image to be 1:1, so that the size of the scanned image is consistent with the actual size; setting a rectangular frame with the width of 10cm and the length of the image, placing the outer semicircular tube in the rectangular frame, ensuring that the upper edge of the outer semicircular tube is 5cm away from the upper edge and the lower edge of the rectangular frame, and cutting the image;

The third step: carrying out batch treatment;

(2) data annotation phase

the second step is that: respectively carrying out voxel marking on 9 key anatomical structures of the temporal bone;

the third step: auditing and modifying the marked result;

the fourth step: dicom images of each 9 key anatomical structures were derived.

3. A small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising two stages: an encoding stage and a decoding stage;

(1) Encoding stage

The specific implementation steps of the encoding stage are as follows:

a) dense join extraction features

The first step is as follows: extracting a cube for training; randomly extracting a 48 × 48 × 48 original data cube and a labeled data cube from a 420 × 420 × 60 voxel cube of input data; checking whether the label in the labeling data cube contains 1, if not, indicating that the extracted cube does not contain the target anatomical structure, and needing to be extracted again until the label in the labeling data cube contains the label 1; in order to eliminate the influence of background pixels on a segmentation task, setting a threshold interval of a target anatomical structure to be-999-2347 according to threshold ranges of 9 critical anatomical structures of the temporal bone, setting a Huhs value smaller than-999 to be-999, and setting a Huhs value larger than 2347 to be 2347; dividing the Hu's value of the cube by 255; then normalizing the data distribution into data distribution with the mean value of 0 and the variance of 1; the data enhancement is realized by simultaneously rotating the original data and the labeled data by an angle (-25 degrees);

the second step is that: extracting characteristics; firstly, extracting features of an extracted original data cube by adopting a convolution kernel of 3 multiplied by 3, wherein the step length of three dimensions is 1, and the convolution padding mode adopts SAME and is filled with 0 to obtain 64 features; inputting the characteristics into a 3-layer dense connection network, wherein the input of each convolution operation in the dense connection block is the aggregation of the characteristics of all the convolution outputs, and the size, the step length and the filling mode of a convolution kernel adopted by the dense connection network are the same as those of the dense connection network;

the third step: reducing dimension of the features; aggregating the previously output features in the dense connection blocks, and then reducing the number of feature graphs by adopting a bottleneck strategy; firstly, performing batch regularization and ReLU activation operation on the features, and outputting 4k features by adopting a convolution kernel of 3 multiplied by 3, wherein k is the growth rate;

b) Multi-pooling feature fusion

A multi-pooling feature fusion migration module is designed among different densely connected network blocks;

The first step is as follows: performing batch regularization on the features extracted from the densely connected network blocks and increasing the nonlinearity of the network by adopting a ReLU activation function; then, extracting features by adopting a three-dimensional convolution kernel with the size of 3 multiplied by 3, and preventing the over-fitting problem by adopting dropout, wherein the dropout rate is 0.5;

the second step is that: 3D max firing and 3D average firing are respectively carried out on the features, the size of the pooling kernel is 2 multiplied by 2, and the step length of three dimensions is 2; selecting the maximum value in the pooling nuclear space range by the 3D max firing; selecting an average value in a pooling nuclear space range by the 3D average firing; splicing the features obtained after max and average firing together;

(2) Decoding stage

the specific implementation steps of the decoding stage are as follows:

a) long and short jump connections are combined;

The first step is as follows: recording the output characteristics of the first dense connection network block, the second dense connection network block and the third dense connection network block in the encoding stage as F1, F2 and F3 respectively, wherein the resolutions of the output characteristics are 48 multiplied by 48, 24 multiplied by 24, and 12 multiplied by 12; performing transposition convolution operation on the F3 feature, wherein the step sizes of three dimensions are all 2, the padding mode is SAME, the padding mode is filled with 0, and the resolution of a feature set T2 obtained after transposition convolution is 24 multiplied by 24;

the second step is that: splicing the characteristics F2 output by the second densely connected network block in the encoding stage with T2 to form a new characteristic group D2; extracting the characteristics of D2 by adopting 3D convolution;

The third step: the method comprises the steps that characteristics F1 output by a first dense connection network in an encoding stage are extracted through 3D convolution to obtain 64 characteristics M1, transposition convolution operation is conducted on a characteristic group D2, the resolution of the characteristics is restored to 48 x 48 and is recorded as T1, characteristic groups F1, M1 and T1 are spliced to obtain a characteristic group D1, wherein M1 and F1 are spliced through short connection and long connection respectively;

b)3D deep supervision mechanism

The fourth step: the prediction voxel cubes obtained in the first step and the second step are auxiliary prediction results, and the prediction voxel cube obtained in the third step is a main network prediction result; and calculating a cross entropy loss function and a DSC loss function by using the aux _ pred1, the aux _ pred2 and the main _ pred and a ground route respectively, and forming a joint loss function by using the loss function obtained by calculating the auxiliary prediction result and the main network loss function together to guide network training.

4. a small target segmentation method for a critical anatomical structure of a temporal bone based on a 3D deep supervision mechanism is characterized by comprising the following steps:

The following describes the process of network training and testing:

Respectively training a segmentation model for each key anatomical structure of the temporal bone to be segmented; the size of input data received by the network is 48 × 48 × 48, 2 labels are contained in a real target, 0 represents a background, and 1 represents a target anatomical structure; the output of the network is the same as the input size, and 2 cubes are output, wherein the cubes respectively represent segmentation results of the background and the foreground;

a) model training

during network training, the batch size is set to be 1, the initial learning rate is 0.001, the momentum coefficient is 0.5, and after each batch training is finished, a sample of data is randomly extracted from a verification set for verification; the model is stored every 10000 times for more than 180000 times of iteration;

b) model testing

The size of CT data for each person after multi-planar reconstruction is 420 x 60 voxels, and in order to meet the input data size received by the model, a seamless segmentation strategy is adopted in the testing stage: firstly, decomposing data to be tested into a plurality of cubes with the size of 48 multiplied by 48 voxels according to a seamless segmentation strategy, wherein the overlapping factor is 4; and then respectively sending the small cubes into the trained model to obtain a prediction result, and finally recombining the prediction results of the small cubes to obtain a final segmentation result of the data to be detected.