CN112634293A

CN112634293A - Temporal bone inner ear bone cavity structure automatic segmentation method based on coarse-to-fine dense coding and decoding network

Info

Publication number: CN112634293A
Application number: CN202110045206.6A
Authority: CN
Inventors: 李晓光; 伏鹏; 朱梓垚; 卓力; 张辉
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-09

Abstract

A temporal bone inner ear bone cavity structure automatic segmentation method based on a coarse-to-fine dense coding and decoding network belongs to the field of medical images. The invention adopts a frame from thick to thin, firstly carries out rough segmentation on the anatomical structure to be segmented in the temporal bone area, and calculates the coordinates of a central point. Around this central point, the image expands externally to a region that can completely contain the inner ear bone cavity structure and can retain a portion of background information as a sub-region for further accurate segmentation. In the segmentation stage, dense connection modules are respectively introduced in the encoding process to extract more sufficient features, and hole convolution is added in the dense connection modules, so that a segmentation algorithm obtains a larger receptive field for a target to be segmented, and more sufficient surrounding features and spatial information are extracted. In the decoding stage, the features extracted in the encoding stage are up-sampled by the transposition convolution, and after each transposition convolution, a dense connection module is introduced to strengthen the reutilization of decoding information. The invention has more accurate segmentation.

Description

Temporal bone inner ear bone cavity structure automatic segmentation method based on coarse-to-fine dense coding and decoding network

Technical Field

The invention belongs to the field of medical image processing, and particularly relates to an automatic segmentation method for a temporal bone inner ear bone cavity structure of a coarse-to-fine dense coding and decoding network.

Background

Temporal bone CT is an important reference for doctors to check ear diseases. In the temporal bone region, three regions, namely the outer ear, the middle ear and the inner ear, are divided, and 30 tiny anatomical structures are contained in total. Wherein the inner ear region is one of the important regions of the temporal bone, helping the human body hear the sound and maintain balance. The region mainly comprises structures such as a cochlea, a vestibule, an outer semicircular canal, a rear semicircular canal, a front semicircular canal and the like, the structures are mainly formed by communicated bone cavity structures, and the structures respectively play different roles in ensuring the hearing and balance of a human body. The cochlea, the external ear and the middle ear act together, and external sound waves promote liquid in the cochlea wall to flow, so that fine hair in the cochlea bends and moves, and a motion signal of the sound waves is converted into an electric signal and is sent to the brain through auditory nerves; the vestibule and the three semicircular canals play an important role in maintaining the balance of the human body, wherein the vestibule is a connecting part of a cochlea and the semicircular canals, contains more liquid and fine hair, and senses the motion of the human body through the liquid flow; the three semicircular canals form a right angle with each other, and when a human body moves, liquid in the semicircular canals flows to promote the movement of internal fine hair and help people to feel the movement direction. Therefore, the inner ear region is an important reference for hearing loss, dizziness and other symptoms. In recent years, with the development of medical imaging technology, the temporal bone CT image data has been rapidly increased, but due to the lack of automatic analysis tools, a large amount of temporal bone CT image data is difficult to be effectively applied to relevant analysis research.

Medical image segmentation is a complex and key basic step in automatic medical image processing and analysis, medical research and clinical diagnosis, and aims to segment parts with certain special meanings in medical images and provide reliable reference bases for tasks such as clinical diagnosis, surgical planning, clinical teaching and the like. Segmentation is an important step of medical image analysis, and accurate segmentation not only can reduce the workload of doctors, but also can help doctors to further know the characteristics of anatomical structures and perform physiological analysis, such as measurement of physical quantities, such as the number of cochlear modiolus, vestibular size, semicircular canal curvature included angle and the like.

Small object segmentation in anatomical structures has been a challenging task in medical image segmentation. In the temporal bone anatomy, the anatomy that needs to be segmented only accounts for less than 1% of the whole CT data volume, and due to the particularity of the temporal bone anatomy, there is usually no obvious boundary division between the anatomy and the surrounding anatomy, which brings a challenge to the segmentation of the temporal bone anatomy.

The classic medical image segmentation algorithm has methods such as a threshold value method and region growing, but because the boundary between temporal bone CT anatomical structures is not obvious, the structure is precise, the volume is small, and the method is difficult to obtain an accurate segmentation result. Therefore, there is a pressing need for more efficient and accurate segmentation algorithms.

In recent years, with the development of deep learning technology, many medical image segmentation algorithms based on convolutional neural networks emerge. In order to fully capture the in-layer information and the inter-layer information, the algorithm usually adopts a three-dimensional neural network with a U-shaped coding and decoding structure. Due to the limitation of computational resources, such algorithms usually use a sliding window to sequentially segment the CT blocks in a complete CT data for the prediction of voxel class. Because the target of the temporal bone to be segmented is tiny and is easily interfered by a complex background, the segmentation speed and the segmentation precision are not good enough.

The invention designs a temporal bone CT image segmentation algorithm for densely connecting an encoding and decoding network from a thick frame to a thin frame, which is used for automatically segmenting a bone cavity structure of an inner ear in a temporal bone region. Firstly, an anatomical structure to be segmented in a temporal bone region is roughly segmented by adopting a frame from coarse to fine by adopting a high-efficiency light-weight segmentation algorithm to obtain a rough region, and the coordinates of a central point are calculated according to a foreground point set predicted by a rough segmentation result. Then, around the central point, the image is externally expanded to a region which can completely contain the inner ear bone cavity structure, and a part of background information can be reserved as a sub-region for further accurate segmentation. In the segmentation stage, dense connection modules are respectively introduced in the encoding process to extract more sufficient features, and hole convolution is added in the dense connection modules, so that a segmentation algorithm obtains a larger receptive field for a target to be segmented, and more sufficient surrounding features and spatial information are extracted. In the decoding stage, the features extracted in the encoding stage are up-sampled by the transposition convolution, and after each transposition convolution, a dense connection module is introduced to strengthen the reutilization of decoding information. And in other parts, a 3D deep supervision mechanism and a 3D multi-pooling feature fusion strategy adopted by a patent (publication number: 110544264A, publication date: 2019, 12 and 6) are continuously used in the fine segmentation network to guide the training of the segmentation algorithm.

Disclosure of Invention

The invention aims to overcome the defects of the existing medical image small target segmentation method. Segmentation of small anatomical structures has been a challenging task in medical image segmentation. This problem also exists in the anatomy of the temporal bone. In an example temporal bone CT sequence, the number of voxels of the target anatomy only accounts for less than 1% of the number of voxels of the complete data, and therefore, the complex larger background may negatively affect the segmentation result of the target to be segmented. Meanwhile, due to the particularity of the temporal bone CT, boundaries among various anatomical structures are not clearly divided, and challenges are brought to segmentation of the anatomical structures. Aiming at the problems, a frame for automatically segmenting the key anatomical structure of the temporal bone from thick to thin is designed, in a fine segmentation algorithm, a new innovation is provided on the basis of a small target segmentation method of the key anatomical structure of the temporal bone based on a 3D deep supervision mechanism, which is proposed by a patent (publication number: 110544264A, publication date: 2019, 12 and 6), a larger target receptive field and more sufficient characteristics are obtained by introducing dense connecting blocks and cavity convolution, and the anatomical structure of the temporal bone is automatically segmented more accurately.

The invention is realized by adopting the following technical means:

a method for automatically segmenting the bone cavity structure of the temporal inner ear from a dense network with thickness from thick to thin. The method is integrally divided into 3 stages: the specific flow is shown in the attached figure 1 of the specification based on the stages of rough positioning of the bone cavity structure of the inner temporal bone, extraction of candidate regions of the bone cavity structure of the inner temporal bone and fine segmentation of the bone cavity structure of the inner temporal bone.

The method specifically comprises the following steps:

1) a temporocele inner ear bone cavity structure coarse positioning stage based on coarse segmentation:

the first step is as follows: the temporal bone inner ear bone cavity structure is roughly segmented by adopting a conventional medical image segmentation method with a light network structure and a high segmentation speed. In the coarse segmentation model training stage, 48 × 48 × 48 (the actual physical dimensions in the front and back and left and right directions of the CT image are the product of the number of pixels and the pixel pitch, and the up and down directions are the product of the slice span and the layer pitch) are randomly extracted at the same position in a complete temple CT image data and a labeling file, so that a cube containing an inner ear bone cavity structure can be completely extracted as long as the actual physical distance is more than 24mm, in the method, the adopted data pixel pitch and the layer pitch are both 0.5mm, therefore, a cube with the size of 48 × 48 × 48 × 48 corresponding to a voxel block of the 24 × 24mm cube is extracted, if the cube contains a label of a target structure, the HU value of the extracted temple CT image cube is cut off, and the HU value is smaller than T_minIs set to T_minHU value greater than T_maxIs set to T_maxWherein, T_minAnd T_maxThe value range of (1) is between the lowest HU value and the highest HU value, generally, in the temporal bone CT, the air HU value is-1024, the HU value of the bone is more than 300, the data is cut to be the lowest-1024, the temporal bone region can be completely reserved (2347 is used as a cut upper limit in the method) when the data is cut to be the highest to be 2000 or more, the temporal bone region is used for further temporal bone analysis, the influence of an irrelevant background on target structure segmentation is reduced, the cut HU values are normalized to be the mean value of 0, the variance is 1, the data distribution is sent to a rough segmentation network for training, and otherwise, the cubes with the same size are extracted again until labels containing anatomical structures are extracted in the range.

The second step is that: and the complete temporal bone CT image data is segmented into a plurality of groups of blocks according to a mode of overlapping sliding windows and a sequence, the blocks are sequentially sent into a trained rough segmentation network for segmentation, and segmentation results are overlapped and restored to be consistent with the complete temporal bone CT size according to an input sequence. During the rough segmentation test, in order to find the balance between the rough positioning time and the segmentation precision, the overlap rate is selected to be 2, and the input complete temporal bone CT image data is divided into a plurality of groups of overlapped blocks to be input into the segmentation network for prediction.

The third step: and removing outliers generated in the rough segmentation by adopting an absolute median difference method according to the structural voxel level label. Let the absolute median difference be calculated as shown in equation (1):

MAD＝median(|Xi-median(X)|) (1)

where X is the set of all points predicted to be foreground points, X_iThe epsilon X is the ith point in the point set X, the mean (-) is the median calculated for the point set, and the absolute median algorithm process is as follows:

(1) calculating median (X) of all predicted foreground point coordinates;

(2) calculating the absolute deviation value abs (X) of each predicted foreground point and the median_i-median(X))；

(3) Calculating Median Absolute Deviation (MAD) of Absolute Deviation values in (2);

(4) and (3) dividing the value in the step (2) by the value in the step (3) to obtain a group of distances Dis of all the predicted foreground points from the center based on the MAD. The calculation formula is as shown in formula (2):

(5) and removing the point with the maximum Dis value larger than the threshold Th in the dimensions of x, y and z as an abnormal point. Th is the screening threshold value of the ratio Dis between the absolute deviation value and the median of the absolute deviation value. The selection can be performed according to the proportional relation between the real foreground point and the outlier. When the ratio of the current scenic spot to the outlier is small, the Th selection is large, which indicates that the position difference of the foreground scenic spot is more robust, more foreground points can be reserved, but the outlier cannot be completely removed; when the ratio of the current scenic spots to the outliers is large, Th selection is small, which means that the condition for screening the foreground points is stricter, and the correct foreground points can be deleted while the outliers are removed and most of the foreground points are reserved. The threshold value of the conventional MAD algorithm is between 1 and 10, when 5 inner ear bone cavity structures are removed and coarse positioning is carried out, outliers can interfere with the center coordinates of a target area, when the outliers are removed, the number of foreground points is large, therefore, the selection of Th values is small, 3.5 is adopted as a screening threshold value in the method, and the outliers far away from the target area can be removed.

2) Temporal bone key anatomical structure candidate region extraction stage:

firstly, counting the extraction size of a bone cavity structure candidate area of the inner ear of the temporal bone. Based on statistical voxel-level labeling data, counting the maximum values and the minimum values of all labeling voxel points of the temporal bone inner ear cavity structure in three dimensions of x, y and z, preliminarily calculating the voxel coordinate span of the anatomical structure according to the difference values between the maximum values of x, y and z and the minimum values of x, y and z, extracting voxel regions with the same size cannot guarantee to extract temporal bone regions with the same size due to the fact that pixel spacing and layer spacing parameters between CTs are possibly inconsistent, therefore, according to the pixel spacing and the layer spacing of each CT, calculating the actual physical enclosure frame size corresponding to the temporal bone cavity structure of each CT, extending the maximum value of the actual physical size of each structure outwards to 24mm × 24mm × 24mm, taking the temporal bone CT with the layer spacing and the pixel spacing of 0.5mm as an example, taking a cube of 48 × 48 × 48 × 48, namely, the input of a segmentation algorithm can be met while guaranteeing that the target segmentation structure is completely enclosed, as the extraction size of the candidate region.

And secondly, extracting the region of interest by combining the coarse positioning central point of the anatomical structure to be segmented and the size information of the prior surrounding frame of the anatomical structure. And describing the region of interest by using a region central point and a three-dimensional size, wherein the central point of the key anatomical structure to be segmented of the temporal bone predicted in the first stage is taken as the center, the central point is extended outwards to extract cube data, the three-dimensional size of the cube is calculated according to the size of an enclosure frame corresponding to each anatomical structure to be segmented of the temporal bone in a first step statistics manner, the extracted sub-region is taken as a candidate region for further accurate segmentation, and the position of the extracted sub-region is recorded.

3) And (3) a temporal bone inner ear bone cavity structure fine segmentation stage:

the fine segmentation stage specifically comprises two processes of encoding and decoding, and the whole network architecture is as shown in the specification and fig. 2.

a) Encoding stage

The first step is as follows: data truncation and normalization. And sending the 48 x 48 voxel length sub-area of the temporal bone inner ear bone cavity structure candidate area extracted in the stage 2) into a precise segmentation algorithm. In order to reduce the influence of a complex irrelevant background on a segmented target, data truncation is carried out on the sub-region CT value of the CT image data according to the HU value distribution range of the temporal bone CT, and for the temporal bone CT, the sub-region CT value can be smaller than T_minIs truncated to T_minIs greater than T_maxIs truncated to T_max，T_minAnd T_maxThe selection is consistent with the coarse positioning value of the bone cavity structure of the inner ear of the temporales based on coarse segmentation in the first stage, and the coarse positioning value is normalized into data distribution with the mean value of 0 and the variance of 1.

The second step is that: and extracting features by adopting a dense connection network with cavity convolution. And (3) feeding the 48 x 48 complete sub-area containing the bone cavity structure of the inner ear of the temporal bone to be segmented after the data truncation and normalization in the first step into a network. 3 groups of dense connection modules are designed in the encoding stage, the modules can enhance the information transmission in the network, reduce gradient disappearance or gradient explosion, and simultaneously can repeatedly utilize extracted features to obtain rich semantic information. The module consists of three parts, namely batch normalization-modified linear unit-convolution layer, splicing and bottleneck layer, as shown in the description attached figure 3, wherein all convolution sizes of the densely connected module are 3 multiplied by 3. The batch normalization-correction linear unit-convolution layer in the dense connecting block consists of three operations of batch normalization, correction linear unit and convolution layer; splicing is to cascade the characteristic diagrams at the channel level; the bottleneck layer is used for reducing the number of characteristic graphs output by the dense connection block. In the third group of densely connected modules, a hole convolution module is adopted, and the schematic diagram of the module is shown in the description attached to fig. 4, so that the convolution output can increase the receptive field, namely the spatial information containing a large range around the anatomical structure. This helps to extract the spatial information of the tiny critical anatomical structures of the temporal bone and their surroundings in the three-dimensional data.

The third step: pooling features are fused. Following the multi-pooling feature fusion strategy of the patent (publication No. 110544264a, published: 2019, 12/6/10), batch normalization-modified activated cell-convolutional layer was used after the dense-connected module output at each level, after which Dropout layer was typically used to prevent overfitting, Dropout rate was set to 0.5, after which both 3D max pooling and 3D average pooling were used and the results after pooling were put together for a splice. The 3D max pooling may preserve edge features of the volumetric data and the 3D average pooling may preserve background information of the volumetric data. The splicing of the two can provide rich characteristic information for subsequent segmentation.

b) Decoding stage

The first step is as follows: the feature upsampling is performed using a transposed convolution and dense join module. In the decoding process, the transposition convolution with the dense connection module is adopted to decode the semantic information. And (3) carrying out up-sampling on the tensor feature data with the size of 12 × 12 × 12 in the last layer in the encoding stage by adopting twice transposition convolution, and restoring the tensor feature data to the original input size of 48 × 48 × 48. Different from the conventional method which adopts the common convolution to extract the features after the transposition convolution, after the two times of transposition convolution in the decoding stage, the dense connecting blocks replace the common convolution layer, so that the features after the transposition convolution can be more efficiently utilized, and the voxel type can be better predicted.

The second step is that: 3D deep supervision mechanism. Along with a 3D deep supervision mechanism of a patent (publication number: 110544264A, publication date: 2019, 12 and 6), in an encoding stage, features output by a first densely-connected network block are extracted by using 64 convolution kernels, then a 1 × 1 × 1 convolution is carried out, and then a softmax layer is followed to output an auxiliary segmentation result. And performing convolution operation on the spliced features by the second layer in the decoding stage to further extract the features, and performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by using a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result. The last layer of the decoding stage outputs the prediction result of the trunk network after the jointed features are subjected to convolution operation containing different convolution kernels, and the prediction result and the branch of the trunk networkThe prediction results of the branch networks together guide the training of the network. In the process of network training, the loss function of the main network and the loss time function of the branch network jointly form a joint objective function, including

Coefficient (dsc) loss function and cross entropy loss function. The DSC loss function is defined as shown in equation (3):

wherein X and Y respectively represent a prediction voxel and a real target voxel, n represents the number of classes (including background) of the target to be segmented, and X_iAnd y_iRespectively representing the number of target labeled voxels contained in the predicted voxel data and the true target voxel data. Introducing a weight denoted as W for the cross entropy loss function, as shown in equation (4):

wherein N is_kRepresenting the number of target voxel labels in the voxel data to be segmented, N_cRepresenting the number of all voxels in the voxel data to be segmented. The cross entropy loss function is shown in equation (5):

constructing a joint objective function based on the loss function defined above is shown in equation (6):

where m is the number of hidden layers to be monitored, λ_1kAnd λ_2kIs the hyperparameter of the k-th supervised hidden layer loss function, and the value of the hyperparameterThe range is 0-1, because the combined loss function should be mainly the loss function of the network backbone and assisted by the loss function of the hidden layer, in the invention, m is 2, and the hyper-parameter lambda is_1kAnd λ_2kValues of 0.6 and 0.3, L respectively_kAnd H_kRespectively, the k-th supervised hidden layer DSC loss function and the cross-entropy loss function. And constructing a target loss function based on the loss functions of the main network and the branch network to jointly guide network training, reduce gradient disappearance and accelerate the convergence speed of the network.

The third step: and predicting a segmentation result with the resolution of 48 multiplied by 48 by adopting a precise segmentation algorithm, and reducing the extracted position of the candidate region recorded in the 2) stage to a corresponding position in the complete CT as a final segmentation result.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

the invention provides an automatic segmentation algorithm for a temporal bone inner ear bone cavity structure of a coarse-to-fine dense coding and decoding network. Combining the characteristic of tiny volume of the anatomical structure of the temporal bone, firstly adopting a light-weight and high-efficiency general medical image anatomical segmentation algorithm to carry out rough segmentation on the anatomical structure to be segmented, obtaining the spatial position distribution range of the anatomical structure, extracting small sub-regions by extending the center of the distribution range outwards, and reducing the candidate region of the structure for accurate segmentation. On the basis of reducing the area to be segmented, an accurate segmentation algorithm with dense connection and cavity convolution is adopted to automatically segment the bone cavity structure of the inner ear of the temporal bone more accurately. According to the method, a coarse-to-fine segmentation frame is adopted, a candidate region is determined by performing coarse segmentation on the anatomical structure to be segmented, and then accurate segmentation is performed in the candidate region, so that the problems that the anatomical structure target of the temporal bone region is small and the influence of a complex background is large are further solved, and the segmentation precision and the segmentation speed of the temporal bone inner ear bone cavity structure are improved. The invention can replace manual drawing and realize automatic segmentation of the bone cavity structure of the inner ear of the temporal bone by a computer.

The invention has the characteristics that:

1. aiming at the characteristic that the inner ear bone cavity structure in the temporal bone CT is complex and fine, the algorithm provides a coarse-to-fine segmentation frame, the frame adopts a lightweight coarse segmentation method to determine a candidate region for accurately segmenting an anatomical structure, and a more complex fine segmentation algorithm is adopted in the region for further accurate segmentation;

2. in the encoding stage of the fine segmentation algorithm, a dense connection module with cavity convolution is introduced to capture complex spatial information around an anatomical structure to be segmented; in the decoding stage, the reuse of decoding information is enhanced by combining the transposition convolution with the dense connecting block. The segmentation performance of the temporal bone micro anatomical structure is improved.

Drawings

FIG. 1 is a general frame diagram of a temporal inner ear bone cavity structure segmentation algorithm from thick to thin;

FIG. 2 is a diagram of an overall network architecture of a temporo inner ear bone cavity structure fine segmentation algorithm;

FIG. 3 is a schematic diagram of a quasi-dense connection module and a dense connection module with a hole convolution;

FIG. 4 is a diagram illustrating standard convolution and hole convolution;

fig. 5 is a graph showing the result of the bone cavity structure segmentation of the inner ear of the temporal bone.

The specific implementation mode is as follows:

the following description of the embodiments of the present invention is provided in conjunction with the accompanying drawings:

we collected 64 manually standardized temporal bone CT data approved by the ethical committee of the beijing friendship hospital affiliated with the university of capital medical science. Patient information in all data is desensitized according to hospital requirements. Of the 64 normal human temporal bone CT data, 33 were male, 31 were female, and the average age was 44 years. A radiologist with experience-rich in the friendship hospital is invited to perform voxel-level labeling of the 5 inner ear bone cavity structures in the temporal bone CT. In the experiment, 56 cases of data were used as training sets, and 8 cases of data were used as training sets.

The data preprocessing employed by the present invention includes resampling of the CT image and the labeled data.

In order to avoid inconsistent distribution of CT data caused by different pixel spacing and layer spacing of CT image data due to different brands and parameters of CT acquisition equipment, resampling uniform pixel spacing and layer spacing to 0.5mm for the CT image data by adopting a B spline interpolation algorithm; and (5) resampling the marked data to a uniform pixel spacing and a layer spacing of 0.5mm by adopting a nearest neighbor algorithm.

1) Coarse positioning stage of critical anatomical structures of temporal bones:

the first step is as follows: the 3D Unet algorithm which is light and is generally applied in the field of medical image segmentation is adopted to carry out rough segmentation on the 5 inner ear bone cavity. The segmentation test was performed using 56 training sets and 8 test sets. In the training stage, one example is randomly extracted from 56 complete cases of temporal bone CT image data and corresponding labeling data of the inner ear bone cavity structure, a cubic region with the actual physical size of 24mm is randomly extracted from the same position in the case of the CT image data and the labeling, and all the CT regions after resampling have uniform pixel spacing and layer spacing of 0.5mm, so that the voxel regions corresponding to the region are uniformly 48 multiplied by 48 cubes, if the labeled cube contains a labeling foreground of a target structure, namely the extracted labeled cube contains a label value of 1, the HU value of the extracted temporal bone CT image cube is cut off, and T is used for cutting off the HU value of the extracted temporal bone CT image cube_minIs set to-1024, T_maxAnd 2347, the influence of the irrelevant background on the target structure segmentation is reduced, the truncated HU values are normalized to be data distribution with the mean value of 0 and the variance of 1, and finally the data distribution is sent to a rough segmentation network for training, otherwise, the cubes with the same size are extracted again until the data distribution is labeled in the range. The net training setting batch size is 2, the initial learning rate is 0.0001, and the momentum factor is set to 0.5. A DSC loss function and a cross-entropy loss function with weights are used. After every 20 batchs are trained, randomly extracting a piece of data in a verification set for verification, adopting a DSC coefficient as an evaluation result, wherein a formula is shown as a formula (7), storing a model every 2000 times, guiding the loss convergence of the accurate segmentation model, verifying that the DSC index is close to saturation, and selecting the model with the highest verification result index as the optimal model for the accurate segmentation network training.

The second step is that: and in the coarse segmentation algorithm testing stage, the complete 8 cases of test data are input into a 3D Unet algorithm for segmentation testing in a sliding window mode with overlapping, the overlapping rate is 2, the cube size of each sliding window is 48 multiplied by 48.

The third step: for the rough segmentation result, removing abnormal outliers of the 3D Unet segmentation by adopting an absolute median difference algorithm and taking a threshold value as 2.5;

the third step: and taking the central point of the voxel-level segmentation result of the inner ear bone cavity structure as the central point for further candidate region extraction.

2) Temporal bone key anatomical structure candidate region extraction stage:

the first step is as follows: and counting the extraction size of the key anatomical candidate region of the temporal bone CT. According to voxel-level labeled data based on an inner ear bone cavity structure, in 56 cases of training data sets, the maximum values and the minimum values of all labeled voxel points of the inner ear bone cavity structure in the x, y and z dimensions are counted, the corresponding actual physical distances are counted and calculated according to the difference values between the maximum values of the x, y and z and the minimum values of the x, y and z dimensions, the actual physical distances are expanded outwards, and the requirement of a 3D Unet segmentation algorithm for three-time down-sampling is met. The actual physical distance of the candidate bounding box region for finally determining the inner ear bone cavity structure is 24 × 24 × 24, and the corresponding voxel size is 48 × 48 × 48.

The second step is that: and according to the structural center point of the inner ear bone cavity in the first stage, respectively extending the inner ear bone cavity in the positive and negative directions of x, y and z by 12mm, namely, 24 voxel distances, and taking the inner ear bone cavity as a candidate region for further accurate segmentation. .

3) Accurate segmentation stage of key anatomical structure of temporal bone:

the precise segmentation stage of the key anatomical structure of the temporal bone is mainly divided into an encoding stage and a decoding stage.

a) Encoding stage

The first step is as follows: sending the 48 multiplied by 48 sub-area of the inner ear bone cavity area extracted in the 2) stage into a precise segmentation algorithm. And (3) truncating a value of the CT value of the sub-region of the CT image data, which is smaller than-1024, to-1024, and truncating a value of the CT value, which is larger than 2347, to 2347, and normalizing the values into data distribution with the mean value of 0 and the variance of 1.

The second step is that: and the dense connection module is combined with the cavity convolution to extract the characteristics. The inner ear bone cavity structure has small volume, in order to more fully extract the structure to be segmented, dense connecting blocks are adopted for feature extraction, and each layer of dense connecting module consists of three parts, namely a group of batch normalization-correction linear units-convolution layer, splicing layer and bottleneck layer. In the encoding phase, we use 3 layers of densely connected modules. In order to obtain a larger receptive field through convolution output and extract a larger range of spatial information, the convolution layer of the 3 rd densely connected module is replaced by a cavity convolution with the expansion rate of 2.

The third step: pooling features are fused. The multi-pooling feature fusion strategy is used, batch normalization-correction activation unit-convolution layer is adopted after the dense connection module of each level outputs, in order to prevent overfitting in the training process, a dropout layer with a discarding rate of 0.5 is usually adopted, 3D maximum pooling and 3D average pooling are simultaneously adopted after the dropout layer, and results after pooling are spliced. The 3D max pooling may preserve edge features of the volumetric data and the 3D average pooling may preserve background information of the volumetric data. The splicing of the two can provide rich characteristic information for subsequent segmentation.

b) Decoding stage

i. The feature upsampling is performed using a transposed convolution and dense join module.

The first step is as follows: the characteristics of the output of the first, second and third densely connected network blocks in the encoding stage are respectively F₁,F₂,F₃The resolutions thereof were 48X 48, 24X 2412X 12, respectively. To F₃Adopting transposition convolution operation, the convolution step length is 2, the boundary filling adopts 0 to fill, and outputting characteristic T after the first group of transposition convolution₂The size is 24 × 24 × 24;

the second step is that: outputting F from the second densely-connected block in the encoding stage₂And T₂Channel splicing is carried out to form a new characteristic group C₂Set of characteristics C₂Extracting features through the dense connecting blocks 4 in the decoding stage to obtain features D₂；

The third step: output F of the first densely packed block of the encoding stage₁Passing through a 3D convolution layer to obtain 64 features M₁By using a transposed convolution operation, D₂Up-sampling is carried out until the sampling rate is 48 multiplied by 48, and the output characteristic is recorded as T₁Will feature F₁、M₁、T₁Performing characteristic splicing to obtain a characteristic group C₁Wherein M is₁And F₁The splicing of the method is respectively spliced in the form of short connection and long connection, and the semantic gap between the low-level space characteristic and the high-level semantic characteristic is reduced. Will be characteristic group C₁Inputting the dense connection block 5 in the decoding stage, increasing the reuse of the features and obtaining the features D₁Finally, by a bottleneck convolution of 1 × 1 × 1, pair D₁And (5) performing channel screening as final output.

ii.3D deep supervision mechanism.

The first step is as follows: in the decoding stage, for feature D₂Performing transposed convolution operation, up-sampling feature data to 48 × 48 × 48, performing bottleneck convolution of 1 × 1 × 1, outputting 2 feature cubes, calculating the probability that each voxel is a target anatomical structure according to softmax, and recording as auxiliary prediction aux _ pred 1;

the second step is that: encoding stage feature set M₁Calculating the probability that each voxel is the target anatomical structure according to softmax by using 1 × 1 × 1 convolution, and recording as auxiliary prediction aux _ pred 2;

the third step: for feature group D₁Inputting 1 × 1 × 1 bottleneck convolution for channel screening, calculating classification probability of each voxel according to softmax, and recording as trunk prediction main _ pred;

the fourth step: joint loss guides network training. And respectively comparing the network prediction results aux _ pred1, aux _ pred2 and main _ pred with a manually marked gold standard GT, calculating cross entropy loss and DSC loss, and forming joint loss guide network training by the loss obtained by the auxiliary prediction result and the loss obtained by the main prediction result.

The training and testing process for the precision segmented network is as follows:

a) model training

In the model training phase, the 48 × 48 × 48 subvolume extracted in the first phase is input to a segmentation algorithm for training. In the training process, the batch size is set to be 2, the initial learning rate is set to be 0.0001, the momentum coefficient is set to be 0.5, after each 20 batches of training are completed, a piece of data is randomly extracted in a verification set for verification, the DSC coefficient is used as an evaluation result, the model is stored every 2000 times until loss convergence of the accurate segmentation model is achieved, the DSC index is verified to be close to saturation, and the model with the highest verification result index is selected as the optimal model for the accurate segmentation network training.

b) Model testing

In the model testing stage, the 48 × 48 × 48 × 48 sub-region extracted in the second stage is subjected to accurate segmentation prediction, and after the prediction is completed, the accurate segmentation result is restored to the corresponding position according to the position recorded by the sub-region extracted in the first stage, and is used as the final segmentation result of the temporal bone CT image.

The subjective picture of the algorithm on the segmentation result of the inner ear bone cavity structure is shown in the attached figure 5.

Claims

1. A method for automatically segmenting a bone cavity structure of an inner temporal ear by a thick-to-thin dense network is characterized by integrally comprising 3 stages: a step of coarse positioning of a bone cavity structure of the inner temporal bone based on coarse segmentation, extraction of a candidate area of the bone cavity structure of the inner temporal bone and fine segmentation of the bone cavity structure of the inner temporal bone;

the method specifically comprises the following steps:

the first step is as follows: roughly dividing a bone cavity structure of an inner ear of a temporal bone; in the coarse segmentation model training stage, a 48 × 48 × 48 cube is randomly extracted at the same position in a complete temporal bone CT image data and a labeling file, if the cube contains a label of a target structure, the HU value of the extracted temporal bone CT image cube is cut off, and the HU value is smaller than T_minIs set to T_minHU value greater than T_maxIs set to T_maxWherein, T_minAnd T_maxRange of values from lowest HU value to highest HU valueIn the temporal bone CT, the air HU value is-1024, the HU value of bone is more than 300, the data is cut to be the lowest-1024, the highest is cut to be 2000 or more, the temporal bone area can be completely reserved for further temporal bone analysis, the HU values after being cut are normalized to be the data distribution with the mean value of 0 and the variance of 1, the data distribution is sent to a rough segmentation network for training, otherwise, cubes with the same size are extracted again until the marks containing anatomical structures are extracted in the range;

the second step is that: dividing the complete temporal bone CT image data into a plurality of groups of blocks according to a mode of overlapping sliding windows and a sequence, sequentially sending the blocks into a trained rough division network for division, and overlapping and restoring the division results to be consistent with the complete temporal bone CT size according to an input sequence; selecting the overlapping rate of 2, dividing the input complete temporal bone CT image data into a plurality of groups of overlapping blocks, and inputting the blocks into a segmentation network for prediction;

the third step: according to the voxel-level labeling of each structure, removing outliers generated in the rough segmentation by adopting an absolute median difference method; let the absolute median difference be calculated as shown in equation (1):

MAD＝median(|X_i-median(X)|) (1)

(1) calculating median (X) of all predicted foreground point coordinates;

(3) Calculating the median MAD of the absolute deviation value in the step (2);

(4) dividing the value in the step (2) by the value in the step (3) to obtain a group of distances Dis between all the predicted foreground points based on the MAD and the center; the calculation formula is as shown in formula (2):

(5) removing the point with the maximum Dis value larger than the threshold Th in the dimensions of x, y and z as an abnormal point; removing outliers far away from the target area by using 3.5 as a screening threshold;

2) temporal bone key anatomical structure candidate region extraction stage:

firstly, counting the extraction size of a temporobone inner ear bone cavity structure candidate area; based on statistical voxel-level labeling data, counting the maximum values and the minimum values of all labeling voxel points of the temporal bone inner ear cavity structure in three dimensions of x, y and z, preliminarily calculating the voxel coordinate span of the anatomical structure according to the difference values between the maximum values of x, y and z and the minimum values of x, y and z, calculating the actual physical enclosure frame size corresponding to the temporal bone cavity structure of each CT according to the pixel interval and the layer interval of each CT, outwards extending the maximum value of the actual physical size of each structure to 24mm × 24mm × 24mm, and taking a 48 × 48 × 48 × 48 cube by using the temporal bone CT with the layer interval and the pixel interval of 0.5mm, so that the target segmentation structure can be completely enclosed, and meanwhile, the input of a segmentation algorithm is met, and the cube serves as the extraction size of a candidate region;

secondly, extracting an interested region by combining a coarse positioning central point of the anatomical structure to be segmented and size information of a prior surrounding frame of the anatomical structure; describing the region of interest through a region central point and a three-dimensional size, wherein the central point of the key anatomical structure to be segmented of the temporal bone predicted in the first stage is taken as the center, extending outwards to extract cube data, calculating the three-dimensional size of the cube according to the size of an enclosure frame corresponding to each anatomical structure to be segmented of the temporal bone calculated in the first step in a statistical manner, taking the extracted sub-region as a candidate region for further accurate segmentation, and recording the position of the extracted sub-region;

the fine segmentation stage specifically comprises two processes of encoding and decoding;

a) encoding stage

The first step is as follows: data truncation and normalization; sending the 48 x 48 voxel length sub-area of the temporal bone inner ear bone cavity structure candidate area extracted in the stage 2) into a precise segmentation algorithm; the sub-region CT value of the CT image data is subjected to data truncation according to the HU value distribution range of the temporal bone CT, and for the temporal bone CT, the CT value smaller than T can be obtained_minIs truncated to T_minIs greater than T_maxIs truncated to T_max，T_minAnd T_maxThe selection is consistent with the coarse positioning value of the bone cavity structure of the inner ear of the temporales based on coarse segmentation in the first stage, and the coarse positioning value are normalized into data distribution with the mean value of 0 and the variance of 1;

the second step is that: extracting features by adopting a dense connection network with cavity convolution; sending the 48 multiplied by 48 complete sub-area containing the bone cavity structure of the inner ear of the temporal bone to be segmented after the data in the first step are truncated and normalized into a network; 3 groups of dense connection modules are designed in the encoding stage; the module consists of three parts, namely batch normalization-correction linear unit-convolution layer, splicing and bottleneck layer, wherein all convolution sizes of the densely connected module are 3 multiplied by 3; the batch normalization-correction linear unit-convolution layer in the dense connecting block consists of three operations of batch normalization, correction linear unit and convolution layer; splicing is to cascade the characteristic diagrams at the channel level; the bottleneck layer is used for reducing the number of characteristic graphs output by the dense connection blocks; in the third group of dense connection modules, a hole convolution module is adopted;

the third step: fusing multi-pooling characteristics; using batch normalization-modified active unit-convolutional layer after the output of the dense connection module of each level, in order to prevent overfitting, using Dropout layer after this, setting Dropout rate to 0.5, using 3D max pooling and 3D average pooling simultaneously after it, and making a splice of the result after pooling;

b) decoding stage

The first step is as follows: performing feature upsampling by using a transposed convolution and dense connection module; in the decoding process, adopting a transposition convolution with a dense connection module to decode semantic information; carrying out up-sampling on tensor eigen data with the size of 12 x 12 at the last layer in the encoding stage by adopting twice transposition convolution, and restoring the tensor eigen data to the original input size of 48 x 48; after two times of transposition convolution in the decoding stage, replacing a common convolution layer with a dense connecting block;

the second step is that: a 3D deep supervision mechanism; in the encoding stage, the output characteristic of the first densely connected network block adopts 64 convolution kernelsExtracting features, performing convolution by 1 multiplied by 1, and outputting an auxiliary segmentation result immediately after a softmax layer; performing convolution operation on the spliced features by the second layer in the decoding stage to further extract the features, firstly performing transposition convolution on the obtained features to improve the resolution, and then performing softmax layer by adopting a 1 × 1 × 1 convolution kernel to obtain a second auxiliary segmentation result; in the decoding stage, the last layer outputs the prediction result of the trunk network after convolution operation containing different convolution kernels is carried out on the spliced characteristics; in the process of network training, the loss function of the main network and the loss time function of the branch network jointly form a joint objective function, including

Coefficient (dsc) loss function and cross entropy loss function; the DSC loss function is defined as shown in equation (3):

wherein X and Y respectively represent a prediction voxel and a real target voxel, n represents the number of classes of the target to be segmented, and X_iAnd y_iRespectively representing the number of target mark voxels contained in the predicted voxel data and the real target voxel data; introducing a weight denoted as W for the cross entropy loss function, as shown in equation (4):

wherein N is_kRepresenting the number of target voxel labels in the voxel data to be segmented, N_cRepresenting the number of all voxels in the voxel data to be segmented; the cross entropy loss function is shown in equation (5):

where m is the number of hidden layers to be monitored, λ_1kAnd λ_2kIs a hyperparameter of the kth supervised hidden layer loss function, m is 2, the hyperparameter is lambda_1kAnd λ_2kValues of 0.6 and 0.3, L respectively_kAnd H_kRespectively a k-th supervised hidden layer DSC loss function and a cross entropy loss function; constructing a target loss function based on loss functions of a main network and a branch network to jointly guide network training, reducing gradient disappearance and accelerating the convergence speed of the network;