CN112967300A

CN112967300A - Three-dimensional ultrasonic thyroid segmentation method and device based on multi-scale fusion network

Info

Publication number: CN112967300A
Application number: CN202110202637.9A
Authority: CN
Inventors: 杨峰
Original assignee: Ariemedi Medical Technology Beijing Co ltd
Current assignee: Ariemedi Medical Technology Beijing Co ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-06-15

Abstract

The three-dimensional ultrasonic thyroid gland segmentation method and device based on the multi-scale fusion network can accurately segment the edges of heterogeneous organ structures, avoid the final segmentation result from relying on the intermediate result predicted by a distance map, utilize multi-level semantic information and keep smaller calculated amount, can accurately segment the three-dimensional thyroid gland and achieve a very good effect. The method comprises the following steps: (1) making the network study to predict the boundary distance graph so as to segment the edges of the heterogeneous organ structure; (2) adding constraint guidance training to the network in a deep supervision mode to avoid that the final segmentation result depends on an intermediate result of distance map prediction; (3) using a CBAM attention module at a multi-scale to focus on edge distance information; (4) and fusing the probability maps of all layers by adopting a dense fusion module of the cavity convolution, and gradually thinning the output probability map to generate a final result of each layer.

Description

Three-dimensional ultrasonic thyroid segmentation method and device based on multi-scale fusion network

Technical Field

The invention relates to the technical field of medical image processing, in particular to a three-dimensional ultrasonic thyroid gland segmentation method based on a multi-scale fusion network and a three-dimensional ultrasonic thyroid gland segmentation device based on the multi-scale fusion network.

Background

Ultrasound imaging relies on the energy of echoes penetrating tissue, which is partially attenuated, absorbed, reflected by the properties of the tissue. The returned signals are collected by an ultrasonic probe and rendered for imaging. The imaging characteristics of ultrasound cause random speckle noise in the image, low contrast, and blurred organ edges. The anisotropic nature of tissue in ultrasound images, the echogenic nature between adjacent tissues, can cause multiple different modal manifestations to be present in the same tissue. In addition, ultrasound imaging is often used for clinical real-time detection, and acquisition of three-dimensional ultrasound volume data is often performed by sliding a handheld two-dimensional ultrasound probe and reconstructing the generated volume data, which introduces inter-frame displacement and blurring. The above reasons make organ segmentation by three-dimensional ultrasound a challenging problem.

There are currently many two-dimensional ultrasound segmentation methods. There are active region-based contour models that employ masks that need to be initialized. Some methods use fuzzy C clustering, histogram clustering, region growing, random walk and the like. Some use borderless active contours, region-based active contours, distance canonical level sets. Some use classical algorithms such as threshold, region splitting and merging, watershed, graph cut, etc. There are multiple organ segmentation methods that use information based on speckle-related pixels and image artifacts. The above methods rely on manual design features and are mostly semi-automatic methods.

Many methods have also been generated for the segmentation of three-dimensional thyroid images. Some use a 3D spring deformation model to segment thyroid cartilage, but for CT images. Some methods use a geodesic active contour model to obtain the contour of the thyroid in a three-dimensional image, but are still semi-automatic segmentation methods. Some use radial basis functions for direct segmentation of ultrasound data blocks. Some methods, which compare level set, graph cut and pixel classifier, show the effectiveness of the learning-like method, or perform frame-by-frame volume data segmentation, or attempt to directly perform thyroid segmentation using 3 DUnet. Some adopt semnet based submnet to carry out three-dimensional thyroid gland segmentation frame by frame.

Recently, the full convolution neural network is more applied to ultrasound image segmentation, such as thyroid segmentation using a feedforward neural network. Some proposed IVUS-Net uses a multi-pass multi-scale convolution kernel segmentation network for the segmentation of two-dimensional intravascular ultrasound. Some deep supervised delab based networks using boundary distance maps are used for two-dimensional ultrasound kidney segmentation. Some deep supervision methods adopting multi-channel fusion improve the accuracy of the segmentation boundary of the ultrasound prostate. These methods are mainly used for two-dimensional segmentation, while some three-dimensional segmentation networks are often applied on images directly acquired by a three-dimensional ultrasound probe.

For two-dimensionally acquiring reconstructed three-dimensional ultrasound volume data, there have been many attempts to segment frame by frame or to directly perform three-dimensional segmentation in the prior art. Some people mention that the three-dimensional segmentation network is difficult to train, the two-dimensional network can better focus on the edge of an organ, and experiments also show that the mode of segmenting and reconstructing by adopting the two-dimensional network has better effect compared with the three-dimensional segmentation network. However, the previous method is only used for healthy thyroid data, and the nodules in the clinical thyroid image are often in strong contrast with the glands and are difficult to segment as a whole, and when the nodules are located at the edge of an organ, the nodules have an influence on the edge segmentation.

There are currently many ways to improve the accuracy of the segmentation of edges. Some coding and decoding structures adopt a mesh-shaped path to better utilize multi-layer characteristics, but practice finds that training is easy to generate instability. Some attention modules adopting two channels pay attention to effective information of two layers of positions and channels at the same time, but the calculation amount is huge when a feature map with higher resolution is obtained. Some guide the fusion of features using advanced features in adjacent levels, but do not make better use of more scale information. Some systems employ a multi-scale channel-based attention module and do not effectively use feature information of a position, but only perform weight learning on channels of each position of a feature map and do not learn information on the position.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a three-dimensional ultrasonic thyroid gland segmentation method based on a multi-scale fusion network, which can accurately segment the edge of a heterogeneous organ structure, avoid the final segmentation result from depending on the intermediate result of distance map prediction, take into account the utilization of multi-level semantic information and keep smaller calculated amount, can accurately segment the three-dimensional thyroid gland and achieve very good effect.

The technical scheme of the invention is as follows: the three-dimensional ultrasonic thyroid gland segmentation method based on the multi-scale fusion network comprises the following steps:

(1) making the network study to predict the boundary distance graph so as to segment the edges of the heterogeneous organ structure;

(2) adding constraint guidance training to the network in a deep supervision mode to avoid that the final segmentation result depends on an intermediate result of distance map prediction;

(3) using a CBAM attention module at a multi-scale to focus on edge distance information;

(4) and fusing the probability maps of all layers by adopting a dense fusion module of the cavity convolution, and gradually thinning the output probability map to generate a final result of each layer.

The invention enables the network to learn and predict the boundary distance graph and can accurately segment the edges of the heterogeneous organ structure. Unlike the tandem mode, the method adds constraint guidance training to the network in a deep supervision mode to avoid the dependence of the final segmentation result on the intermediate result of the distance map prediction. In addition, in order to take advantage of the multi-level semantic information and keep the computation load small, the CBAM attention module is used at the multi-scale to better focus on the edge distance information. And finally, fusing the probability maps of all layers by adopting a dense fusion module with cavity convolution, and gradually refining the output probability map to generate a final result of each layer. Therefore, the method can accurately segment the edges of the heterogeneous organ structure, avoid the final segmentation result from depending on the intermediate result of distance map prediction, utilize multi-level semantic information and keep smaller calculation amount, accurately segment the three-dimensional thyroid and achieve good effect.

Also provided is a three-dimensional ultrasonic thyroid segmentation device based on a multi-scale fusion network, which comprises:

a segmentation module configured to let the network learn to predict a boundary distance map in order to segment edges of the heterogeneous organ structure;

a boundary distance map module configured to add constraint guidance training to the network in a deep supervised manner to avoid that the final segmentation result depends on intermediate results of distance map prediction;

an attention module configured to focus on edge distance information using the CBAM attention module at a multi-scale;

and the hole dense fusion module is configured to fuse the probability maps of the layers by adopting the hole convolution dense fusion module, gradually refine the output probability map and generate a final result of each layer.

Drawings

Fig. 1 is an example of thyroid segmentation mask generation potential distance maps, in which (a) an original image, (b) a thyroid mask, and (c) a normalized distance map.

Fig. 2 is a flowchart of a multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to the present invention.

Detailed Description

As shown in fig. 2, the method for three-dimensional ultrasonic thyroid segmentation based on multi-scale fusion network includes the following steps:

Preferably, in the step (1), the same number of convolution blocks are adopted in each stage of the encoder part, and the maximum pooling layer is replaced by the step convolution with the step size of 2; adopting GN-PReLU-Conv sequence to build single convolution layer, and normalizing the pre-layer form; all the convolution layers adopt the packet convolution with the packet number of 4; a hole convolution with a hole rate of 2 is used in the third and fourth stage of the encoder to increase the receptive field and obtain more instructive semantic information.

Preferably, in the step (2), feature maps of subsequent levels are unified to the feature map size of the first stage through bilinear interpolation, and the edge distance map obtained through mask calculation is used as a constraint for deep supervision.

Preferably, in the step (2), after a thyroid mask is given and an edge mask is obtained, each pixel P is calculated_iDistance to thyroid edge plot D_iAnd d is a normalized distance graph obtained by the formula (1),

d(P_i)＝exp(-λD_i) (1)

wherein D_i＝min_bj∈bdist(P_i,b_j) Is a pixel P_iTo boundary pixel b ═ b_j}_j∈JIs used as a parameter for controlling the normalization effect,

the normalized distance map deep supervised loss function is of the form:

wherein

A distance map predicted for the network.

Preferably, in the step (2), λ is set to 0.01.

Preferably, in the step (3), the resolution of each layer of feature map slf of the encoder is unified by bilinear interpolation with the 2 nd layer resolution as a reference, after splicing, a multi-layer convolution feature mlf is obtained by convolution operation, slf and mlf are spliced into an attention module, the weight of information fusion is guided by multi-scale features, the weight is applied to mlf to obtain effective information required by hierarchical slf refinement, and the effective information is combined with slf to obtain an output feature map.

Preferably, in the step (3), the channel attention weight of the attention module is

M_c(F)＝σ(W₁W₀(AvgPool(F))+W₁W₀(MaxPool(F))) (3)

Wherein σ represents sigmoid function, AvgPool and MaxPool are average pooling and maximum pooling operations, respectively, W₀∈R^C/r×C，W₁∈R^C×C/rTwo convolution operations for channel compression and recovery, followed by a PReLU activation function;

spatial attention weight of

M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)])) (4)

Wherein σ represents sigmoid function, AvgPool and MaxPool are average pooling and maximum pooling operations respectively, and f^7×7A convolution operation representing a convolution kernel of 7 × 7;

the channel and space attention modules are connected in series and constitute a residual block, the convolution in the block adopts packet convolution, the normalization layer adopts a GN layer, and the activation function adopts a PReLU.

Preferably, in the step (4), the feature maps of a plurality of hierarchies are successively input and fused in a dense connection manner; inputting a fourth-layer feature map with high-level semantic information, splicing convolution results with other hierarchical feature maps one by one, and performing iterative convolution operation; with the continuous expansion of the receptive field and the integration of new hierarchical information, the prediction result is gradually optimized, and the frontal mode of the receptive field calculation is

R_cur＝R_pre+S_pre×(K_cur-1)×rate (5)

Wherein R is_curIs the receptive field of the current layer, R_preIs the receptive field of the previous layer, S_preIs the step distance of the previous layer, K_curIs the convolution kernel size of the current layer, and rate is the current void rate; for the convolution operation with convolution kernel size of 3 and hole rate of 2, assuming that the stride length of the previous layer is 1, the field of the n-time rate-2 hole convolution is R_n＝R_pre+4 × n, due to the way mlf is combined with each slf, so that the receptive fields at each level are virtually identical, and after passing through the attention module with global pooling, the receptive fields rapidly grow to 240, and thus already have a relatively large receptive field; in the subsequent DFM, the reception fields are moderately increased by adopting convolution with a void rate of 2, the reception field difference of each level is not opened, and the effect of refining the output result is achieved.

Preferably, the method has a loss function of

L_hybrid＝L_distance+λ₁L_Dice+λ₂L_BCE (6)

Wherein L is_DiceIs a loss function that evaluates the goodness of coincidence of two things, L_BCEIs a cross entropy of two types, L_hybridIs a mixed loss function, pred is a pixel prediction probability map, target is a ground truth, epsilon is a smoothing term, and the values are set to be 1e-8 in an experiment₁，λ₂To lose the weighting coefficients, the coefficients are set to λ experimentally₁＝0.5，λ₂＝0.5。

It will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, corresponding to the method of the invention, the invention also comprises a three-dimensional ultrasonic thyroid gland segmentation device based on the multi-scale fusion network, which is generally expressed in the form of functional modules corresponding to the steps of the method. The device includes:

The present invention is described in more detail below.

In the Vnet continuation method, the same number of convolution blocks are adopted in each stage of the encoder part, and the maximum pooling layer is replaced by the step convolution with the step length of 2, so that the down-sampling process has learnable parameters and is more suitable for semantic segmentation tasks. The single convolutional layer is built by adopting the GN-PReLU-Conv sequence, and the network can be trained more quickly and effectively to achieve a better effect by adopting the preposed form of the normalization layer. GNs have better performance than BN and perform stably over a wider range of batch sizes. All the convolution layers adopt the packet convolution with the packet number of 4, so that the long-distance coupling between channels is reduced, the network parameters are reduced, and the parameter utilization rate is improved. A hole convolution with a hole rate of 2 is used in the third and fourth stage of the encoder to increase the receptive field and obtain more instructive semantic information.

In order to enable the network to pay attention to the edge information of the organ earlier, feature graphs of subsequent levels are unified to the size of the feature graph of the first stage through bilinear interpolation, and an edge distance graph obtained through mask calculation is used as a constraint to carry out deep supervision. Unlike the method that the prediction result is directly used as an intermediate result, so that the final result is influenced by the intermediate result, the method adopts a deep supervision mode, and combines the learning of the edge feature and the backward transmission of the feature.

Given a thyroid mask, obtaining its edge mask, calculating each pixel P_iDistance to thyroid edge plot D_iAnd through

d(P_i)＝exp(-λD_i)

Obtaining D as a normalized distance map, wherein D_i＝min_bj∈bdist(P_i,b_j) Is a pixel P_iTo boundary pixel b ═ b_j}_j∈Jλ is a parameter for controlling the normalization effect, and is experimentally set to 0.01 in the experiment. Under the parameters, the distance map effect obtained by the thyroid gland mask is as shown in figure 1.

The normalized distance map deep supervised loss function is of the form:

wherein

A distance map predicted for the network. The network focuses more on learning the edge information of the thyroid by learning the predicted distance map.

The feature maps of the network layers have different resolutions, and after deep supervision of the edge distance map, the features contain different levels of semantic information of the organ edges. Inspired by the research, in order to combine semantic information of each layer more effectively, the resolution of each layer feature map slf (single layer feature-map) of the encoder is unified by bilinear interpolation with the resolution of the layer 2 as the reference, and after splicing, a multi-layer convolution feature mlf (multi layer feature-map) is obtained by convolution operation. Thereafter, slf and mlf are stitched into the attention module, weights for information fusion are learned by multi-scale feature guidance, the weights are applied to mlf to obtain the available information needed for level slf refinement, and combined with slf to obtain the output feature map.

The effective combination of the multi-layer characteristic diagrams requires attention mechanism to learn the combination mode of information among channels and different positions. In order to effectively use the characteristics about the edges of organs generated under the deep supervision effect, the attention mechanism can learn important spatial position information. Considering the calculation amount of the current resolution and the better effect of the two-channel attention series, the CBAM module is adopted, and weight learning in the aspects of the channels and the space is considered while the calculation amount is low.

Channel attention weight, M, of attention module_c(F)＝σ(W₁W₀(AvgPool(F))+W₁W₀(MaxPool (F))). Wherein σ represents sigmoid function, AvgPool and MaxPool are average pooling and maximum pooling operations, respectively, W₀∈R^C/r×C，W₁∈R^C×C/rFor both convolution operations for channel compression and recovery, the PReLU activation function follows.

Spatial attention weight, M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)])). Wherein σ represents sigmoid function, AvgPool and MaxPool are average pooling and maximum pooling operations respectively, and f^7×7Representing a convolution operation with a convolution kernel of 7 x 7.

Some have the idea of ensemble learning in a manner that the output results of each layer are averaged to be the final result, which is helpful to generate more accurate segmentation results, but no further refinement and combination is performed on the results of each layer. The idea of the decoder is to perform a multi-scale optimization of the results in a learnable process. Inspired by DenseASPP, the invention passes the feature maps of each level through a hole dense fusion module, and takes the average result of the result probability maps of each level as the final output.

The excellent performance of Deeplabv3 in semantic segmentation benefits from parallel convolution of each hole rate in ASPP, and rich long-distance semantic information is obtained. The DenseASPP obtains a wider and denser multi-scale receptive field by applying the concept of DenseNet dense connection and adopting a mode of dense connection with multiple incremental voidage. The present invention uses this idea for result refinement, and instead of processing a single feature map, a plurality of hierarchical feature maps are successively input and fused in a densely connected manner. And inputting a fourth-layer feature map with high-level semantic information, sequentially splicing convolution results with other hierarchical feature maps, and performing iterative convolution operation. With the continuous expansion of the receptive field and the integration of new hierarchical information, the prediction result is gradually optimized. Compared with the parallel mode of the ASPP, the denser mode can continuously expand the receptive field and gradually combine the characteristics of different receptive fields.

The frontal mode of the receptive field calculation is R_cur＝R_pre+S_pre×(K_cur-1) × rate. Wherein R is_curIs the receptive field of the current layer, R_preIs the receptive field of the previous layer, S_preIs the step distance of the previous layer, K_curIs the convolution kernel size of the current layer and rate is the current hole rate. So for a convolution operation with a convolution kernel size of 3 and a void rate of 2, assume that the previous layer stride length is 1. The reception field after n-time rate-2 hole convolution is R_n＝R_pre+4 × n. Because mlf and slf are combined in such a way that the receptive fields at each level are virtually identical, after passing through the attention module with global pooling, the receptive fields rapidly grow to 240 and thus already have a relatively large receptive field. For the reasons, in the DFM, the void rate which is enlarged by twice equal ratio is not adopted, but the reception fields are moderately increased by adopting convolution with the void rate of 2, the reception field difference of each level is not opened, and the effect of refining the output result is achieved.

The BCE loss, namely the cross entropy loss function, is a common two-class segmentation loss function, is suitable for large object segmentation, and has poor effect on small target unbalanced scenes. Dice loss is commonly used for small object segmentation, and the function calculation mode of the Dice loss is not influenced by different pixel proportions. In order to give consideration to the segmentation effect of the slices with more foreground pixels in the middle of the organ and the slices with less foreground pixels at the two ends of the organ in the training process, the invention combines BCE loss and Dice loss. In addition, the loss of the object edge distance graph is used as deep supervision, so that the network pays attention to the object edge information earlier. The mixing loss function of the present invention is defined as follows:

L_hybrid＝L_distance+λ₁L_Dice+λ₂L_BCE

wherein pred is a pixel prediction probability map, target is ground truth, and epsilon is a smoothing term and is set to be 1e-8 in an experiment. Lambda [ alpha ]₁，λ₂To lose the weighting coefficients, the coefficients are set to λ experimentally₁＝0.5，λ₂＝0.5。

The method was validated on clinical ultrasound data and OpenCAS published datasets. Experiments show that the proposed model can accurately segment the three-dimensional thyroid and achieve the best current effect.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. The three-dimensional ultrasonic thyroid gland segmentation method based on the multi-scale fusion network is characterized by comprising the following steps of: which comprises the following steps:

2. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 1, wherein: in the step (1), the same number of convolution blocks are adopted in each stage of the encoder part, and the maximum pooling layer is replaced by the step convolution with the step length of 2; adopting GN-PReLU-Conv sequence to build single convolution layer, and normalizing the pre-layer form; all the convolution layers adopt the packet convolution with the packet number of 4; a hole convolution with a hole rate of 2 is used in the third and fourth stage of the encoder to increase the receptive field and obtain more instructive semantic information.

3. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 2, wherein: in the step (2), feature maps of subsequent levels are unified to the feature map size of the first stage through bilinear interpolation, and an edge distance map obtained through mask calculation is used as a constraint for deep supervision.

4. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 3, wherein: in the step (2), a thyroid mask is given, and after an edge mask is obtained, each pixel P is calculated_iDistance to thyroid edge plot D_iAnd d is a normalized distance graph obtained by the formula (1),

d(P_i)＝exp(-λD_i) (1)

wherein D_i＝min_bj∈bdist(P_i，b_j) Is a pixel P_iTo boundary pixel b ═ b_j)_j∈JIs used as a parameter for controlling the normalization effect,

the normalized distance map deep supervised loss function is of the form:

wherein

A distance map predicted for the network.

5. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 4, wherein: in the step (2), λ is set to 0.01.

6. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 5, wherein: in the step (3), the resolution of each layer of feature map slf of the encoder is unified by bilinear interpolation with the resolution of the 2 nd layer as the reference, after splicing, a multilayer convolution feature mlf is obtained by convolution operation, slf and mlf are spliced and sent to an attention module, the weight of information fusion is learned by guiding of multi-scale features, the weight is applied to mlf to obtain effective information required by hierarchical slf refinement, and the effective information is combined with slf to obtain an output feature map.

7. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 6, wherein: in the step (3), the channel attention weight of the attention module is

M_c(F)＝σ(W₁W₀(AvgPool(F))+W₁W₀(MaxPool(F))) (3)

Wherein σ represents sigmoid function, AvgPool, MaxPool is the average pooling and maximum pooling operation, W₀∈R^C ^/r×C，W₁∈R^C×C/rTwo convolution operations for channel compression and recovery, followed by a PReLU activation function;

spatial attention weight of

M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)])) (4)

8. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 7, wherein: in the step (4), the feature maps of a plurality of hierarchies are input one by one and are fused in a dense connection mode; inputting a fourth-layer feature map with high-level semantic information, splicing convolution results with other hierarchical feature maps one by one, and performing iterative convolution operation; with the continuous expansion of the receptive field and the integration of new hierarchical information, the prediction result is gradually optimized, and the frontal mode of the receptive field calculation is

R_cur＝R_pre+S_pre×(K_cur-1)×rate (5)

Wherein R is_curIs the receptive field of the current layer, R_preIs the receptive field of the previous layer, S_preIs the step distance of the previous layer, K_curIs the convolution kernel size of the current layer, and rate is the current void rate; for the convolution operation with convolution kernel size of 3 and hole rate of 2, assuming that the stride length of the previous layer is 1, the field of the n-time rate-2 hole convolution is R_n＝R_pre+4 xn, because of the way mlf is combined with slf, so that the receptive fields at each level are virtually identical, after passing through the attention module with global pooling, the receptive fields grow rapidly to 240 and thus have already been possessedHas a relatively large receptive field; in the subsequent DFM, the reception fields are moderately increased by adopting convolution with a void rate of 2, the reception field difference of each level is not opened, and the effect of refining the output result is achieved.

9. The multi-scale fusion network-based three-dimensional ultrasonic thyroid segmentation method according to claim 8, wherein: the loss function of the method is

L_hybrid＝L_distance+λ₁L_Dice+λ₂L_BCE (6)

10. Three-dimensional supersound thyroid gland segmenting device based on multiscale fusion network, its characterized in that: it includes: