CN117557791A

CN117557791A - Medical image segmentation method combining selective edge aggregation and deep neural network

Info

Publication number: CN117557791A
Application number: CN202311231035.1A
Authority: CN
Inventors: 朱敏; 陈纪龙; 程俊龙; 姜磊
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-02-13

Abstract

The invention discloses a medical image segmentation method combining selective edge aggregation and a deep neural network, which comprises the steps of firstly constructing a transducer-based encoder, replacing MSA and MLP in a standard transducer block with a selective edge aggregation module and a densely connected feedforward network, and realizing feature fusion and complementation; then constructing an encoder and a decoder based on dense connection CNN, and connecting the two encoders in parallel, so that the network can perform information interaction on a plurality of levels, and the decoder based on dense connection CNN fuses multi-scale features from the double encoders and up-sampling paths from low to high, thereby recovering the spatial resolution of the feature map in a fine-granularity and deep-level mode; and finally, designing and combining the loss functions of the target edge and the region to simultaneously optimize the encoder and the decoder by a multi-level optimization strategy, so that the network further learns more semantic information and boundary details, and refines the segmentation result. The invention can solve the problem of medical image segmentation in a real scene.

Description

Medical image segmentation method combining selective edge aggregation and deep neural network

Technical Field

The invention relates to a medical image segmentation technology in the field of image processing, in particular to a medical image segmentation method combining selective edge aggregation and a deep neural network.

Background

Medical image segmentation is a widely studied and challenging task with the aim of helping clinicians to pay more attention to pathological areas, extracting detailed information of medical images for more accurate diagnosis and analysis. Currently common medical image segmentation tasks include skin lesion segmentation, gland segmentation, thyroid nodule segmentation, and the like. However, due to the large scale variation of the target to be segmented in the medical image, the problems of fuzzy boundary of the target structure, numerous modes and the like, and the lack of a high-quality marker image for training in practice, the accurate segmentation result is very difficult to obtain.

With the rapid development of deep learning technology, many end-to-end automatic segmentation methods have been proposed and applied to the field of medical image analysis. U-Net is one of the most widely used medical image segmentation models at present, which uses an encoder to learn advanced semantic representations, a decoder to recover lost spatial information, and uses a jump to connect different scale features of the fusion encoder and decoder to generate a more accurate segmentation mask, and many variants of improving U-Net have been proposed later, but the medical image segmentation method based on deep learning of U-Net and variants thereof does not explicitly consider that accurate boundary prediction can generate a higher quality segmentation mask. To solve the problem of structural boundary ambiguity, methods such as DeepLab, EANet have been reported which recover the boundary details by learning inter-pixel dependencies. However, they either require manual adjustment of parameters at the time of post-processing or require careful design of a learnable module to accomplish this labor intensive task. In addition, most of the existing CNNs methods cannot establish phase dependency relationship and global context relationship due to the limitation of receptive field in convolution operation. Repeated stride and pooling operations inevitably lose the resolution of the image, which makes the dense prediction task challenging. The advent of transformers greatly eases this problem, which was first used for natural language processing tasks, was able to encode distance dependencies. While transfomers are good at global context modeling, lack of spatial information of images, especially when capturing image structure boundaries, has limitations. These problems limit the successful application of pure transformers to smaller data volumes of medical image datasets.

In summary, efficient boundary prediction, fusion of context and spatial features, and good performance on smaller datasets are key issues in medical image segmentation.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a medical image segmentation method combining selective edge aggregation and deep neural networks, capturing global contextual features and shallow spatial features of the image, enabling multi-scale learning capabilities of the network and enabling the network to select and retain features related to edges without additional learning. The technical proposal is as follows:

a medical image segmentation method combining selective edge aggregation and deep neural networks, comprising the steps of:

step 1: selecting a disclosed medical image segmentation dataset, and preprocessing a training set in the dataset;

step 2: constructing a selective edge aggregation module to enable the network to pay attention to the accuracy of edge division;

step 3: constructing a densely connected feed-forward network to realize feature reuse and multi-scale learning capacity of the network;

step 4: designing a transducer-based encoder structure comprising a selective edge aggregation module and a densely connected feedforward network, and reserving image global context information;

step 5: designing an encoder and decoder structure based on dense connection CNN, and extracting image local information and spatial texture information;

step 6: constructing a multistage optimization strategy, and simultaneously optimizing an encoder and a decoder to learn boundary related information to generate better characteristic representation;

step 7: and designing an image segmentation framework consisting of a transducer encoder structure, an encoder and decoder structure based on dense connection CNN and a multi-level optimization strategy, and completing the segmentation of the medical image.

Further, the step 1 of the medical image segmentation dataset is as follows: ISIC2017, PH2, TN-SCUI 2020 change, GLAnd segmentation and COVID-19Infection segmentation; preprocessing the training set in the data set is as follows: after the ISIC2017 and PH2 normalize the image colors, all image resolutions are adjusted to 224x224 pixel size, TN-SCUI 2020challenge and GLAnd segmentation all image resolutions are adjusted to 224x224 pixel size, and the covd-19 Infection segmentation all image resolution is adjusted to 352x352 pixel size.

Further, the specific process of the step 2 is as follows:

step 2.1: the input feature map after any convolution layer is activated is represented as:

X _Sig ∈R ^H×W×C

wherein H and W are the height and width of the image, respectively, and C represents the number of channels;

step 2.2: extracting input feature map X by a max pooling operation using an edge extraction block _Sig Outputting the pooling operation result X _EEB Expressed by the following formula:

X _EEB ＝Maxpooling(1-X _Sig ,K)-(1-X _Sig )

wherein K represents a sliding window size;

step 2.3: setting threshold selection input feature map X using salient feature selection block _Sig Is accomplished by the following three steps:

(1) depth aggregation input feature mapX-ray _Sig Channel information of (a), i.e.)

(2) Calculating aggregated channel information X _agg Average of all positions in (a)

(3) Will average the valueSelecting aggregated channel information X as a threshold _agg Is characterized by the significance of (1) and output the resultThe formula is as follows;

wherein the superscript (x, y) represents coordinates of a specific location; x, y E [0,1, …, H-1],[0,1,…,W-1]And X is _SFS ∈R ^H×W×1 ；

Step 2.4: x is to be _EEB And X _SFS Element-by-element multiplication to obtain a feature map M which simultaneously shields the background region and the target boundary ₀ ∈R ^H×W×C ；

Step 2.5: preserving feature map M using channel selection algorithm ₀ ∈R ^H×W×C The channel which accords with the expected effect is shielded, the channel which does not accord with the expected effect is shielded, and the channel selection result X is output _out ∈R ^H×W×C ；

Step 2.6: the input feature map after the resolution reduction operation is mapped and expressed as:

T _in ∈R ^H×W×C

step 2.7: aggregating input feature graphs T using tie pooling operations _in Is characterized by outputting result T _avg Expressed by the following formula:

T _avg ＝Avgpooling(T _in ,K)-T _in

wherein K represents the size of a sliding window, and K is 3 in the invention;

step 2.8: t after activation using Sigmoid activation function _avg And X is _out Performing feature stitching, performing average aggregation operation on the stitched feature graphs, and activating the feature graphs after average aggregation to obtain a weight graph T _out The formula is as follows:

where f represents a Sigmoid activation function, C represents the number of channels, C represents the index of the image channel,representing feature stitching operations;

step 2.9: for X _EEB And T _out Performing element-by-element addition, and combining the element-by-element addition result with T _out Element-by-element multiplication to obtain the output result SEA of the Selective Edge Aggregation (SEA) module _out 。

Further, the specific process of constructing the densely connected feedforward network (Dense MLP) in the step 3 is as follows:

step 3.1: the feature map output by the selective feature aggregation (SEA) module is represented as:

step 3.2: will beRemodelling shape +.>Wherein s=h×w;

step 3.3: will beAll as input to the next layer, expressed by the following formula:

.......,

where MLP denotes a layer in a Dense feed-forward network (Dense MLP), M denotes the growth rate of the channel, i.e. the output dimension of the MLP, which in the present invention is 16.

Furthermore, the decoder based on the transformers in the step 4 is formed by repeatedly connecting a plurality of transformers, each Transformer block comprises a standardization layer, a Selective Edge Aggregation (SEA) module and a Dense connection feedforward network (transmission MLP), and a Patch Embedding layer is added before each Transformer block to reduce the resolution of the input feature map; the treatment process is as follows:

step 4.1: the method comprises the following steps of reducing the resolution of a feature map input into a transducer block by using a Patch Embedding layer:

(1) mapping the input feature map of the transducer block to X _in ∈R ^H×W×C Wherein H and W are the height and width of the image, respectively, and C represents the number of channels;

(2) for X _in Sampling the pixels, and expanding the number of channels to 4 times of the original number to obtain

(3) With a convolution kernel of 1, a packet convolution of 4 willChannel of (2) is mapped to X _in The same channel number, obtain X _emb ∈R ^H/2×W/2×C ；

Step 4.2: x is to be _emb A transducer block consisting of three parts, the normalization layer, SEA module and the transform MLP, is input, expressed by the following formula:

wherein,representing features from the transducer branch through Patch Embedding and features from the CNN branch, respectively, norm represents the normalization layer, SEA represents the Selective Edge Aggregation (SEA) module, and DenseMLP represents the Dense connectivity feed forward network (Dense MLP).

Further, the specific step of designing the encoder and decoder structure based on the densely connected CNN in the step 5 includes:

step 5.1: downsampling the input feature map twice in sequence starting from the first convolution block of the encoder, with the final resolution becoming (H/16, w/16);

step 5.2: the fusion feature of the jump connection between the encoder and the decoder is constructed and expressed by the following formula:

wherein,representing element-by-element additions>And->Representing the outputs of the CNN encoder and the transform encoder, respectively, block l-1, denseconv representing densely connected convolution blocks;

step 5.3: the concatenated channels are reduced to 1/4 of the original using a standard convolution, and then the number of channels is increased to 1/2 of the original channels by a series of densely connected convolution blocks.

Further, the specific steps of constructing the multistage optimization strategy in the step 6 are as follows:

step 6.1: calculating the overlay error between the predicted result and the true value using IoU loss, i.e., target area loss l _IoU Expressed by the formula:

wherein P represents the prediction result of the network, G represents the true value, and the subscript i represents the index of the pixel;

step 6.2: the boundary loss for minimizing the boundary error between P and G is calculated by:

(1) extracting P and G boundary P using max pooling operations ^b And G ^b The formula is as follows:

G ^b ＝Maxpooling(1-G,K)-(1-G),

P ^b ＝Maxpooling(1-P,K)-(1-P)

wherein K represents the size of a sliding window, and K is 3 in the invention;

(2) by P ^b And G ^b Structural boundary loss l _Edge The formula is as follows:

wherein,and->Respectively representing a true value and a prediction boundary probability value of an ith position, wherein alpha is a weight coefficient for balancing the number of pixels;

step 6.3: using target area loss l _IoU And boundary loss l _Edge Calculating a loss function l _Seg ：

l _Seg ＝λ ₁ l _IoU +λ ₂ l _Edge

Wherein lambda is ₁ And lambda (lambda) ₂ To balance the target area loss l _IoU And boundary loss l _Edge Weight coefficient of (2);

step 6.4: probability map P for output based on a transform encoder _e And probability map P based on decoder output of dense connection CNN _d Performing multistage optimization to obtain total loss l in training stage _Total The formula is as follows:

where N represents the number of transducer blocks in the transducer encoder and N represents the index of the transducer blocks.

Furthermore, the step 7 designs an image segmentation frame composed of a transducer encoder structure, an encoder and decoder structure based on dense connection CNN and a multi-level optimization strategy, and the specific process of completing the segmentation of the medical image is as follows:

step 7.1: inputting an original image into a transducer-based encoder and a dense connection CNN-based encoder, capturing and retaining global context information by utilizing a transducer branch, and extracting local information and spatial texture information by utilizing a CNN branch;

step 7.2: merging multi-scale features from low to high from dual encoders and up-sampling paths through densely connected CNN decoders;

step 7.3: the output of the transform encoder is directly extended to the target size and true value calculation loss, the output of the densely connected CNN decoder is calculated to the true value calculation loss, and the encoder and decoder are simultaneously optimized in a multistage optimization mode.

The beneficial effects brought by adopting the technical scheme are that:

1) The invention provides a novel and effective medical image segmentation framework combining selective edge aggregation and a deep neural network to comprehensively solve the problem of medical image segmentation, and the framework can be used for solving the problems of multiple scales and fuzzy structural boundaries in medical images of different modes and still can show excellent segmentation performance even on a smaller medical image segmentation dataset.

2) The invention designs a method for selectively aggregating the edge information without an additional supervision Selective Edge Aggregation (SEA) module, so that the network is more concerned with the accuracy of edge division. In addition, the codec has smaller parameters and multi-scale learning capability by adopting a dense connection mode all the time.

3) The invention constructs a loss function combining the target edge and region and simultaneously optimizes the encoder and decoder using a multi-level optimization strategy. This optimization encourages the encoder to learn more boundary-related information, yielding a better characterization.

Description of the drawings:

FIG. 1 is a schematic diagram of a selective edge aggregation module according to the present invention.

Fig. 2 is an edge extraction block of the present invention.

Fig. 3 is a salient feature selection block of the present invention.

FIG. 4 is a transducer block of the present invention.

FIG. 5 is a flow chart of a medical image segmentation method incorporating selective edge aggregation and deep neural networks of the present invention.

Detailed Description

The technical scheme of the invention will be further described in detail below with reference to the accompanying drawings.

The invention designs a medical image segmentation method combining selective edge aggregation and a deep neural network. Firstly, combining CNNs with Dense connections and transformers with Dense feed-forward networks (Dense MLPs) in a parallel manner to form an encoder, and taking the Dense connections as a decoder, effectively capturing shallow texture information and global context information in a medical image in a deeper and multi-scale manner; secondly, we propose a plug and play Selective Edge Aggregation (SEA) module that removes noise background without supervision, selects and retains useful edge features, making the network more concerned about information related to the target boundary; in addition, we have designed a loss function that combines the target content and edges, and use multi-level optimization strategies to refine the fuzzy structure, helping the network learn better feature representation, yielding more accurate segmentation results.

The present invention evaluates the proposed method over a number of different challenging medical segmentation tasks, performs well compared to most of the most advanced methods, and has fewer parameters and gflips than other methods.

Step 1: the disclosed medical image segmentation dataset is picked up and the dataset is preprocessed.

The specific implementation of preprocessing the training set is as follows:

the invention performs segmentation training tasks on four disclosed medical image segmentation datasets. Wherein the data sets are respectively: ISIC2017, PH2, TN-SCUI 2020challenge, GLAnd segmentation and COVID-19Infection segmentation.

The ISIC2017 dataset is provided by the international skin imaging co-organization, including 2000 training images, 150 verification images, and 600 test images. The PH2 dataset comprised 200 mirror images of skin with a resolution of 765 x 572 pixels, by randomly selecting 140 images as the training set, 20 images as the validation set, and the remaining 40 images as the test set. Firstly, normalizing the colors of the images by using a gray world color consistency algorithm for the two data sets, then adjusting the resolution of all the images to 224×224 pixels for experiments, and finally enhancing training data in the training process to improve the generalization capability of the model.

The TN-SCUI 2020challenge dataset provides images of the 3644 Zhang Jiejie thyroid of different sizes, and the annotation of nodules has been annotated by a highly experienced physician. The training set, validation set and test set are first divided into 6:2:2. The training set is subjected to data enhancement methods such as daily random rotation, random horizontal and vertical displacement, random overturn and the like in the training process to increase the diversity of training data, and the resolution of all images is uniformly adjusted to 224 multiplied by 224 pixel size.

GLAnd segmentation (GLAS) dataset contains microscopic images of hematoxylin and eosin (Hematoxylin and Eosin) stained slides, as well as true values provided by expert pathologists. The dataset contained 165 images with non-uniform resolution sizes, with a minimum resolution of 433 x 574 pixels and a maximum resolution of 775 x 522 pixels. 85 images were selected for training and 80 images were used for testing. The resolution of all images was adjusted to 224x224 pixel size in the experiment.

The covd-19 Infection segmentation dataset contained 100 axial CT images and corresponding labeling images from over 40 covd-19 patients. Taking into account that the data volume of the dataset is very small, experiments were performed with five-fold cross-validation (i.e. 80 images were used for training at a time, 20 images were used for validation). In training, data enhancement strategies are also employed to increase the diversity of the training set and uniformly scale the image to a 352x352 pixel size.

Step 2: constructing a Selective Edge Aggregation (SEA) module that receives features from both the transducer and CNN branches focuses the accuracy of the edge partitioning for the network. Because the CNN can better capture the spatial information of the segmented target, the CNN branches are used to supplement the transducer branches, so that the two branches realize feature fusion and complementation, and the selective edge aggregation module of the invention is shown in fig. 1. The specific construction steps are as follows:

1) The input feature map after any convolution layer is activated is represented as:

X _Sig ∈R ^H×W×C

where H and W are the height and width of the image, respectively, and C represents the number of channels.

2) Extracting X in CNN branches by a max pooling operation using Edge Extraction Blocks (EEBs) _Sig Outputting the result X _EEB Referring to fig. 2, the edge extraction block of the present invention is represented by the following formula:

X _EEB ＝Maxpooling(1-X _Sig ,K)-(1-X _Sig )

where K represents the maximum pooled sliding window size, K is 3 in the present invention.

3) Setting threshold selection X in CNN branches using a salient feature selection Block (SFS) _Sig Referring to fig. 3, is a salient feature selection block of the present invention, which is accomplished by the following three steps:

(1) deep polymerization X _Sig Channel information of (a), i.e.)

(2) Calculate X _agg Average of all positions in (a)

(3) Will beSelecting X as a threshold value _agg Is a significant feature of (1) outputting a result +.>The formula is as follows;

wherein the superscript (x, y) represents coordinates of a specific location; x, y E [0,1, …, H-1],[0,1,…,W-1]And X is _SFS ∈R ^H×W×1 。

4) X is to be _EEB And X _SFS Element-by-element multiplication to obtain a feature map M which simultaneously shields the background region and the target boundary ₀ ∈R ^H×W×C 。

5) Preserving feature map M using channel selection algorithm ₀ ∈R ^H×W×C The channel which accords with the expected effect is shielded, the channel which does not accord with the expected effect is shielded, and the result X is output _out ∈R ^H×W×C 。

6) The input feature map after the resolution reduction operation is mapped and expressed as:

T _in ∈R ^H×W×C

7) Aggregating input feature graphs T using tie pooling operations _in Is characterized by outputting result T _avg Expressed by the following formula:

T _avg ＝Avgpooling(T _in ,K)-T _in

where K represents the average pooled sliding window size, K is 3 in the present invention.

8) T after activation using Sigmoid activation function _avg And X is _out Performing feature stitching, performing average aggregation operation on the stitched feature graphs, and activating the feature graphs after average aggregation to obtain a weight graph T _out The formula is as follows:

where f represents a Sigmoid activation function, C represents the number of channels, C represents the index of the image channel,representing a feature stitching operation.

9) For X _EEB And T _out Performing element-by-element addition, and combining the element-by-element addition result with T _out Element-by-element multiplication to obtain the output result SEA of the Selective Edge Aggregation (SEA) module _out 。

Step 3: the method comprises the steps of constructing a transducer block, wherein each transducer block comprises a densely connected feedforward network (Dense MLP), and the densely connected feedforward network is constructed by applying a linear layer in the channel direction in a densely connected mode, so that the information flow between channels is further improved, and referring to FIG. 4, the transducer block of the invention is disclosed. The specific construction flow is as follows:

1) The feature map output by the selective feature aggregation (SEA) module is represented as:

2) Will beRemodelling shape +.>Where n=h×w.

3) Will beAll as input to the next layer, expressed by the following formula:

……,

Step 4: a decoder based on a transducer is constructed, and the decoder is formed by repeatedly connecting a plurality of transducer blocks, wherein each transducer block comprises a standardized layer, a Selective Edge Aggregation (SEA) module and a densely connected feedforward network (Dense MLP), and can adapt to a high-resolution image, and is complementary with spatial features captured by CNN, and a Patch enhancement layer is added before each transducer block to reduce the resolution of an input feature map, so that the transducer can expand receptive fields layer by layer like CNN, and referring to a flow chart of the medical image segmentation method combining selective edge aggregation and a deep neural network, wherein the transducer-based encoder is a branch of 'Transformer Encoder' in FIG. 5. The specific implementation steps are as follows:

1) The method for reducing the resolution of the feature map of the input transducer block by using the Patch Embedding layer mainly comprises the following three steps

And (3) completion:

(3) With a convolution kernel of 1, a packet convolution of 4 willChannel of (2) is mapped to X _in The same channel number, obtain X _emb ∈R ^Hl2×W/2×C 。

2) X is to be _emb Input byThe transducer block consisting of three parts, the normalization layer, the SEA module and the Dense MLP, is expressed by the following formula:

Step 5: an encoder and a decoder based on dense connectivity CNN are constructed, the encoder and the decoder being a U-shaped network, wherein the encoder is used for extracting semantic information of medical images from shallow layers to deep layers, and the decoder is used for recovering the spatial resolution of the output characteristics of the encoder. In addition, a skip connection is applied to acquire detailed information from the encoder and decoder to compensate for information loss due to downsampling and convolution operations. Referring to fig. 5, a flowchart of a medical image segmentation method combining selective edge aggregation and depth neural network of the present invention, in which encoders and decoders based on densely connected CNNs, i.e. "CNN Encoder" and "CNN Decoder" branches in fig. 5, are shown. The specific steps for designing the encoder and decoder architecture based on densely connected CNNs include:

1) The input feature map is downsampled twice in sequence starting from the first convolution block of the encoder, and the final resolution becomes (H/16, w/16).

2) The fusion feature of the jump connection between the encoder and the decoder is constructed and expressed by the following formula:

wherein,representing element-by-element additions>And->Representing the outputs of the CNN encoder and the transform encoder, respectively, block l-1, denseconv represents densely connected convolution blocks.

3) The concatenated channels are reduced to 1/4 of the original using a standard convolution, and then the number of channels is increased to 1/2 of the original channels by a series of densely connected convolution blocks.

Step 6: in order to reduce the difference between the predicted result and the true value, two loss functions are used herein to focus on two independent aspects of the segmentation content and the segmentation boundary, respectively. The first is IoU loss to minimize the overlay error between the predicted result and the real value, and the second is boundary loss to minimize the boundary error between the predicted result and the real value. In addition, a multi-level optimization Strategy is introduced to optimize both the encoder and decoder, and referring to fig. 5, a flowchart of the medical image segmentation method combining selective edge aggregation and deep neural network of the present invention is introduced, wherein the multi-level optimization Strategy and the "MLO Strategy" branch in fig. 5. The specific steps for designing the loss function and the multistage optimization strategy comprise:

1) Calculating the overlay error between the predicted result and the true value using IoU loss, i.e., target area loss l _IoU Expressed by the formula:

where P represents the predicted result of the network, G represents the true value, and i represents the index of all pixels in P and G.

2) For minimizing P _i And G _i The boundary loss of the boundary error between is calculated by:

G ^b ＝Maxpooling(1-G,K)-(1-G),

P ^b ＝Maxpooling(1-P,K)-(1-P)

wherein K represents the maximum pooled sliding window size, and K is 3 in the invention;

(2) by means ofAnd->Structural boundary loss l _Edge The formula is as follows:

wherein,and->The true value and the predicted boundary probability value of the i-th position are respectively represented, and alpha is a weight coefficient for balancing the number of pixels.

3) By using l _IoU And l _Edge Calculating a loss function l _Seg I.e.

l _Seg ＝λ ₁ l _IoU +λ ₂ l _Edge

4) Probability map P for output based on a transform encoder _e And probability map P based on decoder output of dense connection CNN _d The multi-stage optimization is carried out,obtain the total loss of training phase _Total The formula is as follows:

Step 7: an image segmentation framework consisting of a transducer encoder structure, an encoder and decoder structure based on dense connectivity CNN and a multi-level optimization strategy is designed, and referring to fig. 4, a flowchart of the full resolution representation network-based medical image segmentation method of the present invention is shown. The specific process of completing the segmentation of the medical image is as follows:

1) The frame consists of three modules:

constructing a transducer-based encoder consisting of a plurality of transducer blocks, capturing and retaining important global context information, introducing a Patch embedding layer in front of the transducer blocks to adapt to high-resolution images and Dense prediction tasks, and replacing MSA and MLP in the standard transducer blocks with a Selective Edge Aggregation (SEA) module and a Dense connection feed-forward network (Dense MLP) constructed for the invention, so that the network can accept the characteristics of two branches from the transducer-based encoder and the Dense connection CNN-based encoder, and realizing characteristic fusion and complementation. The network has natural multi-scale feature extraction capability through the encoder and decoder based on the dense connection CNN, and the encoder based on the transducer and the encoder based on the dense connection CNN are connected in parallel to perform information interaction on multiple levels, so that the local and global information of the image is fully utilized, the decoder based on the dense connection CNN fuses multi-scale features from the dual encoder and the up-sampling path from low to high, and the spatial resolution of the feature map is recovered in a finer granularity and deeper mode. In addition, the loss function loss of the combination target edge and the region is designed, the encoder and the decoder are simultaneously optimized by a multi-level optimization strategy, so that the network further learns more semantic information and boundary details, and the segmentation result is refined.

2) Model architecture and super parameter setting:

the Keras-based method is realized on NVIDIA RTX3090 GPU (24 g) through training. The learning rate was fixed at 1e-4 using Adam optimizer. The mini batch size was set to 16 and training was stopped using an early stop mechanism when the validation loss was stable and there was no significant change in 30 epochs. Training data was augmented by applying random rotations (+ -25 °), random horizontal and vertical shifts (15%) and random flips (horizontal and vertical). Furthermore, all comparative experiments used the same training set and validation set. After the second phase of the SEAformer CNN branch, the initial weights come from Block2, block3 and Block4 of the pre-trained DenseNet121 on ImageNet, the other layers the invention trains from scratch.

3) Model evaluation method

Five widely used metrics are used to evaluate model performance. I.e., accuracy (Acc), sensitivity (Sens), specificity (Spec), intersection over Union (IoU) and Dice similarity coefficient (Dice). The parameter amounts, GFLOPs and FPS, of the present invention are also reported.

4) The model is implemented as follows:

the method comprises the steps that an original image is input into a encoder based on a transducer and an encoder based on dense connection CNN, global context information is captured and reserved through a transducer branch, and local information and spatial texture information are extracted through a CNN branch; merging multi-scale features from low to high from dual encoders and up-sampling paths through densely connected CNN decoders; the output of the transform encoder is directly extended to the target size and true value calculation loss, the output of the densely connected CNN decoder is calculated to the true value calculation loss, and the encoder and decoder are simultaneously optimized in a multistage optimization mode.

Claims

1. A medical image segmentation method combining selective edge aggregation and deep neural network, comprising the steps of:

2. The medical image segmentation method in combination with selective edge aggregation and deep neural network according to claim 1, wherein the medical image segmentation dataset of step 1 is: ISIC2017, PH2, TN-SCUI 2020 change, GLAnd segmentation and COVID-19Infection segmentation; preprocessing the training set in the data set is as follows: after the ISIC2017 and PH2 normalize the image colors, all image resolutions are adjusted to 224x224 pixel size, TN-SCUI 2020challenge and GLAnd segmentation all image resolutions are adjusted to 224x224 pixel size, and the covd-19 Infection segmentation all image resolution is adjusted to 352x352 pixel size.

3. The medical image segmentation method combining selective edge aggregation and deep neural network according to claim 1, wherein the specific procedure of the step 2 is as follows:

step 2.1: activating any convolution layerPost input feature map mapping X _Sig Expressed as:

X _Sig ∈R ^H×W×C

X _EEB ＝Maxpooling(1-X _Sig ,K)-(1-X _Sig )

wherein K represents a sliding window size;

(1) depth aggregation input feature map mapping X _Sig Channel information of (a), i.e.)

(3) Will average the valueSelecting channel information X as threshold _agg Outputting the result of the significance signatureThe formula is as follows;

wherein the superscript (x, y) represents coordinates of a specific location;x,y∈[0,1,…,H-1],[0,1,…,W-1]and X is _SFS ∈R ^H ^×W×1 ；

T _in ∈R ^H×W×C

step 2.7: aggregating input feature graphs T using tie pooling operations _in Is characterized by outputting the result T of the tie pooling operation _avg Expressed by the following formula:

T _avg ＝Avgpooling(T _in ,K)-T _in

step 2.9: for X _EEB And weight map T _out Performing element-by-element addition and integrating the element-by-element additionFruit and weight map T _out Obtaining the output result SEA of the selective edge aggregation module by multiplying element by element _out 。

4. The medical image segmentation method combining selective edge aggregation and deep neural network according to claim 3, wherein the specific process of constructing the densely connected feedforward network in the step 3 is as follows:

step 3.1: the feature map output by the selective feature aggregation module is represented as:

step 3.2: will beRemodelling shape +.>Wherein s=h×w;

……,

where MLP denotes a layer in a densely connected feed forward network, M denotes the growth rate of the channel, i.e. the output dimension of the MLP.

5. The method for segmenting the medical image by combining selective edge aggregation and deep neural network according to claim 4, wherein the decoder based on the transducer in the step 4 is formed by repeatedly connecting a plurality of transducer blocks, each transducer block comprises a standardized layer, a selective edge aggregation module and a densely connected feedforward network, and a Patch Embedding layer is added before each transducer block to reduce the resolution of an input feature map; the treatment process is as follows:

step 4.1: the method comprises the following three steps of reducing the resolution of a feature map input into a transducer block by using a Patch Embedding layer:

(1) mapping the input feature map of the transducer block to X _in ∈R ^H×W×C ；

wherein,representing the features from the transducer branch through Patch Embedding and from the CNN branch, respectively, the Norm represents the normalization layer, the SEA represents the selective edge aggregation module, and the DenseMLP represents the densely connected feed forward network.

6. The method for medical image segmentation combining selective edge aggregation and deep neural networks according to claim 5, wherein the step 5 of designing the encoder and decoder structure based on densely connected CNNs comprises:

7. The medical image segmentation method combining selective edge aggregation and deep neural network according to claim 6, wherein the specific steps of constructing the multistage optimization strategy in step 6 are as follows:

step 6.1: calculating overlay error between predicted and true values using IoU loss, i.e. target area lossExpressed by the formula:

G ^b ＝Maxpooling(1-G,K)-(1-G),

P ^b ＝Maxpooling(1-P,K)-(1-P)

wherein K represents a sliding window size;

(2) by P ^b And G ^b Structural boundary lossThe formula is as follows:

step 6.3: using target area lossAnd boundary loss->Calculating a loss function->

Wherein lambda is ₁ And lambda (lambda) ₂ To balance the target area lossAnd boundary loss->Weight coefficient of (2);

step 6.4: probability map P for output based on a transform encoder _e And probability map P based on decoder output of dense connection CNN _d Performing multistage optimization to obtain total loss in training stageThe formula is as follows:

8. The method for segmenting the medical image by combining selective edge aggregation and depth neural network according to claim 7, wherein the step 7 designs an image segmentation framework consisting of a transducer encoder structure, an encoder and decoder structure based on densely connected CNNs and a multi-level optimization strategy, and the specific process of completing the segmentation of the medical image is as follows: