CN109886986B

CN109886986B - Dermatoscope image segmentation method based on multi-branch convolutional neural network

Info

Publication number: CN109886986B
Application number: CN201910062500.0A
Authority: CN
Inventors: 谢凤英; 杨加文; 姜志国
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2020-09-08
Anticipated expiration: 2039-01-23
Also published as: CN109886986A

Abstract

The invention discloses a dermatoscope image segmentation method based on a multi-branch convolutional neural network, which comprises the following steps of: firstly, collecting training samples; secondly, expanding the image; thirdly, designing a multi-branch convolutional neural network model; fourthly, training a multi-branch convolution network; fifthly, generating a skin damage distribution probability graph; and sixthly, obtaining a segmentation result. The invention has the advantages and effects that: aiming at the skin mirror image data characteristics, the training data set is effectively expanded by using corresponding image transformation, so that the network training is effective and the generalization performance is strong; the convolutional neural network comprises a plurality of branches, and abundant semantic information and detail information are fused, so that compared with a common network, the convolutional neural network can better recover the skin damage edge and obtain a more accurate skin damage segmentation result; the invention is a full-automatic segmentation scheme, only needs to input the dermatoscope image to be segmented, and the scheme can automatically provide the segmentation result of the image without additional processing, and is efficient and simple.

Description

Dermatoscope image segmentation method based on multi-branch convolutional neural network

Technical Field

The invention belongs to the field of computer-aided diagnosis, and particularly relates to a dermatoscope image segmentation method based on a multi-branch convolutional neural network.

Background

Skin melanoma is divided into benign and malignant, where malignant skin melanoma is extremely harmful and can easily lead to death if the patient does not receive timely treatment at an early stage. For cutaneous melanoma, the most effective treatment is early detection followed by focal resection. The dermatoscope is also called as an epidermal light transmission microscope, and can obtain a dermatoscope image with high resolution and definition. The skin mirror image is automatically diagnosed, so that diagnosis errors caused by subjectivity of a doctor during diagnosis can be avoided. When skin lesions are diagnosed, the region shape and boundary information are important diagnosis bases. The dermatoscope image segmentation is to obtain an accurate skin lesion area, and is an important step in an automatic auxiliary diagnosis process. However, the dermatome image segmentation is a challenging problem due to the large differences of shapes, colors and the like of different skin lesions and the frequent interference of hairs, air bubbles and the like in the process of acquiring the dermatome image.

At present, a dermatoscope image segmentation method mainly comprises: edge, threshold, or region based methods and supervised learning based methods. Calculating the gray gradient in the image by using classical operators such as Sobel, Laplacian or Canny and the like based on an edge method, and then extracting a region with large gradient change as a skin damage boundary; the method has good effect on dermatome images with clear skin lesion boundaries and no other interference. The threshold-based method utilizes the characteristic that the skin damage color is usually inconsistent with the background skin color to set one or more thresholds to divide different areas; this method is simple to calculate, but the threshold is not easy to select. Combining adjacent similar pixels or sub-regions through region growing based on a region method until a final skin damage region is obtained; this method is suitable for maintaining consistent dermoscopic images of the interior of a lesion. Relevant features are manually designed by a supervised learning-based method, or potential features are automatically mined by data, then a classifier is trained to classify the features, and whether a subregion or a certain pixel belongs to skin damage or normal skin is judged; this approach is very dependent on feature design and selection, and has poor adaptability to complex dermatoscopic images.

Convolutional neural networks are widely used in the field of image processing, and have excellent performance in various tasks such as target classification, target detection, target segmentation and the like. The convolutional neural network can automatically learn high-level features from training data, and has very strong adaptability. The invention designs a brand-new convolutional neural network model for dermatoscope image segmentation, the model is provided with a plurality of branches, the lower-layer branches extract detail information, the higher-layer branches extract semantic information, loss is calculated by each branch, back propagation training is carried out to ensure that each branch can effectively extract features, and finally a plurality of branch feature maps are fused, and accurate dermatoscope image segmentation results are obtained through up-sampling.

Disclosure of Invention

The invention aims to provide a dermatoscope image segmentation method based on a multi-branch convolutional neural network, which can automatically and accurately extract a skin damage area, assist the subsequent links of a dermatoscope image diagnosis system and improve the skin damage diagnosis accuracy. The invention has good robustness to the interference of hairs, bubbles and the like in the image by automatically learning high-level features from the original image data, and outputs the skin damage segmentation result with accurate boundary by multi-branch feature fusion.

The invention relates to a dermatoscope image segmentation method based on a multi-branch convolutional neural network, which comprises the following steps of:

the method comprises the following steps: training sample collection

The dermatoscope image used in the present invention is derived from an internationally published data set, and comprises 2750 original dermatoscope images (2000 of them are used for training and 750 are used for verification and testing), and the true value image of the skin damage area is manually marked by a dermatologist, and the true value image is a binary image, wherein 1 represents the skin damage area, and 0 represents the healthy skin area. The lesions contained in the data set vary greatly in shape, color, texture, location, etc., with image resolutions between 542 x 718 and 2848 x 4288. For convenience of processing, the original image and the real-valued skin loss map are uniformly scaled to 512 × 512.

Step two: image augmentation

Generally, the more neurons of the convolutional neural network, the larger the network capacity, which means that the network model can fit more complex mapping relationships, i.e. very good performance for complex tasks. However, with more neurons, the parameters in the network will increase greatly, and in the subsequent training process, if there is not enough training data, the overfitting problem is easily caused. Therefore, in order to train to obtain a good network, the invention adopts various methods to expand the training image.

Considering that different shooting angles exist when images are collected, each original training image is respectively subjected to horizontal turning, vertical turning and rotation of 90 degrees, 180 degrees and 270 degrees. In addition, since the partial dermoscopic images were observed to have black edges on the top, bottom, left, and right sides, the original training images were each shifted by 25 pixels in four directions. And the skin damage true value graph is correspondingly transformed along with the original graph. Finally, 2000 original training images were expanded to obtain 20000 training images.

Step three: multi-branch convolutional neural network model design

In the deep neural network model, the network shallow layer extracts detail information such as edges and textures in the image, so that the shallow layer feature map comprises skin damage boundary information in a plurality of skin mirror images, and the network deep layer integrates the low-level features extracted from the shallow layer to further form high-level features with rich semantics, so that the classification is convenient. For a skin mirror image segmentation task, abundant semantic information is needed to accurately classify skin damage and background, and detail information is needed to accurately extract a skin damage boundary. In contrast, the invention constructs a convolutional neural network, which is an encoder-decoder architecture, wherein the encoder extracts semantic and detail information, and the decoder recovers the feature map size to obtain the final segmentation result. In addition, 4 branches are constructed at different layers of the encoder to extract features, and finally the features on the different branches are fused to obtain an accurate skin mirror image segmentation result. The specific design idea is as follows:

1. the encoder structure design: in a general convolutional network, each convolutional layer usually only receives the output of the previous layer as input, and the information before the previous layer cannot be well utilized. In the convolutional network constructed by the method, the output of each convolutional layer in the convolutional block is the input of the subsequent convolutional layer, so that the learned low-level features can be fully utilized subsequently, the convolutional layer is prevented from learning repeated features, the information redundancy is reduced, and the network training difficulty is reduced. The overall framework of the encoder is as follows: input image → convolutional layer → pooling layer → 1 st volume block → 1 st pooling block → 2 nd volume block → 2 nd pooling block → 3 rd volume block, specific details are as follows:

firstly, an image is input into an encoder, and then is subjected to 1 convolutional layer (the kernel size is 7 multiplied by 7, the step length is 2, and the output dimension is 24), then is subjected to 1 maximum pooling layer (the kernel size is 3 multiplied by 3, and the step length is 2), and finally is input into a 1 st convolutional block. Representing the Input image by Input #, Conv7 × 7# representing the 7 × 7 convolutional layer, MaxPool # representing the largest pooling layer, BN # representing the batch normalization layer, and ReLU # representing the modified linear cell layer, can be expressed as: input → Conv7 × 7 → BN → ReLU → MaxPool. Since the step size in Conv7 × 7 and MaxPool is 2, the resolution of the final output feature map is 1/4 of the input image, and then the 1 st volume block is input.

② the 1 st, 2 nd and 3 rd convolution blocks are all composed of 6 convolution layers, and Conv3 × 3# represents the convolution layer (kernel size is 3 × 3, step size is 1, output dimension is 12), the specific structure of each layer in the convolution block is BN → ReLU → Conv3 × 3. The specific structure of the convolution Block designed by the invention is shown in fig. 1.

And dimensionality reduction of the 1 st pooling block and the 2 nd pooling block. In the convolution blocks, the feature maps are all the same size, and in order to reduce the size of the feature maps to extract higher-level semantic features and reduce the dimension of the feature maps to reduce the parameter number, the invention designs a pooling block after the No. 1 and the No. 2 convolution blocks respectively. Using X # to represent the input feature map (the number of channels is N), Conv3 × 3# to represent the convolutional layer (the kernel size is 3 × 3, the step size is 1, the output dimension is N/2), MaxPool # to represent the maximum pooling layer (the kernel size is 2 × 2, the step size is 2), then the specific structure of the pooling block designed by the present invention is: x → BN → ReLU → Conv 3X 3 → MaxPool. After the feature maps output by the 1 st convolution block and the 2 nd convolution block pass through the pooling block, the number, width and height of the channels are all 1/2.

2. And (3) designing a multi-branch structure: in the convolutional neural network designed by the invention, 4 branches are constructed at different layers to extract features, and finally concat fusion is carried out on the features on the different branches, as shown in the 1 st branch, the 2 nd branch, the 3 rd branch and the 4 th branch of FIG. 2. Each branch extracts information through 1 convolutional layer (kernel size is 1 × 1, step size is 1, output dimension is 2), wherein the lower layer branches mainly extract detailed information, and the upper layer branches mainly extract semantic information. In addition, due to dimension reduction in the network, the sizes of different branch output feature maps are not consistent, the width and height of the 2 nd branch output feature map are 1/2 on the 1 st branch, and the output feature maps of the 3 rd branch and the 4 th branch are 1/4 on the 1 st branch, so that bilinear interpolation is used here, the output feature maps of the 2 nd branch, the 3 rd branch and the 4 th branch are all up-sampled to the same size as the 1 st branch feature map, and then concat fusion is carried out. The fused feature map is then input to a decoder on the backbone.

3. The decoder structure design: since the spatial resolution of the fused feature map in the encoder is 1/4 of the original input image, the present invention designs a decoder to recover the feature map spatial dimensions. The decoder mainly comprises 2 convolutional layers and 1 quadruple upsampling layer, wherein X # represents an input fusion feature map, Conv1 × 1_1# represents a 1 st convolutional layer (the kernel size is 1 × 1, the step size is 1, and the output dimension is 4), Conv1 × 1_2# represents a 2 nd convolutional layer (the kernel size is 1 × 1, the step size is 1, and the output dimension is 2) and Usample # represents a 4-fold bilinear interpolation upsampling layer, so that the specific structure of the decoder designed by the invention is as follows: x → BN → ReLU → Conv1 × 1_1 → ReLU → Conv1 × 1_2 → Upesample.

Step four: multi-branch convolutional network training

And (3) obtaining a distribution probability map of the skin damage region by the characteristic map output by the decoder through a Softmax function, and comparing the distribution probability map with the segmentation true value map through a cross entropy function to calculate the loss. The loss is reversely propagated in the network to obtain the gradient of the parameters in the network, and then the parameters are adjusted according to a gradient descent method, so that the loss value is reduced, and the network is optimal. The specific calculation of this cross entropy loss function is as follows:

wherein W and H are the segmentation true value width and height, respectively, y_ijRepresenting the true class of pixel (i, j),skin damage of 1, background of 0, p_ijRepresenting the probability that pixel (i, j) is a skin lesion.

In order to ensure that the information extracted by the 4 branches is valid, the branch Loss is calculated by the same method as described above for the output feature map of each branch_{1，2，3，4}. Because the branch characteristic diagram does not directly generate a final segmentation result, in order to avoid excessive influence on the segmentation precision, each branch loss is multiplied by a coefficient of 0.5 and then added with the segmentation loss on the trunk to obtain a total loss, and finally, parameters in the network are actually obtained by carrying out back propagation training on the total loss. Specifically, it can be expressed as:

Loss_all＝Loss+0.5*Loss_{1，2，3，4}

step five: lesion distribution probability map generation

The network designed by the invention inputs a skin mirror image and outputs a skin damage distribution probability chart. In order to improve robustness, before the images to be segmented are input into a network, the images to be segmented are respectively subjected to horizontal and vertical turning, rotation of 90 degrees, 180 degrees and 270 degrees, and the original images are added, so that 6 images in total are respectively input into the segmentation network, and 6 skin damage distribution probability maps are obtained. And then, corresponding to the transformation of the input image, carrying out corresponding inverse transformation on the probability map, and finally averaging 6 probability maps to obtain a final skin damage distribution probability map.

Step six: segmentation result obtaining

And after obtaining the distribution probability map of the skin damage area, setting the threshold value to be 0.5, regarding the pixels with the probability value larger than 0.5 as skin damage pixels, or else, regarding the pixels as background skin, and finally obtaining the segmentation result of the image to be segmented.

The invention relates to a dermatoscope image segmentation method based on a multi-branch convolutional neural network, which has the advantages and the effects that:

(1) aiming at the skin mirror image data characteristics, the invention uses the corresponding image transformation to effectively expand the training data set, thereby ensuring the effectiveness of network training and strong generalization performance.

(2) The convolutional neural network designed by the invention comprises a plurality of branches, and is fused with abundant semantic information and detail information, so that compared with a common network, the convolutional neural network can better recover the skin damage edge and obtain a more accurate skin damage segmentation result.

(3) The invention is a full-automatic segmentation scheme, only needs to input the dermatoscope image to be segmented, and the scheme can automatically provide the segmentation result of the image without additional processing, and is efficient and simple.

Drawings

Fig. 1 is a schematic diagram of a convolution block structure in a network designed by the present invention.

Fig. 2 is a schematic diagram of a network structure designed by the present invention.

Fig. 3 is a flow chart of the implementation of the present invention.

Detailed Description

For a better understanding of the technical aspects of the present invention, reference will now be made in detail to the embodiments of the present invention as illustrated in the accompanying drawings.

The invention is realized under a PyTorch deep learning framework, and the network structure diagram and the flow chart of the invention are respectively shown in FIG. 2 and FIG. 3. The computer configuration adopts: intel Core i 56600K processor, 16GB memory, NVIDIA GeForceGTX1080 video card, Ubuntu 16.04 operating system. The invention relates to a dermatoscope image segmentation method based on a multi-branch convolutional neural network, which specifically comprises the following steps:

step 1: skin mirror image training sample collection and processing

A dermoscopic image dataset comprising 2750 raw dermoscopic images (2000 for training and 750 for verification and testing) and a Skin lesion area true value map manually labeled by a professional dermatologist was downloaded from The International Skin Imaging Corporation (ISIC) Challenge official web.

Step 2: training image processing

The original dermatome image and the segmentation true value image are first scaled uniformly in size to 512 × 512, and then the images are transformed. Considering that different shooting angles exist when images are collected, each original training image is respectively subjected to horizontal turning, vertical turning and rotation of 90 degrees, 180 degrees and 270 degrees. In addition, since the partial skin mirror image is observed to have black edges on the top, bottom, left, and right sides, each original image is shifted by 25 pixels in four directions. The skin damage true value graph is also transformed correspondingly. Finally, 2000 original training images were expanded to obtain 20000 training images.

And step 3: multi-branch convolutional network structure design

The structure of the multi-branch convolutional network designed by the present invention is shown in fig. 2, wherein the BN layer and the ReLU layer are simplified for simplicity. According to the network structure, Module is written under a PyTorch deep learning framework, and the Module mainly comprises the following three parts:

1. the encoder structure: the overall framework of the encoder is: input image → convolutional layer → pooling layer → 1 st volume block → 1 st pooling block → 2 nd volume block → 2 nd pooling block → 3 rd volume block, specific details are as follows:

2. A multi-branch structure: in the convolutional neural network designed by the invention, 4 branches are constructed at different layers to extract features, and finally concat fusion is carried out on the features on the different branches, as shown in the 1 st branch, the 2 nd branch, the 3 rd branch and the 4 th branch of FIG. 2. Each branch extracts information through 1 convolutional layer (kernel size is 1 × 1, step size is 1, output dimension is 2), wherein the lower layer branches mainly extract detailed information, and the upper layer branches mainly extract semantic information. In addition, due to dimension reduction in the network, the sizes of different branch output feature maps are not consistent, the width and height of the 2 nd branch output feature map are 1/2 on the 1 st branch, and the 2 nd branch and the 4 th branch output feature map are 1/4 on the 1 st branch, so that bilinear interpolation is used here, the output feature maps of the 2 nd branch, the 3 rd branch and the 4 th branch are all up-sampled to the same size as the 1 st branch feature map, and then concat fusion is carried out. The fused feature map is then input to a decoder on the backbone.

3. The decoder structure is as follows: since the spatial resolution of the fused feature map in the encoder is 1/4 of the original input image, the present invention designs a decoder to recover the feature map spatial dimensions. The decoder mainly comprises 2 convolutional layers and 1 quadruple upsampling layer, wherein X # represents an input fusion feature map, Conv1 × 1_1# represents a 1 st convolutional layer (the kernel size is 1 × 1, the step size is 1, and the output dimension is 4), Conv1 × 1_2# represents a 2 nd convolutional layer (the kernel size is 1 × 1, the step size is 1, and the output dimension is 2) and Usample # represents a 4-fold bilinear interpolation upsampling layer, so that the specific structure of the decoder designed by the invention is as follows: x → BN → ReLU → Conv1 × 1_1 → ReLU → Conv1 × 1_2 → Upesample.

And 4, step 4: multi-branch convolutional network training

wherein W and H are the segmentation true value width and height, respectively, y_ijRepresents the true class of pixel (i, j), with a skin lesion of 1, a background of 0, p_ijRepresenting the probability that pixel (i, j) is a skin lesion.

Loss_all＝Loss+0.5*LosS_{1，2，3，4}

in the training process, the method adopts an Adam random gradient descent method to train the network, the initial learning rate is set to be 0.0005, the number of each batch is set to be 8, and the maximum training round number is set to be 200. And when the total loss on the verification set is no longer in a continuous descending trend, stopping the network training in advance to avoid overfitting.

And 5: multi-branch convolutional network image testing

And after obtaining the distribution probability map of the skin damage area, setting the threshold value to be 0.5, regarding the pixels with the probability value larger than 0.5 as skin damage pixels, and obtaining a segmentation image of the test image if the pixels are not the background skin.

In the whole segmentation model framework, the transformation of the test image, the inverse transformation of the corresponding segmentation image, the fusion and thresholding of the segmentation image are all automatically processed by codes. Therefore, in the actual test, a skin mirror image is input, and the model directly outputs a final segmentation result image.

Claims

1. A dermatoscope image segmentation method based on a multi-branch convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

the method comprises the following steps: training sample collection

The dermatome image is derived from an international public data set and comprises 2750 original dermatome images, 2000 of which are used for training and 750 of which are used for verification and testing, a real value image of a skin damage area of the dermatome is manually marked by a dermatologist, and the real value image is a binary image, wherein 1 represents a skin damage area, and 0 represents a healthy skin area; for convenience of processing, the original image and the real-valued skin loss image are uniformly scaled to 512 × 512;

step two: image augmentation

Respectively carrying out horizontal turning, vertical turning and rotation of 90 degrees, 180 degrees and 270 degrees on each original training image; respectively translating each original training image by 25 pixels in four directions; the real value picture of the skin loss is correspondingly transformed along with the original picture; finally, 20000 training images are obtained by expanding 2000 original training images;

step three: multi-branch convolutional neural network model design

Constructing a convolutional neural network as an encoder-decoder framework, wherein the encoder extracts semantic and detail information, and the decoder restores the size of a characteristic diagram to obtain a final segmentation result; in addition, 4 branches are constructed on different layers of the encoder to extract features, and finally the features on the different branches are fused to obtain an accurate skin mirror image segmentation result;

step four: multi-branch convolutional network training

The characteristic diagram output by the decoder is subjected to a Softmax function to obtain a distribution probability diagram of a skin damage area, and then the distribution probability diagram is compared with a segmentation truth value diagram through a cross entropy function to calculate loss; the loss is reversely propagated in the network to obtain the gradient of the parameters in the network, and then the parameters are adjusted according to a gradient descent method to reduce the loss value and optimize the network; the specific calculation of the cross entropy loss function is as follows:

wherein W and H are the segmentation true value width and height, respectively, y_ijRepresents the true class of pixel (i, j), with a skin lesion of 1, a background of 0, p_ijRepresents the probability that pixel (i, j) is a skin lesion;

in order to ensure that the information extracted by the 4 branches is valid, the same method as described above is applied to the output feature map of each branch, and the Loss of each of the 4 branches is calculated₁,Loss₂,Loss₃And Loss₄(ii) a Because the branch characteristic diagram does not directly generate a final segmentation result, in order to avoid excessive influence on the segmentation precision, each branch loss is multiplied by a coefficient of 0.5 and then added with the segmentation loss on the trunk to obtain a total loss, and finally parameters in the network are actually obtained by carrying out back propagation training on the total loss; specifically, it can be expressed as:

step five: lesion distribution probability map generation

The network designed by the invention inputs a skin mirror image and outputs a skin damage distribution probability chart; in order to improve robustness, before the image to be segmented is input into a network, the image to be segmented is respectively subjected to horizontal and vertical turning, rotation of 90 degrees, 180 degrees and 270 degrees, and the original image is added, so that 6 images in total are respectively input into the segmentation network, and 6 skin damage distribution probability graphs are obtained; then, corresponding to the transformation of the input image, carrying out corresponding inverse transformation on the probability map, and finally averaging 6 probability maps to obtain a final skin damage distribution probability map;

step six: segmentation result obtaining

After obtaining a skin damage area distribution probability map, setting a threshold value to be 0.5, regarding pixels with probability values larger than 0.5 as skin damage pixels, and if not, regarding the pixels as background skin, and finally obtaining a segmentation result of the image to be segmented;

the general framework of the encoder is as follows: input image → convolutional layer → pooling layer → 1 st volume block → 1 st pooling block → 2 nd volume block → 2 nd pooling block → 3 rd volume block, the details are as follows:

an image input encoder firstly passes through 1 convolution layer with the kernel size of 7 multiplied by 7, the step length of 2 and the output dimension of 24, then passes through 1 maximum pooling layer with the kernel size of 3 multiplied by 3 and the step length of 2, and finally inputs a 1 st convolution block; representing the Input image by Input #, Conv7 × 7# representing the 7 × 7 convolutional layer, MaxPool # representing the largest pooling layer, BN # representing the batch normalization layer, and ReLU # representing the modified linear cell layer, can be expressed as: input → Conv7 × 7 → BN → ReLU → MaxPool; since the step size in Conv7 × 7 and MaxPool is 2, the resolution of the final output feature map is 1/4 of the input image, and then the 1 st volume block is input;

the 1 st convolution block, the 2 nd convolution block and the 3 rd convolution block are all composed of 6 convolution layers, and the convolution layer with the kernel size of 3 × 3, the step size of 1 and the output dimension of 12 is represented by Conv3 × 3#, so that the specific structure of each layer in the convolution blocks is BN → ReLU → Conv3 × 3;

the 1 st pooling block and the 2 nd pooling block are subjected to dimensionality reduction; a pooling block is respectively designed after the 1 st convolution block and the 2 nd convolution block; using X # to represent an input feature map, the number of channels is N, Conv3 × 3# represents a convolutional layer with a core size of 3 × 3, a step size of 1 and an output dimension of N/2, MaxPool # represents a maximum pooling layer with a core size of 2 × 2 and a step size of 2, and then the specific structure of the pooling block is as follows: x → BN → ReLU → Conv3 × 3 → MaxPool; after the characteristic graphs output by the 1 st convolution block and the 2 nd convolution block pass through the pooling block, the number, the width and the height of the channels are all 1/2 of the original values;

the multi-branch structure is specifically designed as follows: in the convolutional neural network, 4 branches are constructed in different layers to extract features, and finally concat fusion is carried out on the features on the different branches; each branch extracts information through 1 convolution layer with the kernel size of 1 multiplied by 1, the step length of 1 and the output dimension of 2, wherein the low-layer branch extracts detail information and the high-layer branch extracts semantic information; in addition, due to dimension reduction in the network, the sizes of different branch output feature maps are not consistent, the width and height of the 2 nd branch output feature map are 1/2 on the 1 st branch, and the output feature maps of the 3 rd branch and the 4 th branch are 1/4 on the 1 st branch, so that bilinear interpolation is used here, the output feature maps of the 2 nd branch, the 3 rd branch and the 4 th branch are all up-sampled to the same size as the 1 st branch feature map, and then concat fusion is carried out; the fused feature map is input to a decoder on the backbone;

the decoder comprises 2 convolutional layers and 1 quadruple upsampling layers, and an input fusion feature map is represented by X #; conv1 × 1 — 1# represents the 1 st convolutional layer, with a kernel size of 1 × 1, step size of 1, and output dimension of 4; conv1 × 1 — 2# represents the 2 nd convolutional layer, the kernel size is 1 × 1, the step size is 1, and the output dimension is 2; the upsamplale # represents a 4 times bilinear interpolation upsampling layer, and the specific structure of the decoder is as follows: x → BN → ReLU → Conv1 × 1_1 → ReLU → Conv1 × 1_2 → Upesample.