CN114782532A - Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation - Google Patents

Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation Download PDF

Info

Publication number
CN114782532A
CN114782532A CN202210394761.4A CN202210394761A CN114782532A CN 114782532 A CN114782532 A CN 114782532A CN 202210394761 A CN202210394761 A CN 202210394761A CN 114782532 A CN114782532 A CN 114782532A
Authority
CN
China
Prior art keywords
pet
input
channel
convolution
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210394761.4A
Other languages
Chinese (zh)
Inventor
胡战利
黄正勇
梁栋
郑海荣
杨永峰
刘新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210394761.4A priority Critical patent/CN114782532A/en
Publication of CN114782532A publication Critical patent/CN114782532A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10081Computed x-ray tomography [CT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10088Magnetic resonance imaging [MRI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10104Positron emission tomography [PET]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a spatial attention method and a spatial attention device for PET-CT multi-modal tumor segmentation. The method and the device set a PET input channel and a CT input channel which are independently coded in a multi-scale convolution space attention network, input a PET image in a PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and input a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information; and fusing the PET decoding result and the CT decoding result. The method and the device are based on PET-CT multi-modal images, utilize a space attention method, respectively consider the characteristics of PET and CT, firstly utilize a network to independently extract characteristic information of the PET and the CT, and then fuse the characteristic information, especially the application of multi-scale convolution enables the network to highlight a tumor region and inhibit a non-tumor region, so as to obtain a more accurate tumor segmentation result.

Description

Spatial attention method and device for PET-CT multi-modal tumor segmentation
Technical Field
The invention relates to the field of medical image segmentation, in particular to a spatial attention method and a spatial attention device for PET-CT multi-modal tumor segmentation.
Background
Medical image segmentation, which is a basic task of medical image analysis, is mainly based on delineating a tumor from an image by a Computer Tomography (CT), Positron Emission Tomography (PET) or Magnetic Resonance Image (MRI). Although the CT image has high resolution and can accurately depict the internal structural information of the object, the CT image has insufficient imaging definition for soft tissue structures such as tumors. PET is a novel molecular imaging technique based on radiotracers, which is quite sensitive to soft tissue regions such as tumors and therefore quite effective for tumor detection. However, PET imaging has poor resolution, and the detected tumor boundaries are not clear enough, which is not favorable for determining the tumor boundaries in the tumor segmentation task. The emergence of combined PET-CT imaging solves the above problem, integrating the advantages of CT and PET, while possessing high accuracy of CT on structural information and high sensitivity of PET on tumor detection. However, the current tumor segmentation is manually completed by experienced experts, which is a time-consuming, labor-consuming and expensive task. Therefore, the research and development of the PET-CT multi-modal-based automatic tumor segmentation method has very important scientific significance and application prospect for tumor diagnosis and reduction of the workload of doctors.
Most current methods of automated tumor segmentation are based on CT or PET or MRI alone. There are also multi-modality based on PET-CT or PET-MRI, but they are both simply fusion of the two images, and little consideration is given to the difference between the two.
Chinese patent application No. 202210499262.X proposes "a spatial attention network for PET-CT multi-modal tumor segmentation", which extracts characteristic information of PET and CT respectively by designing a single spatial attention module.
The article "Multimodal orientation Module for Targeting Multimodal PET-CT Lung movement Segmentation" was published by Xiaohang Fu et al in 2021 in the IEEE Journal of biological and Health information Journal. The network is divided into two sub-networks of PET and CT (wherein the network of PET channels is the multi-modal spatial attention module proposed by the authors), and the network adopts a U-Net structure. The network comprises 5 layers, each layer comprises two convolutions, and the convolution kernel number of each layer is respectively 64, 128, 256, 512 and 1024. And finally, performing convolution once by using a convolution kernel and generating a final space attention result by using a softmax activation function. The author samples the output attention result graph of the PET sub-network, then fuses the sampled graph as weight information to the up-sampling stage of the CT, and generates a final segmentation result.
However, the above prior art has the following technical disadvantages:
1. the tumor segmentation effect is poor by using single modal data;
2. most algorithms are designed aiming at special tumors, and the generalization performance of the algorithms is poor;
3. using multimodal data does not take into account the differences in the different modalities.
Disclosure of Invention
The embodiment of the invention provides a spatial attention method and a spatial attention device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation, which at least solve the technical problem of poor segmentation precision of the traditional medical image.
According to an embodiment of the invention, a spatial attention method for PET-CT multi-modal tumor segmentation is provided, comprising the steps of:
setting a PET input channel and a CT input channel which are independently coded in a multi-scale convolution space attention network, inputting a PET image in a PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and fusing the PET decoding result and the CT decoding result.
Further, the PET input channel and the CT input channel are two symmetrical input channels which are not shared by the weights.
Further, before the PET-CT multi-modal image is input into the channel, the method further comprises: the PET-CT multi-modal image is subjected to resampling, cutting and normalization processing, and the method specifically comprises the following steps:
firstly, resampling PET and CT multi-modal images to the resolution of 1mm multiplied by 1mm, and then cutting the PET and CT multi-modal images into the size of 144 multiplied by 144; then, the CT multi-modal image is normalized to the value of [ -1,1] by using the maximum and minimum values, and the PET multi-modal image is normalized by using the mean value and the variance.
Further, before fusing the PET decoding result and the CT decoding result, the method further includes:
setting a channel weight mask module in a PET input channel and a CT input channel, wherein the channel weight mask module comprises a global average pooling layer, two full-connection layers and a sigmoid layer;
mapping the input with the shape format of [ B, C, H, W, D ] into [ B, C,1,1,1] through global average pooling, compressing the last three dimensions, and obtaining a weight vector with the shape of [ B, C,1 ]; wherein B represents the batch size; c represents the number of input channels; H. w, D denotes the input three-dimensional image size;
two fully-connected layers are arranged after the global average pooling, wherein the input node of the first fully-connected layer is C, the output node of the first fully-connected layer is 2-C, the input node of the second fully-connected layer is 2-C, and the output node of the second fully-connected layer is C; the second full-connection layer uses a sigmoid activation function to obtain a final channel weight mask;
adding a short connection between encoding and decoding, multiplying the input with a channel weight mask, and compressing and expanding the channel dimension characteristic information.
Further, the size of the used input data is [ B, C, H, W, D ], firstly, the input data is extruded and expanded on a channel by a channel weight mask module to obtain a result X, and the size is [ B, C, H, W, D ];
then carrying out multi-scale convolution on the output result of the channel weight mask module, adopting convolution kernels with three different sizes, namely 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5, wherein the convolution step length is 1, the padding is 0, 1 and 2 respectively, and the activation functions are relu; matrix-multiplying convolution results of 3 × 3 × 3 and 5 × 5 × 5, flattening a result U with the size of [ B, C, H, W, D ] into a one-dimensional vector S with the size of [ B, C, (H × W × D),1], generating weight information T with the size of [ B, C, (H × W × D),1] through a softmax activation function, and reconstructing the weight information T into an original size W with the size of [ B, C, H, W, D ]; obtaining a multi-scale space attention result by performing dot product on the result W and a convolution result of 1 multiplied by 1; splicing the input and output of the multi-scale convolution according to channels by using a short connection and carrying out normalization;
finally, the number of channels is adjusted to the same size as the input using a 1 × 1 × 1 convolution, resulting in a size of [ B, C, H, W, D ] for V.
Furthermore, an automatic residual error expansion module is arranged on the original residual error network; the automatic extension residual error module consists of two branches, wherein the convolution kernel of one branch is 3 multiplied by 3, the convolution step length is 1, the filling quantity is 1, and the activation function is relu; the other branch is determined according to the automatic change of the input channel and the set output channel;
if the input channel is equal to the set output channel, the branch does not carry out any operation, and the automatic extension residual error module degenerates into a standard residual error network; if the input channel is not equal to the set output channel, the branch circuit executes convolution operation, the size of a convolution kernel is 1 multiplied by 1, and the activation function is relu; the final output is the result of the addition of the two branches;
two automatic residual error expansion modules are used in series; the first automatic residual error expanding module changes the channel number, and the second automatic residual error expanding module does not change the channel number.
Furthermore, a coding module and a decoding module are arranged in the multi-scale convolution space attention network;
the coding module comprises a residual error layer, a space attention layer and a pooling layer which are connected in series; the residual error layer consists of two automatic residual error expansion modules; the input channels of each stage are respectively 2, 16, 32, 64 and 128, and the output channels are respectively 16, 32, 64, 128 and 256; the input channel and the output channel of the spatial attention layer are kept consistent, and the size of the characteristic image is unchanged; the convolution kernel size of the pooling layer is 2 multiplied by 2, and the convolution step length is 2;
the decoding module comprises an deconvolution layer, a residual error layer and a space attention layer which are connected in series; the convolution kernel size of the deconvolution layer is 3 multiplied by 3, the convolution step length is 2, and the padding is 1; the residual error layer and the space attention layer are both formed by connecting two automatic expansion residual error modules in series, and input and output channels and sizes are kept unchanged; the input channels of each stage are 256, 128, 64, 32, respectively, and the output channels are 128, 64, 32, 16, respectively.
Furthermore, the PET input channel and the CT input channel both adopt a U-Net structure and comprise 5 encoding modules and 4 decoding modules;
the convolution kernel size of the first input convolution layer of the multi-scale convolution space attention network is 5 multiplied by 5, the convolution kernel size of the last layer is 1 multiplied by 1, and the activation function is sigmoid; the remaining layer activation functions are relu.
Further, a Dice similarity coefficient DSC and a Hausdorff distance HD are used as evaluation indexes; wherein:
Figure BDA0003598434070000051
HD(X,Y)=max(h(X,Y),h(Y,X))
Figure BDA0003598434070000052
the loss function is the Dice loss:
Figure BDA0003598434070000053
also added to the loss function is the Focal loss function:
Figure BDA0003598434070000054
the final loss function is:
Loss(X,Y)=LossD(X,Y)+LossF(X,Y)
wherein y and
Figure BDA0003598434070000055
respectively representing a tumor segmentation label and a segmentation prediction result of the tumor; α is a balance weight factor, which is set to a value of 0.5; gamma is a weight decay factor, set to 2;
the Adam optimizer is used in the training process, two momentum factors are respectively set to be 0.9 and 0.99, the training is performed for 300 times in total, the attenuation of the learning rate in the training process adopts a simulated annealing algorithm, and the learning rate is restarted every 25 times of training; the learning rate is initialized to 3e-4 and the minimum value is 1 e-6; and finally obtaining a neural network mapping model capable of realizing three-dimensional tumor automatic segmentation.
According to another embodiment of the invention, there is provided a spatial attention device for PET-CT multi-modal tumor segmentation, comprising:
the characteristic information extraction unit is used for setting a PET input channel and a CT input channel which are independently coded in the multi-scale convolution space attention network, inputting a PET image in the PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and the fusion unit is used for fusing the PET decoding result and the CT decoding result.
A storage medium storing a program file capable of implementing any one of the above spatial attention methods for PET-CT multi-modal tumor segmentation.
A processor for running a program, wherein the program is run to perform any of the above spatial attention methods for PET-CT multi-modal tumor segmentation.
The spatial attention method and the spatial attention device for PET-CT multi-modal tumor segmentation in the embodiment of the invention are based on PET-CT multi-modal images, utilize the spatial attention method, respectively consider the characteristics of PET and CT, firstly utilize a network to independently extract the characteristic information of the PET and the CT, and then fuse the characteristic information, particularly the application of multi-scale convolution enables the network to highlight a tumor region and inhibit a non-tumor region, thereby obtaining a more accurate tumor segmentation result. The invention is used for extracting the characteristic information of PET and CT respectively in the encoding stage, and the invention fuses the characteristic information extracted by the PET and the CT in the decoding stage, thereby positioning the tumor position by utilizing the high sensitivity of PET to tumor detection and determining the tumor boundary by utilizing the high accuracy of CT to the structural information.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:
FIG. 1 is a block diagram of a multi-scale convolution spatial attention network in accordance with the present invention;
FIG. 2 is an overall frame diagram of the multi-scale convolution spatial attention network of the present invention;
FIG. 3 is a graph of the segmentation results of the multi-scale convolution spatial attention network of the present invention;
FIG. 4 is a graph of the results of the boundary segmentation of a lesion with a multi-scale convolution spatial attention network in accordance with the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to an embodiment of the invention, there is provided a spatial attention method for PET-CT multi-modal tumor segmentation, comprising the steps of:
setting a PET input channel and a CT input channel which are independently coded in a multi-scale convolution space attention network, inputting a PET image in a PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and fusing the PET decoding result and the CT decoding result.
The spatial attention method for PET-CT multi-modal tumor segmentation in the embodiment of the invention is based on PET-CT multi-modal images, utilizes the spatial attention method, respectively considers the characteristics of PET and CT, firstly utilizes the network to independently extract the characteristic information of the PET and the CT, and then fuses the characteristic information, particularly the application of multi-scale convolution enables the network to highlight the tumor region and inhibit the non-tumor region, thereby obtaining more accurate tumor segmentation results. The invention is used for extracting the characteristic information of PET and CT respectively in the encoding stage, and the invention fuses the characteristic information extracted by the PET and the CT in the decoding stage, thereby positioning the tumor position by utilizing the high sensitivity of PET to tumor detection and determining the tumor boundary by utilizing the high accuracy of CT to the structural information.
The PET input channel and the CT input channel are two symmetrical input channels which are not shared by weights.
Wherein, PET-CT multi-mode image is before the input channel, the method also includes: the method comprises the following steps of resampling, cutting and normalizing PET-CT multi-modal images, and specifically comprises the following steps:
firstly, resampling PET and CT multi-mode images to the resolution of 1mm multiplied by 1mm, and then cutting the PET and CT multi-mode images into the size of 144 multiplied by 144; then, the CT multi-modal image is normalized to the value of [ -1,1] by using the maximum and minimum values, and the PET multi-modal image is normalized by using the mean value and the variance.
Before the PET decoding result and the CT decoding result are fused, the method further includes:
setting channel weight mask modules in a PET input channel and a CT input channel, wherein the channel weight mask modules comprise a global average pooling layer, two full-link layers and a sigmoid layer;
mapping the input with the shape format of [ B, C, H, W, D ] into [ B, C,1,1,1] through global average pooling, compressing the last three dimensions, and obtaining a weight vector with the shape of [ B, C,1 ]; wherein B represents the batch size; c represents the number of input channels; H. w, D denotes the input three-dimensional image size;
two fully connected tiers follow the global average pooling, where the first fully connected tier has an input node of C and an output node of 2C, the second fully connected tier has an input node of 2C and an output node of C; the second full-connection layer uses a sigmoid activation function to obtain a final channel weight mask;
adding a short connection between encoding and decoding, multiplying the input with a channel weight mask, and compressing and expanding the channel dimension characteristic information.
The size of input data used is [ B, C, H, W, D ], the input data is firstly subjected to extrusion and expansion on a channel through a channel weight mask module to obtain a result X, and the size is [ B, C, H, W, D ];
then carrying out multi-scale convolution on the output result of the channel weight mask module, adopting convolution kernels with three different sizes, namely 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5, wherein the convolution step length is 1, the padding is 0, 1 and 2 respectively, and the activation functions are relu; matrix-multiplying convolution results of 3 × 3 × 3 and 5 × 5 × 5, flattening a result U with the size of [ B, C, H, W, D ] into a one-dimensional vector S with the size of [ B, C, (H × W × D),1], generating weight information T with the size of [ B, C, (H × W × D),1] through a softmax activation function, and reconstructing the weight information T into an original size W with the size of [ B, C, H, W, D ]; obtaining a multi-scale space attention result by performing dot product on the result W and a convolution result of 1 multiplied by 1; splicing the input and output of the multi-scale convolution according to channels by using a short connection and carrying out normalization;
finally, the number of channels is adjusted to the same size as the input using a 1 × 1 × 1 convolution, resulting in a size of [ B, C, H, W, D ] for V.
Wherein, an automatic residual error expanding module is arranged on the original residual error network; the automatic residual error extension module consists of two branches, wherein the convolution kernel of one branch is 3 multiplied by 3, the convolution step length is 1, the filling quantity is 1, and the activation function is relu; the other branch is determined according to the automatic change of the input channel and the set output channel;
if the input channel is equal to the set output channel, the branch does not carry out any operation, and the automatic extension residual error module degenerates into a standard residual error network; if the input channel is not equal to the set output channel, the branch circuit executes convolution operation, the size of a convolution kernel is 1 multiplied by 1, and the activation function is relu; the final output is the result of the addition of the two branches;
two automatic residual error expansion modules are used in series; the first automatic residual error expanding module changes the channel number, and the second automatic residual error expanding module does not change the channel number.
Wherein, a coding module and a decoding module are arranged in the multi-scale convolution space attention network;
the coding module comprises a residual error layer, a spatial attention layer and a pooling layer which are connected in series; the residual error layer consists of two automatic expansion residual error modules; the input channels of each stage are respectively 2, 16, 32, 64 and 128, and the output channels are respectively 16, 32, 64, 128 and 256; the input channel and the output channel of the spatial attention layer are kept consistent, and the size of the characteristic image is unchanged; the convolution kernel size of the pooling layer is 2 multiplied by 2, and the convolution step length is 2;
the decoding module comprises an deconvolution layer, a residual error layer and a space attention layer which are connected in series; wherein the convolution kernel size of the deconvolution layer is 3 × 3 × 3, the convolution step length is 2, and the padding is 1; the residual error layer and the space attention layer are both formed by connecting two automatic expansion residual error modules in series, and input and output channels and sizes are kept unchanged; the input channels of each stage are 256, 128, 64, 32, respectively, and the output channels are 128, 64, 32, 16, respectively.
The PET input channel and the CT input channel both adopt a U-Net structure and comprise 5 encoding modules and 4 decoding modules;
the convolution kernel size of the first input convolution layer of the multi-scale convolution space attention network is 5 multiplied by 5, the convolution kernel size of the last layer is 1 multiplied by 1, and the activation function is sigmoid; the remaining layer activation functions are relu.
Wherein, a Dice Similarity Coefficient DSC (DSC) and a Hausdorff distance HD (Hausdorff distance HD) are used as evaluation indexes; wherein:
Figure BDA0003598434070000101
HD(X,Y)=max(h(X,Y),h(Y,X))
Figure BDA0003598434070000102
the loss function is the Dice loss:
Figure BDA0003598434070000103
the loss function is also added with the following Focal loss function:
Figure BDA0003598434070000104
the final loss function is:
Loss(X,Y)=LossD(X,Y)+LossF(X,Y)
wherein y and
Figure BDA0003598434070000105
respectively representing a tumor segmentation label and a segmentation prediction result of the tumor; α is a balance weight factor, which is set to a value of 0.5; gamma is a weight decay factor, set to 2;
in the training process, an Adam optimizer is used, two momentum factors are respectively set to be 0.9 and 0.99, the training is performed for 300 times totally, the attenuation of the learning rate in the training process adopts a simulated annealing algorithm, and the learning rate is restarted every 25 times; the learning rate is initialized to 3e-4 and the minimum value is 1 e-6; and finally obtaining a neural network mapping model capable of realizing three-dimensional tumor automatic segmentation.
The spatial attention method for PET-CT multi-modal tumor segmentation of the present invention is described in detail below with specific examples:
in order to solve the problem of poor effect of the automatic tumor segmentation algorithm, the invention considers the difference and complementarity among various modal data, distinguishes the characteristic information of different modal data by designing different input channels and combining a specific spatial attention mechanism, and obtains a final segmentation result in a characteristic fusion mode so as to improve the segmentation performance.
The invention is based on PET-CT multi-mode images, utilizes a space attention method, respectively considers the characteristics of PET and CT, firstly utilizes the network to independently extract the characteristic information of the PET and the CT, and then fuses the characteristic information, particularly the application of multi-scale convolution enables the network to highlight tumor regions and inhibit non-tumor regions, thereby obtaining more accurate tumor segmentation results. In the encoding stage, the network is divided into two symmetrical input channels without shared weight values, and the two input channels are respectively used for extracting the characteristic information of PET and CT, in the decoding stage, the characteristic information extracted by the two input channels is fused, so that the tumor position can be positioned by respectively utilizing the high sensitivity of PET to tumor detection, and the tumor boundary is determined by utilizing the high accuracy of CT to structural information. In addition, in order to establish the relation of the context information, the invention also adds a jump connection between encoding and decoding. The invention obviously improves the network performance without obviously increasing the network complexity, and greatly improves the segmentation accuracy.
The method comprises the following specific operation steps:
the method comprises the following steps: data pre-processing
The present invention resamples, crops and normalizes the input multimodal data as needed, first resampling the PET and CT to a resolution of 1mm x 1mm, and then cropping them to a size of 144 x 144. Before training the network, the CT images are normalized to [ -1,1] using the maximum and minimum values, and the PET images are normalized using the mean and variance.
Step two: design channel weight mask module
The Channel Weight Mask (CWM) module comprises a global average pooling layer, two full connection layers and a sigmoid layer. And mapping the input with the shape format of [ B, C, H, W, D ] into [ B, C,1,1,1] by global average pooling, compressing the last three dimensions, and obtaining a weight vector with the shape of [ B, C,1 ]. Wherein B represents the batch size; c represents the number of input channels; H. w, D denotes the input three-dimensional image size. After the global average pooling, there are two fully-connected tiers, where the first fully-connected tier has an input node of C and an output node of 2C, and the second fully-connected tier has an input node of 2C and an output node of C. And the second full-connection layer uses a sigmoid activation function to obtain a final channel weight mask. In order to apply the weight mask to the original input, the invention adds a short connection, and multiplies the input by the channel weight mask, thereby completing the compression and expansion of the channel dimension characteristic information.
Step three: designing a multi-scale convolution space attention module
In the present invention, the input data used in the present invention is 3D with dimensions [ B, C, H, W, D ], and the corresponding designed spatial attention module is shown in FIG. 1. The invention firstly inputs the data to pass through a Channel Weight Mask (CWM) module to complete the extrusion and expansion on the channel to obtain a result X (the size is [ B, C, H, W, D ]). And then carrying out multi-scale convolution on the output result of the CWM, adopting convolution kernels with three different sizes, namely 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5, wherein the convolution step size is 1, the padding step size is 0, 1 and 2 respectively, and the activation function is relu, so that the size of the convolution result is not changed. Performing matrix multiplication on convolution results of 3 × 3 × 3 and 5 × 5 × 5, flattening a result U (with the size of [ B, C, H, W, D ]) into a one-dimensional vector S (with the size of [ B, C, (H × W × D),1]), generating weight information T (with the size of [ B, C, (H × W × D),1]) through a softmax activation function, and reconstructing the weight information T into an original size W (with the size of [ B, C, H, W, D ]); and performing dot product on the result W and the convolution result of 1 multiplied by 1 to obtain a multi-scale space attention result. To preserve context information, the input and output of the multiscale convolution are channel-wise spliced and normalized using a short connection (Batch Normalization, BN). Finally, the present invention uses a 1 × 1 × 1 convolution to adjust the number of channels to be the same as the input, without using the activation function. As a result, V was [ B, C, H, W, D ] in size.
Step four: designing an auto-expanding residual module
In the designed neural network, in order to automatically change the number of convolution kernels at different stages and prevent model degradation, the invention designs an automatic residual error extension module based on the original residual error network. The module consists of two branches, wherein the convolution kernel size of one branch is 3 multiplied by 3, the convolution step length is 1, the filling quantity is 1, and the activation function is relu. And the other branch is determined according to the automatic change of the input channel and the set output channel. If the input channel and the set output channel are equal, the branch does not perform any operation, corresponding to a short connection, and the module degenerates to a standard residual network. If the input channel is not equal to the set output channel, the branch performs convolution operation, the convolution kernel size is 1 × 1 × 1, and the activation function is relu. The final output is the result of adding the two branches. In the experiment, the invention uses two automatic residual error expansion modules in series. The first automatic residual error expanding module changes the channel number, and the second automatic residual error expanding module does not change the channel number.
Step five: designing multi-scale spatial attention coding and decoding module
In order to achieve better effect, the invention also designs a corresponding coding and decoding module separately. The coding module comprises a residual layer (ResNet), a spatial attention layer (ISA-Net) and a Pooling layer (Pooling), which are connected in series (RIPM). The residual error layer consists of two automatic residual error expansion modules; the input channels of each stage are 2, 16, 32, 64, 128, respectively, and the output channels are 16, 32, 64, 128, 256, respectively. The input channel and the output channel of the spatial attention layer are kept consistent, and the size of the characteristic image is unchanged; the convolution kernel size of the pooling layer is 2 × 2 × 2, and the convolution step size is 2. The decoding module includes an deconvolution layer (DeConv), a residual layer (ResNet), and a spatial attention layer (ISA-Net), which are also in series (DRIM). The convolution kernel size of the deconvolution layer is 3 × 3 × 3, the convolution step size is 2, and the padding is 1. The residual layer is also composed of two automatic expansion residual modules connected in series, the input and output channels and the size are kept unchanged, and the same is true for the spatial attention layer. The input channels of each stage are 256, 128, 64, 32, respectively, and the output channels are 128, 64, 32, 16, respectively.
Step six: designing an integral network structure
Based on the RIPM and DRIM codec modules, the present invention designs a network architecture as shown in fig. 2. The network input is divided into a PET input channel and a CT input channel. Each channel adopts a U-Net structure and comprises 5 coding modules and 4 decoding modules, wherein the two channel structures are symmetrical but the weight parameters are not shared. The two are coded independently, and the result of each decoding module of the PET channel is fused with the CT channel respectively in the decoding stage to be used as the final output result. The convolution kernel size of the last layer of the network is 1 multiplied by 1, and the activation function is sigmoid. The remaining layer activation functions are relu. The convolution kernel size of the first input convolutional layer of the network is 5 × 5 × 5, which is convenient for obtaining a larger receptive field. The network parameter settings are shown in table 1 below.
Figure BDA0003598434070000141
TABLE 1
Step seven: design loss function and evaluation index
In the present invention, a Dice similarity coefficient DSC (DSC) and a Hausdorff distance HD (Hausdorff distance HD) are used as evaluation indexes. Wherein:
Figure BDA0003598434070000142
HD(X,Y)=max(h(X,Y),h(Y,X))
Figure BDA0003598434070000143
the loss function is the Dice loss:
Figure BDA0003598434070000151
in order to solve the problem of sample imbalance, the invention also adds a Focal loss function in the loss function:
Figure BDA0003598434070000152
the final loss function is:
Loss(X,Y)=LossD(X,Y)+LossF(X,Y)
wherein y and
Figure BDA0003598434070000153
respectively representing the tumor segmentation label and the segmentation prediction result of the tumor. α is a balance weight factor, which is set to a value of 0.5; gamma is a weight decay factor, set to 2. FIG. 3 is a graph of the segmentation results for the network, where the first row, columns 1 and 2 are the CT and PET inputs, respectively, and columns 3 and 4 are the tumor signatures and the segmentation results for the network of the present invention, respectively; the second action is the segmentation result of four comparison algorithms. FIG. 4 is a graph showing the results of the boundary segmentation of tumors, wherein the circles from the outside to the inside are sequentially represented as V-Net, 3D U-Net, and,SE-Net, SK-Net, segmentation results of the network of the invention, tumor label.
Step eight: training network
The Adam optimizer is used in the training process, two momentum factors are set to be 0.9 and 0.99 respectively, the training is performed for 300 times in total, the attenuation of the learning rate in the training process adopts a simulated annealing algorithm, and the learning rate is restarted every 25 times. The learning rate is initialized to 3e-4 and a minimum of 1 e-6. And finally obtaining a neural network mapping model capable of realizing three-dimensional tumor automatic segmentation.
Example 2
According to another embodiment of the invention, there is provided a spatial attention device for PET-CT multi-modal tumor segmentation, comprising:
the characteristic information extraction unit is used for setting a PET input channel and a CT input channel which are independently coded in the multi-scale convolution space attention network, inputting a PET image in the PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and the fusion unit is used for fusing the PET decoding result and the CT decoding result.
The spatial attention device for PET-CT multi-modal tumor segmentation in the embodiment of the invention is based on PET-CT multi-modal images, utilizes a spatial attention method, respectively considers the characteristics of PET and CT, firstly utilizes a network to independently extract the characteristic information of the PET and the CT, and then fuses the characteristic information, particularly the application of multi-scale convolution enables the network to highlight a tumor region and inhibit a non-tumor region, so as to obtain a more accurate tumor segmentation result. The invention is used for extracting the characteristic information of PET and CT respectively in the encoding stage, and the invention fuses the characteristic information extracted by the PET and the CT in the decoding stage, thereby positioning the tumor position by utilizing the high sensitivity of PET to tumor detection and determining the tumor boundary by utilizing the high accuracy of CT to the structural information.
The spatial attention device for PET-CT multi-modal tumor segmentation of the present invention is described in detail below with specific embodiments:
in order to solve the problem of poor effect of the automatic tumor segmentation algorithm, the invention considers the difference and complementarity among various modal data, distinguishes the characteristic information of different modal data by designing different input channels and combining a specific spatial attention mechanism, and obtains a final segmentation result in a characteristic fusion mode so as to improve the segmentation performance.
The invention is based on PET-CT multi-mode images, utilizes a space attention method, respectively considers the characteristics of PET and CT, firstly utilizes the network to independently extract the characteristic information of the PET and the CT, and then fuses the characteristic information, particularly the application of multi-scale convolution enables the network to highlight tumor regions and inhibit non-tumor regions, thereby obtaining more accurate tumor segmentation results. In the encoding stage, the network is divided into two symmetrical input channels without shared weight values, and the two input channels are respectively used for extracting the characteristic information of PET and CT, in the decoding stage, the two extracted characteristic information are fused, so that the tumor position can be positioned by respectively utilizing the high sensitivity of PET to tumor detection, and meanwhile, the tumor boundary is determined by utilizing the high accuracy of CT to structural information. In addition, in order to establish the relation of the context information, the invention also adds a jump connection between encoding and decoding. The invention obviously improves the network performance without obviously increasing the network complexity, and greatly improves the segmentation accuracy.
The method comprises the following specific operation steps:
the method comprises the following steps: data pre-processing
The present invention resamples, crops, and normalizes the input multimodal data as needed, first resampling the PET and CT to a resolution of 1mm x 1mm, and then cropping them to a size of 144 x 144. Before training the network, the CT images are normalized to [ -1,1] using the maximum and minimum values, and the PET images are normalized using the mean and variance.
Step two: design channel weight mask module
The Channel Weight Mask (CWM) module comprises a global average pooling layer, two fully connected layers and a sigmoid layer. And mapping the input with the shape format of [ B, C, H, W, D ] into [ B, C,1,1,1] by global average pooling, compressing the last three dimensions, and obtaining a weight vector with the shape of [ B, C,1 ]. Wherein B represents a batch size; c represents the number of input channels; H. w, D denotes the input three-dimensional image size. The global average pooling is followed by two fully-connected tiers, where the first fully-connected tier has an input node of C and an output node of 2C, the second fully-connected tier has an input node of 2C and an output node of C. And the second full-connection layer uses a sigmoid activation function to obtain a final channel weight mask. In order to apply the weight mask to the original input, the invention adds a short connection, and multiplies the input by the channel weight mask, thereby completing the compression and expansion of the channel dimension characteristic information.
Step three: designing a multi-scale convolution space attention module
In the present invention, the input data used in the present invention are all 3D, with dimensions [ B, C, H, W, D ], and the corresponding designed spatial attention module is shown in fig. 1. The invention firstly inputs the data to pass through a Channel Weight Mask (CWM) module to complete the extrusion and expansion on the channel to obtain a result X (the size is [ B, C, H, W, D ]). And then carrying out multi-scale convolution on the output result of the CWM, adopting convolution kernels with three different sizes, namely 1 × 1 × 1, 3 × 3 × 3 and 5 × 5 × 5, wherein the convolution step length is 1, the padding step lengths are 0, 1 and 2 respectively, and the activation functions are relu, so that the size of the convolution result is not changed. Performing matrix multiplication on convolution results of 3 × 3 × 3 and 5 × 5 × 5, flattening a result U (with the size of [ B, C, H, W, D ]) into a one-dimensional vector S (with the size of [ B, C, (H × W × D),1]), generating weight information T (with the size of [ B, C, (H × W × D),1]) through a softmax activation function, and reconstructing the weight information T into an original size W (with the size of [ B, C, H, W, D ]); and performing dot product on the result W and the convolution result of 1 multiplied by 1 to obtain a multi-scale space attention result. To preserve context information, the input and output of the multiscale convolution are channel-wise spliced and normalized using a short connection (Batch Normalization, BN). Finally, the present invention uses a 1 × 1 × 1 convolution to adjust the number of channels to be the same as the input, without using the activation function. As a result, V was in the size [ B, C, H, W, D ].
Step four: designing an auto-expanding residual module
In the designed neural network, in order to automatically change the number of convolution kernels at different stages and prevent model degradation, the invention designs an automatic residual error extension module based on the original residual error network. The module consists of two branches, wherein the convolution kernel size of one branch is 3 multiplied by 3, the convolution step length is 1, the filling quantity is 1, and the activation function is relu. And the other branch is determined according to the automatic change of the input channel and the set output channel. If the input channel and the set output channel are equal, the branch does not perform any operation, corresponding to a short connection, and the module degenerates to a standard residual network. If the input channel and the set output channel are not equal, the branch performs a convolution operation, the convolution kernel size is 1 × 1 × 1, and the activation function is relu. The final output is the result of the addition of the two branches. In the experiment, the invention uses two automatic residual error expansion modules in series. The first automatic expansion residual error module changes the number of channels, and the second automatic expansion residual error module does not change the number of channels.
Step five: designing multi-scale spatial attention coding and decoding module
In order to achieve better effect, the invention also designs a corresponding coding and decoding module separately. The coding module comprises a residual layer (ResNet), a spatial attention layer (ISA-Net) and a Pooling layer (Pooling), which are connected in series (RIPM). The residual error layer consists of two automatic residual error expansion modules; the input channels of each stage are 2, 16, 32, 64, 128, respectively, and the output channels are 16, 32, 64, 128, 256, respectively. The input channel and the output channel of the spatial attention layer are kept consistent, and the size of the characteristic image is unchanged; the convolution kernel size of the pooling layer is 2 × 2 × 2, and the convolution step size is 2. The decoding module includes an deconvolution layer (DeConv), a residual layer (ResNet), and a spatial attention layer (ISA-Net), which are also in series (DRIM). The convolution kernel size of the deconvolution layer is 3 × 3 × 3, the convolution step is 2, and the padding is 1. The residual layer is also composed of two auto-expanding residual modules connected in series, the input and output channels and sizes remain unchanged, as does the spatial attention layer. The input channels of each stage are 256, 128, 64, 32, respectively, and the output channels are 128, 64, 32, 16, respectively.
Step six: designing an integral network structure
Based on the RIPM and DRIM codec modules, the present invention designs a network architecture as shown in fig. 2. The network input is divided into a PET input channel and a CT input channel. Each channel adopts a U-Net structure and comprises 5 coding modules and 4 decoding modules, wherein the two channel structures are symmetrical but the weight parameters are not shared. The two are coded independently, and the result of each decoding module of the PET channel is fused with the CT channel respectively in the decoding stage to be used as the final output result. The convolution kernel size of the last layer of the network is 1 multiplied by 1, and the activation function is sigmoid. The remaining layer activation functions are relu. The convolution kernel size of the first input convolutional layer of the network is 5 × 5 × 5, which is convenient for obtaining a larger receptive field. The network parameter settings are shown in table 1 below.
Figure BDA0003598434070000191
TABLE 1
Step seven: design loss function and evaluation index
In the present invention, a Dice similarity coefficient DSC (DSC) and a Hausdorff distance HD (Hausdorff distance HD) are used as evaluation indexes. Wherein:
Figure BDA0003598434070000201
HD(X,Y)=max(h(X,Y),h(Y,X))
Figure BDA0003598434070000202
the loss function is the Dice loss:
Figure BDA0003598434070000203
in order to solve the problem of sample imbalance, the invention also adds a Focal loss function in the loss function:
Figure BDA0003598434070000204
the final loss function is:
Loss(X,Y)=LossD(X,Y)+LossF(X,Y)
wherein y and
Figure BDA0003598434070000205
representing the tumor segmentation label and the segmentation prediction result of the tumor respectively. α is a balance weight factor, which is set to a value of 0.5; gamma is a weight decay factor, set to 2. Referring to fig. 3, a graph of the segmentation results of the network is shown, wherein the first row, columns 1 and 2 are CT and PET inputs, respectively, and columns 3 and 4 are tumor labels and the segmentation results of the network of the present invention, respectively; the second action is the segmentation result of four comparison algorithms. FIG. 4 is a graph showing the boundary segmentation results of tumors, wherein the circles from the outside to the inside are sequentially represented as V-Net, 3D U-Net, SE-Net, SK-Net, the segmentation results of the network of the present invention, and the tumor label.
Step eight: training network
The Adam optimizer is used in the training process, two momentum factors are respectively set to be 0.9 and 0.99, the training is performed for 300 times in total, the attenuation of the learning rate in the training process adopts a simulated annealing algorithm, and the learning rate is restarted every 25 times of training. The learning rate is initialized to 3e-4 and a minimum of 1 e-6. And finally obtaining a neural network mapping model capable of realizing three-dimensional tumor automatic segmentation.
Example 3
A storage medium storing a program file enabling any of the above spatial attention methods for PET-CT multi-modal tumor segmentation.
Example 4
A processor for running a program, wherein the program is run to perform any of the above spatial attention methods for PET-CT multi-modal tumor segmentation.
The automatic residual error expansion module completes the characteristic channel conversion and prevents the model from degrading; the multi-scale convolution space attention module extracts feature information, so that feature diversity and accuracy are enriched; the RIPM module and the DRIM module respectively complete original feature coding and fused feature decoding; the two-channel non-completely symmetrical network respectively realizes the extraction and fusion of PET and CT characteristics, and improves the accuracy of the segmentation result.
The invention fuses the characteristics extracted from the PET image and the CT image separately by considering the difference and the complementarity of the PET image and the CT image, wherein the PET channel mainly extracts the position information of the tumor, and the CT channel mainly extracts the boundary information of the tumor. The tumor region can be effectively highlighted and the non-tumor region can be inhibited by combining a spatial attention module based on multi-scale convolution, so that the accuracy of tumor segmentation is improved. In addition to being applied to PET-CT multi-modal tumor segmentation, the invention can also be applied to PET-MRI multi-modal tumor segmentation, and after appropriate deformation, the invention can also be applied to CT or PET or MRI tumor segmentation alone.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, a division of a unit may be a logical division, and an actual implementation may have another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A spatial attention method for PET-CT multi-modal tumor segmentation, comprising the steps of:
setting a PET input channel and a CT input channel which are independently coded in a multi-scale convolution space attention network, inputting a PET image in a PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and fusing the PET decoding result and the CT decoding result.
2. The spatial attention method for PET-CT multimodal tumor segmentation as claimed in claim 1, wherein the PET input channel and the CT input channel are two symmetric but weight-not shared input channels.
3. The spatial attention method for PET-CT multi-modal tumor segmentation according to claim 1, wherein the PET-CT multi-modal image is prior to input channel, the method further comprising: the method comprises the following steps of resampling, cutting and normalizing PET-CT multi-modal images, and specifically comprises the following steps:
firstly, resampling PET and CT multi-modal images to the resolution of 1mm multiplied by 1mm, and then cutting the PET and CT multi-modal images into the size of 144 multiplied by 144; then, the CT multi-modal image is normalized to [ -1,1] by using the maximum and minimum values, and the PET multi-modal image is normalized by using the mean value and the variance.
4. The spatial attention method for PET-CT multi-modal tumor segmentation according to claim 3, wherein before the fusing the PET and CT decoding results, the method further comprises:
setting a channel weight mask module in a PET input channel and a CT input channel, wherein the channel weight mask module comprises a global average pooling layer, two full-connection layers and a sigmoid layer;
mapping the input with the shape format of [ B, C, H, W, D ] into [ B, C,1,1,1] through global average pooling, compressing the last three dimensions, and obtaining a weight vector with the shape of [ B, C,1 ]; wherein B represents a batch size; c represents the number of input channels; H. w, D denotes the input three-dimensional image size;
two fully-connected layers are arranged after the global average pooling, wherein the input node of the first fully-connected layer is C, the output node of the first fully-connected layer is 2-C, the input node of the second fully-connected layer is 2-C, and the output node of the second fully-connected layer is C; the second full-connection layer uses a sigmoid activation function to obtain a final channel weight mask;
adding a short connection between encoding and decoding, multiplying the input with a channel weight mask, and compressing and expanding the channel dimension characteristic information.
5. The spatial attention method for PET-CT multimodality tumor segmentation in accordance with claim 4, wherein the input data size used is [ B, C, H, W, D ], the input is first passed through a channel weight mask module to complete the extrusion and expansion on the channel to obtain the result X, the size is [ B, C, H, W, D ];
then carrying out multi-scale convolution on the output result of the channel weight mask module, adopting convolution kernels with three different sizes, namely 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5, wherein the convolution step length is 1, the padding is 0, 1 and 2 respectively, and the activation functions are relu; matrix-multiplying the convolution results of 3 × 3 × 3 and 5 × 5 × 5, flattening the result U with the size of [ B, C, H, W, D ] into a one-dimensional vector S with the size of [ B, C, (H × W × D),1], generating weight information T with the size of [ B, C, (H × W D),1] through a softmax activation function, and reconstructing the weight information T into the original size W with the size of [ B, C, H, W, D ]; obtaining a multi-scale space attention result by performing dot product on the result W and a convolution result of 1 multiplied by 1; splicing the input and output of the multi-scale convolution according to channels by using a short connection and carrying out normalization;
finally, the number of channels is adjusted to the same size as the input using a 1 × 1 × 1 convolution, resulting in a size of [ B, C, H, W, D ] for V.
6. The spatial attention method for PET-CT multimodal tumor segmentation as claimed in claim 5, characterized in that an auto-extension residual module is provided on the original residual network; the automatic extension residual error module consists of two branches, wherein the convolution kernel of one branch is 3 multiplied by 3, the convolution step length is 1, the filling quantity is 1, and the activation function is relu; the other branch is determined according to the automatic change of the input channel and the set output channel;
if the input channel is equal to the set output channel, the branch does not carry out any operation, and the automatic residual error expanding module degenerates into a standard residual error network; if the input channel is not equal to the set output channel, the branch circuit executes convolution operation, the size of a convolution kernel is 1 multiplied by 1, and the activation function is relu; the final output is the result of the addition of the two branches;
two automatic residual expanding modules are used in series; the first automatic expansion residual error module changes the number of channels, and the second automatic expansion residual error module does not change the number of channels.
7. The spatial attention method for PET-CT multimodal tumor segmentation as claimed in claim 6, characterized in that an encoding module and a decoding module are provided in the multi-scale convolution spatial attention network;
the coding module comprises a residual error layer, a spatial attention layer and a pooling layer which are connected in series; the residual error layer consists of two automatic expansion residual error modules; the input channels of each stage are respectively 2, 16, 32, 64 and 128, and the output channels are respectively 16, 32, 64, 128 and 256; the input channel and the output channel of the spatial attention layer are kept consistent, and the size of the characteristic image is unchanged; the convolution kernel size of the pooling layer is 2 multiplied by 2, and the convolution step length is 2;
the decoding module comprises an deconvolution layer, a residual error layer and a space attention layer which are connected in series; the convolution kernel size of the deconvolution layer is 3 multiplied by 3, the convolution step length is 2, and the padding is 1; the residual error layer and the space attention layer are both formed by connecting two automatic expansion residual error modules in series, and input and output channels and sizes are kept unchanged; the input channels of each stage are 256, 128, 64, 32, respectively, and the output channels are 128, 64, 32, 16, respectively.
8. The spatial attention method for PET-CT multimodal tumor segmentation as claimed in claim 7, wherein the PET input channel and the CT input channel both adopt a U-Net structure, comprising 5 encoding modules and 4 decoding modules;
the convolution kernel size of the first input convolution layer of the multi-scale convolution space attention network is 5 multiplied by 5, the convolution kernel size of the last layer is 1 multiplied by 1, and the activation function is sigmoid; the remaining layer activation functions are relu.
9. The spatial attention method for PET-CT multimodal tumor segmentation according to claim 8, characterized in that a Dice similarity coefficient DSC and a Hausdorff distance HD are used as evaluation indices; wherein:
Figure FDA0003598434060000041
HD(X,Y)=max(h(X,Y),h(Y,X))
Figure FDA0003598434060000042
the loss function is the Dice loss:
Figure FDA0003598434060000043
the loss function is also added with the following Focal loss function:
Figure FDA0003598434060000044
the final loss function is:
Loss(X,Y)=LossD(X,Y)+LossF(X,Y)
wherein y and
Figure FDA0003598434060000045
respectively representing a tumor segmentation label and a segmentation prediction result of the tumor; α is a balance weight factor, which is set to a value of 0.5; gamma is a weight decay factor, set to 2;
the Adam optimizer is used in the training process, two momentum factors are respectively set to be 0.9 and 0.99, the training is performed for 300 times in total, the attenuation of the learning rate in the training process adopts a simulated annealing algorithm, and the learning rate is restarted every 25 times of training; the learning rate is initialized to 3e-4 and the minimum value is 1 e-6; and finally obtaining a neural network mapping model capable of realizing three-dimensional tumor automatic segmentation.
10. A spatial attention device for PET-CT multi-modal tumor segmentation, comprising:
the characteristic information extraction unit is used for setting a PET input channel and a CT input channel which are independently coded in the multi-scale convolution space attention network, inputting a PET image in the PET-CT multi-modal image into the PET input channel to extract PET characteristic information, and inputting a CT image in the PET-CT multi-modal image into the CT input channel to extract CT characteristic information;
and the fusion unit is used for fusing the PET decoding result and the CT decoding result.
CN202210394761.4A 2022-04-15 2022-04-15 Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation Pending CN114782532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210394761.4A CN114782532A (en) 2022-04-15 2022-04-15 Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210394761.4A CN114782532A (en) 2022-04-15 2022-04-15 Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation

Publications (1)

Publication Number Publication Date
CN114782532A true CN114782532A (en) 2022-07-22

Family

ID=82429320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210394761.4A Pending CN114782532A (en) 2022-04-15 2022-04-15 Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation

Country Status (1)

Country Link
CN (1) CN114782532A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543151A (en) * 2023-05-05 2023-08-04 山东省人工智能研究院 3D medical CT image segmentation method based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543151A (en) * 2023-05-05 2023-08-04 山东省人工智能研究院 3D medical CT image segmentation method based on deep learning

Similar Documents

Publication Publication Date Title
Sun et al. Saunet: Shape attentive u-net for interpretable medical image segmentation
CN113077471B (en) Medical image segmentation method based on U-shaped network
WO2021244661A1 (en) Method and system for determining blood vessel information in image
CN109949276B (en) Lymph node detection method for improving SegNet segmentation network
CN111145181B (en) Skeleton CT image three-dimensional segmentation method based on multi-view separation convolutional neural network
CN112364920B (en) Thyroid cancer pathological image classification method based on deep learning
CN110706214B (en) Three-dimensional U-Net brain tumor segmentation method fusing condition randomness and residual error
CN113159056A (en) Image segmentation method, device, equipment and storage medium
CN111080657A (en) CT image organ segmentation method based on convolutional neural network multi-dimensional fusion
CN110570394A (en) medical image segmentation method, device, equipment and storage medium
CN111091010A (en) Similarity determination method, similarity determination device, network training device, network searching device and storage medium
CN111667027A (en) Multi-modal image segmentation model training method, image processing method and device
CN114202545A (en) UNet + + based low-grade glioma image segmentation method
Jia et al. 3D global convolutional adversarial network\\for prostate MR volume segmentation
CN115661165A (en) Glioma fusion segmentation system and method based on attention enhancement coding and decoding network
CN114782532A (en) Spatial attention method and device for PET-CT (positron emission tomography-computed tomography) multi-modal tumor segmentation
CN116665065B (en) Cross attention-based high-resolution remote sensing image change detection method
Ma et al. An iterative multi‐path fully convolutional neural network for automatic cardiac segmentation in cine MR images
CN116128876B (en) Medical image classification method and system based on heterogeneous domain
CN113177953B (en) Liver region segmentation method, liver region segmentation device, electronic equipment and storage medium
CN115984257A (en) Multi-modal medical image fusion method based on multi-scale transform
CN113744284B (en) Brain tumor image region segmentation method and device, neural network and electronic equipment
CN115761371A (en) Medical image classification method and device, storage medium and electronic equipment
CN113409324B (en) Brain segmentation method fusing differential geometric information
CN116958154A (en) Image segmentation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination