CN116258732A

CN116258732A - Esophageal cancer tumor target region segmentation method based on cross-modal feature fusion of PET/CT images

Info

Publication number: CN116258732A
Application number: CN202310109050.2A
Authority: CN
Inventors: 他得安; 岳曜廷; 宋少莉
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-06-13

Abstract

The invention discloses a method for segmenting an esophageal cancer tumor target region based on cross-modal feature fusion of PET/CT images; according to the method, a transitformer fusion attention progressive semantic nested network TransAttPSNN is used as a three-dimensional segmentation model of the tumor target area of the esophageal cancer to realize the segmentation of the tumor target area of the esophageal cancer; the TransAttPSNN network takes an attention progressive semantic nested network AttPSNN as a main structure, and comprises two paths of split networks, wherein one path is PET flow, the other path is CT flow, and a Transformer trans-modal self-adaptive feature fusion module is embedded in different scale feature layers of the two paths of split networks. Compared with the prior art, the method can effectively improve the segmentation accuracy of the esophageal cancer tumor target area and obtain better segmentation performance.

Description

Esophageal cancer tumor target region segmentation method based on cross-modal feature fusion of PET/CT images

Technical Field

The invention belongs to the field of intelligent processing of medical images, and particularly relates to an esophageal cancer tumor target region segmentation method based on cross-modal feature fusion of PET/CT images.

Background

Esophageal cancer is asymptomatic early, resulting in its general progression to a late stage before it is diagnosed by diagnosis. For patients with middle and advanced esophageal cancer, radiotherapy is mainly involved in therapy; in particular esophageal squamous cell carcinoma, which is sensitive to radiation, radiotherapy is particularly effective. The design of the radiotherapy plan depends on the sketch of the tumor target area of the esophageal cancer, and the accurate sketch of the tumor target area of the esophageal cancer is helpful for the tumor to be irradiated by sufficient radiation dose in the radiotherapy process, and can also prevent normal tissues or dangerous organs around the tumor from being damaged due to excessive exposure to radiation irradiation. The current clinical task of delineating the tumor target area of the esophageal cancer is accomplished by manual operation of a doctor. This is a tedious, time consuming and laborious task, which to a certain extent takes up a lot of valuable medical resources. In addition, the manual sketching mode relies on the clinical experience of doctors to carry out subjective judgment and operation, so that the sketching outline of the esophageal cancer tumor target area of the same patient is correspondingly changed along with the judgment of different doctors, and the problem of inappropriateness is caused. Therefore, development of an effective automatic segmentation algorithm for esophageal cancer tumor target region by means of computer-aided technology has become an urgent need.

In actual clinical practice, many esophageal cancer patients who are scheduled to receive radiation therapy have undergone PET/CT imaging examinations. Although some techniques currently use deep learning to segment the tumor target area of esophageal cancer, these techniques are not processed based on PET/CT images and the accuracy of the segmentation remains to be improved.

Disclosure of Invention

In order to fully utilize the complementary information of each of functional metabolic imaging PET and anatomical structure imaging CT, the invention aims to provide a more accurate and effective esophageal cancer tumor target region segmentation method based on PET/CT images for processing.

The technical scheme of the invention is as follows.

A segmentation method of esophageal cancer tumor target regions based on cross-modal feature fusion of PET/CT images specifically comprises the following steps:

s1, collecting PET/CT images of clinical esophagus cancer patients and corresponding labels thereof to form a data set;

s2, preprocessing a PET/CT image data set;

s3, establishing a three-dimensional segmentation model of the esophageal cancer tumor target area: transformer fusion attention progressive semantic nesting networks (Transformer Fusing Attention Progressive Semantically-new networks, transAttPSNN);

the TransAttPSNN takes an attention progressive semantic Nested Network (Attention Progressive Semantically-Nested Network, attPSNN) as a main structure, the AttPSNN introduces a convolution attention mechanism in the progressive semantic Nested Network (Progressive Semantically-Nested Network, PSNN), the TransAttPSNN comprises two paths of split networks, one path is PET flow, the other path is CT flow, the PET flow has the same Network structure as the CT flow, and a Transformer trans-modal self-adaptive feature fusion module is embedded in 5 different scale feature levels of the two paths of split networks; between PET flow and CT flow, 5 Transformer cross-modal adaptive feature fusion modules are used for connecting 5 PET and CT feature images with different scales to carry out adaptive feature fusion on the PET and CT feature images, the fused results are respectively transmitted back to PET flow and CT flow paths to participate in subsequent information forward propagation, the output on the upper path AttPSNN decoding path and the lower path AttPSNN decoding path and the total output of the two paths AttPSNN decoding path are connected through a connection mode of depth supervision, then the output is processed through a convolution layer, and finally a segmentation prediction result is obtained through an output layer Sigmoid;

s4, training the established TransAttPSNN segmentation model;

s5, carrying out segmentation prediction on the PET/CT image of the unknown esophageal cancer patient by using the TransAttPSNN segmentation model obtained through training, outputting the optimal segmentation precision, and carrying out visual display on the segmentation result.

In the invention, in step S1, the label corresponding to the PET/CT image is determined by manually sketching and examining the tumor target area of the esophageal cancer on the CT axial slice by leading the DICOM file of the PET/CT into the ITK-SNAP software and referring to the corresponding PET image.

In the invention, in the step S2, the data preprocessing comprises three operations of carrying out secondary registration on PET/CT images to correct the position deviation between the PET/CT images, carrying out contrast enhancement on the CT images, cutting out the region of interest on the PET/CT images and carrying out normalization processing; preferably, the secondary registration of the PET/CT image adopts a multi-modal intensity three-dimensional registration algorithm, a registration method based on mutual information, a registration method based on an optical flow field or a registration method based on deep learning, and the PET image and the CT image which are output after registration have the same size; the CT image is cut off in window width to contrast-enhance the CT image.

In the invention, in step S3, in the TransAttPSNN network, the PET flow and the CT flow adopt AttPSNN networks, and each AttPSNN network contains characteristic images with 5 scales; specifically, the encoding path includes 5 convolution levels, wherein the first two convolution levels are each composed of 2 convolution modules, the last three convolution levels are each composed of 3 convolution modules, and the decoding path includes 4 convolution attention levels, wherein the first convolution attention level is composed of a convolution attention module+ConV layer, and the last three convolution attention levels are composed of a convolution attention module+ConV+tri-linear interpolation upsampling layer.

In the invention, in step S4, a four-fold cross validation mode is adopted to train the established TransAttPSNN segmentation model; the training method of four-fold cross validation specifically comprises the following steps: first, the dataset was divided into four equal parts. Secondly, sequentially taking one data as a test set, combining the other three data as a training set to train the TransAttPSNN model, so that a total of 4 TransAttPSNN segmentation models are obtained through training.

The invention aims at the problem that the PET/CT image has position deviation, and carries out secondary registration on the PET/CT image; according to the invention, aiming at the problem of poor contrast of the CT image, a reasonable window width cut-off threshold value is selected by a statistical analysis method to carry out window width cut-off processing on the CT image, so that the contrast of the CT image is improved; the invention provides a dual-mode PET/CT image-based tumor target area segmentation for esophageal cancer. According to the invention, a transducer model is introduced into a three-dimensional segmentation task of an esophageal cancer tumor target area, so that cross-mode self-adaptive feature fusion is realized, and further, the three-dimensional segmentation model of the esophageal cancer tumor target area is provided, and a transducer fusion attention progressive semantic nested network TransAttPSNN is provided.

Compared with the prior art, the invention has the beneficial effects that:

in the TransAttPSNN segmentation model, the introduction of a convolution attention mechanism enables the proposed AttPSNN model to be more effective than an original Progressive Semantic Nested Network (PSNN) model. In addition, complementary information which is beneficial to the PET and CT modes is mined based on a transducer model, so that the segmentation performance is improved. Therefore, compared with the most advanced method reported in the prior literature, the three-dimensional segmentation model TransAttPSNN of the esophageal cancer tumor designed by the invention can obtain better segmentation precision.

The invention brings good position deviation correction effect to the secondary registration of PET/CT images. The reasonable window width cut-off threshold value is selected through the statistical analysis method, so that the window width cut-off processing of the CT image brings the beneficial effect of improving the contrast of the CT image. According to the invention, a transducer model is introduced creatively, and based on the model, an esophageal cancer tumor target region segmentation method based on cross-modal feature fusion of PET/CT images is provided, the segmentation accuracy of the esophageal cancer tumor target region is improved, and a certain technical support can be provided for an automatic esophageal cancer tumor target region segmentation task.

Drawings

FIG. 1 is a graph comparing results before and after a second registration of PET/CT images in accordance with the present invention; and comparing the results before and after the PET/CT image secondary registration. (a) The results were visualized superimposed with (b) 2 different example PET/CT images. The green plot represents PET, the purple plot represents CT, the blue outline represents the outline of the real label, and the green highlighting area in the blue outline represents the tumor lesion area in PET.

Fig. 2 is a graph showing the contrast enhancement of CT images before and after the present invention.

Fig. 3 is a frame diagram of a three-dimensional segmentation model TransAttPSNN of an esophageal cancer tumor target area designed by the invention.

Fig. 4 is a graph of a PET and CT feature image fused self-attention weight correlation matrix.

FIG. 5 is a three-dimensional visual result diagram of the segmentation result obtained by the method of the present invention. (a) is a three-dimensional visual view of the esophageal cancer tumor corresponding to a real label, (b) is a three-dimensional visual view of the segmentation result obtained by the invention, and (c) is a superposition view of (a) and (b).

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings and examples, to which the scope of the invention is not limited.

s1, collecting PET/CT images of clinical esophageal cancer patients and corresponding labels thereof to form a data set. Wherein, the PET/CT image of the patient with esophageal cancer is DICOM data of the whole body 18F-FDG PET/CT image examination. The label corresponding to the PET/CT image is firstly agreed by two doctors in routine clinical work, a DICOM file of the PET/CT is imported into ITK-SNAP software (Version 3.6,United States), and the image is formed by manually sketching the tumor target area of the esophageal cancer on a CT axial slice by referring to the corresponding PET image; the delineated labels are then reviewed by a physician to determine the final label.

S2, preprocessing the PET/CT image data set. Specific preprocessing operations include the following 3 aspects:

s2.1, carrying out secondary registration on the PET/CT image. Although PET/CT scanners have hardware registration of PET/CT images, the patient's involuntary respiratory motion, abdominal organ peristalsis, heart beat, etc. lead to PET/CT images in fact in a non-strictly defined configuration during image acquisition. Therefore, we perform a secondary registration of the PET/CT images in the data preprocessing step to correct the positional deviation between the PET/CT images. The registration method used is a multi-modal intensity three-dimensional registration algorithm ^[1] The PET image and the CT image output after registration have the same size, and are 512×512. And (3) injection: besides the multi-modal intensity three-dimensional registration algorithm, the PET/CT image registration can be performed by adopting methods based on mutual information, optical flow field, deep learning and the like ^[2-4] . As shown in fig. 1, which is a graph comparing the results of the present invention before and after the second registration of the PET/CT images of two embodiments, it can be observed that the position deviation of the PET/CT images is better corrected after the second registration of the PET/CT images.

S2.2, contrast enhancement is carried out on the CT image. The specific operation is to cut off window width of CT image, and the pixel values smaller than-150 and larger than 150 in CT image matrix are respectively assigned as-150 and 150. As shown in fig. 2, which is a graph comparing the results before and after the contrast enhancement of the CT image according to an embodiment of the present invention, it can be observed that the contrast between the tumor focus area and the surrounding tissues in the CT image is improved after the window width truncation treatment is performed on the CT image.

S2.3, cutting out a region of interest from the PET/CT image and normalizing. Limited by the high overhead of deep neural network models on computational storage resources, while at the same time to alleviate the problem of highly unbalanced foreground and background (foreground representing tumor region, background representing non-tumor region) data ^[5] It is necessary to crop the region of interest for PET/CT images. The specific operation is that the PET/CT image of each patient in the data set and the corresponding label are cut into the region of interest containing esophageal cancer tumor and having the size of at least 64X 64, and the cut PET/CT image is normalized to [0,1]Interval. After the region of interest is cut and normalized on the PET/CT image, the data set required for training the network model is obtained.

S3, establishing a three-dimensional segmentation model of the esophageal cancer tumor target area: transformer fuses attention progressive semantic nested networks, transAttPSNN. Specific operations include the following 2 aspects:

s3.1, taking progressive semantic nested network PSNN reported in the prior literature as a main body in three-dimensional segmentation and forefront method of esophageal cancer tumor target area ^[6] Attention-seeking language is presented by introducing a convolved attention mechanism theretoThe nested network AttPSNN.

S3.2, designing a transducer cross-modal self-adaptive feature fusion module, and embedding the transducer cross-modal self-adaptive feature fusion module into different scale feature levels of two paths of AttPSNN segmentation networks (one path is a PET segmentation flow, and the other path is a CT segmentation flow), so that a final segmentation model TransAttPSNN is built.

A framework diagram of the TransAttPSNN model is shown in fig. 3. After the dual-channel PET/CT image is input into the TransAttPSNN network, the dual-channel PET/CT image is divided into 2 paths: the upper PET stream and the lower CT stream. The PET flow and the CT flow have the same network structure and are all the proposed AttPSNN networks. Each AttPSNN network contains feature images of 5 scale sizes; specifically, the encoding path includes 5 convolution levels, wherein the first two convolution levels are each composed of 2 convolution modules (ConV+BN+ReLU), and the last three convolution levels are each composed of 3 convolution modules; the fifth convolution level is similar to the middle bridge portion of a U-shaped fabric network. The decoding path comprises 4 convolution attention layers, wherein the first convolution attention layer consists of a convolution attention module+ConV layers, and the last three convolution attention layers consist of a convolution attention module+ConV+tri-linear interpolation upsampling layers. Between the PET stream and the CT stream, 5 Transformer cross-modality adaptive feature fusion modules are used to connect 5 different-scale PET and CT feature images for adaptive feature fusion. The fused results are transmitted back to the PET and CT flow paths, respectively, to participate in the forward propagation of subsequent information. And the output on the upper path and the lower path of AttPSNN decoding paths and the total output of the two paths are connected by a deep supervision connection mode, and then are processed by a convolution layer. And finally obtaining a segmentation prediction result through the output layer Sigmoid.

In the invention, the realization of the function of realizing the cross-modal self-adaptive feature fusion by using a transducer model is as follows.

3.2.1 theory of three-dimensional transducer model

Let the input image be x e R ^{H′×W′×D′×C} Where H ', W ', D ', C represent the height, width, depth, and channel number of the image, respectively. To avoid the problem of computing memory explosion, three dimensions are used firstAdaptive average pooling operation function AdapteveAvgPool 3d (), pools an input image x to x _pooling ∈R ^H×W×D×C Where H, W, D, C is the pooled image height, width, depth, and channel number.

x _pooling ＝AdaptiveAvgPool3d(x).(1)

Secondly, the first step of the method comprises the steps of, x is determined using a window of size 1 x C _pooling Flattening into a series of patches to give x _f ∈R ^(HWD)×C Where HWD is the number of patches generated and C is the dimension of each patch. Will x _f Input to a standard transducer module, the procedure first handled in a Multi-Head Self-Attention (MHSA) module is as follows:

q＝x _f ·W _q ,k＝x _f ·W _k ,v＝x _f ·W _v ,(2)

z＝Concat(z ⁽¹⁾ ；z ⁽²⁾ ；...；z ^(M) )·W _o ,(5)

wherein W is _q 、W _k 、W _v ∈R ^C×C Representing the mapping matrix. q, k, v ε R ^(HWD)×C Respectively representing query, key and value; m is the number of parallel self-attention headers in the MHSA. Note d=c/M is the dimension of each self-attention head, then

A mapping matrix for the mth self-attention head; accordingly, q ^m 、k ^m 、v ^m ∈R ^(HWD)×d (m=1, 2,., M) is the query, key, and value of the mth self-attention head. Sigma (·) represents the Softmax function, z ^(m) ∈R ^(HWD)×d Is the output of the mth self-attention head. W (W) _o ∈R ^Md×C (i.e. W _o ∈R ^C×C ) To map the matrix, z ε R ^(HWD)×C Is the final output of the MHSA.

Third, the output of MHSA is sent to a Multi-Layer perceptron (MLP) module for processing (MLP is composed of two fully connected linear layers, two Dropout layers, and one GELU activation Layer).

Fourth, to sum up the flow, add trainable position code P _f After the LN layer and residual connection operation, the three-dimensional transducer model process flow with L transducer modules is as follows:

z ₀ ＝x _f +P _f ,(6)

z′ _l ＝MHSA(LN(z _l-1 ))+z _l-1 ,(7)

z _l ＝MLP(LN(z′ _l ))+z′,(l＝1,2,...,L).(8)

finally, z _L ∈R ^(HWD)×C Remodelling into

In the form of (2) and applying tri-linear interpolation again>

Upsampling to +.>

The output of the transducer can be restored to the same size as the original input image.

3.2.2 Cross-modality adaptive feature fusion theory based on three-dimensional Transformer model

Based on the three-dimensional transducer model theory, the PET and CT characteristic images to be fused are assumed to be x respectively _PET ∈R ^{H′×W′×D′×C} And x _CT ∈R ^{H′×W′×D′×C} Where H ', W ', D ', C represent the height, width, depth, and channel number of the image, respectively. Then the input image x is first input using equation (1) _PET And x _CT Pooling to

And->

Where H, W, D, C is the pooled image height, width, depth, and channel number.

Secondly, the first step of the method comprises the steps of, using a window of size 1 x C

And->

Respectively flattening into a series of patches to obtain +.>

And->

And then->

And->

Connection in patch dimension

Third, according to formulas (6) - (8), the method

Input a transducer model for processing to obtain an output z _L ∈R ^(2HWD)×C . Will z _L Remodelling to->

And split into two outputs +.>

And (3) with

Finally apply tri-linear interpolation to +.>

And->

Upsampling and restoring to the same size as the original input image respectively yields +.>

And->

The self-adaptive fusion of PET and CT characteristic images is completed, and the fusion process is specifically explained as follows:

will be

After inputting the transducer model, the self-attention weight W is calculated by referring to equation (4) when processing in the MHSA module _a It can be considered that the correlation between every two patches of the PET and CT feature images after flattening and mapping is calculated, as shown in FIG. 4, wherein w _ij (i, j=1, 2,..2 HWD) represents the correlation between the patch of location i and the patch of location j. Therefore, the transducer model can adaptively model the remote dependence relationship between the same modality and the cross modality of the PET and CT characteristic images in the training process, thereby realizing the characteristic fusion function.

S4, training the established segmentation model TransAttPSNN by adopting a four-fold cross validation mode. Specific operations include the following 4 aspects:

s4.1 divides the dataset into four equal parts.

S4.2, training and configuring a segmentation model TransAttPSNN. Specifically, the method comprises the following 2 aspects:

s4.2.1 randomly extracting 16 training patches with the size of 64 multiplied by 64 from each region of interest and the corresponding label thereof obtained in the data preprocessing step S2.3, and randomly performs a data enhancement operation (rotated 90 °, flipped left-right, flipped up-down, flipped left-right then rotated 90 °, or left unchanged) for each patch.

S4.2.2 performing hyper-parameter configuration on the TransAttPSNN segmentation model established in the step S3: training round number epoch=50, learning rate learning rate=5e—3, small batch size mini-batch=4, optimizer AdamW, weight decay parameter decoupled weight decay =0.01, loss function generalized Dice loss (Generalized Dice Loss, GDL). The formula for GDL is defined as follows:

wherein c represents the number of categories, y _cn And p is as follows _cn Representing the true label value and the predicted probability value of the nth pixel belonging to the c-th class respectively,

representing the inverse of the sum of squares of the respective total pixel numbers of the foreground and background (foreground representing the tumor region and background representing the non-tumor region). Thus, the weight w _c The introduction of the method plays a role in correcting the contribution degree of the foreground and the background, so that the problem of unbalanced foreground and background data can be relieved. Epsilon=1×10 ^-8 To prevent the denominator from being zero.

S4.3, sequentially taking one data from the four equal data sets as a test set, and combining the other three data sets as a training set to train the TransAttPSNN model. In training the network, referring to fig. 3, the data is processed with a small batch (mini-batch=4) of samples as criteria, and the specific operations include the following 4 aspects:

s4.3.1 a two-channel PET/CT image is input to the established TransAttPSNN model.

S4.3.2 the extracted PET channel images are sent to PET stream for processing while the extracted CT channel images are sent to CT stream for processing.

S4.3.3 for PET and CT characteristic images of the same layer, a transducer self-adaptive characteristic fusion module is adopted to fuse the PET and CT characteristic images, and the fusion result is fed back to the PET stream and CT stream respectively for participation in information feedforward.

S4.3.4 the training network is continuously optimized to converge and obtain the three-dimensional segmentation model of the esophageal cancer tumor target region with good performance. It should be noted that, because of the four-fold cross-validation training mode, a total of 4 TransAttPSNN segmentation models will be trained.

S4.4, respectively inputting the four test sets into the TransAttPSNN model obtained by training the corresponding training set, and outputting average segmentation accuracy.

In the step S4.4, when testing the network, referring to fig. 3, the data is processed by a single sample as a criterion, and the specific operations include the following 4 aspects:

s4.4.1 to the trained TransAttPSNN model, a two-channel PET/CT image is input.

S4.4.2 the extracted PET channel images are sent to PET stream for processing while the extracted CT channel images are sent to CT stream for processing.

S4.4.3 for PET and CT characteristic images of the same layer, a transducer self-adaptive characteristic fusion module fuses the PET and CT characteristic images, and feeds back fusion results to PET streams and CT streams respectively to participate in follow-up information feedforward.

S4.4.4 is continuously feed-forward processed through the information until the corresponding segmentation result is output.

In the step S4.4, the segmentation accuracy is measured by 3 commonly used evaluation indexes, namely, a Dice similarity coefficient (Dice Similarity Coefficient, DSC), a hausdorff distance (Hausdorff Distance, HD) and an average surface distance (Mean Surface Distance, MSD). Wherein the hausdorff distance is also referred to as the maximum surface distance. DSC measures the degree of spatial overlap between predicted and real labels ^[9,10] . The distance indicators HD and MSD measure the maximum distance between the predicted tumor region edge and the actual tumor region edge, respectively, and the average distance ^[11] . Assuming that the predicted tumor region is denoted by P, the true tumor region is denoted by G, and their respective edge profiles are denoted by P, respectively _C And G _C Calculation formula of DSC, HD and MSDThe formulae are defined as follows:

wherein d (p, g) represents the Euclidean distance between pixel points p and g; the |P| and |G| represent the total number of pixels of the predicted and real tumor regions P and G, respectively; similarly, |P _C I and I G _C The i represents the total number of pixels predicted to the edge contour of the true tumor. DSC value is between [0,1 ]]The closer the value is to 1, the better the segmentation result. The values of HD and MSD are larger than or equal to 0, and the closer the value is to 0, the better the segmentation result is.

S5, predicting PET/CT image data of unknown esophageal cancer patients by using a TransAttPSNN segmentation model obtained through training, outputting the optimal segmentation accuracy, and visually displaying the segmentation result. The specific operations include the following 4 aspects:

s5.1, obtaining PET/CT scanning DICOM data of unknown esophageal cancer patients.

S5.2, preprocessing the acquired PET/CT image by using the data preprocessing method in the step S2.

S5.3, inputting the PET/CT region-of-interest image obtained by preprocessing into the 4 TransAttPSNN segmentation models obtained by training to output the corresponding 4-group segmentation accuracy.

And S5.4, selecting a group with the largest DSC value from the group consisting of the 4 groups obtained in the step S5.3 as the optimal segmentation precision, and visually displaying the corresponding segmentation result. Referring to fig. 5, fig. 5 (a) is a three-dimensional visual view of esophageal cancer tumor corresponding to a real label, fig. 5 (b) is a three-dimensional visual view of the segmentation result obtained by the present invention, and fig. 5 (c) is an overlapping view of (a) and (b). By observing the three-dimensional visualization result, the esophagus cancer tumor predicted by the invention is smoother than the tumor shape corresponding to the real label, which is closer to the real appearance of the focus in clinical practice. In addition, the esophagus cancer tumor predicted by the invention has good similarity with the real tumor.

Table 1. The method of the present invention compares the segmentation accuracy obtained by the segmentation method of other existing esophageal cancer tumors.

Table 1 shows the comparison results of the segmentation accuracy of the three-dimensional segmentation model TransAttPSNN of the tumor target region of the esophageal cancer designed by the invention and other existing segmentation methods of the tumor of the esophageal cancer. In Table 1, the convolution-based reference methods are U-Net, denseUNet and Two-stream chained PSNN, the convolution-based Attention methods are Attention U-Net and DDAUNet, and the transform model-based methods are UNETR, transBTS and CoTr. Among them, two-stream chained PSNN, denseunet, and DDAUNet represent advanced methods in the current GTV three-dimensional segmentation literature of esophageal cancer. As can be seen from the comparison evaluation index value, the segmentation performance of the TransAttPSNN network exceeds that of all other competing networks, and the maximum DSC value and the minimum HD value are obtained, and although the MSD value is less than the MSD of the UNETR, the MSD value is very little different. The Transfomer-based approach can achieve better segmentation performance (with the exception of the performance of CoTr) than the convolutional network approach. In the model based on the Transfomer, the TransAttPSNN designed by the invention has the best performance.

Reference to the literature

[1]MUTHUKUMARAN D,SIVAKUMAR M.Medical Image Registration:A matlab based approach[J].Int J Sci Res Comput Sci,Eng Inform Technol,2017,2(2):29-34.

[2]PENNEC X,CACHIER P,AYACHE N.Understanding the Demon's Algorithm:3D Non-rigid Registration by Gradient Descent[C].In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention,1999,597–605.

[3] Luo Shuqian, li Xiang multimode medical image registration based on maximum mutual information [ J ]. Chinese image graphic journal, 2000,5 (7): 551-8.

[4]HU Y,MODAT M,GIBSON E,et al.Weakly-supervised convolutional neural networks for multimodal image registration[J].Med Image Anal,2018,49:1-13.

[5]CRUM W R,CAMARA O,HILL D L.Generalized overlap measures for evaluation and validation in medical image analysis[J].IEEE T Med Imaging,2006,25(11):1451-61.[6]JIN D,GUO D,HO T Y,et al.DeepTarget:Gross tumor and clinical target volume segmentation in esophageal cancer radiotherapy[J].Med Image Anal,2021,68:101909.[7]RAJON D A,BOLCH W E.Marching Cube Algorithm:Review and trilinear interpolation adaptation for image-based dosimetric models[J].Comput Med Imag Grap,2003,27(5):411-35.

[8]HILL S.Trilinear Interpolation[J].Graphics Gems,1994:521-5.

[9]RAZZAK M I,IMRAN M,XU G.Efficient brain tumor segmentation with multiscale two-pathway-group conventional neural networks[J].IEEE J Biomed Health,2019,23(5):1911-9.

[10]CHEN G,YIN J,DAI Y,et al.A novel convolutional neural network for kidney ultrasound images segmentation[J].Comput Meth Prog Bio,2022,218:106712.

[11]FECHTER T,ADEBAHR S,BALTAS D,et al.Esophagus segmentation in CT via 3D fully convolutional neural network and random walk[J].Med Phys,2017,44(12):6341-52.

[12]IEK Z,ABDULKADIR A,LIENKAMP S S,et al.3D U-Net:Learning dense volumetric segmentation from sparse annotation[C].In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention,2016,424-32.

[13]FECHTER T,ADEBAHR S,BALTAS D,et al.A 3D fully convolutional neural network and a random walker to segment the esophagus in CT[J/OL]2017,1-23,arXiv:1704.06544.

[14]OKTAY O,SCHLEMPER J,FOLGOC L L,et al.Attention U-Net:Learning where to look for the pancreas[C].In Proceedings of the International Conference on Medical Imaging with Deep Learning,2018,1-10.

[15]YOUSEFI S,SOKOOTI H,ELMAHDY M S,et al.Esophageal gross tumor volume segmentation using a 3D convolutional neural network[C].In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention,2018,343-51.

[16]YOUSEFI S,SOKOOTI H,ELMAHDY M S,et al.Esophageal tumor segmentation in CT images using a dilated dense attention Unet(DDAUnet)[J].IEEE Access,2021,9:99235-48.

[17]HATAMIZADEH A,TANG Y,NATH V,et al.UNETR transformers for 3D medical image segmentation[C].In Proceedings of IEEE Winter Conference on Applications of Computer Vision,2021,1-11.

[18]WANG W,CHEN C,DING M,et al.TransBTS:Multimodal brain tumor segmentation using transformer[C].In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention,2021,1-11.

[19]XIE Y,ZHANG J,SHEN C,et al.CoTr:Efficiently bridging CNN and transformer for 3D medical image segmentation[J/OL]2021,1-13,arXiv:2103.03024。

Claims

1. The esophageal cancer tumor target region segmentation method based on the cross-modal feature fusion of the PET/CT images is characterized by comprising the following steps of:

s1, collecting PET/CT images and corresponding labels of clinical esophagus cancer patients to form a PET/CT image data set;

s2, preprocessing a PET/CT image data set;

s3, establishing a three-dimensional segmentation model of the esophageal cancer tumor target area: transformer fusion attention progressive semantic nesting network TransAttPSNN;

the TransAttPSNN network takes an AttPSNN which introduces a convolution attention mechanism attention progressive semantic nesting network as a main structure, and comprises two paths of segmentation networks, wherein one path is a PET flow, the other path is a CT flow, the PET flow has the same network structure as the CT flow, and a Transformer trans-modal self-adaptive feature fusion module is embedded in 5 different scale feature layers of the two paths of segmentation networks; between PET flow and CT flow, 5 Transformer cross-modal adaptive feature fusion modules are used for connecting 5 PET and CT feature images with different scales to carry out adaptive feature fusion on the PET and CT feature images, the fused results are respectively transmitted back to PET flow and CT flow paths to participate in subsequent information forward propagation, the output on the upper path AttPSNN decoding path and the lower path AttPSNN decoding path and the total output of the two paths AttPSNN decoding path are connected through a connection mode of depth supervision, then the output is processed through a convolution layer, and finally a segmentation prediction result is obtained through an output layer Sigmoid;

s4, training the established TransAttPSNN segmentation model;

2. The method according to claim 1, wherein in step S1, the label corresponding to the PET/CT image is determined by manually delineating and examining the esophageal cancer tumor target area on a CT axial slice by importing a DICOM file of the PET/CT into ITK-SNAP software and referring to the corresponding PET image.

3. The method according to claim 1, wherein in step S2, the data preprocessing includes three operations of performing a secondary registration on the PET/CT images to correct a positional deviation between the PET/CT images, performing contrast enhancement on the CT images, and cutting out a region of interest from the PET/CT images and normalizing.

4. The esophageal cancer tumor target region segmentation method according to claim 3, wherein the secondary registration of the PET/CT image adopts a multi-modal intensity three-dimensional registration algorithm, a registration method based on mutual information, a registration method based on an optical flow field or a registration method based on deep learning, and the PET image and the CT image which are output after registration have the same size; the CT image is cut off in window width to contrast-enhance the CT image.

5. The method according to claim 1, wherein in step S3, the TransAttPSNN networks are adopted for PET flow and CT flow, each AttPSNN network includes 5 scale feature images, the encoding path includes 5 convolution levels, the first two convolution levels each include 2 convolution modules, the last three convolution levels each include 3 convolution modules, the decoding path includes 4 convolution attention levels, the first convolution attention level includes convolution attention module+conv layer, and the last three convolution attention levels include convolution attention module+conv+tri-linear interpolation upsampling layer.

6. The method according to claim 1, wherein in step S4, the established TransAttPSNN segmentation model is trained by four-fold cross-validation.

7. The method for segmenting esophageal cancer tumor target according to claim 1, wherein in step S4, the model training method is as follows:

(1) Inputting a double-channel PET/CT image into the established TransAttPSNN segmentation model;

(2) Extracting PET channel images and sending the PET channel images to a PET stream for processing, and simultaneously extracting CT channel images and sending the CT channel images to a CT stream for processing;

(3) Aiming at PET and CT characteristic images of the same level, adopting a transducer self-adaptive characteristic fusion module to fuse the PET and CT characteristic images, and feeding back fusion processed results to PET streams and CT streams respectively to participate in information feedforward;

(4) The training network is continuously optimized to converge and obtain the esophageal cancer tumor target three-dimensional segmentation model with good performance.

8. The method according to claim 1 or 7, wherein in step S4, the optimizer is AdamW and the loss function is generalized Dice loss during training; the formula for GDL is defined as follows:

representing the inverse of the square sum of the total pixel numbers of the foreground and the background, wherein the foreground represents a tumor area, and the background represents a non-tumor area; w (w) _c Represents weight, epsilon=1×10 ^-8 To prevent the denominator from being zero. />