CN114359642A

CN114359642A - Multi-modal medical image multi-organ positioning method based on one-to-one target query Transformer

Info

Publication number: CN114359642A
Application number: CN202210030228.XA
Authority: CN
Inventors: 王洪凯; 刘林琳
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-15

Abstract

The invention provides a multi-modal medical image multi-organ positioning method based on one-to-one target query Transformer, belonging to the technical field of medical image processing. The invention utilizes a conditional Gaussian model and a self-attention mechanism of a Transformer to simulate the correlation of the position and the size between organs. The one-to-one target query architecture enforces a unique target query for each target organ, and the query sequence is a prediction category, so that classification is not needed, the network structure is simplified, redundant calculation is reduced, and the learning convergence speed is higher. Before organ detection is executed, 3D multi-modal images are projected onto two orthogonal 2D planes, complementary information from the multi-modal images is combined through a multi-modal fusion method, and finally the obtained 2D bounding box is back projected to obtain a 3D bounding box, so that the calculation burden is reduced, and a more stable organ positioning result is obtained.

Description

Multi-modal medical image multi-organ positioning method based on one-to-one target query Transformer

Technical Field

The invention belongs to the technical field of medical image processing, and particularly relates to a multi-modality medical image multi-organ positioning method based on one-to-one object query Transformer.

Background

The multi-modality medical images are imaging modalities commonly used in clinical medicine, such as Positron Emission Tomography/electron Computed Tomography (PET/CT), T1, T2 in nuclear magnetic scanning, proton density map, multiple spectral images in spectral CT, and the like. Taking a PET/CT image as an example, wherein the PET image is a functional image, reflects the distribution state of the radioactive tracer in the body, and can be used for diagnosing benign and malignant tumors and quantifying tissue metabolism; the CT image reflects the absorption degree of human organs and tissues to X-rays, and can clearly display the human anatomy structure. With the popularization of multi-modal medical imaging, a large amount of image data is acquired every day, and the diagnosis burden of doctors is increasing. Automatic organ localization methods may help reduce image reading time and provide specific organ regions for subsequent computer-aided diagnosis. Therefore, achieving automatic localization of multiple organs quickly and accurately becomes an indispensable step for multi-modal medical image analysis.

Organ localization in an image refers to determining the three-dimensional bounding box of an organ, i.e., determining the upper and lower bounds (i.e., x) of the x, y, z coordinates of the organ_min，x_max，y_min，y_max，z_min，z_max)。

In recent years, deep Convolutional Neural Networks (CNNs) are excellent in object detection in natural images or medical images. Although CNN has been applied to organ localization in images, its performance is still limited due to lack of understanding of the geometric relationships between organs in images.

Recently, the development of the Transformer network shows the potential of image classification or segmentation using long-distance target correlation, and is gradually used for modeling the correlation between targets, thereby improving the performance of medical image analysis. However, the Transformer network has a complex structure and requires more training time; moreover, a large amount of training data is required to perform the excellence, but manual marking of a large number of three-dimensional medical images by a doctor is a time-consuming and laborious task; existing three-dimensional transformers also maintain a balance between accuracy and learnability simply by reducing the depth of the network. Furthermore, none of the current Transformer models use information from multi-modal medical images to improve the accuracy of organ localization.

Disclosure of Invention

In order to solve the above problems, the present invention provides a multi-modality medical image multi-organ localization method based on one-to-one object query Transformer. The method comprises the steps of fusing two-dimensional projection views (in the coronal and sagittal directions) of a three-dimensional multi-modal image together for organ detection, performing correlation modeling between organs in a human body region by using a Conditional Gaussian Model (CGM), performing feature extraction on the multi-modal fusion image by using a CNN (computer network), using an obtained image feature sequence and a CGM (computer generated Markov Model) prediction result as input of a transform, inquiring one-to-one targets under the comprehensive action of the transform to finally obtain the positions of multiple organs, wherein the sequence of the target inquiry corresponds to the types of the multiple organs, and reversely projecting the detection results of the two projection views to a 3D space to obtain a 3D boundary frame so as to finish automatic positioning of the multiple organs.

The technical scheme of the invention is as follows:

a multi-modal medical image multi-organ positioning method based on one-to-one object query Transformer comprises the following steps:

s1, two-dimensional projection and image fusion of three-dimensional multi-modal images

S11, preprocessing data

Carrying out normalization preprocessing on the spatial resolution and the image size of the data set, which specifically comprises the following steps:

firstly, resampling the multi-modal image to a uniform voxel size according to the finest spatial resolution in the multi-modal image; then, selecting the image size, performing center cutting on the resampled image, removing surrounding background pixels, and simultaneously reserving a human body area.

S12, two-dimensional projection and gray level normalization of multi-modal image

S121, the three-dimensional medical image is large in calculation amount, and the projection graph of the three-dimensional medical image can represent main information in the three-dimensional medical image to a certain degree, so that the calculation amount is greatly reduced, and the positioning difficulty is reduced. Therefore, each mode image is projected by adopting a projection mode which keeps the image characteristics, and a two-dimensional coronal projection and a sagittal projection are obtained. The method comprises the following specific steps:

for functional modality images, such as PET images, hypermetabolic organ tissue is highlighted in the images, and in order to highlight hypermetabolic organ tissue, Maximum Intensity Projection (MIP) is selected, thereby ensuring good contrast of highly metabolic organs.

For anatomical mode images, such as CT and mri, anatomical structures can be displayed, and most of the clearly displayed organ tissues can be observed by human eyes. Since MIP can highlight high contrast tissue and weaken low contrast tissue, an Average Intensity Projection (AIP) is chosen to maintain tissue contrast.

S122, selecting an appropriate threshold range for the projection picture of each mode to normalize the gray scale of the image to 0,255]The gray scale range of (a); the threshold range is specifically used [ g (p) ]₁),g(p₂)]Normalized window of (2), wherein g (p)₁) And g (p)₂) Representing a grey threshold value which excludes sub-p in the image₁Sum of percentages being higher than p₂A percentage of pixels.

S13 image fusion of multi-modal projection

Fusing each projection view together to generate two fused images of coronal and sagittal views, respectively; the fused image calculation formula is as follows:

wherein alpha is_uRepresents a weight factor, and

I_ua projection view representing the image of the u-th modality.

S2, constructing a coarse prediction model of position correlation between organs by CGM

S21, constructing shape vector of organ

The detection of the position of anatomical structures in multi-modal images is a key step in the automatic diagnosis of images. For organ localization, the organ need not be accurately marked with contour lines, but only the upper and lower coordinate bounds of the three-dimensional bounding box of the organ in the three-dimensional multi-modal image need to be determined.

And 6 characteristic values of each training sample in the training set, namely the coordinates of the central point of the three-dimensional organ bounding box and the length, the width and the height of the three-dimensional organ bounding box, form the training set. For each training sample, construct a shape vector, let s_iFor training the shape vector of sample i, i.e. s_i＝(x，y，z，l_x，l_y，l_z)。

S22 generalized Pushing transform

Before statistical modeling is performed by using shape vectors representing positions and sizes of organs, in order to eliminate images of other factors, the shape vector of each training sample in an image space needs to be normalized to a model space through Generalized Procrustes (Generalized Procrustes) transformation, and then subsequent operations are performed in the model space.

S23 Principal Component Analysis (PCA)

PCA is derived from linear algebra, a plurality of variables in the data set are sorted according to the proportion of principal components through linear transformation, and a selected part can represent the characteristics of the whole data set, so that the dimensionality of the data set is reduced. The method can reduce the calculation amount, retain the main information and remove redundant parts and noise during complex data analysis. Note that PCA is performed in model space.

By all s subjected to generalized Fourier transform_iFor the training set, the statistical shape model method is used to model the organ position and size change, and the obtained shape model is expressed as follows:

wherein the content of the first and second substances,

the average shape of the training set is the average value of the characteristics of the selected training set; phi_sA feature vector matrix (subscript s represents shape) obtained by performing Principal Component Analysis (PCA) on the training set, wherein k feature vectors represent k organ position and size change patterns learned from the training set; b_sAs a shape parameter, b_sK elements in (1) respectively represent phi_sThe weight of the k deformation patterns in (a) superimposed with the average shape of the training set; s represents the resulting shape model, b_sInfluence s by adjusting b_sCan control model deformation.

And (3) reversely solving the shape parameter of each training sample according to the formula (2), wherein the formula is as follows:

wherein the content of the first and second substances,

for transposing the eigenvector matrix obtained after PCA on the training set, s_iFor the shape vector of the ith training sample,

as the average shape of the training set, b_sIs the shape parameter of the training sample.

S24 construction of conditional Gaussian model

Human body trunk organs are highly correlated in position and shape, so that the correlation between organs can be modeled, and the method used is a conditional gaussian model. Taking the modeling of the correlation of the shapes between organs (the coordinates of the center point and the length, width and height of the three-dimensional bounding box of the organ) as an example, let

And

representing the shape parameters of two adjacent organs A and B, respectively, when known

When the temperature of the water is higher than the set temperature,

the conditional gaussian model CGM of (a) is described by equations (4) to (6):

wherein the content of the first and second substances,

is the mean value (k multiplied by 1) of the shape parameters of the organ to be predicted obtained by a conditional Gaussian model,

is a covariance matrix (k x k) of shape parameters of the organ to be predicted obtained by a conditional gaussian model;

and

is the shape parameter (k × 1) of each training sample;

and

is that

And

cross covariance matrix (k × k) therebetween;

and

are respectively

And

a covariance matrix (k × k) in the training set;

is known as

When the temperature of the water is higher than the set temperature,

n refers to a gaussian distribution (the content in the above parentheses indicates the dimension of the matrix).

The conditional gaussian model is a model built from conditional probability relationships between the position, length, width and height of adjacent organs. It is used to predict the mean and standard deviation of the location, length, width and height of the center point of the bounding box of the adjacent organ B when the location, length, width and height of the center point of the bounding box of the organ A are known, so that the approximate range of the organ B can be known.

The conditional Gaussian model is obtained by predicting the shape parameters of the organ to be predicted from the shape parameters of the known organ, and then substituting the shape parameters of the organ to be predicted obtained by CGM into the shape model

The data set in the model space is obtained, and because the data set is obtained in the normalized model space, the data set also needs to be inversely normalized to the real physical space of the image, and the final prediction result is obtained.

It is necessary to construct a plurality of CGMs, each of which models the position and size correlation between the trunk region and one organ. Position information of multiple organs of the trunk region is roughly predicted from the trunk as a known region by using CGM (continuous Messaging algorithm) and is used for constraint of one-to-one target query in a subsequent Transformer network.

S3 organ positioning method based on one-to-one target query Transformer network

The Transformer network is a powerful network architecture, originally used for natural language processing. Due to the built-in self-attention mechanism, the Transformer is suitable for learning the correlation between targets so as to improve the performance of the fields of image recognition, object detection, segmentation and the like. The invention utilizes the self-attention mechanism of a Transformer network to model the correlation of positions and sizes among different organs.

The Transformer model was constructed by improving the state-of-the-art 2D detection Transformer (DETR) network. The DETR is the first fully end-to-end 2D image object detector recently proposed, which uses transforms to convert CNN extracted image features directly into localization results. However, since the DETR is used for object detection in natural images, it must process objects that may not be detected by the current image. Although the number of object classes in a natural image dataset is large, the number of object classes and objects present in each image is largely different. Given the N targets defined in the DETR network architecture, the actual number of targets present in an image is typically much less than N. Therefore, the DETR must use a bipartite graph matching strategy and then identify class labels of detection targets in the image through a target classification network.

In contrast, the present invention uses a one-to-one target query architecture, forcing a unique target query for each target organ, thus eliminating the need for a bipartite graph matching module and the need for target classification branches. Furthermore, since the positions and sizes of different organs are closely related to each other, the present invention uses learnable spatial encoding of pixel coordinates to characterize the target location. In this way, modeling the geometric correlation between organs using the self-attention mechanism of the Transformer network ensures robust detection of all target structures from the 2D fused projection images. The method comprises the following specific steps:

s31, feature extraction of 2D fused projection image

And (3) sending each fusion image in the coronal and sagittal directions to a CNN network for image feature extraction, and inputting the fusion images into a Transformer together with the prediction result of CGM after dimension reduction mapping into a sequence.

S32, constructing a Transformer model for one-to-one target query

And sending the characteristic sequences extracted by the CNN and the space position codes corresponding to the characteristic sequences to an encoder module in a Transformer for global correlation calculation. For the decoder module, N target queries are used for N organs to be detected, respectively. Both the N target queries and the encoder output are used as input to the decoder, and under the combined effect of the encoder and CGM prediction, each projection image gets N sets of 2D bounding boxes.

S4, three-dimensional back projection of two-dimensional projection image detection result

The steps of S2 and S3 are performed on the coronal and sagittal projection fusion images, respectively, using an improved Transformer model, the output of which is the 2D bounding box of all target organs in the respective projection views, which is specifically expressed as follows:

let x, y and z be the pixel coordinates of three spatial dimensions, the bounding box of the coronal projection is represented as

The bounding box of the sagittal projection is denoted as

Finally, the coronal and sagittal bounding boxes are backprojected to obtain a three-dimensional organ bounding box

In conclusion, the method is suitable for automatic positioning detection of multiple organs in the multi-modal medical image, and mainly utilizes the geometric correlation between organs jointly modeled by the conditional Gaussian model and the Transformer and the information complementarity of the multi-modal image to efficiently and accurately perform automatic positioning of the organs.

The invention has the beneficial effects that: the method utilizes a self-attention mechanism of a conditional Gaussian model and a Transformer network to simulate the correlation of the position and the size of the organs. The one-to-one target query architecture enforces a unique target query for each target organ without bipartite graph matching; the query sequence is a prediction type, so that classification is not needed, the network structure is simplified, redundant calculation is reduced, and the learning convergence speed is higher. Before organ detection is executed, 3D multi-modal images are projected onto two orthogonal 2D planes, complementary information from the multi-modal images is combined through a multi-modal fusion method, and finally the obtained 2D bounding box is subjected to back projection to obtain a 3D bounding box, so that the calculation burden is reduced, and a more stable organ positioning result is obtained.

Drawings

Fig. 1 is a flow chart of the positioning method of the present invention.

FIG. 2 is a schematic diagram illustrating a three-dimensional bounding box of a multi-organ, using a bimodal PET/CT image as an example; wherein (a) is a coronal plane, (b) is a sagittal plane, and (c) is a transverse plane.

FIG. 3 is a diagram illustrating a two-dimensional projection and image fusion process using a bimodal PET/CT image as an example.

FIG. 4 is a flow chart illustrating the multi-organ positioning method according to the present invention by taking a bimodal PET/CT image as an example.

Detailed Description

The multi-modal medical image multi-organ positioning method based on one-to-one object query Transformer provided by the invention is shown in fig. 1, the purpose of the invention is to determine a three-dimensional bounding box of multiple organs of a human body shown in fig. 2, and the invention is further explained with reference to a specific embodiment mode.

S11, preprocessing data

Firstly, selecting the finest voxel size in a plurality of modal images to resample all the modalities; and then center cropping is carried out on the resampled image, so that the spatial resolution and the image size in the data set are kept consistent.

In order to reduce the calculated amount and reduce the positioning difficulty, the projection mode of keeping the image characteristics is adopted for each mode to carry out projection, and therefore two-dimensional coronal projection and sagittal projection are obtained. In particular, the amount of the solvent to be used,

for functional modality images (e.g., PET images) in multi-modality images, the maximum intensity projection is selected in order to continue highlighting highly metabolic organ tissues. However, the projection image obtained by directly using MIP is too dark due to the highlighted organ (such as bladder), and for accurate detection, it needs to be subjected to gray-scale normalization, specifically using [ g (p) ]₁),g(p₂)]Normalized window of (2), wherein g (p)₁) And g (p)₂) Representing a grey threshold value which excludes sub-p in the image₁Sum of percentages being higher than p₂A percentage of pixels. Found through experiments

And

resulting in robust performance.

For anatomical modality images (e.g., CT images) in multi-modality images, a grayscale normalization window with Hounsfield Units (HU) of [ -1000,1000] is used. Experiments show that the algorithm is not sensitive to the CT gray scale window.

S13 image fusion of multi-modal projection

Normalizing the projection picture of each mode to the gray scale range of [0,255], and then performing multi-mode information fusion according to a fused image calculation formula:

wherein alpha is_uRepresents a weight factor, and

I_ua projection view representing the image of the u-th modality.

Taking a bimodal PET/CT image as an example, experimental results show that the weighting factors for PET yield consistent performance across a range of values of [0.3,0.7 ]. Each projection view is fused together with a weight of 0.5, generating two fused images for coronal and sagittal views, respectively, as shown in fig. 3.

A key step in organ localization in multi-modal images is the statistical modeling of the correlation between position and shape between organs.

First a 3D bounding box data set of the organ needs to be acquired. The 3D bounding box of the organ is manually marked in the multi-modal image by a doctor, and the upper and lower limits of x, y and z coordinates of the 3D bounding box need to be acquired, and then the coordinates are converted into the coordinates of the central point and the length, width and height (x, y, z, l) of the 3D bounding box_x，l_y，l_z). The data characterizing organ position and size in image space is then normalized to model space by a generalized Fourier transform. And then, performing principal component analysis in a model space, selecting principal components with the proportion of the principal characteristics in the data set accounting for more than 90% of all the characteristics, and further completing the organ bounding box statistical modeling based on the PCA.

The steps of the present invention are specifically described by taking an example of predicting an unknown organ in a known torso region, and 6 feature values of each training sample in a training set, namely, the coordinates of the center point and the length, width and height of a three-dimensional bounding box, form the training set (m × n, m is the number of training samples, and n is 6).

When the trunk area predicts organs, the training set of the trunk area is normalized to obtain a normalized training set X (mxn), and a PCA is carried out to obtain a characteristic vector matrix phi_s(n × k), further based on

Substituting the characteristic vector matrix phi obtained by PCA on the normalized training set_sAnd the normalized training set X obtains the shape parameter b of each training sample_s(k × 1). Then, the training set of the organ bounding box is normalized (in the training set, the mean value of the corresponding trunk area is subtracted from the characteristic value of each training sample, and then the mean value is divided by the standard deviation of the trunk area), and the shape parameter b of the organ bounding box of each training sample is obtained according to the previous steps_s(k×1)。

Substituting the shape parameters of the obtained trunk area and organ frame into a formula of CGM, wherein the formula is as follows:

thereby obtaining the shape parameter b of the organ bounding box predicted by the trunk area_s. According to shape model

Substituting the normalized mean value and the normalized feature vector matrix of the organ bounding box training set to obtain s, and performing inverse normalization to a physical space according to the mean value and the standard deviation of the trunk region to obtain a final predicted result.

The test set and the training set are not coincident, but the test data is processed by the same standard as the training data, the trunk area of the test set is used for completing the prediction of the three-dimensional bounding box of the unknown multiple organs, and finally, two-dimensional projection is also carried out to obtain the rough prediction of the two-dimensional bounding box corresponding to the coronal projection and the sagittal projection.

The Transformer model in the invention is constructed by improving the DETR network.

Unlike natural object detection, organs detected in medical images have strong anatomical constraints. In most cases, all anatomical structures of interest are present in each target image, and there is only one target object per structure. We exploit this anatomical constraint to simplify the network architecture. N organs to be detected are given, only N target queries are defined, and each query is assigned to correspond to one target organ, so that a bipartite graph matching strategy is not needed. The order of each query naturally corresponds to the category of each organ, so no additional classification is required. The specific detection process is as follows:

the fused image is firstly sent into a CNN network for image feature extraction, a ResNet-50 structure is selected, and in experiments of CNN layers with different numbers, the best performance is generated in 50 layers. The image features are subjected to dimension reduction mapping to be a sequence and then are added with corresponding position codes to be used as input of an encoder module in a Transformer part, and in addition, the organ coarse prediction result of the CGM can also be used as a channel input to be used for subsequent target position and size constraint.

The self-attention layer in the encoder can aggregate information from each element of the input sequence and update the information of each element, so that the network can perform global correlation calculation, is suitable for long sequences and plays an important role.

In addition, at the input part of the decoder module, only N target queries are defined, each query is assigned to correspond to one target organ, the sequence of each query corresponds to the category of each organ, and under the comprehensive action of the decoder and the CGM prediction result, N outputs are obtained, corresponding to the positions of N organs.

Modeling the geometric correlation between organs jointly by means of the attention mechanism of the Transformer network and CGM ensures robust detection of all target structures from the 2D fused projection images.

The same detection procedure is performed on the coronal and sagittal projection fusion images, the output of the transform model being the 2D bounding boxes of all target organs in the coronal and sagittal projection fusion images, respectively. Wherein the coronal projection yields the length and height of the three-dimensional bounding box and the sagittal projection yields the width and height of the three-dimensional bounding box. Then, the maximum range value of the heights of the two projection planes is taken, and the detection results in the two directions are back projected, so that the final three-dimensional bounding box of the organ is obtained

FIG. 4 is a flowchart of a method for multi-organ localization based on a one-to-one object queried Transformer network, which is illustrated by taking a bimodal PET/CT image as an example.

Claims

1. A multi-modal medical image multi-organ positioning method based on one-to-one object query Transformer is characterized by comprising the following steps:

S11, preprocessing data

Carrying out normalization preprocessing on the spatial resolution and the image size of the data set;

S121, projecting each mode image in a projection mode of keeping image characteristics to obtain a two-dimensional coronal projection and a two-dimensional sagittal projection;

s122, selecting a proper threshold range for the projection picture of each modality to normalize the gray level of the image to a gray level range of [0,255 ];

s13 image fusion of multi-modal projection

wherein alpha is_uRepresents a weight factor, and

I_ua projection view representing an image of a u-th modality;

S21, constructing shape vector of organ

The coordinate of the central point of the three-dimensional organ bounding box and the length, width and height form a training set; each training sample in the training set constructs a shape vector, set s_iFor training the shape vector of sample i, i.e. s_i＝(x，y，z，l_x，l_y，l_z)；

S22 generalized Pushing transform

Normalizing the shape vector of each training sample in the image space to a model space through generalized Fourier transformation, and further performing subsequent operation in the model space;

s23 principal component analysis

wherein the content of the first and second substances,

the average shape of the training set is the average value of the characteristics of the selected training set; phi_sA subscript s represents the shape of a feature vector matrix obtained by performing principal component analysis on the training set, wherein k feature vectors represent k organ position and size change patterns learned from the training set; b_sAs a shape parameter, b_sK elements in (1) respectively represent phi_sThe weight of the k deformation patterns in (a) superimposed with the average shape of the training set; s represents the resulting shape model, b_sInfluence s by adjusting b_sThe value of (a) can control model deformation;

wherein the content of the first and second substances,

the method is characterized by comprising the following steps of (1) transposing a feature vector matrix obtained after principal component analysis is performed on a training set;

s24 construction of conditional Gaussian model

Is provided with

And

When the temperature of the water is higher than the set temperature,

the conditional gaussian model CGM of (a) is described by equations (4) to (6):

wherein the content of the first and second substances,

the method comprises the following steps of obtaining the mean value of shape parameters of an organ to be predicted by a conditional Gaussian model, wherein the dimension is k multiplied by 1;

the method comprises the steps of obtaining a covariance matrix of shape parameters of an organ to be predicted by a conditional Gaussian model, wherein the dimension is k multiplied by k;

and

is the shape parameter of each training sample with dimension k × 1;

and

is that

And

a cross covariance matrix between the two, the dimension is k multiplied by k;

and

are respectively

And

a covariance matrix in the training set, the dimension of which is k × k;

is known as

When the temperature of the water is higher than the set temperature,

n is gaussian distribution;

firstly substituting the shape parameters of the organ to be predicted obtained by CGM into the shape model

Obtaining a data set of a model space; then, inverse normalization is carried out to the real physical space of the image to obtain a final prediction result;

S31, feature extraction of 2D fused projection image

Sending each fusion image in the coronal and sagittal directions to a CNN network for image feature extraction, and inputting the fusion images into a Transformer together with a prediction result of CGM after dimension reduction mapping into a sequence;

s32, constructing a Transformer model for one-to-one target query

Sending the characteristic sequence extracted by the CNN and the space position code corresponding to the characteristic sequence to an encoder module in a Transformer for global correlation calculation; for the decoder module, N target queries are respectively used for N organs to be detected; the N target queries and the encoder output are used as the input of a decoder, and under the comprehensive action of the encoder and CGM prediction, each projection image obtains N groups of 2D bounding boxes;

The steps of S2 and S3 are performed on the coronal and sagittal projection fusion images, respectively, and the output of the transform model is the 2D bounding box of all target organs in the respective projection views, which is specifically expressed as follows:

The bounding box of the sagittal projection is denoted as

2. The method for multi-modal medical image multi-organ localization based on one-to-one query Transformer according to claim 1, wherein the data preprocessing in step S11 is performed as follows:

3. The multi-modality medical image multi-organ localization method based on one-to-one query for a transform as claimed in claim 1 or 2, wherein in step S121, the projection manner of the modality image is specifically as follows: for the functional modal image, selecting the maximum density projection so as to ensure the good contrast of the high metabolic organ; for anatomical modality images, the mean intensity projection is chosen to maintain tissue contrast.

4. The method for multi-modal medical image multi-organ localization based on one-to-one query Transformer according to claim 1 or 2, wherein the threshold range is specifically used [ g (p) in step S122₁),g(p₂)]Normalized window of (2), wherein g (p)₁) And g (p)₂) Representing a grey threshold value which excludes sub-p in the image₁Sum of percentages being higher than p₂A percentage of pixels.

5. The method as claimed in claim 3, wherein the threshold range is [ g (p) in step S122₁),g(p₂)]Normalized window of (2), wherein g (p)₁) And g (p)₂) Representing a grey threshold value which excludes sub-p in the image₁Sum of percentages being higher than p₂A percentage of pixels.