CN115082293A

CN115082293A - Image registration method based on Swin transducer and CNN double-branch coupling

Info

Publication number: CN115082293A
Application number: CN202210650873.1A
Authority: CN
Inventors: 李敏; 范盼; 王梦文
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-20

Abstract

The invention discloses an image registration method based on double branch coupling of Swin transducer and CNN. The method comprises the following steps: 1. performing standard preprocessing steps such as gray value normalization, center clipping, resampling and the like on all images in the original data; 2. splicing the floating image and the fixed image, sending the spliced floating image and fixed image into a registration network, and parallelly passing through a Swin transform and a CNN two encoder branches; 3. in each stage of the Swin transducer, performing characteristic interaction and fusion on the Swin transducer characteristic mapping and the CNN characteristic mapping with corresponding resolution through a double-branch characteristic coupling module; 4. the decoder adaptively adjusts the deep features from the encoder and the features from the upper layer, and finally outputs a deformation field between the floating image and the fixed image; 5. and inputting the floating image and the deformation field into a space transformation network to obtain a registration image. 6. And calculating similarity loss between the registered image and the fixed image and regularization loss of a deformation field, and performing back propagation to train the network. The Swin Transformer and CNN double branches are used for feature extraction, the advantages of the two branches are fully utilized, and feature complementation is realized.

Description

Image registration method based on Swin transducer and CNN double-branch coupling

Technical Field

The invention belongs to the technical field of image registration, and particularly relates to an optimization method for effectively improving image registration performance.

Technical Field

Deformable Image Registration (DIR) is a basic task in image processing, has important clinical application value, and has recently received attention from many scholars. Many conventional registration methods minimize the cost function in an iterative manner. However, these methods involve a large number of operations, and registering a pair of images requires a large amount of time. In recent years, with the rapid development of Deep Learning (DL), image registration studies based on Deep Learning attract the attention of researchers due to the advantages of short time consumption and high precision. In general, methods based on deep learning can be classified into supervised methods and unsupervised methods. In image registration, the true deformation field is very difficult to acquire, and the true deformation field of the manual markers may introduce unnecessary errors. Therefore, supervised learning based methods generally obtain deformation field labels through conventional algorithms or simulated deformation. However, the registration accuracy of these methods is very dependent on the quality of the generated deformation field. Unsupervised learning based approaches have been increasing in this direction because the network can conduct guided training on the similarity between registered and fixed images without the need for a true deformation field. In recent years, a large number of unsupervised image registration methods based on Convolutional Neural Networks (CNN) have been proposed in succession, all with good results. However, CNNs cannot effectively capture far-range mappings in both moving and stationary images, subject to the constraints of the convolution kernel, and are thus limited in performance.

Recently, transform-based network architectures have been introduced into various computer vision tasks due to their powerful capabilities. Unlike convolution operations, the self-attention mechanism in the Transformer has an effective field of infinite size, which enables the Transformer to capture telespatial information. Although a general Transformer has strong long-range modeling capability and can effectively capture long-distance position corresponding relation, the number of voxels in an image registration task is too large, and a network is difficult to find a real corresponding voxel pair. Meanwhile, due to the characteristics of the convolution kernel, the capture capability of the CNN for the local detail information is far better than that of the Transformer. In addition, the Transformer divides the original image into a plurality of windows, and the windows lack interaction. In the image registration task, since the positions of the corresponding voxel pairs of the fixed image and the floating image are different, they are likely to exist in two different windows respectively, so that they are difficult to match with each other. In order to enhance the capture efficiency of the local relationship, the Swin Transformer local window is self-attentive, and the efficiency is greatly improved while the performance is improved. The Swin Transformer calculates self-attention under each window, and introduces a shift window operation for better information interaction with other windows. The shift window appears very bright in general visual tasks, which in practice is achieved indirectly by shifting the feature map. In the task of image registration, the significance of such operations may be small, and the positional relationship of corresponding points within different windows still cannot be effectively captured. The CNN convolution kernel slides on the characteristic diagram with the overlap, so that the condition that corresponding points in different windows in the transform cannot be captured can be effectively avoided.

Disclosure of Invention

The invention discloses an image registration method based on Swin transducer and CNN double-branch coupling. The method designs a novel double-branch coupling network structure, and the network structure is a U-shaped network formed by a classical encoder and a decoder. The encoder consists of a Swin transducer branch and a CNN branch, and can effectively utilize the self-attention feature based on the transducer and the convolution feature based on the CNN. And a feature coupling module is adopted to complementarily fuse the feature mapping of the Swin transducer with the feature mapping of the CNN in an interactive mode, so that the feature expression capability of two encoder branches is fully promoted, and the registration performance is further improved.

The technical solution for realizing the invention is as follows: an image registration method based on Swin transducer and CNN double-branch coupling comprises the following steps:

the first step is as follows: performing preprocessing of standards of gray value normalization, center clipping, resampling and affine transformation on all images in original data;

the second step is that: splicing the floating image and the fixed image, sending the spliced floating image and fixed image into a registration network, and parallelly passing through a Swin transform and a CNN two encoder branches;

the third step: in each stage of the Swin transducer, performing characteristic interaction and fusion on the Swin transducer characteristic mapping and the CNN characteristic mapping with corresponding resolution through a double-branch characteristic coupling module;

the fourth step: the decoder adaptively adjusts the deep features from the encoder and the features from the upper layer, and finally outputs a deformation field between the floating image and the fixed image;

the fifth step: inputting the floating image and the deformation field into a space transformation network to obtain a registration image;

and a sixth step: and calculating similarity loss between the registered image and the fixed image and regularization loss of a deformation field, and performing back propagation to train the network.

Compared with the prior art, the invention has the remarkable characteristics that: (1) a Swin transducer encoder and a CNN encoder are designed in parallel, and the attention feature based on Swin transducer and the convolution feature based on CNN are fused at the same time, so that the generalization capability of the model is enhanced. (2) And a bidirectional interaction mechanism is adopted to promote the feature extraction capability of the Swin transform and the CNN, and meanwhile, the feature mapping of the Swin transform and the CNN is complemented. (3) The network is an unsupervised end-to-end model, all modules are trained and inferred in a unified mode, extra labels are not needed for training (4), the method for registering the speed block is high in registering accuracy.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram of a network architecture of the present invention.

FIG. 3 is a Swin transducer Block schematic.

Fig. 4 is a diagram of a dual-branch signature coupling module.

Fig. 5 is a schematic of the registration of fixed and floating images and different methods in the LBPA40 dataset.

Fig. 6 is a differential schematic of different methods for registering an image with a fixation image in the LBPA40 dataset.

Detailed Description

The invention designs a registration network based on double branch coupling of Swin transducer and CNN, the method adopts parallel design, mutually promotes self-attention characteristics based on Swin transducer and convolution characteristics based on CNN in a bidirectional interaction mode, enhances respective characteristic representation, and accordingly captures accurate spatial correspondence between input motion and fixed images. The network architecture of the present invention is shown in figure 2.

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, the steps of the present invention will be described in detail.

The method comprises the following steps of firstly, preprocessing all images in original data by performing standards such as gray value normalization, center cropping, resampling and affine transformation. The gray value normalization step shrinks the gray value of the image to a [0,1] interval, and the calculation formula is as follows:

wherein, I _min And I _max Representing the minimum and maximum gray value values in the image, respectively.

And secondly, splicing the floating image and the fixed image, sending the spliced images into a registration network, and parallelly passing through Swin transform and CNN two encoder branches. The floating image and the fixed image are set to M and F, respectively.

In the Swin Transformer branch, the input image is first divided into non-overlapping 3D tiles (Patch), each tile being 2 XP P; to be provided with

Representing the ith image block, where i ∈ { 1., N },

is the total number of image blocks; each tile is flattened and treated as a Token, and then each Token is projected into a dimension using a linear mapping layerThe characteristic of degree C indicates that:

wherein the content of the first and second substances,

representing a linear mapping, output z ₀ The dimension of (a) is NxC;

after the linear mapping layer, the branch has 4 consecutive stages. The 1 st stage consists of a linear mapping layer and a plurality of Swin transform blocks; each of the other 3 stages consists of a Patch measuring layer and a plurality of Swin transform blocks; the Swin Transformer block outputs the same number of Token as the input, and the Patch Merging layer connects features of each group of 2 × 2 × 2 adjacent Token, thereby generating 8C-dimensional feature embedding; the feature size of the representation is then reduced to 2C using the linear layer; in this branch, the outputs of two consecutive Swin Transformer blocks are calculated as follows:

wherein W-MSA and SW-MSA are conventional and windowed multi-headed self-attention modules, respectively;

and z ^l Denotes W-MSA andan output of the SW-MSA; MLP and LN represent the multilayer perceptron and regularization layers, respectively; the shift window calculating mechanism adopts 3D cyclic shift to calculate self attention, and the calculation formula is as follows:

q, K, V respectively represents Query, Key and Value matrix, d represents the dimension of Query and Key;

the CNN branch adopts a characteristic pyramid structure, wherein the resolution of characteristic mapping is reduced along with the depth of a network, but the number of channels is increased layer by layer; 3D convolution is uniformly adopted, the convolution kernel size is 3 multiplied by 3, each convolution is followed by an LeakyReLU layer, and down-sampling operation is carried out through a maximum pooling layer.

The structure of the double branch is shown in figure 2.

Thirdly, performing feature interaction and fusion on Swin transform feature mapping and CNN feature mapping with corresponding resolution through a double-branch feature coupling module at each stage of the Swin transform; the CNN branch firstly uses a 3 multiplied by 3 convolution to extract the feature mapping of an upper layer after downsampling, then aligns the feature mapping with the Swin Transformer feature mapping through a 1 multiplied by 1 convolution self-adaption, and simultaneously uses a LayerNorm module to carry out regularization on the feature mapping and adds the regularization into the Swin Transformer feature mapping; then, the Swin Transformer branch sends the fused features into Swin Transformer Blocks to obtain a new feature representation; aligning the CNN feature map with the BatchNorm module by a 1 × 1 × 1 convolution and adding it to the CNN feature map; and finally, adaptively adjusting the aggregation characteristics by using a 3 × 3 × 3 convolution, and further improving the registration accuracy. The details of the dual-branch eigencoupling module are shown in fig. 4.

Fourthly, the decoder adaptively adjusts the deep layer characteristics from the encoder and the characteristics from the upper layer, and finally outputs the deformation field between the floating image and the fixed image

Encoder feature mappingConnected with the upper layer feature map from the decoding path by a skip connection, then passed through two consecutive 3 × 3 × 3 convolutional layers, and the resolution of the feature map is increased by 2 times using the upsampled layer; except for the last convolutional layer, each convolutional layer is followed by an LeakyReLU unit activation; finally, the deformation field between the input image pair is obtained by a 3 × 3 × 3 convolution

The specific process can be seen in fig. 2.

Fifthly, inputting the floating image and the deformation field into a space transformation network to obtain a registration image

Deformation field obtained for space transformation network

The floating image M is non-linearly warped. In the output image, for each voxel p, the values of eight neighboring voxels are linearly interpolated:

wherein

P' is a neighboring voxel set, q is a certain voxel in the neighboring voxel set, and d is a space in three directions of x, y and z.

And sixthly, calculating similarity loss between the registered image and the fixed image and regularization loss of a deformation field, and performing back propagation to train the network. The loss function L of the network consists of an image similarity term and a deformation field regular term, and the calculation formula is as follows:

wherein

Which represents the loss of similarity of the image,

representing the deformation field regularization loss and λ the regularization parameter. Local normalized cross-correlation (LNCC), which is commonly used in the field of image registration, is adopted as an image similarity loss, and the calculation formula is as follows:

where Ω denotes the spatial domain of the input image, p denotes voxels in the spatial domain,

and

representing a size n centered on the voxel p ³ Average voxel values within the local window of (a). The L2 norm of the deformation field gradient is used as the regularization loss, and the calculation formula is as follows:

wherein

Is the difference field between neighboring voxels in Ω, here as the gradient field.

The effect of the present invention can be further illustrated by the following simulation experiments:

simulation conditions

The present invention simulation uses two three-dimensional brain datasets, Mindbogle101 and LBPA 40.

The Mindboggle101 and LPBA40 contained 101T 1 weighted MR images and 40T 1 weighted MR images, respectively. The Mindboggle101 has a segmentation mask with 25 anatomical landmarks per image and the LPBA40 has a segmentation mask with 56 anatomical landmarks per image. For the Mindbogle101 dataset, 42 1722 pairs of images in the NKIRS-22 and NKI-TRT-20 subsets were selected for training, and 20 380 pairs of images in the OASIS-TRT-20 subset were selected for testing. On the LPBA40 dataset, the first 30 870 pairs of images were taken as the training set and the remaining 10 90 pairs of images were taken as the test set. The registration results were evaluated with a Dice coefficient and a hausdorff distance of 95% (HD 95). The larger the value of the Dice coefficient is, the larger the overlapped part of the two areas is proved to be, and the better the registration effect is. The smaller the HD95 value, the smaller the distance between the point sets in the two regions, the better the registration.

The experiment is carried out under an Ubuntu18.04 operating system, the used hardware facilities are NVIDIA GeForce RTX 2080Ti GPU of two video memories 11G, the software environment is python3.7, the model is realized based on a Pythroch frame, Adam is used as an optimizer, the batch size is set to be 1, the learning rate is 1e-4, the regularization parameter lambda is set to be 1 on a Mindbogle101 data set, and the regularization parameter lambda is set to be 5 on an LBPA40 data set.

Emulated content

In order to test the performance of the algorithm, the Proposed image registration method (deployed) based on Swin transform and CNN double-branch coupling is compared with other currently internationally advanced registration algorithms. The comparison method comprises the following steps: VoxelMorph (VM), Vit-V-Net (V-V-N), and TransMorph (TM), among others. Meanwhile, in order to prove the effectiveness of branch fusion of the Swin transducer and CNN dual-encoder in the method, VoxelMorph-Huge (VM-H, increasing the number of convolutional layer channels) and TransMorph-Large (TM-L, increasing the embedding dimension C, the number of Swin transducer Blocks and the number of Head) are also compared. The superparameters of all comparative experiments remained consistent.

Analysis of simulation experiment results

Table 1 shows the initial values of the two evaluation indices in the two data sets, the results of the various comparison methods and the results of the method of the invention, while also giving the inference time of the methods. It can be seen that the inventive method has the best registration accuracy on the test set of the Mindbogle101 and LBPA40 data sets compared to other methods. Compared with the VoxelMorph-Huge and Transmorph-Large, the method has higher registration precision under the condition of less inference time, and proves the effectiveness of double-branch complementary fusion of Swin transform and CNN in the method. The effect of the method of the invention and the comparative method are shown in figures 5-6. The simulation experiment results of the two groups of real data sets show the effectiveness of the method.

TABLE 1

Claims

1. An image registration method based on Swin transducer and CNN double-branch coupling is characterized by comprising the following steps:

2. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein the first step performs pre-processing on all images in the original data by performing criteria of gray value normalization, center clipping, resampling and affine transformation;

the gray value normalization step shrinks the gray value of the image to a [0,1] interval, and the calculation formula is as follows:

wherein, I _min And I _max Representing the minimum and maximum values of the gray-scale values in the image, respectively.

3. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein: and secondly, splicing the floating image and the fixed image, sending the spliced images into a registration network, and parallelly passing through a Swin transform and a CNN encoder branch, wherein the implementation method comprises the following steps: randomly selecting a floating image and a fixed image from the processed data, splicing the floating image and the fixed image, sending the spliced images into a registration network, and parallelly passing through two encoder branches of Swin transform and CNN; wherein the floating image and the fixed image are set to M and F, respectively;

Representing the ith image block, where i ∈ { 1., N },

is the total number of image blocks; each image block is flattened and treated as a Token, and then each Token is projected to a feature representation of dimension C using a linear mapping layer:

wherein the content of the first and second substances,

representing a linear mapping, output z ₀ The dimension of (a) is NxC;

after the linear mapping layer, the branch has 4 consecutive stages; the 1 st stage consists of a linear mapping layer and a plurality of Swin transform blocks; each of the other 3 stages consists of a Patch measuring layer and a plurality of Swin transform blocks; the Swin Transformer block outputs the same number of Token as the input, and the Patch Merging layer connects features of each group of 2 × 2 × 2 adjacent Token, thereby generating 8C-dimensional feature embedding; the feature size of the representation is then reduced to 2C using the linear layer; in this branch, the output of two consecutive Swin Transformer blocks is calculated as follows:

and z ^l Represents the outputs of W-MSA and SW-MSA; MLP and LN represent multilayer perceptrons and regularization layers, respectively; the shift window calculating mechanism adopts 3D cyclic shift to calculate self attention, and the calculation formula is as follows:

q, K, V respectively represents Query, Key and Value matrix, d represents dimensions of Query and Key;

the CNN branch adopts a characteristic pyramid structure, wherein the resolution of characteristic mapping is reduced along with the depth of a network, but the number of channels is increased layer by layer; 3D convolution is uniformly adopted, the convolution kernel size is 3 multiplied by 3, each convolution is followed by a LeakyReLU layer, and downsampling operation is carried out through a maximum pooling layer.

4. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein: thirdly, performing feature interaction and fusion on Swin transform feature mapping and CNN feature mapping with corresponding resolution through a double-branch feature coupling module at each stage of the Swin transform; the CNN branch firstly uses a 3 multiplied by 3 convolution to extract the feature mapping of an upper layer after downsampling, then aligns the feature mapping with the Swin Transformer feature mapping through a 1 multiplied by 1 convolution self-adaption, and simultaneously uses a LayerNorm module to carry out regularization on the feature mapping and adds the regularization into the Swin Transformer feature mapping; then, the Swin Transformer branch sends the fused features into Swin Transformer Blocks to obtain a new feature representation; aligning the CNN feature map with the BatchNorm module by a 1 × 1 × 1 convolution and adding it to the CNN feature map; finally, the aggregated features are adaptively adjusted using a 33 convolution.

5. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein: the fourth step decoder self-adaptively adjusts the deep layer characteristics from the encoder and the characteristics from the upper layer, and finally outputs the deformation field between the floating image and the fixed image

The encoder feature map is connected to the upper layer feature map from the decoding path by a skip connection, then passes through two consecutive 3 x 3 convolutional layers, and uses the upsampling layer to increase the resolution of the feature map by 2 times; except for the last convolutional layer, each convolutional layer is followed by an LeakyReLU unit activation; finally, the deformation field between the input image pair is obtained by a 3 × 3 × 3 convolution

6. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein: fifthly, inputting the floating image and the deformation field into a space transformation network to obtain a registration image

Predicted deformation field for spatial transform networks

Performing nonlinear distortion on the floating image M; in the output image, for each voxel p, the values of eight neighboring voxels are linearly interpolated:

wherein

7. The Swin Transformer and CNN dual-branch coupling-based image registration method of claim 1, wherein: sixthly, calculating similarity loss between the registered image and the fixed image and regularization loss of a deformation field, and performing back propagation to train a network; the loss function L of the network consists of an image similarity term and a deformation field regular term, and the calculation formula is as follows:

wherein

Which represents the loss of similarity of the image,

representing deformation field regularization loss, and lambda represents a regularization parameter; local normalized cross-correlation LNCC in the field of image registration is adopted as image similarity loss, and the calculation formula is as follows:

and

representing a size n centered on the voxel p ³ Average voxel value within the local window of (a); the L2 norm of the deformation field gradient is used as the regularization loss, and the calculation formula is as follows:

wherein

Is in omega middle phaseThe difference field between neighboring voxels is here taken as the gradient field.