WO2024082441A1

WO2024082441A1 - Deep learning-based multi-modal image registration method and system, and medium

Info

Publication number: WO2024082441A1
Application number: PCT/CN2022/142807
Authority: WO
Inventors: 刘洁; 王涛; 顾力栩
Original assignee: 上海精劢医疗科技有限公司; 精劢医疗科技南通有限公司; 上海偌劢机器人科技有限公司
Priority date: 2022-10-21
Filing date: 2022-12-28
Publication date: 2024-04-25
Also published as: CN115690178A

Abstract

A deep learning-based multi-modal image registration method and system, and a medium. The method comprises: acquiring three-dimensional images of different modalities, wherein the three-dimensional images comprise a reference image and a floating image; acquiring areas to be registered of the three-dimensional images, and detecting image feature points in an area to be registered of the reference image; obtaining image blocks according to the image feature points and inputting the image blocks into a similarity network to obtain a similarity graph in a corresponding range of the floating image; inputting the coordinates of the image feature points, the image blocks and the similarity graph into a displacement network to obtain a displacement vector, and performing interpolation on an area having no image feature point to obtain a displacement vector field; and performing spatial transformation on the floating image according to the displacement vector field.

Description

Multimodal image registration method, system and medium based on deep learning

Related Applications

This application claims priority to the Chinese patent application filed on October 21, 2022, with application number 202211296302.9, and invention name “Cross-module non-rigid registration method, system and medium based on deep learning”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of image processing technology, and in particular, to a multimodal image registration method, system and medium based on deep learning.

Background technique

Modern medical diagnosis requires the support of various medical images. Common medical imaging modalities include computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound (US), each with its own imaging characteristics. CT has obvious imaging advantages for high-density tissues in the human body, such as bones; MRI has better resolution for soft tissues, etc. The fusion of multiple modal images can provide complementary information to better achieve the purpose of diagnosis, evaluation or intervention. For example, in computer-aided diagnosis, the fusion of multimodal images can fully combine the tissue characteristics reflected by different modal images to give a more accurate judgment on whether there is a lesion, the nature of the lesion, and the range. In minimally invasive surgical navigation, the fusion of preoperative images and intraoperative images can achieve the superposition of preoperative planning and intraoperative images, which can provide doctors with richer and more intuitive information, improve the quality of image guidance during intervention, thereby improving the quality of surgery and clinical outcomes. However, images of different modalities are usually acquired at different time points using different scanning instruments. This process is accompanied by changes in the patient's posture and internal anatomical structure. Therefore, the prerequisite for achieving multimodal image fusion is to perform multimodal medical image registration, and the accuracy of the registration directly determines the effect of the fusion.

Multimodal medical image registration is a challenging problem. The relationship between the grayscale distribution of medical images of different modalities is often complex and uncertain. In addition, structures and features that exist in one modality may be missing in another modality. Traditional multimodal registration methods can be roughly divided into grayscale-based registration methods and anatomical feature-based registration methods. Grayscale-based registration methods mainly use multimodal similarity measures, such as mutual information and cross-correlation; anatomical feature-based registration methods mainly rely on landmarks identified in images of different modalities. In recent years, deep learning technology has developed rapidly, and has also been increasingly studied and applied in the field of image registration, which is expected to solve the problems of slow registration speed and insufficient registration accuracy in traditional registration.

Summary of the invention

According to various embodiments of the present application, a multimodal image registration method, system and medium based on deep learning are provided.

Multimodal image registration methods based on deep learning include:

Acquire three-dimensional images of different modalities, wherein the three-dimensional images include at least one reference image and at least one floating image; acquire a region to be registered of the three-dimensional image, detect image feature points in the region to be registered of the reference image, wherein the image feature points are points that can be distinguished from image features of other points in a neighborhood; obtain image blocks of a preset size with each of the image feature points as the center; input the image blocks into a similarity network to obtain a similarity graph within a corresponding range of the floating image; input the coordinate information of the image feature points, the image blocks of the reference image and the corresponding similarity graph into a displacement network to obtain a displacement vector; interpolate the region without image feature points based on the displacement vector to obtain a displacement vector field; and perform spatial transformation on the floating image according to the displacement vector field to obtain a registration result.

In some embodiments, the region to be registered is determined through manual interaction, or is determined based on a grayscale threshold of the image, or is determined by automatically detecting and segmenting a specific structure in the image.

In some embodiments, the method of acquiring the image feature points includes:

Sampling voxel points from the to-be-registered region of the reference image, obtaining feature scores according to the grayscale variance and gradient value in the neighborhood of the sampling points, and taking points with feature scores higher than a first preset value as the image feature points;

Alternatively, a specific structure in the area to be registered of the reference image is segmented, a feature score is obtained based on the positional relationship between each boundary point of the specific structure and the boundary points around it, and the boundary points with feature scores greater than a second preset value are used as the image feature points.

In some embodiments, the feature score S(p) of the voxel point at the coordinate p in the image I is determined according to the Foerstner operator, and its expression is:

Among them, _Kσ represents the Gaussian kernel function with variance σ,

is the value of the spatial gradient of image I at coordinate p, and Tr(·) represents the trace of the matrix.

In some embodiments, the registration method further comprises:

The number and distribution of the image feature points are adjusted, and the adjustment includes any of the following:

Adjust the distribution of the image feature points: use a sampling window of a set size to scan the reference image, and when two or more image feature points appear in the sampling window, only retain the image feature point with the largest feature score;

Or, adjusting the number and distribution of the image feature points: when the number of the image feature points is greater than a third preset value, randomly selecting a point from the detected image feature points as an adjustment point set, and selecting a point farthest from the adjustment point set from the remaining image feature points each time to add to the adjustment point set, until the number of the image feature points in the adjustment point set reaches a fourth preset value, and the distance of the image feature point from the adjustment point set is the minimum value of the Euclidean distance from the point to all the image feature points in the adjustment point set;

Or, adjust the number and distribution of the image feature points: when the number of the image feature points is greater than a third preset value, use all the image feature points to construct an octree, traverse the octree according to the breadth-first principle, and if the point with the largest feature score in the current subtree is not in the adjustment point set, then add the point with the largest feature score in the current subtree to the adjustment point set until the number of the image feature points in the adjustment point set reaches a fourth preset value.

In some embodiments, the input of the similarity network is the image blocks corresponding to the reference image and the floating image, the image block of the reference image has a size of W ₁ ×H ₁ ×D ₁ , the image block of the floating image includes a specified detection range, has a size of W ₂ ×H ₂ ×D ₂ , and satisfies W ₁ ≤W ₂ ,H ₁ ≤H ₂ ,D ₁ ≤D ₂ ;

The output of the similarity network is a similarity map corresponding to the image feature points, and the size of the similarity map is [(W ₂ -W ₁ )/q+1]×[(H ₂ -H ₁ )/q+1]×[(D ₂ -D ₁ )/q+1], where q is a downsampling coefficient.

In some embodiments, the displacement network includes an encoding part, an interaction part and a decoding part; the input of the encoding part includes a similarity map of each image feature point, an image block of the corresponding reference image and coordinate information of the corresponding image feature point, the interaction part receives the encoding results of all image feature points, and encodes the interaction information between different image feature points, the decoding part receives the output of the encoding part, the interaction part and some intermediate states, and outputs a displacement vector corresponding to each image feature point; the displacement vector is obtained by first obtaining a displacement probability map, and then taking the pixel value in the displacement probability map as the weight of the pixel corresponding coordinate for weighted averaging; the encoding part and the decoding part are connected by a jump connection.

In some embodiments, the registration method further comprises:

The similarity of the local structure between the reference image and the floating image is obtained by specifying features; an objective function is constructed according to the similarity and a smooth constraint, and the displacement vector field is locally adjusted by minimizing the objective function.

The present application also provides a multimodal image registration system based on deep learning, comprising:

A module for acquiring a region to be registered: acquiring three-dimensional images of different modalities, wherein the three-dimensional images include at least one reference image and at least one floating image, and acquiring a region to be registered of the three-dimensional images;

Image feature point detection module: detects image feature points in the to-be-registered area of the reference image, wherein the image feature points are points that can be distinguished from the image features of other points in the neighborhood;

Similarity graph acquisition module: taking each of the image feature points as the center, obtaining an image block of a preset size; inputting the image block into a similarity network to obtain a similarity graph within the corresponding range of the floating image;

A displacement vector field acquisition module: inputs the coordinate information of the image feature point, the image block of the reference image and the corresponding similarity map into the displacement network to obtain a displacement vector; interpolates the area without image feature points based on the displacement vector to obtain a displacement vector field;

Registration module: performs spatial transformation on the floating image according to the displacement vector field to obtain a registration result.

The present application also provides a computer-readable storage medium, on which a deep learning-based multimodal image registration program is stored. When the deep learning-based multimodal image registration program is executed by a processor, the above-mentioned deep learning-based multimodal image registration method is implemented.

Details of one or more embodiments of the present application are set forth in the following drawings and description to make other features, objects, and advantages of the present application more readily apparent.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to better describe and illustrate the embodiments and/or examples of the present application disclosed herein, reference may be made to one or more drawings. The additional details or examples used to describe the drawings should not be considered as limiting the scope of any of the disclosed applications, the embodiments and/or examples currently described, and the best modes of these applications currently understood.

FIG1 is a flow chart of a modality image registration method based on deep learning according to an embodiment of the present application;

FIG2 is a schematic diagram of a similarity network structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a displacement network structure according to an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application is described and illustrated below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not intended to limit the present application. Based on the embodiments provided in the present application, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present application. In addition, it can also be understood that although the efforts made in this development process may be complex and lengthy, for ordinary technicians in the field related to the contents disclosed in the present application, some changes such as design, manufacturing or production based on the technical contents disclosed in the present application are only conventional technical means, and should not be understood as insufficient contents disclosed in the present application.

Reference to "embodiments" in this application means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.

Unless otherwise defined, the technical terms or scientific terms involved in this application should be the usual meanings understood by people with ordinary skills in the technical field to which this application belongs. The words "one", "a", "a", "the" and the like involved in this application do not indicate a quantitative limitation and may represent the singular or plural. The "multiple" involved in this application means greater than or equal to two. The terms "including", "comprising", "having" and any variations thereof involved in this application are intended to cover non-exclusive inclusions. The "first", "second", etc. involved in this application are only for distinguishing objects and are not limitations on the objects.

The present application discloses a multimodal image registration method based on deep learning, as shown in FIG1 , comprising the following steps:

Step S1: Acquire three-dimensional images of different modalities, wherein the three-dimensional images include at least one reference image and at least one floating image; and acquire areas to be registered of the two images.

By reading three-dimensional images of different modalities covering the same area of the same patient, modality 1 is designated as the reference image, modality 2 is designated as the floating image and interpolated to set the same spatial resolution as modality 1.

The three-dimensional image can be CT, MRI, ultrasound (three-dimensional ultrasound or three-dimensional ultrasound image reconstructed from a series of two-dimensional ultrasound images), etc. During the registration process, the optimal spatial transformation of the floating image is sought and mapped to the coordinate system of the reference image so that the corresponding human anatomical points in the two modal images are spatially consistent.

The area to be registered can be determined by manual interaction, or by the grayscale threshold of the image, or by automatic detection and segmentation of specific structures in the image. A special case is that the area to be registered is the entire image.

Step S2: Detect image feature points in the area to be registered of the reference image. Sample points in the area to be registered of the reference image, obtain feature scores based on the neighborhood information of the sampled points, and use points with feature scores greater than a set threshold as image feature points. The image feature points are obtained in the following way:

Grid sampling or random sampling is performed from the area to be registered in the reference image, and a three-dimensional operator constructed based on the grayscale variance, gradient value, etc. in the neighborhood of the sampling point is used to determine the feature score, and the point with a feature score higher than the first preset value is used as the image feature point. For example, the Foerstner operator is a commonly used three-dimensional feature point detection operator, which can be used to obtain the feature score S(p) of the pixel point at the coordinate p in the image I, and its expression is as follows:

Among them, _Kσ represents the Gaussian kernel function with variance σ,

Another implementation method for obtaining image feature points is to segment the specific structure in the to-be-registered region of the reference image, obtain a feature score based on the positional relationship between each boundary point of the specific structure and its surrounding boundary points, and use the boundary points with a feature score greater than a second preset value as image feature points. For example, a curvature value is obtained through the positional relationship between each boundary point and its surrounding boundary points, and a point with a curvature value greater than a set threshold is used as an image feature point.

Step 3: Adjust the data and distribution of image feature points. In order to prevent image feature points from being concentrated in the same area or being too unevenly distributed in the area to be registered, adjust the number and distribution of image feature points to prevent a large number of image feature points from being stored in the subsequent image blocks. The adjustment includes any of the following:

Adjust the distribution of image feature points: Use a sampling window of a set size to scan the reference image. When two or more image feature points appear in the sampling window, only the image feature point with the largest feature score is retained.

Or, adjust the number and distribution of image feature points: when the number of image feature points is greater than a third preset value, randomly select a point from the detected image feature points as the adjustment point set, and each time select the point farthest from the adjustment point set from the remaining image feature points to add to the adjustment point set until the number of image feature points in the adjustment point set reaches a fourth preset value, and the distance of the image feature point from the adjustment point set is the minimum value of the Euclidean distance from the point to all image feature points in the adjustment point set.

Or, adjust the number and distribution of image feature points: when the number of image feature points is greater than the third preset value, use all the image feature points to construct an octree, and traverse the octree according to the breadth-first principle. If the point with the largest feature score in the current subtree is not in the adjustment point set, then add the point with the largest feature score in the current subtree to the adjustment point set until the number of image feature points in the adjustment point set reaches a fourth preset value.

Step S4: Taking each image feature point as the center, taking an image block that contains a specified range of neighborhood with the feature point as the center, inputting the image block into a similarity network, and obtaining a similarity graph within the corresponding range of the floating image.

As shown in FIG2 , the input of the similarity network is the image blocks corresponding to the reference image and the floating image, the image block size of the reference image is W ₁ ×H ₁ ×D ₁ , the image block of the floating image includes the specified detection range, the size of which is W ₂ ×H ₂ ×D ₂ , and satisfies W ₁ ≤W ₂ , H ₁ ≤H ₂ , D ₁ ≤D ₂ . The output of the similarity network is the similarity graph of the corresponding image feature points, the size of the similarity graph is [(W ₂ -W ₁ )/q+1]×[(H ₂ -H ₁ )/q+1]×[(D ₂ -D ₁ )/q+1], where q is the downsampling coefficient. The size of the value of any point in the similarity graph indicates: the possibility that the corresponding position of the value predicted based on the local image feature in the floating image corresponds to the same anatomical point as the image feature point in the reference image.

The similarity network is a convolutional neural network based on self-supervised training. The peak value of the similarity graph is used to construct a contrast loss function to determine whether the floating image block contains the anatomical structure corresponding to the feature point of the reference image.

Specifically, the image blocks of the reference image and the image blocks of the floating image can be feature encoded respectively through two convolutional neural network branches to obtain a first feature map after the image blocks of the reference image are encoded and a second feature map after the image blocks of the floating image are encoded. A sliding window convolution operation is performed on the first feature map on the second feature map, and the similarity map is obtained after normalization.

Step S5: input the coordinate information of the image feature points, the image block of the reference image and the corresponding similarity graph into the displacement network to obtain a displacement vector.

As shown in Figure 3, the displacement network includes an encoding part, an interaction part and a decoding part. Among them, the input of the encoding part includes a similarity map of each image feature point, an image block of the corresponding reference image and the coordinate information of the corresponding image feature point. The above three input items can be encoded separately and output to the interaction part, or all or two of the above three input items can be integrated (such as splicing, addition, etc.), and then jointly encoded and output to the interaction part. The interaction part receives the encoding results of all image feature points and encodes the interaction information between different image feature points. The decoding part receives the output of the encoding part, the interaction part and some intermediate states, and outputs the displacement vector corresponding to each image feature point.

Specifically, the encoding part encodes the similarity graph of the image feature points and the image block of the corresponding reference image respectively through two convolutional neural network branches, encodes the position of the image feature points in a fixed manner, and adds it to the image block code. The interaction part can be constructed by a self-attention mechanism: the relationship between different image feature points is obtained and encoded through the self-attention layer, and the feature transformation is further performed through the feedforward layer. The structure composed of the above self-attention layer and the feedforward layer can be cascaded in multiple levels. The decoding part is constructed based on the convolutional neural network, and a normalized displacement probability map can be obtained. The size of the value of any point in the displacement probability map indicates: the possibility that the corresponding position of the value in the floating image and the feature point in the reference image correspond to the same anatomical point based on the position distribution of all image feature points, the image block of the reference image and the corresponding similarity map prediction. The pixel value in the displacement probability map is used as the weight of the corresponding coordinate of the pixel for weighted averaging to obtain a displacement vector; or, the coordinate corresponding to the maximum value in the displacement probability map is taken as the displacement vector. The encoding part and the decoding part are connected by a jump connection. In practical applications, multiple displacement network structures can also be cascaded.

Furthermore, the interaction part can also be constructed based on graph neural network.

Step S6: interpolating the region without image feature points based on the displacement vector to obtain a displacement vector field.

One way to store the displacement vector field is as a 6-dimensional matrix, where the first three dimensions are the same size as the modality 1 image, and the last three dimensions represent the displacement vectors that map the corresponding pixel points to the modality 2 image.

To ensure the smoothness of the displacement vector field, cubic linear interpolation can be used.

Step S7: locally adjust the displacement vector field to obtain the final displacement vector field.

A specific adjustment method is to obtain the similarity of the local structure between the reference image and the floating image by specifying features; construct an objective function based on the similarity and smoothness constraints, and locally adjust the displacement vector field by minimizing the objective function. For example, the modality independent neighborhood descriptor (MIND) is a common multimodal image feature, which can be used as a specified feature and extracted from two modal images respectively, and the square difference of the modality independent neighborhood descriptors of the two modal images is used to measure the similarity of the local structure. It is also possible to use a similarity network, and use its output to replace the above-mentioned specified feature similarity to construct an objective function to perform local adjustments to the displacement vector field.

Step S8: Perform spatial transformation on the floating image according to the optimized displacement vector field to obtain a registration result.

The present application also discloses a multimodal image registration system based on deep learning, comprising:

The present application also discloses a computer-readable storage medium, such as a computer hard disk, etc., on which a multimodal image registration program based on deep learning is stored. When the multimodal image registration program based on deep learning is executed by a processor, the above-mentioned multimodal image registration method based on deep learning is implemented.

Those skilled in the art know that, in addition to implementing the system and its various devices, modules, and units provided by the present application in a purely computer-readable program code, it is entirely possible to implement the same functions of the system and its various devices, modules, and units provided by the present application in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system and its various devices, modules, and units provided by the present application can be considered as a hardware component or a software module for implementing the method.

This application reduces the interference of low-information points by extracting image feature points, and improves the efficiency of registration, especially when the image size is large. Using the similarity graph of all feature points for global optimization removes the prerequisite of accurately detecting corresponding points in two modalities. At the same time, the spatial distribution information of all feature points is considered to improve the robustness of the algorithm. The introduction of deep learning is used to fully extract the corresponding information between different modalities of the same anatomical structure, and on the other hand, it can avoid the large time overhead caused by iterative solutions in traditional methods.

The technical features of the above-described embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the patent application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the patent application shall be subject to the attached claims.

Claims

A multimodal image registration method based on deep learning, characterized by comprising:

Acquire three-dimensional images of different modalities, wherein the three-dimensional images include at least one reference image and at least one floating image; acquire a region to be registered of the three-dimensional image, detect image feature points in the region to be registered of the reference image, wherein the image feature points are points that can be distinguished from image features of other points in a neighborhood; obtain image blocks of a preset size with each of the image feature points as the center; input the image blocks into a similarity network to obtain a similarity graph within a corresponding range of the floating image; input the coordinate information of the image feature points, the image blocks of the reference image and the corresponding similarity graph into a displacement network to obtain a displacement vector; interpolate the region without image feature points based on the displacement vector to obtain a displacement vector field; and perform spatial transformation on the floating image according to the displacement vector field to obtain a registration result.
According to the multimodal image registration method based on deep learning according to claim 1, wherein the area to be registered is determined by manual interaction, or is determined according to the grayscale threshold of the image, or is determined by automatic detection and segmentation of specific structures in the image.
According to the multimodal image registration method based on deep learning according to claim 1, wherein the method of acquiring the image feature points includes:

Sampling voxel points from the to-be-registered region of the reference image, obtaining feature scores according to the grayscale variance and gradient value in the neighborhood of the sampling points, and taking points with feature scores higher than a first preset value as the image feature points;

Alternatively, the specific structure in the area to be registered of the reference image is segmented, a feature score is obtained according to the positional relationship between each boundary point of the specific structure and the boundary points around it, and the boundary points with feature scores greater than a second preset value are used as the image feature points.
According to the multimodal image registration method based on deep learning in claim 3, the feature score S(p) of the voxel point at the coordinate p in the image I is determined according to the Foerstner operator, and its expression is:

Among them, Kσ represents the Gaussian kernel function with variance σ,
is the value of the spatial gradient of image I at coordinate p, and Tr(·) represents the trace of the matrix.
The multimodal image registration method based on deep learning according to claim 3, further comprising:

The number and distribution of the image feature points are adjusted, and the adjustment includes any of the following:

Adjust the distribution of the image feature points: use a sampling window of a set size to scan the reference image, and when two or more image feature points appear in the sampling window, only retain the image feature point with the largest feature score;

Or, adjusting the number and distribution of the image feature points: when the number of the image feature points is greater than a third preset value, randomly selecting a point from the detected image feature points as an adjustment point set, and selecting a point farthest from the adjustment point set from the remaining image feature points each time to add to the adjustment point set, until the number of the image feature points in the adjustment point set reaches a fourth preset value, and the distance of the image feature point from the adjustment point set is the minimum value of the Euclidean distance of the point to all the image feature points in the adjustment point set;

Or, adjust the number and distribution of the image feature points: when the number of the image feature points is greater than a third preset value, use all the image feature points to construct an octree, traverse the octree according to the breadth-first principle, and if the point with the largest feature score in the current subtree is not in the adjustment point set, then add the point with the largest feature score in the current subtree to the adjustment point set until the number of the image feature points in the adjustment point set reaches a fourth preset value.
The multimodal image registration method based on deep learning according to claim 1, wherein the input of the similarity network is the image blocks corresponding to the reference image and the floating image, the image block size of the reference image is W 1 ×H 1 ×D 1 , the image block of the floating image includes a specified detection range, the size of which is W 2 ×H 2 ×D 2 , and satisfies W 1 ≤W 2 ,H 1 ≤H 2 ,D 1 ≤D 2 ;

The output of the similarity network is a similarity map corresponding to the image feature points, and the size of the similarity map is [(W 2 -W 1 )/q+1]×[(H 2 -H 1 )/q+1]×[(D 2 -D 1 )/q+1], where q is a downsampling coefficient.
According to the multimodal image registration method based on deep learning in claim 1, the displacement network includes an encoding part, an interaction part and a decoding part; the input of the encoding part includes a similarity map of each image feature point, an image block of the corresponding reference image and coordinate information of the corresponding image feature point, the interaction part receives the encoding results of all image feature points, and encodes the interaction information between different image feature points, the decoding part receives the output of the encoding part, the interaction part and some intermediate states, and outputs a displacement vector corresponding to each image feature point; the displacement vector is obtained by first obtaining a displacement probability map, and then taking the pixel value in the displacement probability map as the weight of the pixel corresponding coordinate for weighted averaging; the encoding part and the decoding part are connected by a jump connection.
The multimodal image registration method based on deep learning according to claim 1, further comprising:

The similarity of the local structure between the reference image and the floating image is obtained by specifying features; an objective function is constructed according to the similarity and a smooth constraint, and the displacement vector field is locally adjusted by minimizing the objective function.
A multimodal image registration system based on deep learning, characterized by comprising:

A module for acquiring a region to be registered: acquiring three-dimensional images of different modalities, wherein the three-dimensional images include at least one reference image and at least one floating image, and acquiring a region to be registered of the three-dimensional images;

Image feature point detection module: detects image feature points in the to-be-registered area of the reference image, wherein the image feature points are points that can be distinguished from the image features of other points in the neighborhood;

Similarity graph acquisition module: taking each of the image feature points as the center, obtaining an image block of a preset size; inputting the image block into a similarity network to obtain a similarity graph within the corresponding range of the floating image;

A displacement vector field acquisition module: inputs the coordinate information of the image feature point, the image block of the reference image and the corresponding similarity map into the displacement network to obtain a displacement vector; interpolates the area without image feature points based on the displacement vector to obtain a displacement vector field;

Registration module: performs spatial transformation on the floating image according to the displacement vector field to obtain a registration result.
A computer-readable storage medium, characterized in that: a multimodal image registration program based on deep learning is stored on the computer-readable storage medium, and when the multimodal image registration program based on deep learning is executed by a processor, the multimodal image registration method based on deep learning as described in any one of claims 1-8 is implemented.