CN114270408A

CN114270408A - Method for controlling a display, computer program and mixed reality display device

Info

Publication number: CN114270408A
Application number: CN202080058704.8A
Authority: CN
Inventors: 西尔科·佩尔茨尔; 迈克尔·沃希尼亚克
Original assignee: Apokole Co ltd
Current assignee: Apokole Co ltd
Priority date: 2019-09-09
Filing date: 2020-09-09
Publication date: 2022-04-01

Abstract

A method for controlling a display of a mixed reality display device, wherein source point clouds and target point clouds representing a surface of a treatment object are generated from image data and medical imaging data of the treatment object. A plurality of segmentation masks is determined in the point cloud by applying semantic segmentation. A transformation between the source point cloud and the target point cloud is determined using the segmentation mask, and at least a portion of the medical imaging data is superimposed on the subject using the determined transformation.

Description

Method for controlling a display, computer program and mixed reality display device

The invention relates to a method for controlling a display of a mixed reality device.

Furthermore, the invention relates to a computer program having program code means adapted to perform such a method.

Further, the invention relates to a mixed reality display device having such a computer program.

The present invention relates generally to the field of visualization of virtual information in conjunction with a real-world environment. On the display of the display device, the virtual information is superimposed on the real object. This field is commonly referred to as "mixed reality".

A "virtual continuum" extends from a purely real environment to a purely virtual environment, including augmented reality and augmented virtualization between these extremes. The term "mixed reality" is generally defined as "anywhere between the extremes of a virtual continuum," i.e., mixed reality generally includes a complete virtual continuum in addition to pure reality and pure virtual. In the context of the present application, the term "mixed reality" may particularly denote "augmented reality".

Mixed reality techniques are particularly promising for medical applications, for example for medical surgery or other medical treatments. For example, medical imaging data (CT images, MRI images, etc.) visualizing anatomical and/or physiological processes of a human or animal body may be superimposed on a real-world view of the body by a mixed reality display device. In this way, for example, a surgeon may obtain support during surgery by placing such medical imaging data virtually directly on the treatment object, i.e. on the patient's body or a part of the patient's body.

One of the most important challenges in mixed reality is the registration problem, i.e. the problem of properly aligning objects in the real world and objects in the virtual world with respect to each other. Without accurate registration, the illusion of coexistence of the two worlds will suffer. More seriously, in medical applications, inaccurate registration may lead to risks affecting medical success and even patient health. Therefore, it is crucial that visualized virtual information, such as medical imaging data, exactly match the real world, i.e. the treatment object, in position, size and perspective.

From EP 2874556B 1, augmented reality-based methods and corresponding systems are known which enable instrument guidance in surgery and other interventional procedures. For this purpose, an interventional path for use in an interventional procedure is obtained, wherein the interventional path is planned based on 3D image data inside the patient and a camera image outside the patient is obtained during the interventional procedure. A spatial correspondence is established between the camera image and the 3D image data, and a view of the interventional path corresponding to the camera image is calculated. Finally, the view of the interventional path is combined with the camera image to obtain a composite image that is displayed on the display.

It is an object of the present invention to provide an improved technique for visualizing virtual information in medical applications, which technique enables an improved alignment of virtual and real objects.

The object of the invention is achieved by a method for controlling a display of a mixed reality display device having the features of claim 1.

According to the invention, the method comprises at least the following steps:

a) providing an image dataset comprising a plurality of images of a treatment object, wherein the treatment object is a patient body or a part of a patient body, and the images depict the treatment object from different perspectives,

b) generating a 3D target point cloud from the image dataset, wherein the target point cloud comprises a plurality of points defined in a three-dimensional coordinate system and the points represent a surface of the treatment object,

c) determining a plurality of semantic segmentation masks in the target point cloud by applying semantic segmentation,

d) providing a medical imaging data set comprising medical imaging data of a subject to be treated,

e) generating a 3D source point cloud from the medical imaging dataset, wherein the source point cloud comprises a plurality of points defined in a three-dimensional coordinate system, and the points also represent a surface of the treatment object,

f) determining a plurality of semantic segmentation masks in the source point cloud by applying semantic segmentation,

g) determining a transformation between the source point cloud and the target point cloud using the segmentation mask of the source point cloud and the segmentation mask of the target point cloud, an

h) At least a portion of the medical imaging data is visualized on a display, wherein the medical imaging data is superimposed on and aligned with the treatment object using a transformation between the source point cloud and the target point cloud.

The steps of the method need not be performed in the order specified and therefore do not limit the invention, i.e. the alphabetical order of the letters does not imply a specific order of steps a) to h). For example, steps a) to c) may of course be performed after steps d) to f), or some of the steps of the method may be performed in parallel.

The invention proposes a method for controlling a display of a mixed reality device for overlaying medical imaging data on a treatment object. The invention therefore proposes a method for controlling a display of a mixed reality device for visualizing medical imaging data on a treatment object.

In the context of the present application, the term "subject" refers to a patient's body or a part of a patient's body. The patient may be a human or animal, i.e. the subject may be a human or animal body or a part of a human or animal body.

The terms "mixed reality display device" and "mixed reality device" may be used interchangeably. The term "computer" is used in its broadest sense, i.e., it refers to any processing device that can be instructed to perform a sequence of arithmetic and/or logical operations.

The term "2D" refers to two-dimensional coordinates. The term "3D" refers to three-dimensional coordinates. The term "4D" refers to four-dimensional coordinates.

In addition to a display, a mixed reality device may include a computer and memory. The mixed reality device may also include multiple computers. Furthermore, the mixed reality device may comprise a camera, in particular a 3D camera system, and/or a plurality of sensors, in particular at least one depth sensor, for example a time-of-flight depth sensor. The mixed reality display device may also include a positioning system and/or an inertial measurement unit.

In step a), an image dataset comprising a plurality of images of the treatment object is provided, wherein the images depict the treatment object from different perspectives. These images represent a real-world view of the treatment object. They may be generated, for example, by a camera and/or a depth sensor, in particular by a camera and/or a depth sensor of a mixed reality device.

In step d), a medical imaging data set comprising medical imaging data of the subject of treatment is provided. The medical imaging data represents virtual information to be visualized on a display. The medical imaging data may include, for example, cross-sectional images of the treatment object. For example, medical imaging data may be generated using medical imaging methods such as Magnetic Resonance Imaging (MRI). The medical imaging data may be generated prior to and/or during a medical treatment, such as preoperatively and/or intraoperatively.

In steps b) and e), a three-dimensional point cloud representing the treatment object is generated. For this purpose, a method of 3D reconstruction may be used.

In steps c) and f), semantic segmentation is applied to determine a plurality of segmentation masks in the target point cloud and a plurality of segmentation masks in the source point cloud, respectively. Semantic segmentation (also called semantic image segmentation) may be defined as the task of clustering together image parts belonging to the same object class. In the context of the present application, an object class may be, for example, a specific part of the anatomy of a treatment object. For example, if the treatment object is a human head, simple object classes may include nose, ears, mouth, eyes, eyebrows, and the like. The term "semantic segmentation mask" refers to the use of semantic segmentation to segment a portion (or fragment) of an image that has been determined to belong to the same object class. Semantic segmentation may be performed in 2D data, i.e. on a pixel basis, or in 3D data, i.e. on a voxel basis.

In step g), a transformation between the source point cloud and the target point cloud is determined using the segmentation mask of the source point cloud and the segmentation mask of the target point cloud. In this step, the transformation may be determined as: when a transformation is applied to one of the point clouds, the points of the two point clouds are aligned with each other. The transformation may include translation and/or rotation. For example, the transform may be in the form of a transform matrix, in particular a 4 × 4 transform matrix. The determined transformation may in particular approximately transform the source point cloud into the target point cloud, or may approximately transform the target point cloud into the source point cloud.

In addition to the segmentation mask, other parameters may be used as input to determine the transformation between point clouds. In particular, a transformation between a source point cloud and a target point cloud may be determined using a segmentation mask of the source point cloud and a segmentation mask of the target point cloud, as well as coordinates of points of the source point cloud and coordinates of points of the target point cloud.

In step h), the medical imaging data is visualized on a display, wherein the medical imaging data is superimposed on and aligned with the real-world view of the treatment object using the transformation determined in step g). In this way, a virtual fusion of the visualized medical imaging data and the real world view of the treatment object is created.

The present invention is based on the following findings: by using semantic segmentation, the registration problem can be solved more accurately and reliably. This is accomplished by determining a transformation between the source point cloud and the target point cloud based on the segmentation masks of the source point cloud and the target point cloud.

As an example, the optimization function used to determine the transformation between the source point cloud and the target point cloud may be designed to facilitate a transformation that exactly matches the corresponding semantic segmentation masks in the two point clouds, i.e., a transformation that exactly matches semantic segmentation masks having the same and/or similar object classes (e.g., nose, ear, eye, eyebrow). This may be achieved, for example, by determining a transformation between the source point cloud and the target point cloud using a four-dimensional optimization algorithm, wherein, for each point of the respective point cloud, an object class (e.g., nose, ear, mouth) of the respective semantic segmentation mask is interpreted as a fourth dimension of the point (in addition to the 3D coordinates of the point). For example, a 4D variant of the Iterative Closest Point (ICP) algorithm may be used for this purpose.

The invention enables a very accurate and reliable alignment of virtual information of medical imaging data with a real world view of a treatment object. Thus, an improved virtual fusion of the visualized medical imaging data and the real world view of the treatment object may be created and the risks affecting the medical success and affecting the health of the patient may be avoided.

According to an advantageous embodiment of the invention, it is proposed to design the display as an optical see-through display, in particular an optical see-through head mounted display.

These embodiments of the invention provide the advantage of a real and intuitive perception of a real-world environment for a user, e.g. a surgeon.

According to a further advantageous embodiment of the invention, it is proposed that the mixed reality display device comprises or consists of a head-mounted mixed reality display device and/or mixed reality smart glasses. The mixed reality display device may for example comprise or consist of a microsoft HoloLens device or a microsoft HoloLens2 device or similar.

These embodiments of the invention provide the following advantages: which is convenient to use and at the same time provides powerful hardware for visualizing virtual information to a user. In particular, many head mounted mixed reality display devices and mixed reality smart glasses include powerful and versatile integrated hardware components in addition to the display, including high performance processors and memory, 3D camera systems and time-of-flight depth sensors, positioning systems, and/or inertial measurement units.

According to another advantageous embodiment of the invention, the mixed reality display device may comprise an additional external computer, for example an external server, connected to the display and adapted to perform at least part of the method proposed according to the invention. The mixed reality display device may, for example, include a head mounted mixed reality display device and/or mixed reality smart glasses and an additional computer, such as an external server, connected to the head mounted mixed reality display device and/or mixed reality smart glasses, respectively, by a wired connection or a wireless connection. The external server may be designed as a cloud server.

This embodiment provides the following advantages: additional computing power for complex and computationally intensive operations that are particularly common in the related art of computer vision and computer graphics.

According to another advantageous embodiment of the invention, all components of the mixed reality display device may be integrated in the head mounted mixed reality display device and/or the mixed reality smart glasses. This provides the advantage of a compact and therefore highly mobile mixed reality display device.

According to another advantageous embodiment of the invention, it is proposed to generate medical imaging data using at least one of the following medical imaging methods: magnetic Resonance Imaging (MRI), Computed Tomography (CT), Cone Beam Computed Tomography (CBCT), Digital Volume Tomography (DVT), fluoroscopic intra-operative images, X-rays, radiography, ultrasound, endoscopy, and/or nuclear medicine imaging.

This provides the following advantages: the results of powerful and versatile modern medical imaging methods can be advantageously exploited by mixed reality visualization during medical treatment, e.g. surgery. This enables, for example, the visualization of internal organs and/or tumors of the patient and/or other internal defects of the patient's body.

According to a further advantageous embodiment of the invention, it is proposed to determine a semantic segmentation mask in the target point cloud and/or a semantic segmentation mask in the source point cloud using a convolutional neural network configured for semantic segmentation.

In recent years, semantic segmentation by Convolutional Neural Networks (CNN) has made significant progress. Properly designed and trained CNNs enable reliable, accurate and fast semantic segmentation of 2D and even 3D image data. Thus, embodiments of the present invention that use CNN for semantic segmentation provide the advantage of exploiting the powerful functionality of CNN to improve mixed reality visualization. For example, the U-Net CNN architecture can be used for this purpose, i.e., a convolutional neural network configured for semantic segmentation can be designed as a U-NET CNN.

The CNN may be trained for semantic segmentation of the body or specific parts of the body using a suitable training data set comprising semantic segmentation masks labeled with their respective object classes (e.g. nose, ears, eyes, eyebrows in case of a training set for a human head).

According to another advantageous embodiment of the invention, it is proposed that step c) comprises the following steps:

-determining a plurality of semantic segmentation masks in an image of the image dataset by applying a semantic segmentation to the image of the image dataset, in particular using a convolutional neural network configured for the semantic segmentation, and

-determining a semantic segmentation mask in the target point cloud using the semantic segmentation mask in the image of the image dataset.

According to this embodiment, it is proposed to determine a semantic segmentation mask in an image of an image dataset. The image may particularly be designed as a 2D image, particularly a 2D RGB image. Based on these semantic segmentation masks in the image, a 3D semantic segmentation mask may be determined in the target point cloud. For example, each point of the 3D target point cloud may be projected onto the 2D image to determine a semantic segmentation mask in the 2D image that corresponds to the respective point in the 3D target point cloud.

This embodiment provides the following advantages: it facilitates semantic segmentation and enables the determination of semantic segmentation masks in 2D images, in particular in 2D RGB images, using sophisticated available methods for semantic segmentation. For example, a powerful convolutional neural network and corresponding training data set may be used for semantic segmentation in 2D RGB images. Exploiting their potential, enables fast, accurate and reliable semantic segmentation of 2D images. These results may be transmitted to the 3D point cloud to use them for the purpose of the claimed method, i.e. for determining an accurate transformation between the source point cloud and the target point cloud.

According to a further advantageous embodiment of the invention, it is proposed that the image dataset comprises a plurality of visual images and/or depth images of the treatment object, and that the 3D target point cloud is generated from the visual images and/or depth images, in particular using a photogrammetric method and/or a depth fusion method.

The visual image may be generated by a camera, in particular a 3D camera system. The visual image may be designed as an RGB image and/or a grayscale image, for example. The depth image may be generated by a 3D scanner and/or a depth sensor, in particular by a 3D laser scanner and/or a time-of-flight depth sensor. The depth image may be designed as a depth map. The combination of the visual image and the depth image may be designed as RGB-D image data. Thus, the image of the image data set may be designed as an RGB-D image.

A 3D target point cloud may be generated from the visual image and/or the depth image using a 3D reconstruction method, in particular an active and/or passive 3D reconstruction method.

Photogrammetry may be used to generate a 3D target point cloud from a visual image. In particular, a 3D target point cloud may be generated from a visual image using a motion structure (SfM) process and/or a multi-view stereo (MVS) process. For example, a COLMAP 3D reconstruction pipeline may be used for this purpose.

Depth fusion may be used, i.e., using 3D reconstruction from multiple depth images, to generate a 3D target point cloud from the depth images. Depth fusion may be based on a Truncated Signed Distance Function (TSDF). For example, a Point Cloud Library (PCL) may be used to generate a depth image using depth fusion. For example, Kinect Fusion can be used as the deep Fusion method. In particular, a Kinect Fusion implementation included in the PCL may be used for this purpose, such as KinFu.

As described above, these embodiments of the present invention, which include generating a 3D target point cloud from a visual image and/or a depth image, provide the advantage that an accurate and detailed reconstruction of the surface of the treatment object can be achieved.

-determining a plurality of semantic segmentation masks in a visual image and/or a depth image of an image data set by applying a semantic segmentation to the visual image and/or the depth image of the image data set, in particular using a convolutional neural network configured for the semantic segmentation, and

-determining a semantic segmentation mask in the target point cloud using the semantic segmentation mask in the visual image and/or the depth image of the image dataset.

This embodiment provides the following advantages: which facilitates semantic segmentation and enables the determination of semantic segmentation masks in 2D images and/or depth images using sophisticated and available semantic segmentation methods. The resulting semantic segmentation masks in the visual image and/or depth image may be transmitted to the 3D point cloud to use them for the purpose of the claimed method, i.e. for determining the exact transformation between the source point cloud and the target point cloud. In this way, a fast, accurate and detailed reconstruction of the surface of the treatment object can be achieved.

According to another advantageous embodiment of the invention, it is proposed that the image data set comprises a plurality of visual images and depth images of the treatment object, and that step b) comprises the steps of:

-generating a first 3D point cloud from a visual image of an image dataset,

-generating a second 3D point cloud from the depth image of the image dataset, and

-generating a 3D target point cloud using the first 3D point cloud and the second 3D point cloud, in particular by merging the first 3D point cloud and the second 3D point cloud.

This embodiment provides the following advantages: the accuracy and completeness of the 3D reconstruction, and thus the accuracy and completeness of the resulting target point cloud, may be improved. This is achieved by merging a first 3D point cloud, which may be an RGB based point cloud, with a second 3D point cloud, which is a depth based point cloud.

According to another advantageous embodiment of the invention, it is proposed that step f) comprises:

-determining a plurality of semantic segmentation masks in medical imaging data of the medical imaging data set by applying a semantic segmentation to the medical imaging data of the medical imaging data set, in particular using a convolutional neural network configured for the semantic segmentation, and

-determining a semantic segmentation mask in the source point cloud using the semantic segmentation mask in the medical imaging data of the medical imaging dataset.

According to this embodiment, it is proposed to determine a semantic segmentation mask in medical imaging data of a medical imaging data set. The medical imaging data may comprise 2D medical images, in particular 2D sectional images, and the semantic segmentation mask may be determined in these 2D medical images. Based on these semantic segmentation masks in the medical imaging data, a 3D semantic segmentation mask may be determined in the source point cloud. For example, each point of the 3D source point cloud may be projected to the 2D medical image to determine a semantic segmentation mask in the 2D medical image that corresponds to the respective point in the 3D source point cloud.

This embodiment provides the following advantages: it facilitates semantic segmentation and enables the determination of semantic segmentation masks in 2D images using sophisticated and available medical semantic segmentation methods. For example, a powerful convolutional neural network and a corresponding training data set may be used for medical semantic segmentation in 2D medical images. Examples include U-Net convolutional neural network structures. Exploiting the potential of these CNNs enables fast and accurate semantic segmentation of 2D medical images. These results may be transmitted to the 3D point cloud to use them for the purpose of the claimed method, i.e. for determining an accurate transformation between the source point cloud and the target point cloud.

The medical imaging data may also comprise 3D medical imaging data, for example a 3D medical imaging model of the subject of treatment. The 3D medical imaging data may be reconstructed from a plurality of 2D medical images, in particular 2D cross-sectional images, of the treatment object. Medical imaging methods may be used to generate these 2D medical images. In such an embodiment where the medical imaging data further comprises 3D medical imaging data, the semantic segmentation mask may also be determined in the 3D medical imaging data. A 3D semantic segmentation method, in particular a 3D semantic segmentation method based on a convolutional neural network configured for 3D semantic segmentation, may be used for this purpose. Based on the semantic segmentation mask in the 3D medical imaging data, a 3D semantic segmentation mask may be determined in the source point cloud.

According to a further advantageous embodiment of the invention, it is proposed that:

-step c) comprises: determining a semantic segmentation mask in a target point cloud by directly applying a semantic segmentation to the target point cloud, in particular using a convolutional neural network configured for semantic segmentation, and/or

-step f) comprises: the semantic segmentation mask in the source point cloud is determined by directly applying the semantic segmentation to the source point cloud, in particular using a convolutional neural network configured for semantic segmentation.

According to another advantageous embodiment of the invention, it is proposed that in step g) the transformation between the source point cloud and the target point cloud is determined by an Iterative Closest Point (ICP) algorithm using the coordinates of the points of the source point cloud and the segmentation mask of the source point cloud and the coordinates of the points of the target point cloud and the segmentation mask of the target point cloud.

According to this embodiment, it is proposed to use an Iterative Closest Point (ICP) algorithm to determine the transformation between the source point cloud and the target point cloud. In particular, a 4D variant of the ICP algorithm may be used for this purpose. The coordinates of the points of the source and target point clouds and the segmentation mask are used as inputs to the ICP algorithm. This may be achieved, for example, by determining a transformation between a source point cloud and a target point cloud using a 4D variant of the ICP algorithm, wherein for each point of the respective point cloud, an object class (e.g., nose, ear, mouth) of the respective semantic segmentation mask is interpreted as a fourth dimension of the point (in addition to the 3D coordinates of the point). As described above, by including semantic segmentation masks in the ICP algorithm, the ICP algorithm takes into account not only the spatial information of the points of the point cloud (i.e. their coordinates), but also their semantics, i.e. their meaning.

The semantic segmentation mask may also be used to determine an initial estimate of the transformation between the source point cloud and the target point cloud of the ICP algorithm, i.e. to determine an initial alignment.

The above embodiments provide the advantages of: guidance to the ICP algorithm is improved to find the best transformation between the source point cloud and the target point cloud. For example, by using semantic segmentation masks to determine the transformation, the ICP algorithm can be avoided from finding a local optimum as a solution. Thus, the accuracy of the transformation may be improved to improve the alignment of the medical imaging data with the treatment object.

-step a) comprises: removing background and/or other extraneous portions from an image of a treatment object using semantic segmentation, and ≥ using a convolutional neural network configured for semantic segmentation

Or

-step d) comprises: background and/or other extraneous portions are removed from medical imaging data of a treatment subject using semantic segmentation, and in particular using a convolutional neural network configured for semantic segmentation.

The extraneous portion may be an image or a portion of medical imaging data, such as hair of a human head, respectively, that is not relevant to the purpose of the medical imaging and/or is not relevant to the superimposition of the medical imaging data on the subject.

This embodiment provides the following advantages: which facilitates generating point clouds, determining semantic segmentation masks, and determining transformations between the point clouds.

According to a further advantageous embodiment of the invention, it is proposed that the medical imaging data set comprises a 3D medical imaging model of the treatment object, wherein the 3D medical imaging model is reconstructed from a plurality of 2D sectional images of the treatment object generated by the medical imaging method.

This embodiment provides the following advantages: the method can facilitate the generation of the 3D source point cloud, and can improve the accuracy of the obtained 3D source point cloud.

According to a further advantageous embodiment of the invention, it is proposed that in step a), an image data set comprising images of the treatment object is created by a camera, in particular a 3D camera system, of a mixed reality display device and/or by a depth sensor, in particular by a time-of-flight depth sensor, of the mixed reality display device.

In this way, the image dataset required for generating the target point cloud can be created and provided for the purposes of the present invention in a very simple and user-friendly manner. For example, using a camera and/or depth sensor of a mixed reality display device, an image of an image dataset may be created by automatically scanning a treatment object. In particular, if the mixed reality display device includes or consists of a head mounted mixed reality display device and/or mixed reality smart glasses, the camera and/or the depth sensor may be integrated into the head mounted mixed reality display device and/or mixed reality smart glasses, respectively. In this case, the user may simply point his head at the portion of the subject to be scanned.

-in step a), when creating images by means of cameras of a mixed reality display device, determining the position of the camera in the three-dimensional coordinate system for each image and storing the position of the camera as the 3D camera position, and in step b), generating a target point cloud using the 3D camera position, and ∑ er

Or

-in step a), when creating images by means of cameras of the mixed reality display device, determining the orientation of the cameras in the three-dimensional coordinate system for each image and storing the orientation of the cameras as 3D camera orientations, and in step b), generating a target point cloud using the 3D camera orientations.

Such an embodiment provides the following advantages: which enables a particularly accurate and reliable generation of a target point cloud from an image dataset.

According to a further advantageous embodiment of the invention, it is proposed that any position, any orientation and any transformation determined in steps a) to h) are marker-free and/or determined using simultaneous localization and mapping (SLAM).

Such an embodiment provides the following advantages: which is particularly comfortable for the user since no optical or other reference marks have to be provided.

The object of the invention is also achieved by a computer program having program code means adapted to perform the method as described above when the computer program is executed on a computer.

The object of the invention is also achieved by a mixed reality display device with a display, a computer and a memory, wherein the computer program as described above is stored in the memory and the computer is adapted to execute the computer program.

The computer program may be designed as a distributed computer program. The computer and the memory can be designed as a distributed computer and a distributed memory, respectively, i.e. a computer adapted to execute a computer program can comprise two or more computers. The distributed storage may include a plurality of memories, wherein each memory may store at least a portion of the distributed computer program, and each of the two or more computers may be adapted to execute a portion of the distributed computer program.

Further, the mixed reality display device may have an interface adapted to receive medical imaging data from an external source. The interface may be adapted for example to be connected to a medical imaging device, such as a Magnetic Resonance Imaging (MRI) device and/or a Computed Tomography (CT) device, and/or to a memory storing medical imaging data. This may include a wired connection and/or a wireless connection.

As described above, the mixed reality display device may include or consist of a head-mounted mixed reality display device and/or mixed reality smart glasses. The mixed reality display device may for example comprise or consist of a microsoft HoloLens device or a microsoft HoloLens2 device or similar.

As mentioned above, the mixed reality display device may comprise an additional external computer, e.g. an external server, connected to the display and adapted to perform at least part of the method proposed according to the invention. The mixed reality display device may, for example, include a head mounted mixed reality display device and/or mixed reality smart glasses and an additional computer, such as an external server, connected to the head mounted mixed reality display device and/or mixed reality smart glasses by a wired connection and/or a wireless connection, respectively. The external server may be designed as a cloud server.

As described above, all components of the mixed reality display device may be integrated in the head mounted mixed reality display device and/or the mixed reality smart glasses.

In the following, the invention will be explained in more detail using exemplary embodiments schematically illustrated in the drawings. The drawings show the following:

FIG. 1-a schematic diagram of a mixed reality display device according to the present invention;

FIG. 2-a schematic diagram of a method for controlling a mixed reality display device according to the present invention;

FIG. 3-schematic of a 3D target point cloud with semantic segmentation mask;

FIG. 4-schematic of a 3D source point cloud with semantic segmentation mask.

Fig. 1 shows a schematic diagram of a mixed reality display device 1 comprising a head-mounted mixed reality display device in the form of a pair of mixed reality smart glasses 5. In this example embodiment, the mixed reality smart glasses 5 are of the microsoft HoloLens type. The smart glasses 5 have a memory 13a and a computer 11a connected to the memory 13a, wherein the computer 11a includes several processing units, i.e., a CPU (central processing unit), a GPU (graphics processing unit), and an HPU (holographic processing unit).

Furthermore, the smart glasses 5 of the mixed reality display device 1 have a camera 9 in the form of a 3D camera system. The camera 9 is adapted to create visual images of the object 15 from different perspectives. Furthermore, the smart glasses 5 comprise a plurality of sensors 7, said sensors 7 comprising time-of-flight depth sensors adapted to create depth images of the therapeutic object 15 from different perspectives. In this example embodiment, the subject 15 is the head of a human patient 17.

In addition, the smart glasses 5 of the mixed reality display device 1 have a display 3, which in this example embodiment is designed as an optical see-through head mounted display. The see-through display 3 is adapted to visualize virtual information, in this example embodiment medical imaging data, by overlaying the virtual information on a real view of the therapeutic object 15.

Further, fig. 1 shows that the mixed reality display device 1 comprises a server 21, the server 21 being connected to the smart glasses 5 by a wireless connection and/or a wired connection. The server 21 includes a memory 13b and a computer 11b connected to the memory 13b, wherein the computer 11b includes a CPU and a GPU. The server 21 has an interface 19, which interface 19 is adapted to receive medical imaging data from an external source, i.e. from a memory storing medical imaging data, by a wired connection and/or a wireless connection.

FIG. 2 illustrates a schematic diagram of an example method for controlling a mixed reality display device in accordance with this invention.

In step a), an image data set comprising a plurality of images of the treatment object 15 is provided, wherein the images depict the treatment object from different perspectives. In this example embodiment, the image dataset comprises a plurality of visual images in the form of 2D RGB images and a plurality of depth images in the form of depth maps. A visual image is created by the camera 9 of the mixed reality display device 1 and a depth image is created by the depth sensor 7 of the mixed reality display device 1 (see fig. 1). In other words, this means that an image data set is created which contains RGB-D image data. To this end, a user wearing the smart glasses 5, e.g. a surgeon, may simply point his head at the treatment object 15 at different perspectives and automatically scan the treatment object 15 from different perspectives using the camera 9 and the depth sensor 7. When creating the images, the position and orientation of the camera 9 is determined in the three-dimensional coordinate system for each image and stored as the 3D camera position and 3D camera orientation, respectively. For this purpose, synchronous positioning and mapping (SLAM) is used. The 3D camera position and 3D camera orientation of each image may be referred to as extrinsic camera parameters. The image data set comprising the external camera parameters is then transmitted from the smart glasses 5 to the server 21.

In step b), the server 21 generates a 3D target point cloud from the image dataset, wherein the target point cloud comprises a plurality of points defined in a three-dimensional coordinate system and which points represent the surface of the treatment object 15. For this purpose, the extrinsic camera parameters, i.e. the 3D camera position and the 3D camera orientation, are used.

A 3D target point cloud is generated from the visual image and the depth image. To this end, a first 3D point cloud is generated from the visual image of the image dataset using a photogrammetric method, i.e. by using a COLMAP 3D reconstruction pipeline comprising a motion structure (SfM) process and a multi-view stereo (MVS) process. Further, a second 3D point cloud is generated from the depth image of the image dataset using depth fusion, i.e. using 3D reconstruction from a plurality of depth images. For this purpose, kinfu is realized, for example, using a Point Cloud Library (PCL), i.e. Kinect Fusion contained in PCL. Thereafter, a 3D target point cloud is generated by merging the first 3D point cloud and the second 3D point cloud.

In step c), a plurality of semantic segmentation masks in the target point cloud is determined by applying semantic segmentation.

First, a plurality of semantic segmentation masks is determined in a 2D visual image of an image dataset by applying semantic segmentation to these 2D RGB images. The semantic segmentation mask defines an object class for each pixel of each 2D visual image, wherein the object class may include an object class "other", "blank", or similar object class for otherwise unmatchable image regions. To determine the semantic segmentation mask in the 2D visual image, a Convolutional Neural Network (CNN) configured for semantic segmentation is used, i.e., a CNN based on the U-Net CNN architecture (U-Net CNN). The U-Net CNN is trained for semantic segmentation of the therapeutic subject 15, i.e. for semantic segmentation of the human head, using a suitable training data set comprising semantic segmentation masks labeled with their respective subject classes (e.g. nose, ears, eyes, eyebrows).

Second, a semantic segmentation mask in the 3D target point cloud is determined using a previously determined semantic segmentation mask in a 2D visual image of the image dataset. To this end, each point of the 3D target point cloud is projected to a plurality of 2D RGB images to determine a semantic segmentation mask in the respective 2D image corresponding to the point in the 3D target point cloud.

Fig. 3 schematically shows the result of determining a semantic segmentation mask in a target point cloud. The figure shows a 3D target point cloud 23 generated from an image dataset. In the 3D target point cloud 23, several

semantic segmentation masks

25a, 25b, 25c have been determined. Two semantic segmentation masks 25a represent the right and left eyes of the patient, respectively. Another semantic segmentation mask 25b represents the nose of the patient and another semantic segmentation mask 25c represents the mouth of the patient.

Referring back now to fig. 2, in step d) a medical imaging data set comprising medical imaging data of the subject of treatment is provided. In this example embodiment, the medical imaging data includes a plurality of 2D cross-sectional images of the subject 15 created by Magnetic Resonance Imaging (MRI) prior to surgery. These 2D MRI images in DICOM data are received by the server 21 through the server interface 19. The server 21 reconstructs a 3D medical imaging model of the treatment object 15 from the plurality of 2D MRI images of the treatment object 15.

The medical imaging data may also include metadata stored as tags in the DICOM data, for example in the form of attributes. Examples of such metadata include information about slice thickness of the cross-sectional image and/or information about pixel spacing. Such metadata may be used to reconstruct a 3D imaging model and/or to generate a 3D source point cloud.

In step e), a 3D source point cloud is generated from the medical imaging dataset, wherein the source point cloud comprises a plurality of points defined in a three-dimensional coordinate system and which points also represent the surface of the treatment object 15. For this purpose, a Point Cloud Library (PCL) is used.

In step f), a plurality of semantic segmentation masks is determined in the source point cloud by applying semantic segmentation.

First, a plurality of semantic segmentation masks is determined in 2D MRI images of medical imaging datasets by applying semantic segmentation to the 2D MRI images of these medical imaging datasets. The semantic segmentation mask defines an object class for each pixel of each 2D MRI image, wherein the object class may include object classes "other", "blank", or similar object classes for otherwise unmatchable image regions. To determine the semantic segmentation mask in the 2D MRI image, a Convolutional Neural Network (CNN) configured for semantic segmentation, i.e., a CNN based on the U-Net CNN architecture (U-Net CNN), is used. The U-Net CNN is trained for semantic segmentation of the therapeutic subject 15, i.e. for semantic segmentation of the human head, using a suitable training data set comprising semantic segmentation masks labeled with their respective subject classes (e.g. nose, ears, eyes, eyebrows).

Second, a semantic segmentation mask in the 3D source point cloud is determined using a previously determined semantic segmentation mask in a 2D MRI image of the medical imaging dataset.

Fig. 4 schematically shows the result of determining a semantic segmentation mask in a source point cloud. The figure shows a 3D source point cloud 27 generated from a medical imaging data set. In the 3D source point cloud 27, several

semantic segmentation masks

29a, 29b, 29c have been determined. Two semantic segmentation masks 29a represent the right and left eyes of the patient, respectively. Another semantic segmentation mask 29b represents the nose of the patient and another semantic segmentation mask 29c represents the mouth of the patient.

Referring back now to fig. 2, in step g), a transformation between the source point cloud and the target point cloud is determined using the segmentation mask of the source point cloud and the segmentation mask of the target point cloud. In this example embodiment, a 4 x 4 transformation matrix including translations and rotations is determined and points of the source point cloud are aligned with points of the target point cloud when the 4 x 4 transformation matrix is applied to the source point cloud. Thus, the determined 4 x 4 transformation matrix is used for the purpose of transforming between different perspectives (positions and orientations) from which the source point cloud and the target point cloud have been acquired.

In this example embodiment, a 4D variant of the Iterative Closest Point (ICP) algorithm is used to determine the 4 x 4 transformation matrix. The coordinates of the points of the source and target point clouds and the segmentation mask are used as input to an algorithm, wherein for each point of the respective point cloud, the object class (e.g., nose, ear, mouth) of the respective semantic segmentation mask is interpreted as the fourth dimension of the point (in addition to the 3D coordinates of the point). The optimization function of the 4D ICP is designed to support a transformation that exactly matches the corresponding semantic segmentation masks (nose and nose, ear and ear, mouth and mouth, etc.) in the two point clouds.

The 4 x 4 transformation matrix and the medical imaging data are transmitted from the server 21 to the mixed reality smart glasses 5.

In step h), at least a part of the medical imaging data, i.e. the MRI imaging data, is visualized on the optical see-through display 3 of the smart glasses 5 (see fig. 1), wherein the medical imaging data is superimposed on the real view of the therapeutic object 15 and aligned with the therapeutic object 15 using the transformation between the source point cloud and the target point cloud, i.e. using the 4 × 4 transformation matrix determined in step g).

In this way, a surgeon using the mixed reality display device 1 may be supported during surgery by virtually visualizing the anatomy of the treatment object 15 shown by the MRI image in precise alignment with the real world view of the treatment object 15.

In this example embodiment, steps a) and h) are performed by the mixed reality smart glasses 5, while steps b) through g) are performed by the server 21. In other embodiments, additional or all steps of the method may be performed by the mixed reality smart glasses 5.

List of reference numerals

1 mixed reality display device

3 display

5 Mixed reality intelligent glasses

7 sensor

9 Camera

11a, 11b computer

13a, 13b memory

15 treating a subject

17 patients

19 interface

21 Server

23 target point cloud

25a, 25b, 25c segmentation mask for target point clouds

27 source point cloud

29a, 29b, 29c segmentation mask for source point clouds

Claims

1. A method for controlling a display of a mixed reality display device (1), the method comprising at least the steps of:

a) providing an image dataset comprising a plurality of images of a subject (15), wherein the subject (15) is a patient body or a part of the patient body and the images depict the subject (15) from different perspectives,

b) generating a 3D target point cloud (23) from the image dataset, wherein the target point cloud (23) comprises a plurality of points defined in a three-dimensional coordinate system and the points represent a surface of the treatment object,

c) determining a plurality of semantic segmentation masks (25a, 25b, 25c) in the target point cloud (23) by applying semantic segmentation,

d) providing a medical imaging data set comprising medical imaging data of the subject of treatment (15),

e) generating a 3D source point cloud (27) from the medical imaging dataset, wherein the source point cloud (27) comprises a plurality of points defined in a three-dimensional coordinate system and which points also represent the surface of the treatment object,

f) determining a plurality of semantic segmentation masks (29a, 29b, 29c) in the source point cloud (27) by applying semantic segmentation,

g) determining a transformation between the source point cloud (27) and the target point cloud (23) using a segmentation mask (29a, 29b, 29c) of the source point cloud (27) and a segmentation mask (25a, 25b, 25c) of the target point cloud (23), and

h) visualizing at least a portion of the medical imaging data on the display (3), wherein the medical imaging data is superimposed on the treatment object (15) and aligned with the treatment object (15) using a transformation between the source point cloud (27) and the target point cloud (23).

2. The method according to claim 1, characterized in that the display (3) is designed as an optical see-through display (3), in particular as an optical see-through head-mounted display (3), and/or in that the mixed reality display device (1) comprises or consists of a head-mounted mixed reality display device (1) and/or mixed reality smart glasses (5).

3. The method of claim 1 or 2, wherein the medical imaging data is generated using at least one of the following medical imaging methods: magnetic Resonance Imaging (MRI), Computed Tomography (CT), radiography, ultrasound, endoscopy, and/or nuclear medicine imaging.

4. The method according to any of the preceding claims, characterized in that a convolutional neural network configured for semantic segmentation is used to determine a semantic segmentation mask (25a, 25b, 25c) in the target point cloud (23) and/or a semantic segmentation mask (29a, 29b, 29c) in the source point cloud (27).

5. Method according to any of the preceding claims, wherein step c) comprises the steps of:

-determining a plurality of semantic segmentation masks in an image of the image dataset by applying a semantic segmentation to the image of the image dataset, in particular using a convolutional neural network configured for semantic segmentation, and

-determining a semantic segmentation mask (25a, 25b, 25c) in the target point cloud (23) using the semantic segmentation mask in the image of the image dataset.

6. The method according to any one of the preceding claims, characterized in that the image dataset comprises a plurality of visual images and/or depth images of the treatment object (15), and the 3D target point cloud (23) is generated from the visual images and/or the depth images, in particular using photogrammetry methods and/or depth fusion methods.

7. The method of claim 6, wherein step c) comprises the steps of:

-determining a plurality of semantic segmentation masks in a visual image and/or a depth image of the image data set by applying a semantic segmentation to the visual image and/or the depth image of the image data set, in particular using a convolutional neural network configured for semantic segmentation, and

-determining a semantic segmentation mask (25a, 25b, 25c) in the target point cloud (23) using the semantic segmentation mask in a visual image and/or a depth image of the image dataset.

8. The method according to any of the preceding claims, wherein the image data set comprises a plurality of visual images and depth images of the subject (15), and step b) comprises the steps of:

-generating a first 3D point cloud from a visual image of the image dataset,

-generating the 3D target point cloud (23) using the first 3D point cloud and the second 3D point cloud, in particular by merging the first 3D point cloud and the second 3D point cloud.

9. The method according to any one of the preceding claims, wherein step f) comprises:

-determining a plurality of semantic segmentation masks in medical imaging data of the medical imaging data set by applying a semantic segmentation to the medical imaging data of the medical imaging data set, in particular using a convolutional neural network configured for semantic segmentation, and

-determine a semantic segmentation mask (29a, 29b, 29c) in the source point cloud (27) using the semantic segmentation mask in medical imaging data of the medical imaging dataset.

10. The method according to any of the preceding claims, characterized in that:

-step c) comprises: determining a semantic segmentation mask (25a, 25b, 25c) in the target point cloud (23) by directly applying a semantic segmentation to the target point cloud (23), in particular using a convolutional neural network configured for semantic segmentation, and/or

-step f) comprises: a semantic segmentation mask (29a, 29b, 29c) in the source point cloud (27) is determined by directly applying a semantic segmentation to the source point cloud (27), in particular using a convolutional neural network configured for semantic segmentation.

11. The method according to any of the preceding claims, characterized in that the transformation between the source point cloud (27) and the target point cloud (23) is determined by an Iterative Closest Point (ICP) algorithm using the coordinates of the points of the source point cloud (27) and the segmentation mask (29a, 29b, 29c) of the source point cloud (27) and the coordinates of the points of the target point cloud (23) and the segmentation mask (25a, 25b, 25c) of the target point cloud (23).

12. The method according to any of the preceding claims, characterized in that:

-step a) comprises: removing background and/or other extraneous parts from the image of the subject (15) using semantic segmentation, in particular using a convolutional neural network configured for semantic segmentation, and/or

-step d) comprises: background and/or other extraneous portions are removed from medical imaging data of the subject (15) using semantic segmentation, in particular using a convolutional neural network configured for semantic segmentation.

13. The method according to any one of the preceding claims, wherein the medical imaging data set comprises a 3D medical imaging model of the object of treatment (15), wherein the 3D medical imaging model is reconstructed from a plurality of 2D sectional images of the object of treatment (15) generated by a medical imaging method.

14. The method according to any one of the preceding claims, characterized in that in step a), the image dataset comprising images of the therapeutic object (15) is created by a camera (9), in particular a 3D camera system, of the mixed reality display device (1) and/or by a depth sensor (7), in particular by a time-of-flight depth sensor (7), of the mixed reality display device (1).

15. The method of claim 14, wherein:

-in step a), when creating the images by means of the cameras (9) of the mixed reality display device, determining the position of the camera (9) in a three-dimensional coordinate system for each image and storing the position of the camera (9) as a 3D camera position, and in step b), generating the target point cloud (23) using the 3D camera positions, and/or

-in step a), when creating the images by means of the cameras (9) of the mixed reality display device, determining for each image the orientation of the camera (9) in a three-dimensional coordinate system and storing the orientation of the camera (9) as a 3D camera orientation, and in step b), generating the target point cloud (23) using the 3D camera orientation.

16. Method according to any of the preceding claims, characterized in that any position, any orientation and any transformation determined in steps a) to h) are determined marker-free and/or determined using simultaneous localization and mapping (SLAM).

17. Computer program having program code means adapted to perform the method of any one of the preceding claims when the computer program is executed on a computer (11a, 11 b).

18. A mixed reality display device (1) having a display (3), a computer (11a, 11b) and a memory (13a, 13b), wherein a computer program according to the preceding claim is stored in the memory (13a, 13b) and the computer (11a, 11b) is adapted to execute the computer program.