WO2023184526A1

WO2023184526A1 - System and method of real-time stereoscopic visualization based on monocular camera

Info

Publication number: WO2023184526A1
Application number: PCT/CN2022/085011
Authority: WO
Inventors: Pengjia CAO; Kun FANG; Qin LUO; Xiaofang GAN; Yingying LIU
Original assignee: Covidien Lp
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2023-10-05

Abstract

An imaging system includes a monocular endoscope configured to capture a monocular image. The system also includes an image processing device having a processor and a memory, with instructions stored thereon, which when executed by the processor cause the image processing device to: resize the monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.

Description

SYSTEM AND METHOD OF REAL-TIME STEREOSCOPIC VISUALIZATION BASED ON MONOCULAR CAMERA

BACKGROUND

Minimally invasive surgery has become an indispensable part in surgical procedures and is performed with the aid of an endoscope, which allows for viewing of the surgical site through a natural opening, a small incision, or an access port. However, conventional minimally invasive surgeries mostly employ monocular endoscopes, which only display two-dimensional (2D) images lacking depth information. Therefore, it is challenging for a surgeon to accurately move surgical instruments to specific locations inside a patient’s body. Surgeons usually perceive depth in 2D images according to motion parallax, monocular cues, and other indirect visual feedback for positioning accuracy. Stereoscopic visualization provides better imaging of the surgical site during minimally invasive surgery, providing the surgeon with depth perception. Despite the advantages of depth information or stereoscopic images, dual-camera endoscopes have the drawback of being much more expensive than monocular endoscopes.

SUMMARY

The present disclosure relates to a stereoscopic visualization system for endoscopes and, more particularly, to a stereoscopic visualization system generating stereoscopic images based on monocular images.

According to one embodiment of the present disclosure, an image processing device for generating a stereoscopic image is disclosed. The image processing device may include a processor; and a memory, including instructions stored thereon, which when executed by the processor cause the image processing device to: resize a monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.

Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the second resolution may be smaller than the first resolution. The monocular image may be a frame from a video stream. The image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map. The image processing device may be further configured to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. The image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.

According to another embodiment of the present disclosure, an imaging system for generating a stereoscopic image is disclosed. The imaging system includes a monocular endoscope configured to capture a monocular image. The system also includes an image processing device having a processor and a memory, with instructions stored thereon, which when executed by the processor cause the image processing device to: resize the monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.

Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the second resolution is smaller than the first resolution. The imaging system may include a stereoscopic display configured to display the stereoscopic image. The monocular image may be a frame from a video stream. The image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map. The image processing device may be further configured resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. The image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.

According to a further embodiment of the present disclosure, a method for generating a stereoscopic image is disclosed. The method includes resizing a monocular image having a first resolution yielding a resized image having a second resolution. The method also includes calculating an estimated depth map of the monocular image based on the resized image. The method further includes resizing the estimated depth map yielding a resized estimated depth map having the first resolution. The method additionally includes generating a counterpart monocular image based on the resized estimated depth map and generating a stereoscopic image based on the monocular image and the counterpart monocular image.

Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the method may also include receiving the monocular image as a frame from a video stream. The second resolution may be smaller than the first resolution. The method may further include outputting the stereoscopic image on a stereoscopic display. Calculating the estimated depth map may further include executing a convolutional neural network. Resizing the monocular image may further include using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. Resizing the estimated depth map may further include using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure may be understood by reference to the accompanying drawings, when considered in conjunction with the subsequent, detailed description, in which:

FIG. 1 is a schematic view of an imaging system according to an embodiment of the present disclosure;

FIG. 2 is flow chart of a stereoscopic image generating algorithm according to an embodiment of the present disclosure; and

FIG. 3 is flow chart of a stereoscopic image generating algorithm according to another embodiment of the present disclosure.

DETAILED DESCRIPTION

Embodiments of the presently disclosed system are described in detail with reference to the drawings, in which like reference numerals designate identical or corresponding elements in each of the several views. In the following description, well-known functions or constructions are not described in detail to avoid obscuring the present disclosure in unnecessary detail. Those skilled in the art will understand that the present disclosure may be adapted for use with any imaging system.

With reference to FIG. 1, an imaging system 10 includes a monocular endoscope 20 and an image processing device 30. The endoscope 20 is configured to capture 2D image data, which includes still images or a video stream having a plurality of monocular endoscopic images captured over a period of time. The endoscope 20 may be any device structurally configured for internally imaging an anatomical region of a body (e.g., human or animal) and may include fiber optics, lenses, miniaturized (e.g., complementary metal oxide semiconductor (CMOS) sensor) imaging systems or the like. Suitable endoscopes 20 include, but are not limited to, any type of scope (e.g., a bronchoscope, a colonoscope, a laparoscope, etc. ) and any device similar to a scope that is equipped with an image system (e.g., an imaging cannula) .

The endoscope 20 is coupled the image processing device 30 that is configured to receive image data from the endoscope 20 for further processing. The image processing device 30 may include a processor 32, which may be operably connected to a memory 34, which may include one or more of volatile, non-volatile, magnetic, optical, or electrical media, such as read-only memory (ROM) , random access memory (RAM) , electrically-erasable programmable ROM (EEPROM) , non-volatile RAM (NVRAM) , or flash memory. The processor 32 is configured to perform the operations, calculations, and/or set of instructions stored in the memory 34. The processor 32 may be any suitable processor including, but not limited to, a hardware processor, a field programmable gate array (FPGA) , a digital signal processor (DSP) , a central processing unit (CPU) , a microprocessor, a graphic processing unit ( “GPU” ) , and combinations thereof. Those skilled in the art will appreciate that the processor may be substituted for by using any logic processor (e.g., control circuit) adapted to execute algorithms, calculations, and/or set of instructions described herein.

The image processing device 30 is also coupled to a display 40, which may be a stereoscopic monitor and is configured to display the stereoscopic images or stereoscopic video stream generated by and transmitted from the image processing device 30. The display 40 may be configured to display stereoscopic images in a side-by-side format or an interlaced format to be viewed with the aid of 3D glasses. In further embodiments, the display 40 may be an autostereoscopic display (e.g., using a parallax barrier, lenticular lens, or other display technologies) configured to display stereoscopic images without 3D glasses.

The image processing device 30 receives monocular images from the endoscope 20 as input, and generates the corresponding stereoscopic images which are displayed on the display 40. Specifically, the input monocular image may be the left image or the right image in the generated stereoscopic images and the generated image is the counterpart image (e.g., left or right) .

The image processing device 30 is configured to execute an image generation algorithm based on deep learning, which performs stereoscopic image generation. The algorithm is illustrated in FIG. 2 and may be embodied as a software application or instructions stored in the memory 34 and executable by the processor 32. Initially, at step 100, the image processing device 30 receives an input image (e.g., left image) which may be a still image or a frame of a video stream, from the endoscope 20. In embodiments, the input image may be a right image.

At step 102, the image generation algorithm calculates an estimated depth map for the input mage using a convolutional neural network. The convolutional neural network may have any suitable convolutional architecture, such a U-Net architecture, which may be used in medical image processing. The parameters to be optimized in the algorithm include those of the convolutional neural network. There are no learnable parameters in the sampling step and are thus, excluded from optimization.

In various embodiments, training of the neural network may happen on a separate system, e.g., graphic processor unit ( “GPU” ) workstations, high performing computer clusters, etc., and the trained algorithm may then be deployed on the image processing device 30. The stereoscopic image generation algorithm may be trained in an end-to-end manner using actual stereoscopic endoscopic images as training data. During the training phase, the algorithm receives a left image of the stereoscopic images as input, and outputs one estimated right image using the process described above with respect to FIG. 2. By measuring and minimizing the differences between the estimated right images and the actual right images, the parameters in the algorithm are optimized via backpropagating the gradients with respect to the differences. Given a large enough training set and appropriate training settings, the algorithm training converges, and the differences between estimated and actual images are reduced to a locally minimal value, which indicates that the stereoscopic image generation algorithm has been fully trained.

At step 104, the image processing algorithm generates another image (e.g., right) by sampling the input image based on the estimated depth map. After a counterpart image is generated, the input image and the generated image are combined as a stereoscopic image and displayed on the display 40.

FIG. 3 shows a method for stereoscopic visualization using the imaging system 10 including the process and algorithm of FIG. 2. Initially, at step 200 a video stream from the endoscope 20 is received at the image processing device 30. More specifically, the image processing device 30 reads one frame (i.e., a still monocular image) at a time from the video stream. The image may be of any suitable resolution, e.g., 4K, 1080p, 720p, etc. At step 202, the single frame is resized (e.g., downsized) to a smaller size (i.e., resolution) , which may be reduced by a factor of from about 1.5 to about 5. Resizing may be accomplished any suitable image resizing algorithm to reduce the resolution of the image to a desired image size. The first resizing operation, i.e., resizing the input image, may be implemented using any suitable interpolation technique, including, but not limited to, an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, and/or a bicubic interpolation.

At step 204, the resized image is processed using a convolutional neural network yielding an estimated depth map, as described above with respect to FIG. 2. The depth map estimation is performed on the resized image, allowing for faster processing and generation of the depth map due to the smaller resolution size on which depth estimation is being performed.

At step 206, the estimated depth map is resized (e.g., enlarged) to the original image size, since the estimated depth map was obtained from the smaller image. The second resizing operation, i.e., resizing the depth map, may be implemented using any suitable interpolation technique, including, but not limited to, a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, an area interpolation, and combinations thereof.

Two resizing operations allow for faster image generation while maintaining the quality and resolution of the generated image. As noted above, the input image is first resized to a smaller size to perform depth estimation portion of the algorithm. Thereafter, the estimated depth map from the algorithm is then resized back to original size of the input image to generate the right image. Without the resizing operations, the processing speed of the algorithm would be adversely affected.

At step 208, the image processing device 30 samples the original input image and generates the counterpart (e.g., right) image based on the resized depth map. Finally, at step 210, the left original image and the right generated image are combined as a stereoscopic image and displayed on the display 40.

The image generation algorithm according to the present disclosure was tested to demonstrate the effect of two resizing operations on stereoscopic image generation from a single image. Two algorithms, one with two resizing operations, and one without resizing operations, were executed on a personal computer (PC) with an NVIDIA GTX 1070 GPU, running Windows 10, CUDA 10.2.89, cuDNN 8.0.5, and PyTorch 1.6.0 (hereinafter “Windows PC” ) . Net interference and total time were measured. “Net inference” indicates the time that the Windows PC took to calculate the estimated depth map using the convolutional neural network. “Total time” indicates the total processing of one frame, i.e., from reading one input frame to generating the stereoscopic images. The statistics were averaged for 500 frames. The original size of each frame was 1728 x 512. The resized smaller size of the frame was 896 x 256. The results are summarized in Table 1 below and demonstrate that the algorithm with resizing operations has a significant effect on the time to generate depth maps and obtain stereoscopic images.

	Net inference (ms)	Total time (ms)
Without resizing operations	42.535	77.321
With resizing operations	18.808	48.842

Table 1

The image generation algorithm according to the present disclosure was also tested using an open-source package TensorRT ^TM developed by the

corporation to increase the processing speed of the convolutional neural network. TensorRT ^TM is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. The Windows PC was used to execute the algorithm of this disclosure without TensorRT. Another PC with an NVIDIA GTX 1070 GPU, running Ubuntu 18.04, CUDA 10.0, cuDNN7.6.5, PyTorch 1.6.0, and TensorRT 7.0.0.11 executed the algorithm. As the data in Table 2 shows, both, “net interference” and “total time” were significantly improved by TensorRT taking advantage of the GPU processing.

	Net inference (ms)	Total time (ms)
Without TensorRT	42.535	77.321
With TensorRT	13.284	27.454

Table 2

The disclosed method and techniques may be implemented in hardware, software, firmware, virtualized computer environments, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include non-transitory computer-readable media, which corresponds to a tangible medium such as data storage media (e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer) .

While several embodiments of the disclosure have been shown in the drawings and/or described herein, it is not intended that the disclosure be limited thereto, as it is intended that the disclosure be as broad in scope as the art will allow and that the specification be read likewise. Therefore, the above description should not be construed as limiting, but merely as exemplifications of particular embodiments. Those skilled in the art will envision other modifications within the scope of the claims appended hereto.

Claims

An image processing device for generating a stereoscopic image, the image processing device comprising:

a processor; and

a memory, including instructions stored thereon, which when executed by the processor cause the image processing device to:

resize a monocular image having a first resolution yielding a resized image having a second resolution;

calculate an estimated depth map of the monocular image based on the resized image;

resize the estimated depth map yielding a resized estimated depth map having the first resolution;

generate a counterpart monocular image based on the resized estimated depth map; and

generate a stereoscopic image based on the monocular image and the counterpart monocular image.
The image processing device according to claim 1, wherein the second resolution is smaller than the first resolution.
The image processing device according to claim 1, wherein the monocular image is a frame from a video stream.
The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to execute a convolutional neural network to calculate the estimated depth map.
The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
An imaging system for generating a stereoscopic image, the imaging system comprising:

a monocular endoscope configured to capture a monocular image;

an image processing device including:

a processor; and

a memory, including instructions stored thereon, which when executed by the processor cause the image processing device to:

resize the monocular image having a first resolution yielding a resized image having a second resolution;

calculate an estimated depth map of the monocular image based on the resized image;

resize the estimated depth map yielding a resized estimated depth map having the first resolution;

generate a counterpart monocular image based on the resized estimated depth map; and

generate a stereoscopic image based on the monocular image and the counterpart monocular image.
The imaging system according to claim 7, further comprising a stereoscopic display configured to display the stereoscopic image.
The imaging system according to claim 7, wherein the second resolution is smaller than the first resolution.
The imaging system according to claim 7, wherein the monocular image is a frame from a video stream.
The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to execute a convolutional neural network to calculate the estimated depth map.
The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
A method for generating a stereoscopic image, the method comprising:

resizing a monocular image having a first resolution yielding a resized image having a second resolution;

calculating an estimated depth map of the monocular image based on the resized image;

resizing the estimated depth map yielding a resized estimated depth map having the first resolution;

generating a counterpart monocular image based on the resized estimated depth map; and

generating a stereoscopic image based on the monocular image and the counterpart monocular image.
The method according to claim 14, further comprising:

receiving the monocular image as a frame from a video stream.
The method according to claim 14, wherein the second resolution is smaller than the first resolution.
The method according to claim 14, further comprising:

outputting the stereoscopic image on a stereoscopic display.
The method according to claim 14, wherein calculating the estimated depth map further includes executing a convolutional neural network.
The method according to claim 14, wherein resizing the monocular image further includes using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
The method according to claim 14, wherein resizing the estimated depth map further includes using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.