WO2023184526A1 - System and method of real-time stereoscopic visualization based on monocular camera - Google Patents
System and method of real-time stereoscopic visualization based on monocular camera Download PDFInfo
- Publication number
- WO2023184526A1 WO2023184526A1 PCT/CN2022/085011 CN2022085011W WO2023184526A1 WO 2023184526 A1 WO2023184526 A1 WO 2023184526A1 CN 2022085011 W CN2022085011 W CN 2022085011W WO 2023184526 A1 WO2023184526 A1 WO 2023184526A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- interpolation
- depth map
- monocular
- processing device
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 24
- 238000012800 visualization Methods 0.000 title description 6
- 238000012545 processing Methods 0.000 claims abstract description 52
- 238000003384 imaging method Methods 0.000 claims abstract description 20
- 238000013527 convolutional neural network Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 4
- 238000002324 minimally invasive surgery Methods 0.000 description 3
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/128—Adjusting depth or disparity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/139—Format conversion, e.g. of frame-rate or size
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/261—Image signal generators with monoscopic-to-stereoscopic image conversion
- H04N13/268—Image signal generators with monoscopic-to-stereoscopic image conversion based on depth image-based rendering [DIBR]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/50—Constructional details
- H04N23/555—Constructional details for picking-up images in sites, inaccessible due to their dimensions or hazardous conditions, e.g. endoscopes or borescopes
Definitions
- Minimally invasive surgery has become an indispensable part in surgical procedures and is performed with the aid of an endoscope, which allows for viewing of the surgical site through a natural opening, a small incision, or an access port.
- conventional minimally invasive surgeries mostly employ monocular endoscopes, which only display two-dimensional (2D) images lacking depth information. Therefore, it is challenging for a surgeon to accurately move surgical instruments to specific locations inside a patient’s body. Surgeons usually perceive depth in 2D images according to motion parallax, monocular cues, and other indirect visual feedback for positioning accuracy. Stereoscopic visualization provides better imaging of the surgical site during minimally invasive surgery, providing the surgeon with depth perception. Despite the advantages of depth information or stereoscopic images, dual-camera endoscopes have the drawback of being much more expensive than monocular endoscopes.
- the present disclosure relates to a stereoscopic visualization system for endoscopes and, more particularly, to a stereoscopic visualization system generating stereoscopic images based on monocular images.
- an image processing device for generating a stereoscopic image.
- the image processing device may include a processor; and a memory, including instructions stored thereon, which when executed by the processor cause the image processing device to: resize a monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.
- Implementations of the above embodiment may include one or more of the following features.
- the second resolution may be smaller than the first resolution.
- the monocular image may be a frame from a video stream.
- the image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map.
- the image processing device may be further configured to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- the image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
- an imaging system for generating a stereoscopic image includes a monocular endoscope configured to capture a monocular image.
- the system also includes an image processing device having a processor and a memory, with instructions stored thereon, which when executed by the processor cause the image processing device to: resize the monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.
- the imaging system may include a stereoscopic display configured to display the stereoscopic image.
- the monocular image may be a frame from a video stream.
- the image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map.
- the image processing device may be further configured resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- the image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
- a method for generating a stereoscopic image includes resizing a monocular image having a first resolution yielding a resized image having a second resolution.
- the method also includes calculating an estimated depth map of the monocular image based on the resized image.
- the method further includes resizing the estimated depth map yielding a resized estimated depth map having the first resolution.
- the method additionally includes generating a counterpart monocular image based on the resized estimated depth map and generating a stereoscopic image based on the monocular image and the counterpart monocular image.
- Implementations of the above embodiment may include one or more of the following features.
- the method may also include receiving the monocular image as a frame from a video stream.
- the second resolution may be smaller than the first resolution.
- the method may further include outputting the stereoscopic image on a stereoscopic display.
- Calculating the estimated depth map may further include executing a convolutional neural network.
- Resizing the monocular image may further include using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- Resizing the estimated depth map may further include using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
- FIG. 1 is a schematic view of an imaging system according to an embodiment of the present disclosure
- FIG. 2 is flow chart of a stereoscopic image generating algorithm according to an embodiment of the present disclosure.
- FIG. 3 is flow chart of a stereoscopic image generating algorithm according to another embodiment of the present disclosure.
- an imaging system 10 includes a monocular endoscope 20 and an image processing device 30.
- the endoscope 20 is configured to capture 2D image data, which includes still images or a video stream having a plurality of monocular endoscopic images captured over a period of time.
- the endoscope 20 may be any device structurally configured for internally imaging an anatomical region of a body (e.g., human or animal) and may include fiber optics, lenses, miniaturized (e.g., complementary metal oxide semiconductor (CMOS) sensor) imaging systems or the like.
- CMOS complementary metal oxide semiconductor
- Suitable endoscopes 20 include, but are not limited to, any type of scope (e.g., a bronchoscope, a colonoscope, a laparoscope, etc. ) and any device similar to a scope that is equipped with an image system (e.g., an imaging cannula) .
- the endoscope 20 is coupled the image processing device 30 that is configured to receive image data from the endoscope 20 for further processing.
- the image processing device 30 may include a processor 32, which may be operably connected to a memory 34, which may include one or more of volatile, non-volatile, magnetic, optical, or electrical media, such as read-only memory (ROM) , random access memory (RAM) , electrically-erasable programmable ROM (EEPROM) , non-volatile RAM (NVRAM) , or flash memory.
- the processor 32 is configured to perform the operations, calculations, and/or set of instructions stored in the memory 34.
- the processor 32 may be any suitable processor including, but not limited to, a hardware processor, a field programmable gate array (FPGA) , a digital signal processor (DSP) , a central processing unit (CPU) , a microprocessor, a graphic processing unit ( “GPU” ) , and combinations thereof.
- a hardware processor e.g., a field programmable gate array (FPGA) , a digital signal processor (DSP) , a central processing unit (CPU) , a microprocessor, a graphic processing unit ( “GPU” ) , and combinations thereof.
- FPGA field programmable gate array
- DSP digital signal processor
- CPU central processing unit
- microprocessor e.g., a microprocessor
- GPU graphic processing unit
- the image processing device 30 is also coupled to a display 40, which may be a stereoscopic monitor and is configured to display the stereoscopic images or stereoscopic video stream generated by and transmitted from the image processing device 30.
- the display 40 may be configured to display stereoscopic images in a side-by-side format or an interlaced format to be viewed with the aid of 3D glasses.
- the display 40 may be an autostereoscopic display (e.g., using a parallax barrier, lenticular lens, or other display technologies) configured to display stereoscopic images without 3D glasses.
- the image processing device 30 receives monocular images from the endoscope 20 as input, and generates the corresponding stereoscopic images which are displayed on the display 40.
- the input monocular image may be the left image or the right image in the generated stereoscopic images and the generated image is the counterpart image (e.g., left or right) .
- the image processing device 30 is configured to execute an image generation algorithm based on deep learning, which performs stereoscopic image generation.
- the algorithm is illustrated in FIG. 2 and may be embodied as a software application or instructions stored in the memory 34 and executable by the processor 32.
- the image processing device 30 receives an input image (e.g., left image) which may be a still image or a frame of a video stream, from the endoscope 20.
- the input image may be a right image.
- the image generation algorithm calculates an estimated depth map for the input mage using a convolutional neural network.
- the convolutional neural network may have any suitable convolutional architecture, such a U-Net architecture, which may be used in medical image processing.
- the parameters to be optimized in the algorithm include those of the convolutional neural network. There are no learnable parameters in the sampling step and are thus, excluded from optimization.
- training of the neural network may happen on a separate system, e.g., graphic processor unit ( “GPU” ) workstations, high performing computer clusters, etc., and the trained algorithm may then be deployed on the image processing device 30.
- the stereoscopic image generation algorithm may be trained in an end-to-end manner using actual stereoscopic endoscopic images as training data.
- the algorithm receives a left image of the stereoscopic images as input, and outputs one estimated right image using the process described above with respect to FIG. 2.
- the parameters in the algorithm are optimized via backpropagating the gradients with respect to the differences. Given a large enough training set and appropriate training settings, the algorithm training converges, and the differences between estimated and actual images are reduced to a locally minimal value, which indicates that the stereoscopic image generation algorithm has been fully trained.
- the image processing algorithm generates another image (e.g., right) by sampling the input image based on the estimated depth map. After a counterpart image is generated, the input image and the generated image are combined as a stereoscopic image and displayed on the display 40.
- FIG. 3 shows a method for stereoscopic visualization using the imaging system 10 including the process and algorithm of FIG. 2.
- a video stream from the endoscope 20 is received at the image processing device 30. More specifically, the image processing device 30 reads one frame (i.e., a still monocular image) at a time from the video stream.
- the image may be of any suitable resolution, e.g., 4K, 1080p, 720p, etc.
- the single frame is resized (e.g., downsized) to a smaller size (i.e., resolution) , which may be reduced by a factor of from about 1.5 to about 5. Resizing may be accomplished any suitable image resizing algorithm to reduce the resolution of the image to a desired image size.
- the first resizing operation i.e., resizing the input image, may be implemented using any suitable interpolation technique, including, but not limited to, an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, and/or a bicubic interpolation.
- the resized image is processed using a convolutional neural network yielding an estimated depth map, as described above with respect to FIG. 2.
- the depth map estimation is performed on the resized image, allowing for faster processing and generation of the depth map due to the smaller resolution size on which depth estimation is being performed.
- the estimated depth map is resized (e.g., enlarged) to the original image size, since the estimated depth map was obtained from the smaller image.
- the second resizing operation i.e., resizing the depth map, may be implemented using any suitable interpolation technique, including, but not limited to, a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, an area interpolation, and combinations thereof.
- Two resizing operations allow for faster image generation while maintaining the quality and resolution of the generated image.
- the input image is first resized to a smaller size to perform depth estimation portion of the algorithm. Thereafter, the estimated depth map from the algorithm is then resized back to original size of the input image to generate the right image. Without the resizing operations, the processing speed of the algorithm would be adversely affected.
- the image processing device 30 samples the original input image and generates the counterpart (e.g., right) image based on the resized depth map. Finally, at step 210, the left original image and the right generated image are combined as a stereoscopic image and displayed on the display 40.
- the image generation algorithm according to the present disclosure was tested to demonstrate the effect of two resizing operations on stereoscopic image generation from a single image.
- Two algorithms, one with two resizing operations, and one without resizing operations, were executed on a personal computer (PC) with an NVIDIA GTX 1070 GPU, running Windows 10, CUDA 10.2.89, cuDNN 8.0.5, and PyTorch 1.6.0 (hereinafter “Windows PC” ) .
- Net interference and total time were measured.
- “Net inference” indicates the time that the Windows PC took to calculate the estimated depth map using the convolutional neural network.
- Total time indicates the total processing of one frame, i.e., from reading one input frame to generating the stereoscopic images. The statistics were averaged for 500 frames.
- the image generation algorithm according to the present disclosure was also tested using an open-source package TensorRT TM developed by the corporation to increase the processing speed of the convolutional neural network.
- TensorRT TM is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.
- the Windows PC was used to execute the algorithm of this disclosure without TensorRT.
- Table 2 shows, both, “net interference” and “total time” were significantly improved by TensorRT taking advantage of the GPU processing.
- Computer-readable media may include non-transitory computer-readable media, which corresponds to a tangible medium such as data storage media (e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer) .
- data storage media e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Processing (AREA)
Abstract
An imaging system includes a monocular endoscope configured to capture a monocular image. The system also includes an image processing device having a processor and a memory, with instructions stored thereon, which when executed by the processor cause the image processing device to: resize the monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.
Description
Minimally invasive surgery has become an indispensable part in surgical procedures and is performed with the aid of an endoscope, which allows for viewing of the surgical site through a natural opening, a small incision, or an access port. However, conventional minimally invasive surgeries mostly employ monocular endoscopes, which only display two-dimensional (2D) images lacking depth information. Therefore, it is challenging for a surgeon to accurately move surgical instruments to specific locations inside a patient’s body. Surgeons usually perceive depth in 2D images according to motion parallax, monocular cues, and other indirect visual feedback for positioning accuracy. Stereoscopic visualization provides better imaging of the surgical site during minimally invasive surgery, providing the surgeon with depth perception. Despite the advantages of depth information or stereoscopic images, dual-camera endoscopes have the drawback of being much more expensive than monocular endoscopes.
SUMMARY
The present disclosure relates to a stereoscopic visualization system for endoscopes and, more particularly, to a stereoscopic visualization system generating stereoscopic images based on monocular images.
According to one embodiment of the present disclosure, an image processing device for generating a stereoscopic image is disclosed. The image processing device may include a processor; and a memory, including instructions stored thereon, which when executed by the processor cause the image processing device to: resize a monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.
Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the second resolution may be smaller than the first resolution. The monocular image may be a frame from a video stream. The image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map. The image processing device may be further configured to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. The image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
According to another embodiment of the present disclosure, an imaging system for generating a stereoscopic image is disclosed. The imaging system includes a monocular endoscope configured to capture a monocular image. The system also includes an image processing device having a processor and a memory, with instructions stored thereon, which when executed by the processor cause the image processing device to: resize the monocular image having a first resolution yielding a resized image having a second resolution; calculate an estimated depth map of the monocular image based on the resized image; resize the estimated depth map yielding a resized estimated depth map having the first resolution; generate a counterpart monocular image based on the resized estimated depth map; and generate a stereoscopic image based on the monocular image and the counterpart monocular image.
Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the second resolution is smaller than the first resolution. The imaging system may include a stereoscopic display configured to display the stereoscopic image. The monocular image may be a frame from a video stream. The image processing device may be configured to execute a convolutional neural network to calculate the estimated depth map. The image processing device may be further configured resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. The image processing device may be also configured to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
According to a further embodiment of the present disclosure, a method for generating a stereoscopic image is disclosed. The method includes resizing a monocular image having a first resolution yielding a resized image having a second resolution. The method also includes calculating an estimated depth map of the monocular image based on the resized image. The method further includes resizing the estimated depth map yielding a resized estimated depth map having the first resolution. The method additionally includes generating a counterpart monocular image based on the resized estimated depth map and generating a stereoscopic image based on the monocular image and the counterpart monocular image.
Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the method may also include receiving the monocular image as a frame from a video stream. The second resolution may be smaller than the first resolution. The method may further include outputting the stereoscopic image on a stereoscopic display. Calculating the estimated depth map may further include executing a convolutional neural network. Resizing the monocular image may further include using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation. Resizing the estimated depth map may further include using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
The present disclosure may be understood by reference to the accompanying drawings, when considered in conjunction with the subsequent, detailed description, in which:
FIG. 1 is a schematic view of an imaging system according to an embodiment of the present disclosure;
FIG. 2 is flow chart of a stereoscopic image generating algorithm according to an embodiment of the present disclosure; and
FIG. 3 is flow chart of a stereoscopic image generating algorithm according to another embodiment of the present disclosure.
Embodiments of the presently disclosed system are described in detail with reference to the drawings, in which like reference numerals designate identical or corresponding elements in each of the several views. In the following description, well-known functions or constructions are not described in detail to avoid obscuring the present disclosure in unnecessary detail. Those skilled in the art will understand that the present disclosure may be adapted for use with any imaging system.
With reference to FIG. 1, an imaging system 10 includes a monocular endoscope 20 and an image processing device 30. The endoscope 20 is configured to capture 2D image data, which includes still images or a video stream having a plurality of monocular endoscopic images captured over a period of time. The endoscope 20 may be any device structurally configured for internally imaging an anatomical region of a body (e.g., human or animal) and may include fiber optics, lenses, miniaturized (e.g., complementary metal oxide semiconductor (CMOS) sensor) imaging systems or the like. Suitable endoscopes 20 include, but are not limited to, any type of scope (e.g., a bronchoscope, a colonoscope, a laparoscope, etc. ) and any device similar to a scope that is equipped with an image system (e.g., an imaging cannula) .
The endoscope 20 is coupled the image processing device 30 that is configured to receive image data from the endoscope 20 for further processing. The image processing device 30 may include a processor 32, which may be operably connected to a memory 34, which may include one or more of volatile, non-volatile, magnetic, optical, or electrical media, such as read-only memory (ROM) , random access memory (RAM) , electrically-erasable programmable ROM (EEPROM) , non-volatile RAM (NVRAM) , or flash memory. The processor 32 is configured to perform the operations, calculations, and/or set of instructions stored in the memory 34. The processor 32 may be any suitable processor including, but not limited to, a hardware processor, a field programmable gate array (FPGA) , a digital signal processor (DSP) , a central processing unit (CPU) , a microprocessor, a graphic processing unit ( “GPU” ) , and combinations thereof. Those skilled in the art will appreciate that the processor may be substituted for by using any logic processor (e.g., control circuit) adapted to execute algorithms, calculations, and/or set of instructions described herein.
The image processing device 30 is also coupled to a display 40, which may be a stereoscopic monitor and is configured to display the stereoscopic images or stereoscopic video stream generated by and transmitted from the image processing device 30. The display 40 may be configured to display stereoscopic images in a side-by-side format or an interlaced format to be viewed with the aid of 3D glasses. In further embodiments, the display 40 may be an autostereoscopic display (e.g., using a parallax barrier, lenticular lens, or other display technologies) configured to display stereoscopic images without 3D glasses.
The image processing device 30 receives monocular images from the endoscope 20 as input, and generates the corresponding stereoscopic images which are displayed on the display 40. Specifically, the input monocular image may be the left image or the right image in the generated stereoscopic images and the generated image is the counterpart image (e.g., left or right) .
The image processing device 30 is configured to execute an image generation algorithm based on deep learning, which performs stereoscopic image generation. The algorithm is illustrated in FIG. 2 and may be embodied as a software application or instructions stored in the memory 34 and executable by the processor 32. Initially, at step 100, the image processing device 30 receives an input image (e.g., left image) which may be a still image or a frame of a video stream, from the endoscope 20. In embodiments, the input image may be a right image.
At step 102, the image generation algorithm calculates an estimated depth map for the input mage using a convolutional neural network. The convolutional neural network may have any suitable convolutional architecture, such a U-Net architecture, which may be used in medical image processing. The parameters to be optimized in the algorithm include those of the convolutional neural network. There are no learnable parameters in the sampling step and are thus, excluded from optimization.
In various embodiments, training of the neural network may happen on a separate system, e.g., graphic processor unit ( “GPU” ) workstations, high performing computer clusters, etc., and the trained algorithm may then be deployed on the image processing device 30. The stereoscopic image generation algorithm may be trained in an end-to-end manner using actual stereoscopic endoscopic images as training data. During the training phase, the algorithm receives a left image of the stereoscopic images as input, and outputs one estimated right image using the process described above with respect to FIG. 2. By measuring and minimizing the differences between the estimated right images and the actual right images, the parameters in the algorithm are optimized via backpropagating the gradients with respect to the differences. Given a large enough training set and appropriate training settings, the algorithm training converges, and the differences between estimated and actual images are reduced to a locally minimal value, which indicates that the stereoscopic image generation algorithm has been fully trained.
At step 104, the image processing algorithm generates another image (e.g., right) by sampling the input image based on the estimated depth map. After a counterpart image is generated, the input image and the generated image are combined as a stereoscopic image and displayed on the display 40.
FIG. 3 shows a method for stereoscopic visualization using the imaging system 10 including the process and algorithm of FIG. 2. Initially, at step 200 a video stream from the endoscope 20 is received at the image processing device 30. More specifically, the image processing device 30 reads one frame (i.e., a still monocular image) at a time from the video stream. The image may be of any suitable resolution, e.g., 4K, 1080p, 720p, etc. At step 202, the single frame is resized (e.g., downsized) to a smaller size (i.e., resolution) , which may be reduced by a factor of from about 1.5 to about 5. Resizing may be accomplished any suitable image resizing algorithm to reduce the resolution of the image to a desired image size. The first resizing operation, i.e., resizing the input image, may be implemented using any suitable interpolation technique, including, but not limited to, an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, and/or a bicubic interpolation.
At step 204, the resized image is processed using a convolutional neural network yielding an estimated depth map, as described above with respect to FIG. 2. The depth map estimation is performed on the resized image, allowing for faster processing and generation of the depth map due to the smaller resolution size on which depth estimation is being performed.
At step 206, the estimated depth map is resized (e.g., enlarged) to the original image size, since the estimated depth map was obtained from the smaller image. The second resizing operation, i.e., resizing the depth map, may be implemented using any suitable interpolation technique, including, but not limited to, a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, an area interpolation, and combinations thereof.
Two resizing operations allow for faster image generation while maintaining the quality and resolution of the generated image. As noted above, the input image is first resized to a smaller size to perform depth estimation portion of the algorithm. Thereafter, the estimated depth map from the algorithm is then resized back to original size of the input image to generate the right image. Without the resizing operations, the processing speed of the algorithm would be adversely affected.
At step 208, the image processing device 30 samples the original input image and generates the counterpart (e.g., right) image based on the resized depth map. Finally, at step 210, the left original image and the right generated image are combined as a stereoscopic image and displayed on the display 40.
The image generation algorithm according to the present disclosure was tested to demonstrate the effect of two resizing operations on stereoscopic image generation from a single image. Two algorithms, one with two resizing operations, and one without resizing operations, were executed on a personal computer (PC) with an NVIDIA GTX 1070 GPU, running Windows 10, CUDA 10.2.89, cuDNN 8.0.5, and PyTorch 1.6.0 (hereinafter “Windows PC” ) . Net interference and total time were measured. “Net inference” indicates the time that the Windows PC took to calculate the estimated depth map using the convolutional neural network. “Total time” indicates the total processing of one frame, i.e., from reading one input frame to generating the stereoscopic images. The statistics were averaged for 500 frames. The original size of each frame was 1728 x 512. The resized smaller size of the frame was 896 x 256. The results are summarized in Table 1 below and demonstrate that the algorithm with resizing operations has a significant effect on the time to generate depth maps and obtain stereoscopic images.
Net inference (ms) | Total time (ms) | |
Without resizing operations | 42.535 | 77.321 |
With resizing operations | 18.808 | 48.842 |
Table 1
The image generation algorithm according to the present disclosure was also tested using an open-source package TensorRT
TM developed by the
corporation to increase the processing speed of the convolutional neural network. TensorRT
TM is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. The Windows PC was used to execute the algorithm of this disclosure without TensorRT. Another PC with an NVIDIA GTX 1070 GPU, running Ubuntu 18.04, CUDA 10.0, cuDNN7.6.5, PyTorch 1.6.0, and TensorRT 7.0.0.11 executed the algorithm. As the data in Table 2 shows, both, “net interference” and “total time” were significantly improved by TensorRT taking advantage of the GPU processing.
Net inference (ms) | Total time (ms) | |
Without TensorRT | 42.535 | 77.321 |
With TensorRT | 13.284 | 27.454 |
Table 2
The disclosed method and techniques may be implemented in hardware, software, firmware, virtualized computer environments, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include non-transitory computer-readable media, which corresponds to a tangible medium such as data storage media (e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer) .
While several embodiments of the disclosure have been shown in the drawings and/or described herein, it is not intended that the disclosure be limited thereto, as it is intended that the disclosure be as broad in scope as the art will allow and that the specification be read likewise. Therefore, the above description should not be construed as limiting, but merely as exemplifications of particular embodiments. Those skilled in the art will envision other modifications within the scope of the claims appended hereto.
Claims (20)
- An image processing device for generating a stereoscopic image, the image processing device comprising:a processor; anda memory, including instructions stored thereon, which when executed by the processor cause the image processing device to:resize a monocular image having a first resolution yielding a resized image having a second resolution;calculate an estimated depth map of the monocular image based on the resized image;resize the estimated depth map yielding a resized estimated depth map having the first resolution;generate a counterpart monocular image based on the resized estimated depth map; andgenerate a stereoscopic image based on the monocular image and the counterpart monocular image.
- The image processing device according to claim 1, wherein the second resolution is smaller than the first resolution.
- The image processing device according to claim 1, wherein the monocular image is a frame from a video stream.
- The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to execute a convolutional neural network to calculate the estimated depth map.
- The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- The image processing device according to claim 1, wherein the instructions, when executed by the processor, further cause the image processing device to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
- An imaging system for generating a stereoscopic image, the imaging system comprising:a monocular endoscope configured to capture a monocular image;an image processing device including:a processor; anda memory, including instructions stored thereon, which when executed by the processor cause the image processing device to:resize the monocular image having a first resolution yielding a resized image having a second resolution;calculate an estimated depth map of the monocular image based on the resized image;resize the estimated depth map yielding a resized estimated depth map having the first resolution;generate a counterpart monocular image based on the resized estimated depth map; andgenerate a stereoscopic image based on the monocular image and the counterpart monocular image.
- The imaging system according to claim 7, further comprising a stereoscopic display configured to display the stereoscopic image.
- The imaging system according to claim 7, wherein the second resolution is smaller than the first resolution.
- The imaging system according to claim 7, wherein the monocular image is a frame from a video stream.
- The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to execute a convolutional neural network to calculate the estimated depth map.
- The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to resize the monocular image using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- The imaging system according to claim 7, wherein the instructions, when executed by the processor, further cause the image processing device to resize the estimated depth map using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
- A method for generating a stereoscopic image, the method comprising:resizing a monocular image having a first resolution yielding a resized image having a second resolution;calculating an estimated depth map of the monocular image based on the resized image;resizing the estimated depth map yielding a resized estimated depth map having the first resolution;generating a counterpart monocular image based on the resized estimated depth map; andgenerating a stereoscopic image based on the monocular image and the counterpart monocular image.
- The method according to claim 14, further comprising:receiving the monocular image as a frame from a video stream.
- The method according to claim 14, wherein the second resolution is smaller than the first resolution.
- The method according to claim 14, further comprising:outputting the stereoscopic image on a stereoscopic display.
- The method according to claim 14, wherein calculating the estimated depth map further includes executing a convolutional neural network.
- The method according to claim 14, wherein resizing the monocular image further includes using at least one of an area interpolation, a nearest-neighbor interpolation, a bilinear interpolation, or a bicubic interpolation.
- The method according to claim 14, wherein resizing the estimated depth map further includes using at least one of a bilinear interpolation, a nearest-neighbor interpolation, a linear interpolation, a bicubic interpolation, a trilinear interpolation, or an area interpolation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/085011 WO2023184526A1 (en) | 2022-04-02 | 2022-04-02 | System and method of real-time stereoscopic visualization based on monocular camera |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/085011 WO2023184526A1 (en) | 2022-04-02 | 2022-04-02 | System and method of real-time stereoscopic visualization based on monocular camera |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023184526A1 true WO2023184526A1 (en) | 2023-10-05 |
Family
ID=88198823
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/085011 WO2023184526A1 (en) | 2022-04-02 | 2022-04-02 | System and method of real-time stereoscopic visualization based on monocular camera |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023184526A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6847392B1 (en) * | 1996-10-31 | 2005-01-25 | Nec Corporation | Three-dimensional structure estimation apparatus |
US20120274629A1 (en) * | 2011-04-28 | 2012-11-01 | Baek Heumeil | Stereoscopic image display and method of adjusting stereoscopic image thereof |
US20170235277A1 (en) * | 2016-02-12 | 2017-08-17 | Samsung Electronics Co., Ltd. | Method and apparatus for processing holographic image |
US20170366795A1 (en) * | 2016-06-17 | 2017-12-21 | Altek Semiconductor Corp. | Stereo image generating method and electronic apparatus utilizing the method |
CN111179326A (en) * | 2019-12-27 | 2020-05-19 | 精英数智科技股份有限公司 | Monocular depth estimation algorithm, system, equipment and storage medium |
-
2022
- 2022-04-02 WO PCT/CN2022/085011 patent/WO2023184526A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6847392B1 (en) * | 1996-10-31 | 2005-01-25 | Nec Corporation | Three-dimensional structure estimation apparatus |
US20120274629A1 (en) * | 2011-04-28 | 2012-11-01 | Baek Heumeil | Stereoscopic image display and method of adjusting stereoscopic image thereof |
US20170235277A1 (en) * | 2016-02-12 | 2017-08-17 | Samsung Electronics Co., Ltd. | Method and apparatus for processing holographic image |
US20170366795A1 (en) * | 2016-06-17 | 2017-12-21 | Altek Semiconductor Corp. | Stereo image generating method and electronic apparatus utilizing the method |
CN111179326A (en) * | 2019-12-27 | 2020-05-19 | 精英数智科技股份有限公司 | Monocular depth estimation algorithm, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10966592B2 (en) | 3D endoscope apparatus and 3D video processing apparatus | |
US20190051039A1 (en) | Image processing apparatus, image processing method, program, and surgical system | |
JP5684033B2 (en) | IMAGING DEVICE AND ENDOSCOPE DEVICE OPERATION METHOD | |
US20140336461A1 (en) | Surgical structured light system | |
US20150215614A1 (en) | Imaging system and method | |
US20140293007A1 (en) | Method and image acquisition system for rendering stereoscopic images from monoscopic images | |
US9635343B2 (en) | Stereoscopic endoscopic image processing apparatus | |
US20160295194A1 (en) | Stereoscopic vision system generatng stereoscopic images with a monoscopic endoscope and an external adapter lens and method using the same to generate stereoscopic images | |
US11030745B2 (en) | Image processing apparatus for endoscope and endoscope system | |
US10993603B2 (en) | Image processing device, image processing method, and endoscope system | |
US10609354B2 (en) | Medical image processing device, system, method, and program | |
JPWO2016043063A1 (en) | Image processing apparatus and image processing method | |
US20220012954A1 (en) | Generation of synthetic three-dimensional imaging from partial depth maps | |
CN114630611A (en) | System and method for changing visual direction during video-guided clinical surgery using real-time image processing | |
US11426052B2 (en) | Endoscopic system | |
EP3029934A1 (en) | Image outputting device | |
CN109068035B (en) | Intelligent micro-camera array endoscopic imaging system | |
WO2023184526A1 (en) | System and method of real-time stereoscopic visualization based on monocular camera | |
JP7179837B2 (en) | Endoscope device, endoscope image display method, and operation method of endoscope device | |
JPWO2005091649A1 (en) | 3D display method using video images continuously acquired by a single imaging device | |
WO2023184527A1 (en) | System and method for unsupervised stereoscopic reconstruction with disparity consistency | |
WO2023184525A1 (en) | System and method for deep learning based hybrid image enlargement | |
EP3119264B1 (en) | Optically adaptive endoscope | |
JP2014068796A (en) | Stereoscopic endoscope apparatus | |
WO2020084784A1 (en) | Image processing device and endoscope system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22934355 Country of ref document: EP Kind code of ref document: A1 |