CN117011381A - Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision - Google Patents

Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision Download PDF

Info

Publication number
CN117011381A
CN117011381A CN202310992271.9A CN202310992271A CN117011381A CN 117011381 A CN117011381 A CN 117011381A CN 202310992271 A CN202310992271 A CN 202310992271A CN 117011381 A CN117011381 A CN 117011381A
Authority
CN
China
Prior art keywords
pose
surgical instrument
deep learning
image
estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310992271.9A
Other languages
Chinese (zh)
Inventor
陈业慧
周莹
周博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Helibai Hefei Intelligent Technology Co ltd
Original Assignee
Helibai Hefei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Helibai Hefei Intelligent Technology Co ltd filed Critical Helibai Hefei Intelligent Technology Co ltd
Priority to CN202310992271.9A priority Critical patent/CN117011381A/en
Publication of CN117011381A publication Critical patent/CN117011381A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of medical instrument control, and particularly discloses a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision, wherein the method comprises the following steps: data acquisition and preprocessing, object detection and segmentation, stereo matching and depth estimation, instrument pose estimation and pose tracking and updating; according to the invention, two or more cameras are used for acquiring images or video data in a surgical scene, the acquired images are subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the images, the input stereo images are input into a trained stereo neural network to obtain a predicted depth map, and the tracking of the pose of the surgical instrument in continuous image frames is realized by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.

Description

Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision
Technical Field
The invention belongs to the technical field of medical instrument control, and particularly relates to a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision.
Background
With the development of social technology, surgical procedures have undergone a process ranging from traditional open surgery to minimally invasive surgery to robotic surgery. The rapid development can be separated from the development of several important technical supports, namely: image enhancement, instrument intelligence, medical robots, data analysis and machine learning, and cloud interconnection, but at present, the technical supports are relatively blocked, and the respective action fields are disjoint. For example, image enhancement has its own independent medical system, and more is shown in preoperative diagnosis with little help in surgery. The intelligent instruments and the medical robots act on the operation process more, the operation difficulty is judged, the operation scheme is formulated, even the operation process is guided more or by the experience of doctors, therefore, the prior instruments are independent, a unified technology fusion interaction platform is not formed, interconnection cannot be formed, the data utilization rate is low, the ecological cycle of the whole instrument industry is hindered, and the operation needs cannot be met by a single system or instrument more and more.
Disclosure of Invention
The invention aims to provide a real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision comprises the following steps:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
Preferably, the denoising is implemented by a bilateral filter, which is specifically as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
Preferably, the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
RoIPooling is performed on each anchor box to map it to a feature map of fixed size, resulting in ROI features, which are then input into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
Preferably, the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
Preferably, the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
The invention also provides a real-time surgical instrument pose estimation system based on the deep learning and the stereoscopic vision, which comprises the real-time surgical instrument pose estimation method based on the deep learning and the stereoscopic vision, and is characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, two or more cameras are used for acquiring image or video data in a surgical scene, the acquired image is subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the image, and boundary frame information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary frame position and confidence probability.
(2) According to the invention, the depth value of the pixel point in the image is predicted by adopting the three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain the predicted depth image, and the pose of the surgical instrument is tracked in the continuous image frames by adopting the extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.
Drawings
FIG. 1 is a flow chart of a method for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision according to the invention;
fig. 2 is a block diagram of a real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
referring to fig. 1 to 2, the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision includes:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
The denoising is realized by a bilateral filter, and the denoising method is concretely as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
The deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
RoIPooling is performed on each anchor box to map it to a feature map of fixed size, resulting in ROI features, which are then input into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
The deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
The optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
As can be seen from the above, firstly, images or video data in a surgical scene are acquired by using two or more cameras, the acquired images are subjected to denoising pretreatment by a bilateral filter to reduce data noise and improve image quality, then, the position of a surgical instrument in the images is detected by using a deep learning target detection algorithm, namely, fast R-CNN, and boundary box information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary box position and confidence probability;
the depth value of a pixel point in an image is predicted by adopting a three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain a predicted depth image, and the pose of the surgical instrument is tracked in continuous image frames by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.
The invention also provides a real-time surgical instrument pose estimation system based on the deep learning and the stereoscopic vision, which comprises the real-time surgical instrument pose estimation method based on the deep learning and the stereoscopic vision, and is characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
The beneficial effects of the method are the same as those of the embodiment of the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision, and the detailed description is omitted.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims (6)

1. The real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision is characterized by comprising the following steps of:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
2. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the denoising is realized by a bilateral filter, and the denoising method is concretely as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
3. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
performing roitools on each anchor frame to map the anchor frames to feature maps with fixed sizes to obtain ROI features, and then inputting the ROI features into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
4. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
5. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 2, wherein: the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
6. A real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision, comprising the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision according to any one of the preceding claims 1-5, characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
CN202310992271.9A 2023-08-08 2023-08-08 Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision Pending CN117011381A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310992271.9A CN117011381A (en) 2023-08-08 2023-08-08 Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310992271.9A CN117011381A (en) 2023-08-08 2023-08-08 Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision

Publications (1)

Publication Number Publication Date
CN117011381A true CN117011381A (en) 2023-11-07

Family

ID=88572414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310992271.9A Pending CN117011381A (en) 2023-08-08 2023-08-08 Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision

Country Status (1)

Country Link
CN (1) CN117011381A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117671012A (en) * 2024-01-31 2024-03-08 临沂大学 Method, device and equipment for calculating absolute and relative pose of endoscope in operation
CN118072929A (en) * 2024-04-22 2024-05-24 中国人民解放军总医院第七医学中心 Real-time data intelligent management method for portable sterile surgical instrument package storage equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117671012A (en) * 2024-01-31 2024-03-08 临沂大学 Method, device and equipment for calculating absolute and relative pose of endoscope in operation
CN117671012B (en) * 2024-01-31 2024-04-30 临沂大学 Method, device and equipment for calculating absolute and relative pose of endoscope in operation
CN118072929A (en) * 2024-04-22 2024-05-24 中国人民解放军总医院第七医学中心 Real-time data intelligent management method for portable sterile surgical instrument package storage equipment

Similar Documents

Publication Publication Date Title
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN110688965B (en) IPT simulation training gesture recognition method based on binocular vision
CN109684925B (en) Depth image-based human face living body detection method and device
CN107452015B (en) Target tracking system with re-detection mechanism
CN117011381A (en) Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision
CN109685045B (en) Moving target video tracking method and system
CN105160310A (en) 3D (three-dimensional) convolutional neural network based human body behavior recognition method
JP2006209755A (en) Method for tracing moving object inside frame sequence acquired from scene
CN110555868A (en) method for detecting small moving target under complex ground background
CN113111947A (en) Image processing method, apparatus and computer-readable storage medium
CN110298248A (en) A kind of multi-object tracking method and system based on semantic segmentation
CN107194929B (en) Method for tracking region of interest of lung CT image
CN113379789B (en) Moving target tracking method in complex environment
CN115661246A (en) Attitude estimation method based on self-supervision learning
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN115909110A (en) Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
CN113436251B (en) Pose estimation system and method based on improved YOLO6D algorithm
CN113920168A (en) Image tracking method in audio and video control equipment
CN116665097A (en) Self-adaptive target tracking method combining context awareness
CN117079313A (en) Image processing method, device, equipment and storage medium
CN114882372A (en) Target detection method and device
KR20230046818A (en) Data learning device and method for semantic image segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination