CN117011381A - Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision - Google Patents
Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision Download PDFInfo
- Publication number
- CN117011381A CN117011381A CN202310992271.9A CN202310992271A CN117011381A CN 117011381 A CN117011381 A CN 117011381A CN 202310992271 A CN202310992271 A CN 202310992271A CN 117011381 A CN117011381 A CN 117011381A
- Authority
- CN
- China
- Prior art keywords
- pose
- surgical instrument
- deep learning
- image
- estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000013135 deep learning Methods 0.000 title claims abstract description 48
- 238000001514 detection method Methods 0.000 claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 claims abstract description 26
- 238000005457 optimization Methods 0.000 claims abstract description 18
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 7
- 230000002146 bilateral effect Effects 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 230000001373 regressive effect Effects 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000001356 surgical procedure Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000002324 minimally invasive surgery Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010882 preoperative diagnosis Methods 0.000 description 1
- 238000002432 robotic surgery Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration using local operators
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/12—Edge-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20024—Filtering details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30168—Image quality inspection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of medical instrument control, and particularly discloses a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision, wherein the method comprises the following steps: data acquisition and preprocessing, object detection and segmentation, stereo matching and depth estimation, instrument pose estimation and pose tracking and updating; according to the invention, two or more cameras are used for acquiring images or video data in a surgical scene, the acquired images are subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the images, the input stereo images are input into a trained stereo neural network to obtain a predicted depth map, and the tracking of the pose of the surgical instrument in continuous image frames is realized by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.
Description
Technical Field
The invention belongs to the technical field of medical instrument control, and particularly relates to a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision.
Background
With the development of social technology, surgical procedures have undergone a process ranging from traditional open surgery to minimally invasive surgery to robotic surgery. The rapid development can be separated from the development of several important technical supports, namely: image enhancement, instrument intelligence, medical robots, data analysis and machine learning, and cloud interconnection, but at present, the technical supports are relatively blocked, and the respective action fields are disjoint. For example, image enhancement has its own independent medical system, and more is shown in preoperative diagnosis with little help in surgery. The intelligent instruments and the medical robots act on the operation process more, the operation difficulty is judged, the operation scheme is formulated, even the operation process is guided more or by the experience of doctors, therefore, the prior instruments are independent, a unified technology fusion interaction platform is not formed, interconnection cannot be formed, the data utilization rate is low, the ecological cycle of the whole instrument industry is hindered, and the operation needs cannot be met by a single system or instrument more and more.
Disclosure of Invention
The invention aims to provide a real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision comprises the following steps:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
Preferably, the denoising is implemented by a bilateral filter, which is specifically as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
Preferably, the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
RoIPooling is performed on each anchor box to map it to a feature map of fixed size, resulting in ROI features, which are then input into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
Preferably, the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
Preferably, the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
The invention also provides a real-time surgical instrument pose estimation system based on the deep learning and the stereoscopic vision, which comprises the real-time surgical instrument pose estimation method based on the deep learning and the stereoscopic vision, and is characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, two or more cameras are used for acquiring image or video data in a surgical scene, the acquired image is subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the image, and boundary frame information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary frame position and confidence probability.
(2) According to the invention, the depth value of the pixel point in the image is predicted by adopting the three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain the predicted depth image, and the pose of the surgical instrument is tracked in the continuous image frames by adopting the extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.
Drawings
FIG. 1 is a flow chart of a method for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision according to the invention;
fig. 2 is a block diagram of a real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
referring to fig. 1 to 2, the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision includes:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
The denoising is realized by a bilateral filter, and the denoising method is concretely as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
The deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
RoIPooling is performed on each anchor box to map it to a feature map of fixed size, resulting in ROI features, which are then input into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
The deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
The optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
As can be seen from the above, firstly, images or video data in a surgical scene are acquired by using two or more cameras, the acquired images are subjected to denoising pretreatment by a bilateral filter to reduce data noise and improve image quality, then, the position of a surgical instrument in the images is detected by using a deep learning target detection algorithm, namely, fast R-CNN, and boundary box information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary box position and confidence probability;
the depth value of a pixel point in an image is predicted by adopting a three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain a predicted depth image, and the pose of the surgical instrument is tracked in continuous image frames by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.
The invention also provides a real-time surgical instrument pose estimation system based on the deep learning and the stereoscopic vision, which comprises the real-time surgical instrument pose estimation method based on the deep learning and the stereoscopic vision, and is characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
The beneficial effects of the method are the same as those of the embodiment of the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision, and the detailed description is omitted.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.
Claims (6)
1. The real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision is characterized by comprising the following steps of:
data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;
object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;
stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;
estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;
pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.
2. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the denoising is realized by a bilateral filter, and the denoising method is concretely as follows:
i: denoised map
I0: original picture
x: denoising pixel point location
Omega: window function
f γ : range kernel range
G s : spatial kernel space kernel range.
3. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;
assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;
performing roitools on each anchor frame to map the anchor frames to feature maps with fixed sizes to obtain ROI features, and then inputting the ROI features into two fully connected layers: one for object classification and the other for bounding box regression;
classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;
for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;
non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);
finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.
4. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;
training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;
depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.
5. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 2, wherein: the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:
state representation:
in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;
and a prediction step:
at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;
prediction state:
x_hat=f(x_p)
wherein f () is the equation of motion and x_p is the state estimate at the last time;
prediction covariance matrix:
P_hat=F*P_p*F^T+Q
wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;
updating:
at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;
predicting observed values:
z_h=h(x_h)
where h () is the observation equation and x_h is the predicted state estimate;
estimating a covariance matrix:
S=H*P_h*H^T+R
wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;
calculating Kalman gain:
K=P_h*H^T*S^-1
updating the state estimation:
x_new=x_h+K*(z-z_h)
updating the covariance matrix:
P_new=(I-K*H)*P_h
wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.
6. A real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision, comprising the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision according to any one of the preceding claims 1-5, characterized in that the system comprises:
the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;
the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;
the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;
an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;
and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310992271.9A CN117011381A (en) | 2023-08-08 | 2023-08-08 | Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310992271.9A CN117011381A (en) | 2023-08-08 | 2023-08-08 | Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117011381A true CN117011381A (en) | 2023-11-07 |
Family
ID=88572414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310992271.9A Pending CN117011381A (en) | 2023-08-08 | 2023-08-08 | Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117011381A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117671012A (en) * | 2024-01-31 | 2024-03-08 | 临沂大学 | Method, device and equipment for calculating absolute and relative pose of endoscope in operation |
CN118072929A (en) * | 2024-04-22 | 2024-05-24 | 中国人民解放军总医院第七医学中心 | Real-time data intelligent management method for portable sterile surgical instrument package storage equipment |
-
2023
- 2023-08-08 CN CN202310992271.9A patent/CN117011381A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117671012A (en) * | 2024-01-31 | 2024-03-08 | 临沂大学 | Method, device and equipment for calculating absolute and relative pose of endoscope in operation |
CN117671012B (en) * | 2024-01-31 | 2024-04-30 | 临沂大学 | Method, device and equipment for calculating absolute and relative pose of endoscope in operation |
CN118072929A (en) * | 2024-04-22 | 2024-05-24 | 中国人民解放军总医院第七医学中心 | Real-time data intelligent management method for portable sterile surgical instrument package storage equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097568B (en) | Video object detection and segmentation method based on space-time dual-branch network | |
CN109800689B (en) | Target tracking method based on space-time feature fusion learning | |
CN110910391B (en) | Video object segmentation method for dual-module neural network structure | |
CN110688965B (en) | IPT simulation training gesture recognition method based on binocular vision | |
CN109684925B (en) | Depth image-based human face living body detection method and device | |
CN107452015B (en) | Target tracking system with re-detection mechanism | |
CN117011381A (en) | Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision | |
CN109685045B (en) | Moving target video tracking method and system | |
CN105160310A (en) | 3D (three-dimensional) convolutional neural network based human body behavior recognition method | |
JP2006209755A (en) | Method for tracing moving object inside frame sequence acquired from scene | |
CN110555868A (en) | method for detecting small moving target under complex ground background | |
CN113111947A (en) | Image processing method, apparatus and computer-readable storage medium | |
CN110298248A (en) | A kind of multi-object tracking method and system based on semantic segmentation | |
CN107194929B (en) | Method for tracking region of interest of lung CT image | |
CN113379789B (en) | Moving target tracking method in complex environment | |
CN115661246A (en) | Attitude estimation method based on self-supervision learning | |
CN113312973A (en) | Method and system for extracting features of gesture recognition key points | |
CN115909110A (en) | Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network | |
CN113920254B (en) | Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof | |
CN113436251B (en) | Pose estimation system and method based on improved YOLO6D algorithm | |
CN113920168A (en) | Image tracking method in audio and video control equipment | |
CN116665097A (en) | Self-adaptive target tracking method combining context awareness | |
CN117079313A (en) | Image processing method, device, equipment and storage medium | |
CN114882372A (en) | Target detection method and device | |
KR20230046818A (en) | Data learning device and method for semantic image segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |