CN117011381A

CN117011381A - Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision

Info

Publication number: CN117011381A
Application number: CN202310992271.9A
Authority: CN
Inventors: 陈业慧; 周莹; 周博
Original assignee: Helibai Hefei Intelligent Technology Co ltd
Current assignee: Helibai Hefei Intelligent Technology Co ltd
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-11-07

Abstract

The invention relates to the technical field of medical instrument control, and particularly discloses a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision, wherein the method comprises the following steps: data acquisition and preprocessing, object detection and segmentation, stereo matching and depth estimation, instrument pose estimation and pose tracking and updating; according to the invention, two or more cameras are used for acquiring images or video data in a surgical scene, the acquired images are subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the images, the input stereo images are input into a trained stereo neural network to obtain a predicted depth map, and the tracking of the pose of the surgical instrument in continuous image frames is realized by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.

Description

Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision

Technical Field

The invention belongs to the technical field of medical instrument control, and particularly relates to a method and a system for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision.

Background

With the development of social technology, surgical procedures have undergone a process ranging from traditional open surgery to minimally invasive surgery to robotic surgery. The rapid development can be separated from the development of several important technical supports, namely: image enhancement, instrument intelligence, medical robots, data analysis and machine learning, and cloud interconnection, but at present, the technical supports are relatively blocked, and the respective action fields are disjoint. For example, image enhancement has its own independent medical system, and more is shown in preoperative diagnosis with little help in surgery. The intelligent instruments and the medical robots act on the operation process more, the operation difficulty is judged, the operation scheme is formulated, even the operation process is guided more or by the experience of doctors, therefore, the prior instruments are independent, a unified technology fusion interaction platform is not formed, interconnection cannot be formed, the data utilization rate is low, the ecological cycle of the whole instrument industry is hindered, and the operation needs cannot be met by a single system or instrument more and more.

Disclosure of Invention

The invention aims to provide a real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision comprises the following steps:

data acquisition and preprocessing, capturing images or video data in a surgical scene by using a stereoscopic camera system, acquiring depth information by using two or more cameras, and denoising the acquired images to reduce data noise and improve image quality;

object detection and segmentation, namely detecting the position of a surgical instrument in an image by using a deep learning target detection algorithm Faster R-CNN, and acquiring bounding box information of the instrument;

stereo matching and depth estimation, wherein a deep learning method, such as a stereo neural network, is adopted to predict the depth value of a pixel point in an image;

estimating the pose of the instrument by using a deep learning method, such as a pose estimation neural network;

pose tracking and updating, tracking the pose of the surgical instrument in continuous image frames, and updating the pose according to the previous pose estimation and new image data by using an optimization method such as nonlinear optimization so as to improve the stability and accuracy of estimation.

Preferably, the denoising is implemented by a bilateral filter, which is specifically as follows:

i: denoised map

I0: original picture

x: denoising pixel point location

Omega: window function

f _γ : range kernel range

G _s : spatial kernel space kernel range.

Preferably, the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;

assuming that the feature map is F, generating a candidate region possibly containing a surgical instrument on the feature map F by using an RPN, generating a plurality of anchor boxes at each position of the feature map by means of a sliding window, each anchor box representing one candidate region, and assuming that the generated anchor boxes are B, each anchor box being represented by its coordinates and dimensions: b= { bx, by, bh, bw }, wherein (bx, by) is the coordinates of the center of the anchor frame, bh is the height of the anchor frame, bw is the width of the anchor frame;

RoIPooling is performed on each anchor box to map it to a feature map of fixed size, resulting in ROI features, which are then input into two fully connected layers: one for object classification and the other for bounding box regression;

classifying each anchor frame by a softmax classifier to determine whether the anchor frame contains surgical instruments and classifying the anchor frame into different target categories (usually surgical instruments and backgrounds); assuming a classification score of S, s= { s_0, s_1,..s_n }, where s_0 represents the background class and s_1 to s_n represent the surgical instrument class;

for an anchor frame classified as a surgical instrument, predicting its offset relative to a real target bounding box using a regressive to accurately adjust the position of the anchor frame; assuming that the regression score is R, r= { rx, ry, rh, rw }, where (rx, ry) represents the coordinates of the center of the bounding box, rh represents the height of the bounding box, and rw represents the width of the bounding box;

non-maximal suppression (NMS) is applied to reject candidate regions that overlap more, leaving only the regions most likely to contain surgical instruments; the NMS will screen the candidate region based on the target class score and the degree of overlap of the bounding box (IoU value);

finally, outputting a final surgical instrument detection result according to the candidate region screened by the NMS and the corresponding target category score; each detection result contains an instrument category label, a bounding box location, and a confidence probability.

Preferably, the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;

training a network: training a stereoscopic neural network using the paired stereoscopic images and the corresponding depth maps; in the training process, a loss function is used for measuring the gap between the network output and the real depth map, and the network parameters are updated through a gradient descent optimization algorithm; the Loss function comprises a mean square error MSE and a Smooth L1 Loss and is used for measuring the difference between the predicted depth map and the real depth map;

depth prediction: in practical application, inputting an input stereo image into a trained stereo neural network to obtain a predicted depth map; post-processing the predicted depth map, e.g. mapping the depth values to a true physical depth range, removing noise, or further smoothing the depth map.

Preferably, the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:

state representation:

in tracking tasks, state variables need to be defined to represent the pose of the instrument; in general, a pose may be composed of a position vector and a rotation quaternion, represented as a state vector x= [ p, q ], where p is the position vector and q is the rotation quaternion;

and a prediction step:

at each time step, predicting by using the state estimation of the last moment; the predicting step updates the state estimation by using a motion equation according to a dynamics model of the system; for continuous tracking tasks, a kinetic model may be used to predict the position and pose of the instrument in the next frame;

prediction state:

x_hat＝f(x_p)

wherein f () is the equation of motion and x_p is the state estimate at the last time;

prediction covariance matrix:

P_hat＝F*P_p*F^T+Q

wherein P_p is the state covariance matrix of the last moment, F is the state transition matrix, and Q is the covariance matrix of the process noise;

updating:

at each time step, a state estimate is updated by observing data (typically the instrument position detected from the image); for nonlinear systems, the state is updated using EKF;

predicting observed values:

z_h＝h(x_h)

where h () is the observation equation and x_h is the predicted state estimate;

estimating a covariance matrix:

S＝H*P_h*H^T+R

wherein H is an observation matrix, P_h is a predicted state covariance matrix, and R is a covariance matrix of observation noise;

calculating Kalman gain:

K＝P_h*H^T*S^-1

updating the state estimation:

x_new＝x_h+K*(z-z_h)

updating the covariance matrix:

P_new＝(I-K*H)*P_h

wherein z is observation data, and z_h is a predicted observation value, and tracking of the pose of the surgical instrument in continuous image frames is realized through the steps, so that the stability and the accuracy of pose estimation are improved.

The invention also provides a real-time surgical instrument pose estimation system based on the deep learning and the stereoscopic vision, which comprises the real-time surgical instrument pose estimation method based on the deep learning and the stereoscopic vision, and is characterized in that the system comprises:

the data acquisition and preprocessing module is used for capturing images or video data in a surgical scene by using a stereoscopic camera system;

the object detection and segmentation module is used for detecting the position of the surgical instrument in the image by using a deep learning target detection algorithm Faster R-CNN and acquiring bounding box information of the instrument;

the stereo matching and depth estimation module adopts a deep learning method, such as a stereo neural network, to predict the depth value of a pixel point in an image;

an instrument pose estimation module that uses a deep learning method, such as a pose estimation neural network, to more accurately estimate pose;

and the gesture tracking and updating module is used for tracking the gesture of the surgical instrument in continuous image frames and updating the gesture according to the previous gesture estimation and the new image data so as to improve the stability and accuracy of the estimation.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the invention, two or more cameras are used for acquiring image or video data in a surgical scene, the acquired image is subjected to denoising pretreatment through a bilateral filter so as to reduce data noise and improve image quality, then a deep learning target detection algorithm Faster R-CNN is used for detecting the position of a surgical instrument in the image, and boundary frame information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary frame position and confidence probability.

(2) According to the invention, the depth value of the pixel point in the image is predicted by adopting the three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain the predicted depth image, and the pose of the surgical instrument is tracked in the continuous image frames by adopting the extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.

Drawings

FIG. 1 is a flow chart of a method for estimating the pose of a real-time surgical instrument based on deep learning and stereoscopic vision according to the invention;

fig. 2 is a block diagram of a real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

referring to fig. 1 to 2, the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision includes:

The denoising is realized by a bilateral filter, and the denoising method is concretely as follows:

i: denoised map

I0: original picture

x: denoising pixel point location

Omega: window function

f _γ : range kernel range

G _s : spatial kernel space kernel range.

The deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;

The deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;

The optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:

state representation:

and a prediction step:

prediction state:

x_hat＝f(x_p)

prediction covariance matrix:

P_hat＝F*P_p*F^T+Q

updating:

predicting observed values:

z_h＝h(x_h)

where h () is the observation equation and x_h is the predicted state estimate;

estimating a covariance matrix:

S＝H*P_h*H^T+R

calculating Kalman gain:

K＝P_h*H^T*S^-1

updating the state estimation:

x_new＝x_h+K*(z-z_h)

updating the covariance matrix:

P_new＝(I-K*H)*P_h

As can be seen from the above, firstly, images or video data in a surgical scene are acquired by using two or more cameras, the acquired images are subjected to denoising pretreatment by a bilateral filter to reduce data noise and improve image quality, then, the position of a surgical instrument in the images is detected by using a deep learning target detection algorithm, namely, fast R-CNN, and boundary box information of the instrument is acquired to obtain surgical instrument detection results, wherein each detection result comprises an instrument category label, a boundary box position and confidence probability;

the depth value of a pixel point in an image is predicted by adopting a three-dimensional neural network, the input three-dimensional image is input into the trained three-dimensional neural network to obtain a predicted depth image, and the pose of the surgical instrument is tracked in continuous image frames by adopting an extended Kalman filter EKF nonlinear optimization method, so that the stability and the accuracy of pose estimation are improved, and the pose of the surgical instrument can be accurately and stably estimated in real time.

The beneficial effects of the method are the same as those of the embodiment of the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision, and the detailed description is omitted.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1. The real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision is characterized by comprising the following steps of:

2. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the denoising is realized by a bilateral filter, and the denoising method is concretely as follows:

i: denoised map

I0: original picture

x: denoising pixel point location

Omega: window function

f _γ : range kernel range

G _s : spatial kernel space kernel range.

3. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning target detection algorithm fast R-CNN specifically comprises feature extraction, and a pretrained convolutional neural network (such as VGG and ResNet) is used for forward transmission of an input image to obtain advanced features in the image;

performing roitools on each anchor frame to map the anchor frames to feature maps with fixed sizes to obtain ROI features, and then inputting the ROI features into two fully connected layers: one for object classification and the other for bounding box regression;

4. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 1, wherein: the deep learning method adopts a three-dimensional neural network, and specifically comprises the following steps: adopting an encoder-decoder architecture, wherein the encoder is used for extracting the characteristics of an image, the decoder is used for decoding the characteristic image into a depth image, preprocessing the left view and the right view which are input, for example, adjusting the size of the image and normalizing pixel values, and simultaneously, normalizing and preprocessing the depth image to ensure consistency with network output;

5. The deep learning and stereoscopic vision based real-time surgical instrument pose estimation method according to claim 2, wherein: the optimization method adopts an extended Kalman filter EKF nonlinear optimization method, and specifically comprises the following steps:

state representation:

and a prediction step:

prediction state:

x_hat＝f(x_p)

prediction covariance matrix:

P_hat＝F*P_p*F^T+Q

updating:

predicting observed values:

z_h＝h(x_h)

where h () is the observation equation and x_h is the predicted state estimate;

estimating a covariance matrix:

S＝H*P_h*H^T+R

calculating Kalman gain:

K＝P_h*H^T*S^-1

updating the state estimation:

x_new＝x_h+K*(z-z_h)

updating the covariance matrix:

P_new＝(I-K*H)*P_h

6. A real-time surgical instrument pose estimation system based on deep learning and stereoscopic vision, comprising the real-time surgical instrument pose estimation method based on deep learning and stereoscopic vision according to any one of the preceding claims 1-5, characterized in that the system comprises: