CN111652175A

CN111652175A - Real-time surgical tool detection method applied to robot-assisted surgical video analysis

Info

Publication number: CN111652175A
Application number: CN202010529745.2A
Authority: CN
Inventors: 赵子健; 刘玉莹
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-11

Abstract

The present disclosure provides a real-time surgical tool detection method applied to robot-assisted surgical video analysis, comprising: acquiring a robot-assisted surgery video and processing the video to obtain a surgery image; estimating key points of the surgical tool on the surgical image to obtain a heat map of the central point of the surgical tool, and predicting the central point of the surgical tool and the size of the surgical tool according to the peak value of the heat map; and calculating the boundary frame of the surgical tool according to the predicted central point of the surgical tool and the size of the central point. The light-weight convolutional neural network is adopted, the fire module is used for replacing a residual error module in the traditional hourglass network, parameters required by training of the convolutional neural network are greatly reduced, the detection speed is improved while the high detection accuracy is guaranteed, and the real-time detection requirement is met.

Description

Real-time surgical tool detection method applied to robot-assisted surgical video analysis

Technical Field

The disclosure belongs to the technical field of robot-assisted surgical video analysis, and particularly relates to a real-time surgical tool detection method applied to robot-assisted surgical video analysis.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The robot-assisted surgery mainly includes two main categories of robot-assisted surgery operation and robot-assisted surgery navigation. The computer-assisted surgery operation is that a surgeon performs surgery operation by means of a robot arm so as to reduce surgical accidents caused by subjective consciousness, fatigue and other factors of people, the surgeon can operate the robot far away from an operating table to perform accurate surgery, the computer-assisted surgery operation is completely different from the traditional surgery concept, and the probability of infection of medical staff can be reduced. The computer-assisted surgery navigation is to accurately correspond video data before or during surgery of a patient hand to an anatomical structure of the patient on an operating bed, track an operating tool during the surgery and update and display the position of the operating tool on an image of the patient in real time in a virtual probe mode, so that a doctor can clearly know the position of the operating tool relative to the anatomical structure of the patient, and the surgery is quicker, more accurate and safer.

Video analysis techniques further analyze and track objects in the camera scene by distinguishing between the background and objects in the scene using computer vision analysis techniques. Video analysis is to create a basic operation idea, namely acquisition, preprocessing, processing and action, according to the biological characteristics of human eyes. Firstly, acquiring video data of an experienced doctor in the operation process; then, performing framing on the operation video with clear pictures and complete process; identifying the information such as the appearance time, the position and the like of the surgical tool in the surgical image obtained by framing; and finally, training novices, operation early warning, operating room resource allocation and the like are realized according to the obtained information of the operation tool.

Real-time surgical tool detection, one of the classic problems in computer vision, has the task of marking the specific position of a surgical tool in an image with a box and giving the category of the surgical tool. Because the surgical tool detection technology is used in the surgical video image, it is necessary to achieve rapidness and accuracy in the detection process of the surgical tool, that is, real-time performance and accuracy are required for the detection of the surgical tool.

Different from a common target detection task, the actual image for surgical tool detection often has factors which are unfavorable for surgical tool detection, such as blood fog, blurring, too fast moving speed and the like, so that the detection precision of the surgical tool is reduced, and the surgical navigation process is influenced to cause harm to a human body; on the other hand, to help the surgical navigation through the surgical tool detection, real-time performance is very necessary, and if the real-time performance of the surgical tool detection is not achieved, the view of the doctor in the surgical process is delayed, and unnecessary damage is caused to the human body. However, the method adopted by the current surgical tool detection is very time-consuming, needs to generate a great number of anchor boxes as prior frames and even needs to map the prior frames back to the image feature map, which increases the amount of calculation, is very time-consuming and cannot achieve the effect of real-time performance.

Disclosure of Invention

In order to overcome the defects of the prior art, the real-time surgical tool detection method applied to robot-assisted surgical video analysis is provided, and the speed and accuracy of surgical tool detection are improved.

In one aspect, to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

a real-time surgical tool detection method applied to robot-assisted surgical video analysis comprises the following steps:

acquiring a robot-assisted surgery video and processing the video to obtain a surgery image;

estimating key points of the surgical tool on the surgical image to obtain a heat map of the central point of the surgical tool, and predicting the central point of the surgical tool and the size of the surgical tool according to the peak value of the heat map;

the bounding box of the surgical tool is derived from the predicted center point of the surgical tool and its size.

In a further technical scheme, the key points of the surgical tools of the surgical images are estimated by adopting a key point estimator of the surgical tools obtained after the lightweight neural network framework is trained.

In another aspect, a real-time surgical tool detection system for use in robotic-assisted surgical video analysis is disclosed, comprising:

the operation image acquisition module is configured to acquire a robot-assisted operation video and process the robot-assisted operation video to obtain an operation image;

the central point prediction module of the surgical tool is configured to preprocess the surgical image, estimate key points of the surgical tool to obtain a heat map of the central point of the surgical tool, and predict the central point of the surgical tool and the size of the surgical tool according to the peak value of the heat map;

and a bounding box acquisition module of the surgical tool, which is configured to obtain the bounding box of the surgical tool from the predicted central point of the surgical tool and the size thereof.

The above one or more technical solutions have the following beneficial effects:

1. according to the technical scheme, the light-weight convolutional neural network is adopted, the fire module is used for replacing a residual error module in the traditional hourglass network, parameters required by training of the convolutional neural network are greatly reduced, the detection speed is improved while the high detection accuracy is guaranteed, and the detection requirement of real-time performance is met.

2. According to the technical scheme, the candidate operation tool boundary frame is extracted by adopting an anchor-free box method, the boundary frame of the operation tool to be detected is directly obtained through the coordinate of the center point of the operation tool through a formula, any post-processing process is not needed, and the detection method at one stage is more efficient.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a block diagram of the overall convolutional neural network of the present invention;

fig. 2 is a detailed block diagram of a fire block instead of a residual block.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The disclosed embodiments employ an anchor-box-less lightweight convolutional neural network architecture.

Interpretation of terms

Anchor-free is an Anchor-free box, and an Anchor (also called an Anchor box) is a group of rectangular boxes clustered by using methods such as k-means and the like on a training set before convolutional neural network training, and represents the length and width dimensions of the main distribution of an object in a data set. The target detection algorithm can be generally divided into anchor-based and anchor-free fusion classes, namely sliding window, selective search and regression, and the difference lies in whether the anchor is used for extracting a candidate target frame or not. Anchor-free is an anchorless box, i.e., no anchor is used to extract candidate target boxes.

The detection method of the surgical tool widely adopted at present mainly comprises the following steps: a two-stage test method and a one-stage test method requiring an anchor box. The two-stage detection method needs to map a large number of anchor boxes back to the image feature map and then carry out classification regression, and is very time-consuming; although the detection-stage method requiring the anchor box can directly perform classification regression after the anchor box is generated, the design of a large number of anchor boxes requires the fine design of anchor box parameters, the detection precision of surgical tools of the method is reduced, and the speed of the method is far less than the requirement of the detection real-time performance of the surgical tools. The method for detecting the light-weight surgical tool is a light-weight surgical tool detection method which can directly carry out classification regression without a large number of anchor boxes at one stage, and meets the requirement on detection real-time performance of the surgical tool while ensuring the detection accuracy of the surgical tool.

Example one

The embodiment discloses a real-time surgical tool detection method applied to robot-assisted surgical video analysis, which has the overall concept that:

The method comprises the following specific steps:

s1: acquiring a robot-assisted surgery video, and performing framing operation on the surgery video to obtain a surgery image;

s2: initializing a neural network framework for training;

s3: inputting the operation image obtained in the step S1 into a neural network framework, and preprocessing the operation image;

s4: training the neural network framework to obtain the surgical tool keypoint estimator in the surgical image of step S3

S5: obtaining a heat map of the center point of the surgical tool according to the surgical tool key point estimator detected in the step S4, regressing the size of a surgical tool boundary box, and outputting an offset value after down sampling;

s6: the bounding box of the surgical tool is obtained from the center point of the surgical tool obtained in step S5.

In a specific implementation example, step S1 specifically includes:

s11: when the robot-assisted surgery is carried out, a camera is used for collecting the video of the whole surgery process, and the speed is 25 fps;

s12: down-sampling the video with the speed of 20fps collected in the step S11 into 5fps by using framing software, and then storing the video as an operation image; it should be noted here that, in the specific implementation, the original video is down-sampled to the frame rate of the manual marking operation video; the down-sampling realizes the down-sampling of the original video so as to enrich the time information among video segments and be beneficial to the accuracy rate of the detection of the surgical tool.

S13: step S12 is repeated until all surgical videos are converted into surgical images.

The step S2 specifically includes:

s21, the type of the surgical tool is C, wherein C is selected to be 1, C represents the type of the surgical tool appearing in the surgical video, C is 1, represents that only one surgical tool appears in the processed surgical video, the surgical image with the size of W × H and the downsampling factor of R, and the surgical image with the size of 720 × 576 and the downsampling factor of 4 is selected

The method comprises the steps of firstly passing through a convolution module and a residual error module of 7 × 7 to reduce the resolution of images, and uniformly setting the resolution of the images to be 512 × 512 so as to improve the training speed of the neural network;

s22: the operation image obtained in step S21 passes through two lightweight hourglass network frames with relay supervision layers (1 × 1 convolution module and batch normalization), as shown in fig. 1, although two hourglass network modules are used, only two symmetrical down-sampling modules and up-sampling modules with jump connection layers are added, the maximum pooling layer is not used during down-sampling, and the step length of 2 is selected to reduce the resolution of the operation image.

The lightweight hourglass network frame includes: a down-sampling part: processing the image to a size suitable for the display area; a low resolution map of the corresponding surgical tool image is generated.

An up-sampling part: the image is magnified so that it can be displayed on a high resolution device.

A relay supervision part: the whole neural network is directly subjected to gradient descent, and the problem of gradient disappearance caused by errors of an output layer can be greatly reduced by back propagation layer by layer. To solve this problem, a relay supervision part is added in the middle to ensure the update of the low-level parameters.

The training of the neural network is to input a picture into a black box, the neural network converts the picture into different characteristics, the neural network spreads layer by layer, and finally the required result is output.

Referring to fig. 2, S2 is based on using fire modules to replace residual modules in a traditional hourglass network, and the surgical tool key point estimator is learned through neural network training composed of fire modules; the fire module enables the application network to reduce the number of parameters while ensuring the detection accuracy of the surgical tool, so as to achieve the real-time detection of the surgical tool. The method specifically comprises the following steps:

s23: the residual error module in the traditional hourglass network framework used in step S22 is replaced by a fire module, which first compresses the channels of the input image using a 1 × 1 convolution kernel, then uses a module in which the 1 × 1 convolution kernel and a 3 × 3 depth separable convolution kernel are mixed and parallel, and finally outputs the result through a modified linear unit (also called a linear rectification function), which is an activation function common in convolutional neural networks and is a nonlinear mapping. The superposition of a plurality of linear operation layers can only carry out linear mapping, and the nonlinearity of the whole network can be increased by introducing activation functions such as a linear rectification function and the like to form a complex function. The design can reduce training parameters and greatly accelerate the training detection speed of the neural network, and the fire module and the deep separable convolution are common methods of the lightweight neural network;

s24: the depth separable convolution in step S23 is divided into two steps: in the first step, three convolutions are adopted to carry out convolution calculation on three channels respectively, three numbers are output after one convolution, then in the second step, three numbers output in the first step are used, a point convolution of 1 multiplied by 3 is calculated, and finally, one number is obtained.

In the neural network framework, the position of the surgical tool in the surgical tool image is trained and learned, and each surgical tool image is subjected to the convolution operation.

The deep separable convolution is two steps, each having a different role:

1. separating the depth information;

2. size was reduced using 1 x 1 convolution to fuse channels. The computation amount of the depth separable convolution is about 1/9 of the computation amount of the traditional convolution operation, and the speed is greatly increased.

The step S3 specifically includes:

s31: processing the input operation images in batches;

s32: preprocessing input batch images, namely enhancing data (rotating, displacing, scaling and the like operation on the operation images) so as to increase training data set samples;

s33: step S32 is repeated until all the batches have been processed.

The step S4 specifically includes:

s41: training the neural network framework to obtain the surgical tool keypoint estimator in the surgical image of step S3

When predicted

Then the key point detected is the center point of the surgical tool, if predicted

Then the representation of the detected keypoint is background.

Obtaining surgical tool keypoint estimators by training a designed one-stage anchorless, lightweight convolutional neural network

The expression is as follows:

where W, H are the width and height of the input image, respectively, and R is the downsampling factor.

The step S5 specifically includes:

s51: if the coordinates of the center point of a certain surgical tool c in the input surgical image I are (95, 102), the real value of the center point of the certain surgical tool can be distributed on the heat map based on the real value, and then various subsequent calculations are performed. Then the surgical tool keypoint estimator derived from step S4 is

Given a radius, for the surgical tool detectedNegative examples are penalized rather than being in the form of a non-zero, one; the penalty for negative samples is reduced by a denormal 2D Gaussian kernel Y centered at a positive position_xycGiving out; since the surgical image data has a serious imbalance problem of positive and negative samples, the items in front of the log operator have a balancing role. If Y is_xycApproaching 1, indicating that this point is a point of easy detection, Y_xycApproaching 0 indicates that the key point has not been learned and therefore should be increased in training weight, so the parenthetical term preceding log will be based on Y_xycThe size of the weight of the training ball adjusts the training proportion. An improved focus loss function is used:

wherein, the hyper-parameter α is 2, β is 4, and N is 2 is the number of key points in the operation image;

s52: designing a loss function L consisting of three parts_det＝L_k+λ_sL_s+λ_oL_oWherein L is_kIs a loss function, L, that estimates the surgical tool key points_sIs an estimate of the size L of the surgical tool₁Loss function, λ_s0.1 is its constant coefficient, L_oIs a local offset L₁Loss function, λ _o1 is its constant coefficient.

S53: and repeating the step S52, continuously learning, continuously training the network, and enabling the value of the loss function in the step S52 to be gradually reduced to a certain value and then to be unchanged until the loss function of the convolutional neural network is subjected to curve fitting.

The loss function curve fitting represents the success of neural network training, and a marked surgical tool picture can be directly input for testing to directly obtain C +4 data, namely the type C of the surgical tool key point, the sizes W and H of the boundary box and the x and y of the offset.

The heat map is an output of the neural network, i.e., the class of surgical tool key points.

Post-processing NMS (non-maximum suppression), computing the IOU between bounding boxes, is commonly used to reduce the number of bounding boxes that are duplicated by the same surgical tool, but is difficult to distinguish and train, resulting in most detectors that are not end-to-end trainable today. Therefore, all the response points on the heat map are compared with 8 neighborhood points, and if the value of the key point is greater than or equal to 8 neighborhood point values, the key point is retained, and finally the first 100 peak points are left. Therefore, the NMS post-processing process is omitted, and end-to-end training can be realized.

For C types in the actual values of the key points of the surgical tool in the surgical image, calculating the key points P of the actual values for training, wherein the calculation formula of the key points of the surgical tool is as follows:

the corresponding key points after the down sampling of the original image are as follows:

thus, the device is provided with

And corresponding to the key point of the original image under low resolution after the down-sampling is finished.

Down-sampled image true value key point according to formula

Input Gaussian kernel

In the formula sigma_PIs the standard deviation associated with the surgical tools W and H, such that the keypoints are distributed on the feature map of the surgical tool, and if there is overlap, the selection level is large. Namely, each point Y ranges from 0 to 1, and 1 is the point needing to learn prediction.

Finally using a key point loss function, i.e. an improved logistic regression focus loss function at pixel level

Regression was performed using the L1 loss function.

If a certain operatorThe bounding box with k is represented as:

the center point of the surgical tool is:

the bounding box size for each surgical tool k is calculated prior to training:

y2k-y1k, this size is W, H after downsampling the original image;

by using

Estimating the central points of all surgical tools;

to achieve real-time reduction of computational effort, the same predicted size is used for different surgical tools

Regression was performed at the center point position with the L1 loss function:

the step S6 specifically includes:

s61: according to the peak value of the heat map obtained in the step S5, the responses of all the values greater than or equal to 8 connected neighborhoods are detected, the first 100 peak values are kept, and the

2 detected center points for class 1

The set of (a) and (b),

is a prediction of the deviation of the surgical tool,

size prediction of surgical tools.

S62: based on the center point of the surgical tool detected in step S5 and the predicted size of the surgical tool, the formula is finally used:

the bounding box of the surgical tool is obtained, the whole process belongs to the bounding box of the candidate surgical tool extracted without an anchor box, and any post-processing process is not needed.

Based on the same inventive concept, the present embodiment is directed to a computing device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the real-time surgical tool detection method applied to the robot-assisted surgical video analysis in the first embodiment.

Based on the same inventive concept, it is an object of the present embodiment to provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the real-time surgical tool detection method applied to the robot-assisted surgical video analysis in the first implementation example.

Based on the same inventive concept, the present embodiment aims to provide a real-time surgical tool detection system applied to video analysis of robot-assisted surgery, comprising:

The steps involved in the apparatus of the above embodiment correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.

Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The real-time surgical tool detection method applied to robot-assisted surgical video analysis is characterized by comprising the following steps of:

and calculating the boundary frame of the surgical tool according to the predicted central point of the surgical tool and the size of the central point.

2. The method of claim 1, wherein the surgical tool keypoint estimator derived from training the lightweight neural network framework is used to estimate the surgical tool keypoints for the surgical image.

3. The method of claim 2, wherein the lightweight neural network framework includes a fire module configured to compress channels of an input image using a convolution kernel, then to employ a module in which convolution kernels and depth separable convolution kernels are mixed and parallelized, and finally output through the modified linear elements.

4. The method as claimed in claim 2, wherein the surgical tool keypoint estimator obtained after training the lightweight neural network framework outputs different values to represent whether the detected keypoint is the center point of the surgical tool or the background of the surgical tool.

5. The method as claimed in claim 1, wherein the step of processing the video of the robot-assisted surgery includes framing the video of the surgery, downsampling the video of the robot-assisted surgery, and storing the downsampled video as a surgery image.

6. The method of claim 1, wherein the pre-processing, i.e., data enhancement, is performed on the surgical image prior to performing the surgical tool keypoint estimation on the surgical image to increase the training dataset samples for the lightweight neural network framework.

7. The method of claim 1, wherein the step of extracting the center point of the surgical tool comprises:

and constructing a loss function based on the operating tool key point estimator, and obtaining the center point coordinate of a certain operating tool in the operating image when fitting the loss function curve of the lightweight neural network framework.

8. Real-time surgical tool detection system for robot-assisted surgical video analysis, characterized by comprising:

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of any of claims 1-7 for real-time surgical tool detection for use in robot-assisted surgical video analysis.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method for real-time surgical tool detection for robot-assisted surgical video analysis according to any of the preceding claims 1-7.