CN114463334A

CN114463334A - Inner cavity vision SLAM method based on semantic segmentation

Info

Publication number: CN114463334A
Application number: CN202111548927.5A
Authority: CN
Inventors: 赵剑波
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-05-10

Abstract

The invention relates to an inner cavity vision SLAM method based on semantic segmentation, which comprises the following steps: (1) acquiring an image sequence of an inner cavity environment through an endoscope, and extracting characteristic points from image data frame by frame; (2) aiming at the lumen image obtained in the first step, performing binary semantic segmentation by using a convolutional neural network to obtain mask information of the surgical tool; (3) removing the dynamic characteristic points on the surgical tool by combining the preliminarily extracted characteristic points and the segmentation result; (4) and estimating the endoscopic pose according to the processed reliable static characteristic points and completing the three-dimensional mapping of the lumen environment. The method can solve the problems that when the SLAM system is executed in an inner cavity scene, the robustness of the system is reduced due to the moving surgical tool, pose estimation errors are generated, the map is wrongly built, and the like.

Description

Inner cavity vision SLAM method based on semantic segmentation

The technical field is as follows:

the invention relates to the technical field of computer vision, in particular to an inner cavity vision SLAM method for eliminating dynamic feature points based on semantic segmentation.

Background art:

synchronous positioning and mapping (SLAM) is used for solving the problems of positioning and mapping of a robot in an unknown environment, and is a basic module and a prerequisite for applications such as an autonomous robot or augmented reality. The vision SLAM inputs data through a camera serving as a sensor, and completes the tasks of self positioning and mapping of the surrounding environment. Depending on the way in which the data association is performed, there are mainly a feature method and a direct method. The feature method estimates the three-dimensional geometric shape along the common-view image by using a group of matched feature points, and the direct method estimates the three-dimensional shape by directly using the pixel intensity without extracting the image features.

With the increasing attention on minimally invasive surgery and medical robots, a minimally invasive surgery navigation system is gradually combined with a computer vision technology, a visual SLAM algorithm is used for performing three-dimensional positioning and three-dimensional reconstruction on a focus area only through an endoscope image sequence, and the limitation that visual feedback in the traditional minimally invasive surgery is relatively incomplete or poor can be solved. In a traditional feature method vision SLAM algorithm, an observed object is generally assumed to be static, however, in an inner cavity scene, a moving surgical tool may appear on a picture, and feature points from a moving object bring wrong matching, so that the pose estimation of a camera is deviated, the mapping is not accurate, even SLAM tracking is lost, and the robustness of the whole system is reduced. In recent years, image semantic segmentation and target detection algorithms utilizing deep learning are greatly developed in efficiency and accuracy, so that moving objects are identified and segmented by combining a convolutional neural network in a feature method visual SLAM, corresponding image regions are shielded to avoid feature matching in the image regions, the robustness of an SLAM system can be improved, and a more accurate reconstruction result is obtained.

The invention content is as follows:

the invention aims to provide a semantic segmentation-based lumen vision SLAM method, which can segment a surgical tool appearing in an endoscope image in a lumen environment, thereby eliminating dynamic feature points on the surgical tool and constructing a map with higher accuracy by using static feature points in a background area.

In order to solve the technical problem, the invention provides an inner cavity vision SLAM method based on semantic segmentation, which comprises the following steps:

step 1: shooting an inner cavity environment through a monocular endoscope, acquiring an inner cavity image sequence, and then transmitting the inner cavity image sequence into an SLAM system to perform feature extraction and descriptor matching on the inner cavity image sequence frame by frame;

step 2: semantic segmentation is carried out on the image data by using a convolutional neural network, dynamic operation tools appearing in the segmented image are detected, and a corresponding binary mask is obtained through calculation;

and step 3: checking the preliminarily extracted feature point sequence according to the extracted feature points and a binary mask result obtained by segmenting the network, and if the preselected feature points are in the mask range, rejecting the preselected feature points to eliminate error feature points detected on the dynamic surgical tool;

and 4, step 4: and (4) continuing tracking by using static characteristic points of other background areas for executing subsequent pose estimation and constructing an environment map, and realizing the dynamic visual SLAM method facing the inner cavity scene.

Preferably, a feature method ORB-SLAM2 is adopted in step 1 as a global SLAM algorithm, when an endoscope camera acquires an RGB image of a lumen, the RGB image is transmitted to a SLAM system, ORB feature points are extracted and matched for each frame image through an ORB feature extraction algorithm in a SLAM system tracking thread, in the ORB algorithm, an image pyramid is divided into 8 layers, each layer of image is divided into 30 × 30 grids, and corner points are extracted from each grid, so that the extracted feature points are uniformly distributed, and finally, a preselected feature point sequence is obtained.

Preferably, in the step 2, the surgical tool moving under the intracavity scene is segmented by utilizing semantic information and a convolutional neural network, and the method introduces the U-net neural network into an ORB-SLAM2 framework to realize semantic segmentation on the surgical tool. The U-net is a full convolutional network with a symmetric encoding-decoding structure. The network mainly comprises two parts, namely a contraction path and an expansion path. Every two 3 × 3 convolutional layers are followed by a 2 × 2 max pooling layer in the contraction path, and each convolutional layer is down-sampled by using the ReLU activation function, which doubles the number of channels. An upsampling operation is performed on the extended path, each step comprising a 2 x 2 convolutional layer and a 3 x 3 convolutional layer, the activation function is also a ReLU, and the features in the contracted path are fused by a jump connection. The last layer of the network is a 1 × 1 convolutional layer, which converts the feature components into binary classification results. The network has a total of 23 convolutional layers. The model was trained using a data set from MICCAI collected using the da vinci surgical system. As an output of the model, a pixel probability map is obtained. For the desired binary segmentation problem, a prediction mask for the surgical instrument is ultimately generated.

Preferably, in the step 3, a RANSAC algorithm is used to eliminate mismatching points in the image with respect to the extracted ORB feature points, and the RANSAC algorithm finds an optimal homography matrix H for the matching points by 8 at a time, and is implemented by the following formula:

wherein (x, y) represents the corner position of the target image, (x ', y') represents the coordinates of the feature points on the image to be matched, and s is a scale parameter.

The RANSAC algorithm randomly extracts samples from the matched data set to calculate a homography matrix, then the model is used for testing all data, and if the number of consistent sets under the model meets the requirement, the consistent sets under the model are determined to be the optimal model. The sampling times are as follows:

where δ is an exemplary number of samples, p is the probability that there is no outlier at least once in δ random samples, and ε is the ratio of outliers to total. The number of consistent sets is determined by:

T＝(1-ε)n

since the RANSAC algorithm described above cannot exclusively select the correct point pairs, it is incorrect for feature points in a moving area and may lead to errors. Therefore, firstly, for the dynamic feature points on the surgical tool, further screening of the feature points is performed in the SLAM tracking thread according to the preselected feature points and the mask information of the surgical tool provided by the segmentation network, and the feature point detection area is limited by using the mask so as to prevent the feature points from being concentrated on the surgical tool; the pixel-by-pixel mask obtained by semantic segmentation can distinguish an operation tool area and a background area in an image, so that the operation tool area and the background area are used for shielding a moving operation tool, when points in the mask area exist in a feature point sequence, the points are identified as dynamic feature points and deleted, and then the RANSAC algorithm is used for processing the feature points in a static area, so that the endoscope pose is estimated more stably, and the built map is not interfered by the movement of the operation tool in a picture.

Preferably, in the step 4, after the dynamic feature points on the surgical tool are removed, the static feature points in other areas are used for calculating and acquiring a map of the pose and the lumen scene of the endoscope; specifically, the SLAM system mainly comprises a tracking thread, a local map building thread and a closed loop detection thread, wherein the tracking thread extracts static feature points in an image, carries out pose estimation, tracks a reconstructed local map and determines a key frame; the local map building process mainly completes the building of a local map and carries out three-dimensional point reconstruction on the static environment; finally, the closed-loop detection thread mainly comprises closed-loop fusion and global optimization; specifically, for each new image frame, extracting and matching static ORB feature points on the image frame, estimating a pose based on a uniform motion model, optimizing the pose by using a minimized projection error, and judging whether a key frame is generated or not; calculating the map points of the key frames with higher common view range by a triangulation method, and fusing if the current key frame and the adjacent frames have repeated map points; and finally, adjusting through a global beam method, and jointly optimizing the endoscope pose and map points.

Drawings

FIG. 1 is a flow chart of an intracavity vision SLAM method based on semantic segmentation according to the present invention;

FIG. 2 is a schematic diagram of a semantic segmentation network model in the present invention;

FIG. 3 is a schematic flow chart of the method for eliminating dynamic feature points on a surgical tool according to the present invention.

The specific implementation mode is as follows:

the principles and methods of the present invention will be described in further detail below with reference to the accompanying drawings, but the described embodiments are not intended to limit the embodiments of the invention.

As shown in FIG. 1, the invention provides a lumen vision SLAM method based on semantic segmentation, which works by utilizing a lumen image sequence shot by an endoscope camera and eliminates wrong feature points on a surgical tool to obtain more stable endoscope pose estimation and more accurate mapping. The method specifically comprises the following steps:

step 1: acquiring video data of an inner cavity environment through an endoscope camera, converting the video into an RGB image frame sequence, sequentially transmitting the RGB image frame sequence into an SLAM system for processing, and extracting ORB characteristic points;

specifically, the ORB-SLAM2 is selected as a global SLAM algorithm framework, FAST corner points in images are extracted from the cavity images frame by frame through a feature extraction algorithm in the cavity images, the images are made to have scale invariance and rotation invariance by constructing an image pyramid and calculating a gray centroid, grids are divided by using a quadtree algorithm to enable feature points to be uniformly distributed, and then a BRIEF descriptor is calculated to obtain a final ORB feature point descriptor which is used as a preselected feature point.

Step 2: simultaneously, inputting the original RGB image into a trained semantic segmentation network, segmenting the surgical tool in the image, and distinguishing a surgical tool region and a background region by utilizing semantic information;

further, a U-net neural network is selected as a segmentation network of the system, and a U-net architecture comprises a contraction path for acquiring context information and a symmetrical expansion path for realizing accurate positioning; the contraction path adopts alternate convolution and pooling operation, gradually samples the characteristic diagram, and increases the number of the characteristic diagram layer by layer; each step in the extended path is composed of up-sampling and convolution of the feature map, so that the resolution of output can be improved; then, a low-level feature map and a high-resolution feature are combined through jump connection, so that target details can be better repaired, pixel-level positioning is realized, and a good effect is achieved on a task of segmenting limited data; the network model, using the Jaccard index as an evaluation index, can be interpreted as a similarity measure between a set of priority quantities, defined by the following equation:

wherein, y_iIs the binary class attribute of the pixel point i,

is the pixel point probability obtained by model prediction; and combining with the classification loss function H to obtain a final expression of the generalized loss function:

L＝H-logJ

as shown in fig. 2, the image frame with the surgical tool in the image is semantically segmented through a U-net network, so that the surgical tool appearing in the image can be segmented, and the corresponding pixel-by-pixel binary mask can be obtained through calculation.

as shown in fig. 3, the embodiment introduces a U-net semantic segmentation network into the SLAM system, pre-processes the lumen image, and removes dynamic feature points on the surgical tool according to the extracted ORB feature points and semantic segmentation results, which specifically includes:

when a moving surgical tool appears in an inner cavity image sequence, screening a characteristic point sequence by using a mask information of the surgical tool provided by a preselected characteristic point and a segmentation network in an SLAM tracking thread aiming at the dynamic characteristic point extracted from the surgical tool, distinguishing static and dynamic characteristics in the image by using a binary semantic information of a background and the surgical tool under an inner cavity scene, deleting the dynamic characteristic point in a mask area from the characteristic point sequence as an abnormal value, and not participating in pose estimation and mapping work; and then the static feature points are detected by using a RANSAC algorithm and the error matching in the static feature points is deleted, so that the robustness of the SLAM is improved, and the endoscope pose estimation is more stable and accurate and the mapping is more accurate.

And 4, step 4: sequentially executing subsequent modules of ORB-SLAM2 according to the processed ORB characteristics, estimating the camera pose by using the adjustment minimization of the reprojection error by using a light beam method through the matching corresponding relation of adjacent frames, and determining a key frame; calculating map points by using a triangulation method in a local map building thread, carrying out three-dimensional point reconstruction on the lumen environment, and carrying out local optimization on the camera pose and the map points; and finally, loop detection and global optimization are carried out.

The method can eliminate the characteristic points which are possibly adversely affected on the moving surgical tool in the SLAM reconstruction process of the inner cavity, and improve the robustness and accuracy of the system. The above embodiments are merely illustrative of the core idea of the present invention, and do not limit the present invention, and modifications or substitutions of some techniques in the foregoing technical solutions are made without departing from the spirit and scope of the present invention.

Claims

1. An intracavity vision SLAM method based on semantic segmentation is characterized by comprising the following steps:

2. The semantic segmentation based lumen vision SLAM method of claim 1, wherein in step 1: the lumen monocular vision SLAM method takes an ORB-SLAM2 open source feature method vision SLAM system as a basic framework, when an endoscope camera collects RGB images of a lumen, the RGB images are transmitted to the SLAM system, and ORB feature points of each frame image are extracted and matched through a feature extraction algorithm in a tracking thread.

3. The semantic segmentation based lumen vision SLAM method of claim 1, wherein in step 2: and inputting the original RGB image into a trained semantic segmentation network, and segmenting the surgical tool moving in the cavity scene by utilizing semantic information and a convolutional neural network, wherein the U-net neural network is introduced into an ORB-SLAM2 framework to realize semantic segmentation of the surgical tool, so that pixel-by-pixel binary mask information of the surgical tool is obtained.

4. The semantic segmentation based lumen vision SLAM method of claim 1, wherein in step 3: a decision module for distinguishing dynamic feature points by utilizing semantic information is added in the SLAM system, the operation tool is segmented, the dynamic features are judged and are used as exterior points to be deleted.

5. The semantic segmentation based lumen vision SLAM method of claim 4 wherein: checking the preliminarily extracted feature point sequence according to the preliminarily extracted ORB feature points and a binary mask result obtained by a semantic segmentation network, and if the preselected feature points are in the mask range, rejecting the preselected feature points to eliminate error feature points detected on a dynamic surgical tool; then, further rejecting outliers by using a RANSAC algorithm, and selecting the most reliable characteristic point pair.

6. The semantic segmentation based lumen vision SLAM method of claim 1 wherein in step 4: after the dynamic characteristic points on the surgical tool are removed, correct corresponding relations between reliable static characteristic points in other areas are used for calculating and acquiring a map of the pose and the lumen scene of the endoscope.