CN113469985A

CN113469985A - Method for extracting characteristic points of endoscope image

Info

Publication number: CN113469985A
Application number: CN202110788779.8A
Authority: CN
Inventors: 熊璟; 徐玉伟; 夏泽洋; 谢高生
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-01
Also published as: WO2023284246A1

Abstract

The invention relates to an endoscope image feature point extraction method, belongs to the field of image processing, and particularly relates to an endoscope image feature point extraction method. Finding out the overlapping area of the two images by using an optical flow method for the two images with the overlapping area, and taking the pixel matching relation obtained by the optical flow method as the reference of a convolution neural network; then, inputting the first image and the second image into a convolutional neural network respectively, thereby detecting the feature points in the first image and the second image and descriptors corresponding to the feature points; and in the training stage, obtaining a finally trained model by utilizing whether the loss function is converged or whether the set training times is reached. The invention utilizes the convolutional neural network to extract and describe the characteristic points of the image acquired by the digestive endoscope in a learning mode, so as to solve the problem that the traditional characteristic extraction method cannot finish the pose calculation of the lens of the digestive endoscope due to insufficient extraction of the characteristic points on the image acquired by the digestive tract.

Description

Method for extracting characteristic points of endoscope image

Technical Field

The invention belongs to the field of image processing, and particularly relates to an endoscope image feature point extraction method.

Background

The digestive tract endoscopy is the most important screening means in minimally invasive surgery and digestive tract lesion diagnosis, the positioning of the digestive endoscope can provide the tracking of the digestive endoscope for doctors, and the extraction and matching of the characteristic points of the digestive endoscope image can realize the calculation of the lens pose of the digestive endoscope, thereby realizing the positioning and navigation of the digestive endoscope.

The characteristic point extraction and matching of the current digestive endoscopy image mainly comprises the following steps:

for digestive endoscopy images with rich textures and good imaging quality, directly applying SIFT (scale invariant feature transform) and other algorithms to detect and describe feature points in the images;

for a digestive endoscopy image with weak texture, additional processing needs to be performed on the inner wall of the digestive tract, for example, a laser spot is emitted to the inner wall of the digestive tract, then an algorithm such as SIFT (scale invariant feature transform) is applied to detect the laser spot in the image and describe the detected feature point, or a suitable coloring agent is injected into a blood vessel on the surface of the digestive tract to enrich the texture of the surface of the digestive tract, and then an algorithm such as SIFT (scale invariant feature transform) is applied to detect the feature point in the image and describe the feature point.

For scenes in which the SIFT (scale invariant feature transform) and other algorithms can be directly applied to extract and describe the feature points, the SIFT (scale invariant feature transform) algorithm has higher computational complexity and cannot be used in some scenes with high real-time requirements;

for a digestive endoscope image with weak texture, additional processing needs to be performed on the inner wall of the digestive tract or the digestive endoscope (for example, a light source channel for emitting laser is added on the digestive endoscope), and the additional processing increases time and cost.

Disclosure of Invention

The invention aims to extract and describe feature points of an image acquired by an digestive endoscope in a learning mode by utilizing a convolutional neural network so as to solve the problem that the pose calculation of the lens of the digestive endoscope cannot be completed due to insufficient extraction of the feature points on the image acquired by the digestive tract in the traditional feature extraction method.

In order to solve the problems in the prior art, the invention provides an endoscope image feature point extraction method, which comprises the following steps:

for two images with overlapping areas, finding out the overlapping areas of the two images by using an optical flow method, and taking a pixel matching relation obtained by the optical flow method as a reference of a convolution neural network;

inputting the first image and the second image into a convolutional neural network respectively, thereby detecting feature points in the first image and the second image and descriptors corresponding to the feature points;

and in the training stage, obtaining a finally trained model by utilizing whether the loss function is converged or not or whether the set training frequency is reached.

Preferably, for the two images with the overlapping area, finding the overlapping area of the two images by using the optical flow method specifically includes finding the position of an image belonging to the same scene point in the first image in the second image.

Preferably, the first image and the second image are respectively input into a convolutional neural network, so that the feature points in the first image and the second image and descriptors corresponding to the feature points are detected, and the branch of the output feature points comprises two feature maps, wherein one feature map is related to the reliability of the feature points, and the other feature map is related to the repeatability of the feature points.

Preferably, the convolutional neural network structure includes 9 layers connected in sequence, and convolutional layers of convolutional kernels for reducing parameters of the convolutional neural network and freely adjusting the number of channels are provided in specific convolutional layers.

Preferably, in the training stage, when the loss function is used for converging or not or whether the loss function reaches the set training frequency is used, the obtaining of the finally trained model specifically includes judging whether the matching relationship between the feature points learned by the convolutional neural network and the descriptors thereof is consistent with the matching relationship given by the optical flow method, and if not, updating the parameters of the convolutional neural network.

Preferably, the inputting the first image and the second image into the convolutional neural network respectively to detect the feature points in the first image and the second image and the descriptors corresponding to the feature points, specifically includes providing corresponding loss functions to extract the feature points and learn the descriptors.

Preferably, the respective loss functions comprise similarity loss functions.

Preferably, the corresponding loss function comprises a learning repetitive loss function.

Preferably, the respective loss functions include a loss function that learns the reliability of the detected feature points.

Compared with the prior art, the method for extracting the characteristic points of the endoscope image has the following beneficial effects:

aiming at the characteristics of the digestive endoscopy image and avoiding the defects, the invention utilizes the convolutional neural network and adopts a learning mode to extract enough characteristic points of the digestive endoscopy image and describe the extracted characteristic points so as to serve the subsequent characteristic point matching, thereby realizing the calculation of the lens pose of the digestive endoscopy.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of an endoscopic image feature point extraction method according to an embodiment of the present invention.

Fig. 2 is a training flowchart provided in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a network structure according to an embodiment of the present invention.

Fig. 4(a) and (B) are the results of feature point extraction on a gastroscope-acquired gastric body model image by the method of the present invention, and (C) and (D) are the results of feature point extraction on the same two pictures by the SIFT method.

Fig. 5(a) and (B) are the results of feature point extraction of a gastroscope-acquired stomach phantom image by the method of the present invention, and (C) and (D) are the results of image matching after description.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention provides a method for extracting characteristic points of an image of a digestive endoscopy by using a convolutional neural network and describing the extracted characteristic points. As shown in fig. 1, includes:

s1, finding out the overlapping area of the two images by using an optical flow method for the two images with the overlapping area, and taking the pixel matching relation obtained by the optical flow method as the reference of the convolution neural network;

s2, inputting the first image and the second image into a convolutional neural network respectively, thereby detecting feature points in the first image and the second image and descriptors corresponding to the feature points;

and S3, in the training stage, obtaining the finally trained model by using whether the loss function is converged or whether the set training times is reached.

As shown in fig. 2, the specific process of the network training phase is as follows: for two images with overlapping areas (namely, the two images have images formed by the same scene), firstly, an optical flow method is utilized to find the overlapping areas of the two images, namely, the positions of the images belonging to the same scene point in the first image in the second image are found, matched pixels of the two images found by the optical flow method are used as references of a convolutional neural network (namely, the positions are equivalent to the training of matching the pixels found by the optical flow method to a monitoring network), then the two images are respectively input into the network, pass through each layer of the network, the final output is two branches, one branch outputs a descriptor of a feature point (since the number of channels finally output by the network can be controlled, the dimension of the descriptor can be adjusted, in the invention, the final number of channels is 128, the dimension of the descriptor of one feature point is 128 dimensions, and it needs to be noted that, the descriptors of all the finally output feature points are subjected to normalization processing), and the other branch outputs feature points, wherein the branch of the output feature points has two feature maps (the number of channels is 1), the reliability of one feature map about the feature points is realized, the numerical value of each pixel in the reliability feature map is between 0 and 1, the larger the numerical value is, the more likely the point is the feature point, and meanwhile, the more reliable the descriptors corresponding to the feature point are; the other characteristic diagram is repeated according to the characteristic points, the numerical value of each pixel in the repeated characteristic diagram is also between 0 and 1, the larger the numerical value is, the higher the probability that the pixel appears in the other image is, namely, the pixel at the position is more likely to find the matching point in the other image. The loss function is provided based on the matching given by the optical flow method, that is, two images taken in the same scene are theoretically the same as the image formed by the same scene point in the two images, and the optical flow method can track the position where each pixel in the first image appears in the second image (of course, if the image is out of range, the pixel is discarded), so that the matching relation of each pixel in the two images with the overlapped area can be determined by the optical flow method. If the matching relationship of the feature points and the descriptors thereof learned by the network is inconsistent with the matching relationship given by the optical flow method, the parameters of the network need to be updated, the loss function can not be converged at this time, and in the training stage, the finally trained model can be obtained only when the loss function is converged or reaches the set maximum training times.

As shown in fig. 3, the convolutional neural network is composed of 9 convolutional layers, which are 2 convolutional layers with convolutional kernel size of 3 and step size of 1 in sequence from front to back; 1 convolution layer with convolution kernel size of 3 and step length of 2; 1 convolution layer with convolution kernel size of 1 and step length of 1; 1 convolution layer with convolution kernel size of 3 and step length of 1; 1 convolution layer with convolution kernel size of 1 and step length of 1; 1 convolution layer with convolution kernel size of 2 and step length of 2; 1 convolution layer with convolution kernel size of 1 and step length of 1; 1 convolution layer with convolution kernel size of 2 and step size of 2. The number of the channels from front to back is 32,32,64, 128,64 and 128. The convolution layer with the convolution kernel size of 1 can effectively reduce the parameters of the network and freely adjust the number of channels, and simultaneously can fuse the information of each pixel according to the channels, thereby highlighting the characteristics of each pixel. The input of the network is a 3-channel picture, and the output of the network is provided with two branches, wherein one branch is a characteristic point in an image detected by the network, the other branch is a descriptor corresponding to each detected characteristic point, and each descriptor is a 128-dimensional vector.

The invention provides an endoscope image feature point extraction method, which flexibly uses a convolution layer with the convolution kernel size of 1 multiplied by 1, greatly reduces network parameters and obtains better results. The network has 9 layers in total, the convolutional layers with the convolutional kernel size of 1 × 1 are respectively arranged on the 4 th layer, the 6 th layer and the 8 th layer, namely the later layers from the third layer, one convolutional layer with the convolutional kernel size of 1 × 1 is inserted into each two layers, and the function of the convolutional layer with the convolutional kernel size of 1 × 1 is three: 1) the number of channels can be flexibly adjusted (such as the network of the invention is adjusted to 16 channels of layer 4 from 64 channels of layer 3, and is adjusted to 64 channels of layer 8 from 128 channels of layer 7); 2) the parameters of the network can be greatly reduced (in the present invention (training using 4873 gastroscope images with a resolution of 1058 × 900), if all convolutional layers with a convolutional kernel size of 1 × 1 are replaced with convolutional layers with a convolutional kernel size of 3 × 3, the network parameters are more than 3 times that before the replacement); 3) because the convolution layer with the convolution kernel size of 1 × 1 fuses the information of each pixel according to the channel (that is, the pixel at each position is not fused with the pixels at other positions, but is fused with the information at the same position of the feature map of each channel along the channel), the convolution layer can retain the detail information of each position and better highlight the characteristics of the pixel at each position. Based on the network, the task of extracting and describing the characteristic points of the digestive endoscopy image (even the digestive endoscopy image with weak texture) can be smoothly completed.

The invention provides an endoscope image feature point extraction method, the traditional method (such as SIFT and the like) can not extract enough feature points for the weak texture image of a digestive endoscope, and the method based on a convolutional neural network can extract enough feature points; the traditional method (such as SIFT and the like) is high in computational complexity and cannot be applied to occasions with high real-time requirements, and the method based on the convolutional neural network can meet the real-time requirements by extracting and matching image feature points after training a model; the traditional method (such as SIFT and the like) is provided by relying on the experience of researchers, namely the method for extracting and describing the image feature points can be provided by the researchers with more experience, the method based on the convolutional neural network does not need the researchers to have more experience, because the method based on the convolutional neural network is a data-driven learning-based method, and good results can be obtained by reasonably designing the network and the loss function as long as the data quantity of a training set is enough.

The loss function includes two, one provided for the repeatability of the feature points, and the other evaluating the reliability of each feature point.

Repeatability of learning feature points:

as shown in fig. 2, for two images (represented by the first image and the second image, respectively) with a repeated area, an optical flow method is first used to find out an overlapping area of the first image and the second image as a reference for image matching (i.e. if a certain pixel in the first image and a certain pixel in the second image are images formed by a same scene point, the two pixels should be corresponding, and if the two pixels in the first image and the second image, respectively, are detected by the above-mentioned convolutional neural network as belonging to a feature point, their corresponding descriptors should be matched, i.e. the euclidean distance between the two descriptors should be the minimum in all descriptor euclidean distances). Then inputting the first image and the second image into the convolution neural network respectively, thereby detecting the feature points in the first image and the second image and the descriptors corresponding to the feature points, then matching the feature points in the first image and the second image according to the learned feature point descriptors, and comparing the feature points with the matching points calculated by an optical flow method to adjust the parameters of the network. For each image block in the first image, similarity with the image block in the second image is calculated respectively, the higher the similarity is, the more likely the feature point is to be an image of the same scene point in the two images, and the definition of the loss function is to maximize the similarity between the image blocks.

The similarity loss function is defined as:

where sim is a symbol indicating a pixel-by-pixel differencing of two image blocks, P indicates the set of image blocks, P indicates an image block in the set of image blocks, R₁Is a feature point repetitive feature map of the first image (see figure 2),

is a feature point repeatability feature map of the second image passing through S (such as an optical flow method). To prevent from being caused by R₁And

constant (i.e., constant value) results in minimizing the similarity loss function L_simAdding a regular term to the characteristic point repeatability characteristic graph corresponding to each graph, wherein the definition of the regular term is as follows:

where (i, j) ∈ p means that (i, j) is a pixel position in the image block p, R_ijRepresenting one pixel in image R, P being an image block in image R, max representing the maximum of the pixels in image block P, mean representing the mean of the pixels in image block P, P representing the set of all image blocks, | P | representing the number of image blocks in the set, peak representing only one sign to distinguish the different loss functions (even if L is an integer of L)_peakyAnd L_simAnd L of the formula (3)_repTo distinguish).

Finally, the loss function of learning repeatability is defined by a weighted sum of the above equations (1) and (2):

L_rep＝L_sim(I₁，I₂，H)+α(L_peaky(I₁)+L_peaky(I₂)) (3)

where α is a hyper-parameter, H represents an image match (e.g. a match obtained by optical flow), I₁，I₂Respectively representing two images of the input, rep representing only one discriminative sign to distinguish different loss functions, similar to the meaning of peak.

Reliability of learning detected feature points:

in order to emphasize the reliability of the detected feature points, the network not only calculates the repeatability of the feature points, but also predicts a confidence value R between 0 and 1 for the descriptor corresponding to each feature point generated by the network, wherein the greater the R, the more reliable the corresponding feature point and the descriptor thereof, the goal is to enable the network to select the feature descriptor with high distinguishability, the confidence value of the descriptor is larger, and the confidence value is lower for the descriptor with low distinguishability, such as the descriptor generated in an area which can not be fully distinguished, and the influence of the loss function on the descriptor with low distinguishability is small. The descriptor matching problem can be regarded as a ranking optimization problem, for example, for two images (represented by a first image and a second image), for a descriptor in the first image, Euclidean distances between the descriptor and all descriptors in the second image are calculated, and then ranking is performed according to the magnitude of the Euclidean distances, so that corresponding loss functions are provided for extracting feature points and learning descriptors. In the present invention, a set of matched image blocks is first given and their descriptors are calculated separately using a convolutional neural network. The euclidean distances between all the descriptors are then calculated to form a euclidean distance matrix, each row of which can be considered as the euclidean distance between a descriptor from the first image and all the descriptors from the second image. After the euclidean distance matrix of the descriptors is obtained, all the descriptors in the image block set are optimized in a manner that the image block set is divided into two sets, one is a matching set (assuming that K image blocks are available) and the other is a non-matching set, for each descriptor (assuming that the descriptor is from the first image) according to the euclidean distance, then an average matching value is calculated in the whole image block set, and the objective function is to maximize the average matching value, that is, when the average matching value among all the matched image blocks is larger than the average matching value among the non-matched image blocks, the objective function converges, that is, the objective function obtains its optimal solution.

The loss function of the reliability of the learned detected feature points is defined as:

wherein B represents a feature point descriptor set, d represents a descriptor in the descriptor set,

set of descriptors representing matches with descriptor d, and

corresponding to

Represents a set of descriptors that do not match descriptor d, (x)₁，x₂，...，x_n) Representing the result of the euclidean distances between all descriptors corresponding to a certain descriptor (e.g. d) and from another image, in ascending order, i.e. corresponding to a certain row in the euclidean distance matrix, Prec @ K represents the exact values of the K descriptors matching a certain descriptor (e.g. d), and AP is the average of the exact values evaluated at different locations.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to provide alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims

1. An endoscope image feature point extraction method is characterized by comprising the following steps: comprises that

and in the training stage, obtaining a finally trained model by utilizing whether the loss function is converged or whether the set training times is reached.

2. An endoscopic image feature point extraction method as claimed in claim 1, wherein said finding the overlapped area of the two images with the overlapped area by using an optical flow method specifically comprises finding the position of the image belonging to the same scene point in the first image in the second image.

3. An endoscopic image feature point extraction method as claimed in claim 1, wherein said inputting the first image and the second image into a convolutional neural network respectively, so as to detect the feature points in the first image and the second image and the branches of descriptor output feature points corresponding to the feature points, wherein the branches comprise two feature maps, one feature map is related to the reliability of the feature points, and the other feature map is related to the repeatability of the feature points.

4. An endoscopic image feature point extraction method according to claim 1, wherein said convolutional neural network structure comprises 9 layers connected in sequence, and convolutional layers of convolutional kernels for reducing parameters of the convolutional neural network and freely adjusting the number of channels are provided in specific convolutional layers.

5. An endoscopic image feature point extraction method according to claim 1, wherein, in the training phase, when the loss function is used to determine whether the feature points converge or the loss function reaches the set training times, obtaining the final trained model specifically comprises determining whether the matching relationship between the feature points learned by the convolutional neural network and their descriptors is consistent with the matching relationship given by the optical flow method, and if not, updating the parameters of the convolutional neural network.

6. An endoscopic image feature point extraction method as claimed in claim 1, wherein said inputting the first image and the second image into a convolutional neural network respectively, thereby detecting feature points in the first image and the second image and descriptors corresponding to the feature points, specifically comprises providing corresponding loss functions to extract the feature points and learn descriptors.

7. An endoscopic image feature point extraction method as defined in claim 6, wherein said corresponding loss function comprises a similarity loss function.

8. An endoscopic image feature point extraction method as defined in claim 6, wherein said corresponding loss function comprises a learning repetitive loss function.

9. An endoscopic image feature point extraction method as defined in claim 6, wherein said corresponding loss function comprises a loss function of learning reliability of detected feature points.