CN112396593B

CN112396593B - Closed loop detection method based on key frame selection and local features

Info

Publication number: CN112396593B
Application number: CN202011360902.8A
Authority: CN
Inventors: 宋海龙; 游林辉; 胡峰; 孙仝; 陈政; 张谨立; 黄达文; 王伟光; 梁铭聪; 黄志就; 何彧; 陈景尚; 谭子毅; 尤德柱; 区嘉亮; 陈宇婷
Original assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2023-01-24
Anticipated expiration: 2040-11-27
Also published as: CN112396593A

Abstract

The invention relates to a closed loop detection method based on key frame selection and local features. The key frames are selected through KLT sparse optical flow tracking, the motion speed of the mobile robot does not need to be considered, meanwhile, the images at the corners can be well processed, and the selected key frames are more representative. Meanwhile, the key frame is selected, so that the operation speed in the matching process can be reduced, and the detection speed of the whole method is improved.

Description

Closed loop detection method based on key frame selection and local features

Technical Field

The invention relates to the field of positioning and navigation based on vision in autonomous inspection of unmanned aerial vehicles, in particular to a closed-loop detection method based on key frame selection and local features.

Background

In the intelligent inspection process of the unmanned aerial vehicle, the unmanned aerial vehicle needs to autonomously determine the operation required to be performed according to the environmental information. Therefore, autonomous positioning and environmental map sensing and construction are key links in autonomous inspection of the unmanned aerial vehicle. In recent years, development of visual SLAM (simultaneous localization and mapping) technology has improved the capability of autonomous localization and mapping of mobile robots. The closed-loop detection is an important component in the visual SLAM system, is used for detecting whether the mobile robot returns to a place visited once, and plays an extremely important role in reducing the positioning error of the mobile robot and constructing a globally consistent environment map. The closed-loop detection matches the current frame with the key frame, and judges whether to be closed according to the matching degree, so that the correct selection of the key frame is crucial to the closed-loop detection.

The Chinese patent application document with the publication number of CN109902619A and the publication date of 2019, 6 and 18 discloses an image closed-loop detection method and a system, and the method comprises the following steps: extracting a FAST corner point for each frame image, and calculating a BRIEF operator; substituting the BRIEF operator into a pre-established word bag model to obtain a visual word corresponding to the operator; the visual words are used for establishing vector description of the image; judging whether a current image is likely to generate a closed loop or not based on a tracking prediction algorithm, and predicting the position of the likely generated closed loop to obtain a closed loop candidate set; evaluating the similarity degree of the current image and each image in the closed-loop candidate set through the visual word vector, and taking the image with the highest similarity in the closed-loop candidate set as a candidate image; carrying out normalization processing on the candidate image to obtain a normalized image; and calculating an ORB global operator of the normalized image to complete the structure check of the candidate image. The invention can effectively accelerate the detection algorithm and provide more accurate closed loop detection performance.

The method belongs to a closed loop detection method based on a visual bag-of-words model, and comprises the steps of extracting local feature points and descriptors of an input image, obtaining BoW vector representation of the input image by means of a visual dictionary, and judging whether to be closed loop or not through a tracking prediction algorithm. Closed-loop detection based on the visual bag-of-words model has better robustness under the condition of changing the visual angle of an image, but is difficult to process the condition of changing the appearance. Meanwhile, the method lacks selection of key frames, only takes the similarity as a candidate image, and the calculation amount is large, so that the final detection speed is influenced.

Disclosure of Invention

The invention aims to solve the problem of slow detection speed in the prior art, and provides a closed-loop detection method based on key frame selection and local features.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a closed loop detection method based on key frame selection and local features comprises the following steps:

the method comprises the following steps: an input image acquired by a mobile robot; determining a first frame of an input image sequence as a key frame, extracting Shi _ Tomasi corner points of a previous key frame of a current input image, tracking the corner points in the current input image iteratively by adopting a sparse optical flow tracking algorithm, and if the number of the corner points which cannot be tracked is greater than a threshold value, determining the current input image as a new key frame;

step two: extracting global features from the current input image by adopting a convolutional neural network trained by an image classification data set, and inserting the extracted global features into a layered navigable small-world map of an approximate nearest neighbor retrieval algorithm if the current input image is a key frame;

step three: in the retrieval range of the current input image, retrieving a key frame most similar to the current input image as a closed-loop candidate key frame of the current image through HNSW, and taking all images between the closed-loop candidate key frame and a key frame next to the closed-loop candidate key frame as a closed-loop candidate image queue;

step four: introducing geometric consistency check, respectively extracting ORB characteristic points and corresponding local difference binary descriptors LDB from the input image and the retrieved closed-loop candidate image, and respectively matching the input image with descriptors of images in a closed-loop candidate image queue;

step five: the closed-loop candidate image which is most matched with the LDB descriptor of the current input image is used as an optimal closed-loop candidate image, the feature points matched with the two images are input into a random sampling consistency algorithm to further eliminate mismatching and solve a basic matrix, and if the number of the inner points between the two images is less than a threshold value, the two images do not form a closed loop; if the number of inner points between the two images is larger than the threshold value, the two images may form a closed loop;

step six: and (4) introducing time consistency check, and if the continuous 2 frames of images after the current input image all meet the threshold condition of the step five, considering that the input image and the closed-loop candidate image form a group of closed loops.

Preferably, in the first step, the corner points are iteratively tracked in the current input image by using a sparse optical flow tracking algorithm KLT, specifically:

current input image I _i Is marked as I _k-1 For image I _i And I _k-1 Carrying out graying processing to obtain an image G _i 、G _k-1 (ii) a Extracting image G _k-1 Shi _ Tomasi corner, set image I _i And I _k-1 The brightness of the middle pixel point is kept constant before and after movement, and an image G is calculated _k-1 At the center point P (x, y) in the image G _i Position P (x + dx, y + dy) and optical flow

The specific calculation steps are as follows: for the current input image I _i Performing graying to obtain an image G _i Image I _i The gray image of the previous key frame is G _k-1 Extracting an image G _k-1 The Shi _ Tomasi corner point. For image G _k-1 、G _i And respectively carrying out Gaussian pyramid transformation to obtain L layers of images with different resolutions. At L _m In a layer, assume G _k-1 At the corner P (x, y) in the image G _i To point P (x + dx, y + dy), taking time dt. Because the luminance keeps invariable before and after the pixel moves in two pictures, promptly:

I(x,y,t)＝I(x+dx,y+dy,t+dt) (1)

where I (x, y, t) represents the brightness of the pixel P (x, y) at time t, and I (x + dx, y + dy, t + dt) represents the shifted image G _i Brightness at the middle pixel point P (x + dx, y + dy). I (x + dx, y + dy, t + dt) can be decomposed by Taylor's equation as:

wherein epsilon is infinitesimal and can be ignored. Equation (1) can therefore be simplified to:

both sides are divided by dt simultaneously:

let u, v be the velocity components of the flow along the X-axis and Y-axis, respectively, i.e.

In addition, note

In this case, equation (5) can be written as:

I _x u+I _y v+I _t ＝0 (8)

assuming that the pixel points around P (x, y) keep the same moving distance with P (x, y), a window with size of (5, 5) is taken around P (x, y), and for the pixel points in the window:

and solving the optimal solution of the equation system by adopting a least square method so as to minimize the matching error sum in the window. Equation (9) can be abbreviated as:

Ad＝b (10)

multiplying both sides by A ^T ：(A ^T A)d＝A ^T b (11)

At this time, velocity vectors u and v of the optical flow along the X axis and the Y axis are obtained as follows:

the L < th > value can be calculated by solving the u and v values _m Corner point P (x, y) in layer in image G _i Position P (x + dx, y + dy) and optical flow

Will L _m The optical flow value obtained by layer calculation is taken as L _m-1 The initial value of the laminar flow, and calculate L _m-1 The precise value of the laminar flow until the lowest layer L is calculated ₀ The optical flow of the original image and the tracked corner point P (x + dx, y + dy).

Preferably, if the number of corner points that cannot be tracked is greater than the threshold, it is determined that the current input image is a new key frame specifically:

key frame image G _k-1 At the current input image G _i When KLT sparse optical flow tracking is performed, if the following occurs, it is considered that tracking has failed:

(1) Corner point P (x, y) at G _i Out of image range;

(2) The sum of matching errors in the neighborhood of the matching corners is larger than a threshold value;

if the number of the corner points which fail to track is larger than the set threshold value, the current input image I is considered _i Is a new key frame.

Preferably, in the second step, the extracting global features of the current input image by using the convolutional neural network trained by the image classification dataset specifically includes: for the current input image I _i And preprocessing is carried out, the input of the convolutional neural network requires to adjust the size of the image, and the output of the last but one full connection of the convolutional neural network is used as the global characteristic of the image.

Preferably, in the third step, if the current input image is a key frame, the specific process of inserting the extracted global features into the hierarchical navigable small-world map of the approximate nearest neighbor search algorithm is as follows: if the current input image I _i Selected as key frames, are randomized by exponentially decaying probability distribution functionsSpecifying an image I _i Characteristic node of (1) the highest level number l in the HNSW structure _max Inserting the feature node into l _max To the bottom layer ₀ In all layers of (a). And searching M nodes nearest to the node in each layer respectively, and connecting the new feature node with the M nodes nearest to the new feature node.

Preferably, in the second step, the search range of the current input image is specifically:

U _sa ＝U _before -U _fr×ct

wherein, U _sa Indicating a search range of the input image; u shape _before A set representing all images preceding the current input image; fr is the frame rate of the camera; ct is a time constant; u shape _fr×ct Is a set of fr × ct frame images preceding the current input image.

Preferably, in the fourth step, the specific process of extracting ORB feature points and corresponding local differential binary descriptors LDB from the current input image and the retrieved closed-loop candidate image queue includes:

respectively extracting ORB characteristic points from the current input image and the closed loop candidate image queue, and for each ORB characteristic point k _ij In k, with _ij Cropping a block S of size S × S for the center _ij Will S _ij Divided into c x c mesh units of equal size

Calculating the average intensity I of each grid unit _avg And gradient d _x 、d _y . For S _ij Of any two grid cells

Performing binary test to obtain binary code as the sum of characteristic points k _ij Corresponding binary LDB descriptors.

Preferably, for S _ij Of any two grid cells

ExecuteThe binary test specifically comprises the following steps:

wherein f (m) and f (n) respectively represent grid cells

Average intensity of _avg And gradient d _x 、d _y The value is obtained.

Preferably, in the fourth step, the input image is matched with the descriptors of the images in the closed-loop candidate image queue, specifically, the descriptors are matched

Input image I using Hamming distance _i And closed loop candidate image I _n For the input image I _i LDB descriptor of (1)

In the candidate image I _n In search and

two descriptors with the closest distance

If it is

And with

If the following conditions are satisfied, the product is considered to be

And

is a pair of satisfactory feature matching:

wherein the content of the first and second substances,

respectively represent feature descriptors

And

hamming distance between, epsilon _d The value is usually less than 1 for the distance scaling factor.

Preferably, the Hamming distance is adopted for the input image I _i And closed loop candidate image I _n The specific matching of the LDB descriptors is as follows:

wherein d is ¹ ，d ² Represents two LDB descriptors, d _i Denotes d ¹ ，d ² Bit i of the descriptor.

Compared with the prior art, the invention has the beneficial effects that:

1. the key frames are selected through KLT sparse optical flow tracking, the movement speed of the mobile robot does not need to be considered, meanwhile, the images at the corners can be better processed, and the selected key frames are more representative. Meanwhile, the key frame is selected, so that the operation speed in the matching process can be reduced, and the detection speed of the whole method is improved.

2. The invention checks whether the two images form a closed loop or not through the local differential binary descriptor LDB, thereby not only obtaining the geometric topological relation between the two images, but also verifying whether the two images form the closed loop or not, and improving the precision of closed loop detection.

3. The invention extracts the global features of the image by adopting the convolutional neural network trained by the image classification dataset and uses the global features for nearest neighbor image retrieval, thereby being capable of better coping with scenes with appearance changes.

Drawings

FIG. 1 is a flow chart of a closed loop detection method based on key frame selection and local features of the present invention;

FIG. 2 is a flowchart of key frame selection for a closed loop detection method based on key frame selection and local features according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms such as "upper", "lower", "left", "right", "long", "short", etc., indicating orientations or positional relationships based on the orientations or positional relationships shown in the drawings, it is only for convenience of description and simplicity of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationships in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The technical scheme of the invention is further described in detail by the specific embodiments and the accompanying drawings:

examples

Fig. 1-2 show an embodiment of a closed loop detection method based on key frame selection and local features, which includes the following steps:

the method comprises the following steps: the first frame of the input image sequence is identified as the key frame, the current input image I _i Is marked as I _k-1 For image I _i And I _k-1 Performing graying to obtain an image G _i 、G _k-1 . Extracting image G _k-1 Shi _ Tomasi corner point; for image G _k-1 、G _i Respectively carrying out Gaussian pyramid transformation to obtain L layers of images with different resolutions;

due to the image G _k-1 、G _i The brightness is kept constant before and after the movement of the middle pixel point, and the L-th pixel point is calculated by solving the velocity components u and v of the optical flow along the X axis and the Y axis _m Corner point P (x, y) in layer in image G _i Position P (x + dx, y + dy) and optical flow

Mixing L with _m The optical flow value obtained by layer calculation is taken as L _m-1 Initial value of laminar flow, and calculating L _m-1 The precise value of the laminar flow until the lowest layer L is calculated ₀ The optical flow of the original image and the traced corner point P (x + dx, y + dy).

Image G _k-1 In image G _i When KLT sparse optical flow tracking is performed, if the following occurs, it is considered that tracking has failed:

(1) Corner point P (x, y) at G _i Out of the image range.

(2) The sum of the matching errors in the neighborhood of the matching corner points in the two images is greater than a threshold.

Step two: for the current input image I _i Preprocessing is performed to resize the image to 224 x 224 pixels. Extraction of image I Using convolutional neural network VGG16 trained from Places365-standard datasets _i The output of the penultimate fully connected layer of the VGG16 network will be the image I _i Global feature f of _glo,i . And if the current input image is a key frame, inserting the extracted global features into a hierarchical navigable small world map (HNSW) of an approximate nearest neighbor search algorithm.

Step three: at the current input image I _i In the search range of (2), the most phase with the current input image is searched by HNSWThe similar key frame is used as a closed loop candidate key frame of the current image, and all images between the closed loop candidate key frame and the key frame of the next frame are used as a closed loop candidate image queue. Since the image sequence transmitted by the mobile robot has high similarity between adjacent images, the retrieval range of the current input image is U _sa All key frames within:

U _sa ＝U _before -U _fr×ct

in the formula of U _before For at the current input image I _i Set of all previous images, fr frame rate of camera, ct time constant, U _fr×ct Is a set of fr × ct frame images preceding the current input image.

Step four: introducing geometric consistency check to the current input image I _i And extracting ORB characteristic points respectively with the retrieved closed loop candidate image queue. For each ORB feature point k _ij In k, with _ij Cropping a block S of size S x S for the center _ij . Secondly, adding S _ij Divided into c x c mesh units of equal size

The binary test is performed as follows:

wherein f (m) is a grid unit

Average intensity of _avg And gradient d _x 、d _y F (n) represents a grid cell

Average intensity of _avg And gradient d _x 、d _y The value of (c). To S _ij After the c × c grid units all execute binary test, the obtained binary code is the sum of the characteristic point k _ij Corresponding binary LDB descriptors.

After the current input image I is obtained _i After the LDB descriptor of the closed loop candidate image queue, the Hamming distance is respectively adopted for the input image I _i With pictures I in a closed-loop candidate picture queue _q,n For I, for the LDB descriptor of _i LDB descriptor of (1)

In I _q,n In search and

two LDB descriptors with the nearest distance

If it is

And

if the following conditions are satisfied, the product is considered to be

And

is a good feature match:

wherein the content of the first and second substances,

presentation descriptor

And

the Hamming distance between the two electrodes,

presentation descriptor

And

hamming distance between them. Epsilon _d The distance scaling factor is usually smaller than 1.

Input image I using Hamming distance _i And closed loop candidate image I _n The specific steps for matching the LDB descriptors are:

wherein d is ¹ ，d ² Representing two LDB descriptors, d _i Denotes d ¹ ，d ² Bit i of the descriptor.

Step five: with the current input image I _i The closed loop candidate image with the most matched LDB descriptor is used as the optimal closed loop candidate image, the matched feature points of the two images are input into a random sampling consistency algorithm (RANSAC) to further eliminate mismatching and solve a basic matrix; if the number of the inner points between the two images is less than the threshold value, the two images do not form a closed loop; if the number of inliers between two images is not less than the threshold, the two images may form a closed loop.

Step six: checking the consistency of the incoming time if the current input image I _i And C, if the subsequent 2 continuous frame images meet the threshold condition of the step five, the current input image and the optimal closed-loop candidate image are considered to form a group of closed loops.

The beneficial effects of this example: 1. the key frames are selected through KLT sparse optical flow tracking, the movement speed of the mobile robot does not need to be considered, meanwhile, the images at the corners can be better processed, and the selected key frames are more representative. Meanwhile, the key frame is selected, so that the operation speed in the matching process can be reduced, and the detection speed of the whole method is improved. 2. The invention checks whether the two images form a closed loop or not through the local differential binary descriptor LDB, thereby not only obtaining the geometric topological relation between the two images, but also verifying whether the two images form the closed loop or not, and improving the precision of closed loop detection. 3. The invention extracts the global features of the image by adopting the convolutional neural network trained by the image classification dataset and uses the global features for nearest neighbor image retrieval, thereby being capable of better coping with scenes with appearance changes.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A closed loop detection method based on key frame selection and local features is characterized by comprising the following steps:

the method comprises the following steps: an input image acquired by a mobile robot; determining a first frame of an input image sequence as a key frame, extracting Shi _ Tomasi corner points of a previous key frame of a current input image, iteratively tracking the corner points in the current input image by adopting a sparse optical flow tracking algorithm, and determining the current input image as a new key frame if the number of corner points which cannot be tracked is greater than a threshold value;

step two: extracting global features of the current input image by adopting a convolutional neural network trained by an image classification data set, and if the current input image is a key frame, inserting the extracted global features into a layered navigable small world map approximate to a nearest neighbor retrieval algorithm;

2. A closed-loop detection method based on key-frame selection and local features as claimed in claim 1, characterized in that in said step one, the corner points are iteratively tracked in the current input image using a sparse optical flow tracking algorithm KLT, specifically:

current input image I _i Is marked as I _k-1 For image I _i And I _k-1 Carrying out graying processing to obtain an image G _i 、G _k-1 (ii) a Extracting image G _k-1 Shi _ Tomasi corner point, set image I _i And I _k-1 The brightness is kept constant before and after the movement of the middle pixel point, and an image G is calculated _k-1 At the center point P (x, y) in the image G _i Position P (x + dx, y + dy) and optical flow

3. The method according to claim 2, wherein if the number of corner points that cannot be tracked is greater than a threshold, the method considers that the current input image is a new key frame specifically as follows:

key frame image G _k-1 At the current input image G _i When KLT sparse optical flow tracking is performed, if the following situations occur, the tracking is considered to be failed:

(1) Corner point P (x, y) at G _i Out of image range;

(2) The sum of the matching errors in the neighborhood of the matching corner points is greater than a threshold value;

if the number of corner points which fail to track is greater than the set threshold value, the current input image I is considered _i Is a new key frame.

4. The method as claimed in claim 3, wherein in the second step, the current input image is extracted with a convolutional neural network trained by the image classification dataset by: for the current input image I _i And preprocessing is carried out, the input of the convolutional neural network requires to adjust the size of the image, and the output of the last but one full connection of the convolutional neural network is used as the global characteristic of the image.

5. The method according to claim 3, wherein in the third step, if the current input image is a key frame, the specific process of inserting the extracted global features into the hierarchical navigable small-world map of the approximate nearest neighbor search algorithm is as follows: if the current input image I _i Selected as key frames, the image I is randomly assigned by an exponentially decaying probability distribution function _i Characteristic node of (1) the highest level number l in the HNSW structure _max Insert the feature node into l _max To the bottom layer l ₀ Of all layers of (a). And searching M nodes nearest to the node in each layer respectively, and connecting the new feature node with the M nodes nearest to the new feature node.

6. The method according to claim 1, wherein in the second step, in the search range of the current input image, the method specifically comprises:

U _sa ＝U _before -U _fr×ct

7. The method as claimed in claim 1, wherein in the fourth step, the specific process of extracting ORB feature points and corresponding local differential binary descriptors LDB from the current input image and the retrieved closed-loop candidate image queue is as follows:

respectively extracting ORB characteristic points from the current input image and the closed loop candidate image queue, and for each ORB characteristic point k _ij In k is given _ij Cropping a block S of size S x S for the center _ij Will S _ij Divided into c x c mesh units of equal size

Calculating the average intensity I of each grid cell _avg And gradient d _x 、d _y (ii) a For S _ij Of any two grid cells

Executing binary test to obtain binary code as the sum of characteristic points k _ij Corresponding binary LDB descriptors.

8. The method of claim 7, wherein for S, the closed loop detection method is based on key frame selection and local feature _ij Of any two grid cells

Executing binary test, specifically:

wherein f (m) and f (n) respectively represent grid cells

Average intensity of _avg And gradient d _x 、d _y The value is obtained.

9. The method according to claim 8, wherein in step four, the input image is matched with the descriptors of the images in the closed-loop candidate image queue, specifically, the descriptors of the images in the closed-loop candidate image queue are matched

In the candidate image I _n In search and

two descriptors with the closest distance

If it is

And

if the following conditions are satisfied, the product is considered to be

And

is a pair of satisfactory feature matching:

wherein the content of the first and second substances,

respectively represent feature descriptors

And with

Hamming distance between, epsilon _d The distance scaling factor is usually smaller than 1.

10. The method of claim 9, wherein the Hamming distance is used for the input image I _i And closed loop candidate image I _n The specific steps for matching the LDB descriptors are:

wherein, d ¹ ，d ² Represents two LDB descriptors, d _i Denotes d ¹ ，d ² Bit i of the descriptor.