CN113191435B

CN113191435B - Image closed-loop detection method based on improved visual dictionary tree

Info

Publication number: CN113191435B
Application number: CN202110493714.0A
Authority: CN
Inventors: 赵勃; 杭程
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2022-08-23
Anticipated expiration: 2041-05-07
Also published as: CN113191435A

Abstract

The invention provides an image closed-loop detection method based on an improved visual dictionary tree, which comprises the following steps: step 1, establishing a visual dictionary tree by utilizing a layered K-means clustering scheme, and describing at least two images of a preset frame by using each node in the visual dictionary tree and a score vector obtained by TF-IDF entropy of each node; step 2, carrying out similarity calculation on the images in the step 1; and 3, processing the miscorrect closed loop by using the constraint relation of the image on time and space. The method effectively reduces the problem of perception ambiguity in closed-loop detection and can effectively improve the recall rate of closed-loop detection.

Description

Image closed-loop detection method based on improved visual dictionary tree

Technical Field

The invention relates to an image closed-loop detection method based on an improved visual dictionary tree, and belongs to the technical field of mobile robot navigation.

Background

With the rapid development of modern high-tech technologies, the research on mobile robot technology is also continuously advanced. The Simultaneous positioning and Mapping technology (SLAM) which receives much attention in automatic driving, unmanned aerial vehicle autonomous navigation and scene three-dimensional reconstruction is a key technology for realizing that a mobile robot can autonomously complete tasks. The simultaneous positioning and mapping technology means that when a mobile robot enters an unknown environment, a 3D environment map of the current environment of the mobile robot is constructed through sensors such as a camera, a laser radar and an IMU carried by the mobile robot, and the position of the robot in the map can be determined simultaneously. In the visual SLAM problem with a camera as a sensor, errors are accumulated continuously by a system along with the increase of tracking time, and the closed-loop detection technology has good effects on eliminating the accumulated errors of the position and posture estimation of a robot and keeping the accuracy of a map for a long time by judging whether the robot returns to an environment area visited before, and is a key link and a basic problem in the SLAM technology.

The existing closed loop detection scheme has the problems of perception ambiguity under the condition of similar scenes and low recall rate under the condition of ensuring the accuracy.

In view of the above, it is necessary to provide an improved visual dictionary tree-based image closed-loop detection method to solve the above problems.

Disclosure of Invention

The invention aims to provide an image closed-loop detection method based on an improved visual dictionary tree, so that the problem of perception ambiguity existing in closed-loop detection is effectively reduced, and the recall rate of closed-loop detection can be effectively improved.

In order to achieve the above object, the present invention provides an image closed-loop detection method based on an improved visual dictionary tree, comprising the following steps:

the image closed-loop detection method based on the improved visual dictionary tree comprises the following steps:

step 1, establishing a visual dictionary tree by utilizing a layered K-means clustering scheme, and describing at least two images of a preset frame by using each node in the visual dictionary tree and a score vector obtained by TF-IDF entropy of each node;

step 2, carrying out similarity calculation on the images in the step 1;

and 3, processing the misadjustment closed loop by using the constraint relation of the image on time and space.

As a further improvement of the present invention, step 1 comprises:

step 1-1, establishing a visual dictionary tree by adopting a hierarchical K-means clustering scheme: creating a tree with kappa branches and l layers, recursively calling a K mean value clustering method for each branch of each layer to obtain kappa finer branches of the next layer, and calling the kappa finer branches to the l layer to finish;

step 1-2, extracting image features from the collected image, and projecting the image features to a visual dictionary tree to obtain a description vector corresponding to the image;

1-3, expressing the scoring weight of the image at different nodes by using TF-IDF entropy of different nodes in the visual dictionary tree:

wherein l represents the number of levels of the visual dictionary tree, i represents the number of nodes at the l-th level,

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image feature and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with the existing image features projected on the node i, lambda _i Representing the coefficient of variation; TF represents the frequency of a certain word appearing in an image, and the higher the frequency is, the higher the word discrimination is; the IDF represents the frequency of a certain word in all words, and the lower the frequency is, the more distinguishing the classification of the image is;

1-4, expressing the score vector of the image in the whole visual dictionary tree by using the score weights of the image at different nodes as follows:

W(P)＝(W ¹ (P),W ² (P),…,W ^L (P))

wherein W (P) represents a score vector of the image P, W ¹ (P) a score vector, W, for image P at the first level ² (P) a score vector, W, for image P on the second layer ^L (P) a score vector representing picture P on level L;

1-5, the score vector of the image P at the l-th layer is represented as:

wherein, W ^l (P) represents a score vector of the picture P on the l-th layer,

a score vector representing the 1 st node of the image P on the l-th level,

a score vector representing the 2 nd node of image P on the l-th level,

indicating that the picture P is on the l-th layer ^l Score vectors for individual nodes.

As a further improvement of the invention, in step 1, the TF-IDF entropy of each node is expressed as:

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image feature and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with image features projected to the node i, lambda _i The coefficient of variation is indicated.

As a further development of the invention, the lambda _i The calculation formula of (c) is:

wherein CV is _i Coefficient of variation representing the number of words of the ith node, alpha being the coefficient of variation average scale factor, k ^ι Representing the number of words.

As a further improvement of the invention, the CV is _i The calculation formula of (c) is:

where σ denotes a standard deviation of the number of occurrences of the word of the ith node, and μ denotes an average value of the number of occurrences of the word of the ith node.

As a further improvement of the present invention, step 2 comprises:

step 2-1, representing similarity scores of words by using the minimum value of the score weights of the image P and the image Q in the same word;

step 2-2, when the image M, the image P and the image Q exist, if the similarity score of the image M and the image Q is the same as the similarity score between the image P and the image Q, the step 2-3 is carried out;

step 2-3, improving a calculation formula of the similarity score, which is as follows:

wherein the content of the first and second substances,

representing the similarity score of the ith node of image P and image Q at the ith level in the visual dictionary tree,

a score vector representing the ith node of the image P at the ith level in the visual dictionary tree,

the score vector of the ith node of the image Q on the ith layer in the visual dictionary tree;

step 2-4, when the number of words existing in each image is far less than the number of all words in the visual dictionary tree, namely the scoring weight of many words in the image is 0, the calculation formula of the improved similarity score is as follows:

wherein the content of the first and second substances,

representing the similarity score of the ith node at the ith level in the visual dictionary tree for image P and image Q,

step 2-5, based on the calculation formula of the similarity score of the ith node of the image P and the image Q on the ith layer in the visual dictionary tree, defining the calculation formula of the similarity score of the image P and the image Q on the ith layer as follows:

a similarity score representing the ith node of the image P and the image Q on the ith layer in the visual dictionary tree;

step 2-6, defining the increment of the similarity score of the image P and the image Q on the l level based on the function of the similarity score of the image P and the image Q on the l level to avoid the repeated accumulation of the similarity of the image P and the image Q in the visual dictionary tree from top to bottom, wherein the increment of the similarity score of the image P and the image Q on the l level is defined as follows:

wherein S is ^l (P, Q) represents a similarity calculation score between the image P and the image Q at the l-th layer, S ^l+1 (P, Q) represents a similarity calculation score of the image P and the image Q at the l +1 th layer;

step 2-7, based on the increment of the similarity score in step 2-6, the formula defining the similarity calculation score between P, Q two images is:

where K (P, Q) represents a similarity calculation score of the image P and the image Q, S ^l (P, Q) represents a similarity calculation score between the image P and the image Q at the l-th layer, S ^l+1 (P, Q) represents a similarity calculation score between the image P and the image Q at the l +1 th layer, S ^L (P, Q) represents a similarity calculation score of the image P and the image Q at the L-th layer,

representing the matching strength factor of the visual dictionary tree.

As a further improvement of the present invention, in step 2-1, the similarity score is calculated by the formula:

wherein the content of the first and second substances,

a score vector representing the ith node of the image P on the ith layer,

the score vector of the ith node on the ith layer of the image Q.

As a further improvement of the present invention, in step 2-2,

wherein the content of the first and second substances,

the score weight of the ith node representing the image M on the l-th level of the visual dictionary tree,

the scoring weight of the ith node representing the image P on the ith level of the visual dictionary tree,

represents the scoring weight of the ith node of image Q on the ith level of the visual dictionary tree.

As a further improvement of the invention, the step 3 comprises the following steps:

3-1, eliminating a false positive closed loop by using time consistency constraint of the image;

and 3-2, eliminating the false closed loop by using the space consistency constraint of the image.

The invention has the beneficial effects that: the invention improves the TF-IDF entropy expression method of each node by adjusting the visual dictionary tree; and a similarity calculation method between the images is improved, so that the problem of perception ambiguity existing in closed-loop detection is effectively reduced, and the recall rate of the closed-loop detection can be effectively improved.

Drawings

FIG. 1 is a flow chart of the image closed-loop detection method based on the improved visual dictionary tree according to the present invention.

Fig. 2 (a) is a graph of accuracy versus recall in a data set fr3_ long _ office _ house according to the improved visual dictionary tree-based image closed-loop detection method and IAB-MAP detection method, FAB-MAP detection method, and RTAB-MAP detection method of the present invention.

Fig. 2 (b) is a graph of accuracy versus recall in the data set fr2_ pioneer _ slam2 of the improved visual dictionary tree-based image closed-loop detection method and IAB-MAP detection method, FAB-MAP detection method, and RTAB-MAP detection method of the present invention. .

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The closed loop detection has the effects that the current position information of the mobile robot and the observed map point information are combined by utilizing a scene recognition algorithm to judge whether the mobile robot returns to a scene which is experienced once, the constraint is added between the current pose and the previous pose, and the accumulated error brought by the system is reduced. The method aims to reduce the problem of perception ambiguity existing in a traditional closed-loop detection algorithm based on a visual bag-of-words model (BOVW).

As shown in fig. 1, to solve the problems in the prior art, the present invention provides an improved image closed-loop detection method based on an improved visual dictionary tree, which includes the following steps:

step 1, establishing a visual dictionary tree by utilizing a layered K-means clustering scheme, and describing at least two images of a preset frame by using each node in the visual dictionary tree and a score vector obtained by TF-IDF entropy of each node; the TF-IDF entropy of each node is expressed as:

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image features and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with the existing image features projected on the node i, lambda _i Represents an improved coefficient of variation;

step 2, similarity calculation is carried out on the images in the step 1;

In other words, the above steps include three aspects: 1. a vector description of the improved image; 2. improved methods of similarity calculation between images; 3. closed loop posterior.

1. Improved vector description of images.

The closed-loop detection algorithm of the classic visual bag-of-word model uses a large number of image features to train to obtain a visual dictionary tree, then extracts the image features of the collected images, and projects the image features to the visual dictionary tree, so that the description vector of the images is obtained.

In the invention, the real-time calculation requirement of a large number of images in an actual scene is considered, and the visual dictionary tree is established by adopting a layered K mean value clustering scheme: creating a tree with kappa branches and l layers, recursively calling a K-means clustering method for each branch of each layer to obtain kappa finer branches of the next layer, and calling the branches to the l layer to finish.

In the prior art, the scoring weight of the defined image at different tree nodes can be represented by TF-IDF entropy of each tree node. TF represents the frequency of a certain word appearing in an image, and the higher the frequency is, the higher the word discrimination is; IDF represents the frequency with which a word appears in all words, the lower the frequency, the more discriminative the classification of the image.

Define TF-IDF entropy as:

representing the scoring weight, n, of the ith node of the image P on the l-th level of the tree _i And N represents the number of feature points projected to the node i and the total number of feature points, N and N, respectively _i Representing the total number of images that need to be processed and the number of images that present the projection of the feature onto node i, respectively.

However, consider a case for the TF-IDF entropy defined by the above formula: as shown in Table 1,. omega. ₁ ，ω ₂ ，ω ₃ The IDF values calculated according to the above formula for these three words are the same, ω from TF point of view ₁ The most frequent word occurs, with the highest scoring weight, ω ₂ Sub, omega ₃ Minimum; but actually fromFrom the word discrimination view, omega ₃ The number of times a word appears in each image is relative to ω ₁ And omega ₂ Said large span, ω ₃ The word should get the highest weight. Clearly the two are contradictory.

TABLE 1 number of words in database image

Based on this problem, the present invention introduces an improved coefficient of variation to assist the calculation of scoring weights for words. The coefficient of variation defining the number of words is:

in the above equation, σ represents the standard deviation of the number of occurrences of the word, and μ represents the average of the number of occurrences of the word. For ω in Table 1 ₁ In the case of the standard deviation of 0, the coefficient of variation is improved, and the improved coefficient of variation is defined as lambda _i ：

Wherein CV is _i Representing the coefficient of variation of the word represented by the ith node, alpha being the coefficient of variation average scale factor, kappa ^ι Representing the number of visual words.

Thus, the improved TF-IDF entropy is expressed as:

and using the improved TF-IDF entropy of different nodes in the visual dictionary tree as the score weight of different visual words, thereby obtaining a score vector of words in the image for describing the scene. The score vector of the image P in the whole visual dictionary tree is represented as: w: (W:P)＝(W ¹ (P),W ² (P),…,W ^L (P)), wherein W ^L (P) represents the score vector of the image on level L, expressed as:

specifically, the step 1 includes:

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image feature and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with image features projected to the node i, lambda _i Represents an improved coefficient of variation; TF represents the frequency of a certain word appearing in an image, and the higher the frequency is, the higher the word discrimination is; the IDF represents the frequency of a certain word in all words, and the lower the frequency is, the more distinguishing the classification of the image is;

W(P)＝(W ¹ (P),W ² (P),…,W ^L (P))

in steps 1-5, the score vector of the image P at the l-th layer is represented as:

wherein, W ^l (P) represents a score vector of the picture P on the l-th layer,

score vectors representing the 1 st node of the image P on the l-th level,

a score vector representing the 2 nd node of image P on the l-th level,

2. Improved similarity score algorithm

For the algorithm for calculating the similarity score between images, in the prior art, the BOVW scheme is a calculation formula for expressing the similarity score of a word by using P, Q minimum values of the two images in the same word score weight:

such a similarity score is expressed by using the minimum value of the score weight of the same word in the image, and although the similarity degree of a single node can be judged well, there is still a problem. According to the above formulaExpressed, if there are three images: image M, image P and image Q satisfying

Then, there is a similarity score between the image M and the image Q which is the same as the similarity score between the image P and the image Q. However, contrary to our cognitive scope, we would consider that the closer the similarity is, the higher the similarity is between the two images, and that image P should be more similar to image Q than to image M.

In order to avoid the above problem of perception ambiguity, and at the same time, considering that the number of words existing in each image is far less than the number of all words in the visual dictionary tree, the score of many words existing in the image is 0, and in order to improve the calculation efficiency of the whole algorithm, the calculation formula of the similarity score is improved as follows:

based on the improved calculation formula of the similarity score of the new single node, the similarity score function of the image at the l-th layer is defined as follows:

considering that the visual dictionary tree is built layer by layer from top to bottom, a certain layer of space of the visual dictionary tree will inevitably contain a part of the image similarity of the next layer of the layer. Therefore, in order to avoid repeated accumulation of similarity, a scheme of calculating the similarity score increment from the bottom to the top of the lowest layer is adopted, and then the similarity score increment of the ith layer can be expressed as:

therefore, the similarity calculation between two images is defined as:

wherein, the first and the second end of the pipe are connected with each other,

and representing the matching strength coefficient, and constraining the matching difference between different levels in the dictionary tree.

Specifically, step 2 includes:

step 2-1, representing the similarity score of the words by using the minimum value of the score weights of the images P and Q on the same word;

step 2-2, when the conditions are satisfied

If so, namely the similarity score of the image M and the image Q is the same as the similarity score between the image P and the image Q, the step 2-3 is carried out;

step 2-3, improving a calculation formula of the similarity score, as follows:

wherein the content of the first and second substances,

a score vector representing the ith node of image P at the l-th level in the visual dictionary tree,

a similarity score representing the ith node of image P and image Q on the l-th level in the visual dictionary tree;

wherein S is ^l (P, Q) represents a similarity calculation score of the image P and the image Q at the l-th layer, S ^l+1 (P, Q) represents a similarity calculation score of the image P and the image Q at the l +1 th layer;

where K (P, Q) represents a similarity calculation score of the image P and the image Q, S ^l (P, Q) represents a similarity calculation score between the image P and the image Q at the l-th layer, S ^l+1 (P, Q) represents a similarity calculation score between the image P and the image Q at the l +1 th layer, S ^L (P, Q) represents a similarity calculation score between the image P and the image Q at the L-th layer,

a matching strength coefficient representing a visual dictionary tree.

3. Closed loop posterior test

Due to the fact that the spatial position relation and semantic correlation among different features of the image are omitted, some closed loops obtained through similarity calculation can be mistaken for correct closed loops. The invention utilizes the constraint relation of images in time and space to process the miscorrect closed loop.

Specifically, step 3 comprises:

and 3-1, eliminating the error closed loop by using time consistency constraint. Because the robot runs in a continuous time, images collected by the robot should be continuous in time, and adjacent images should correspond to the continuous change of the same scene, so that when a closed loop exists at a certain moment, closed loops also exist at the later moments, and if the candidate closed loop does not meet the constraint of the time consistency, the candidate closed loop is rejected.

And 3-2, eliminating the error closed loop by using space consistency constraint. When a closed-loop phenomenon occurs, two images for generating the closed loop should correspond to the same scene, and only the imaging angles of the images are different, so that the two images can eliminate the error closed loop by utilizing space consistency constraint. And calculating a basic matrix by matching the pose of the two images with the feature points, comparing the number of inner points of the basic matrix with a set threshold, and if the number of inner points exceeds the threshold, keeping the two images to form a closed loop.

A plurality of experiments using the improved visual dictionary tree-based image closed-loop detection method of the present invention will be described in detail below. In the closed-loop detection problem, an important index for evaluating the closed-loop detection performance is an accuracy-recall curve, and the accuracy represents that: the proportion of the real closed loop in the closed loops detected by all algorithms, and the recall rate represents: the closed loops that the algorithm correctly detects are a percentage of all the actual closed loops. The calculation formulas of the accuracy rate and the recall rate are respectively shown as the following formulas:

wherein, TP represents the number of correct closed loops in the closed loops detected by the algorithm, FP represents the number of closed loops which are not actually detected by the algorithm, FN represents the number of closed loops which are not detected by the algorithm and the actual result is the number of closed loops.

As shown in fig. 2 (a) and fig. 2 (b), in order to demonstrate the effect of the formula for calculating the corresponding description vector and the similarity score of the image proposed by the present invention, two data sets were selected from the TUM RGB-D data set for experimental verification. The first is the data set fr3_ long _ office _ house in a complex indoor environment; the second is the data set fr2_ pioneer _ slam2, which is extremely similar in context, and prone to perceptual ambiguities. The image closed-loop detection method based on the improved visual dictionary tree is compared with a plurality of classical image closed-loop detection algorithms including IAB-MAP, FAB-MAP and RTAB-MAP in a data set fr3_ long _ office _ house hold and a data set fr2_ pioneer _ slam2 respectively. As can be seen from fig. 2 (a) and fig. 2 (b), the improved visual dictionary tree-based image closed-loop detection method of the present invention has a higher recall rate with an accuracy of 100%, so that the influence of perceptual ambiguity can be effectively reduced.

In conclusion, each node in the visual dictionary tree and TF-IDF entropy of the node are used for forming a vector to describe at least two images of a preset frame, similarity calculation is carried out on the images represented by the vector, and a constraint relation of the images on time and space is used for processing a miscorrect closed loop; further, the establishment of the visual dictionary tree is adjusted, and a TF-IDF entropy expression method of each node is improved; improving the similarity calculation method to reduce the perception ambiguity problem existing in closed loop detection; and respectively processing the error closed loops by utilizing time consistency and space consistency, eliminating the error closed loops as correct closed loops, and finally determining the correct closed loops. The method can effectively reduce the problem of perception ambiguity in closed-loop detection and can effectively improve the recall rate of the closed-loop detection.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. An improved visual dictionary tree-based image closed-loop detection method is characterized by comprising the following steps:

step 2, carrying out similarity calculation on the images in the step 1;

step 2-3, improving a calculation formula of the similarity score, as follows:

wherein the content of the first and second substances,

representing the phase of the ith node at the l-th level in the visual dictionary tree for image P and image QThe similarity score is obtained by the similarity analysis method,

step 2-7, based on the increment of the similarity score in step 2-6, defining P, Q the formula for the similarity calculation score between the two images as:

a matching strength coefficient representing a visual dictionary tree;

2. The improved visual dictionary tree-based image closed-loop detection method according to claim 1, wherein the step 1 comprises:

wherein l represents the number of levels of the visual dictionary tree, i represents the number of nodes of the l-th level,

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image features and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with the existing image features projected on the node i, lambda _i Representing the coefficient of variation; TF represents the frequency of a certain word appearing in an image, and the higher the frequency is, the higher the word discrimination is; the IDF represents the frequency of a certain word in all words, and the lower the frequency is, the more distinguishing the classification of the image is;

W(P)＝(W ¹ (P),W ² (P),…,W ^L (P))

wherein W (P) represents a score vector of the image P, W ¹ (P) a score vector, W, for image P at the first level ² (P) score vector, W, on the second layer representing image P ^L (P) a score vector representing picture P on level L;

1-5, the score vector of the image P at the l-th layer is represented as:

wherein, W ^l (P) represents a score vector of the picture P on the l-th layer,

a score vector representing the 1 st node of the image P on the l-th level,

a score vector representing the 2 nd node of the image P on the l-th layer,

3. The improved visual dictionary tree-based image closed-loop detection method according to claim 1, wherein in step 1, the TF-IDF entropy of each node is expressed as:

representing the scoring weight, n, of the ith node of the image P on the l-th level of the visual dictionary tree _i And N represents the number of feature points projected to the node i by the image feature and the total number of the feature points, N and N _i Respectively representing the total number of images to be processed and the number of images with the existing image features projected on the node i, lambda _i The coefficient of variation is indicated.

4. The improved visual dictionary tree-based image closed-loop detection method according to claim 3, wherein λ _i The calculation formula of (2) is as follows:

5. The improved visual dictionary tree-based image closed-loop detection method according to claim 4, wherein the CV is a function of the image closed-loop detection method _i The calculation formula of (c) is:

6. The improved visual dictionary tree-based image closed-loop detection method according to claim 1, wherein in step 2-1, the similarity score is calculated by the formula:

wherein the content of the first and second substances,

a score vector representing the ith node of image P on the ith layer,

the score vector of the ith node on the ith layer of the image Q.

7. The improved visual dictionary tree-based image closed-loop detection method according to claim 1, wherein, in step 2-2,

wherein the content of the first and second substances,

8. The improved visual dictionary tree-based image closed-loop detection method according to claim 1, wherein step 3 comprises:

3-1, eliminating a false closed loop by using time consistency constraint of the image;