CN110296705B

CN110296705B - Visual SLAM loop detection method based on distance metric learning

Info

Publication number: CN110296705B
Application number: CN201910575905.4A
Authority: CN
Inventors: 高瑜; 陈良
Original assignee: Suzhou Ruijiu Intelligent Technology Co ltd
Current assignee: Suzhou Ruijiu Intelligent Technology Co ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2022-01-25
Anticipated expiration: 2039-06-28
Also published as: CN110296705A

Abstract

The invention discloses a visual SLAM loop detection method based on distance metric learning, which comprises the following steps: pre-training the CNN model, and optimizing by using a training set; averagely dividing pictures of a training set into k groups, and simultaneously inputting k CNN pre-training models which share parameters; constructing a multi-tuple for training by using a multi-tuple construction method; after constructing the multi-element groups of all scenes, optimizing a CNN model according to the distance relation among the multi-element groups; when no proper multi-tuple is constructed in the fourth step, directly finishing the training and entering a testing stage; and (5) using the optimized CNN model to perform actual robot loop detection application. The invention solves the technical problem of robustness to the appearance change and the visual angle change at the same time, and reduces the computation amount of similarity measurement.

Description

Visual SLAM loop detection method based on distance metric learning

Technical Field

The invention discloses a visual SLAM loop detection method based on distance metric learning, and relates to the technical field of robot mobile positioning.

Background

Visual SLAM (simultaneous localization and mapping) is a key technology in the field of mobile robots. In a SLAM system, the robot will model the surrounding environment and estimate its motion trajectory synchronously. A typical visual SLAM system is generally made up of several modules: visual odometry, rear-end optimization, loop detection and graph building. The loop detection module is used for automatically detecting whether the robot has arrived at a certain place or not. If the loop is successfully detected, the robot can provide an additional constraint term for the optimization of the back end and reduce the optimization error.

During the actual navigation, the environment around the robot may change. These changes can be classified into changes in appearance and changes in viewing angle. The change in appearance may be due to changes in lighting, weather, and shadows. Meanwhile, pictures shot by the robot at different angles in the same place in the moving process look different. Therefore, loop detection needs to be robust to changes in appearance and viewing angle.

In addition to being robust to changes in appearance and viewing angle, another important issue for loop detection is real-time. The robot needs to determine whether the current position has passed or not within a short time. However, loop back detection cannot occupy much computing resources, since other modules of the visual SLAM also consume a large amount of memory.

At present, most of loop detection methods for mainstream visual SLAM systems are bag-of-word models. The bag-of-words model requires manual design of features and composition of the features into a dictionary, and the distance between the target picture and the features in the dictionary is calculated, so as to judge whether a loop is formed. However, this method of artificially designing features is highly susceptible to interference from environmental factors such as lighting, shooting angles, and dynamic objects.

Recently, many researchers have used Convolutional Neural Networks (CNNs) to solve the loop detection problem. Compared with the traditional method, the CNN can extract picture features with higher quality, and the CNN model pre-trained on the ImageNet data set has strong generalization and can be used for solving various different visual tasks. Research has found that features extracted from the middle layer of the CNN are robust to appearance changes, and features extracted from the high layer of the CNN are robust to view angle changes. However, how to have robustness to both appearance change and viewing angle change remains a key technical challenge. Meanwhile, the similarity measure of pictures is usually the euclidean distance between features of the comparison pictures. However, the image feature dimension extracted by the CNN model is very high, so that the similarity measurement requires a large amount of computation, which is a great challenge to real-time performance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the defects of the prior art, the visual SLAM loop detection method based on distance metric learning is provided, the technical problem that robustness is provided for appearance change and visual angle change at the same time is solved, and the calculation amount of similarity metric is reduced.

The invention adopts the following technical scheme for solving the technical problems:

a visual SLAM loop detection method based on distance metric learning, the method comprising the steps of:

firstly, pre-training a CNN model, and optimizing by using a training set;

averagely dividing the pictures of the training set into k groups, and simultaneously inputting k CNN pre-training models which share parameters;

step three, constructing a multi-tuple by using a multi-tuple construction method for training;

fourthly, after the multi-element groups of all scenes are constructed, optimizing the CNN model according to the distance relation among the multi-element groups;

step five, when no proper multi-tuple is constructed in the step four, directly finishing the training and entering a testing stage;

and step six, using the optimized CNN model to perform actual robot loop detection application.

As a further preferred aspect of the present invention, in the first step, the training set is a NewCollege, City Center and TUM data set; the New college and City Center datasets are collected by two cameras on the left and right sides of the mobile robot and provide GPS information and real loopback picture markers; the TUM data set is an indoor data set collected by a depth camera and provides true trajectory information.

As a further preferable scheme of the present invention, in the second step, the pre-training model is VGG-16, and the picture feature is extracted by using the first full-connected layer of the VGG-16.

As a further preferable scheme of the present invention, the step three specifically includes:

3.1, taking all picture features of the first scene, and calculating the central points of the feature vectors;

let the feature vector O belong to the ith sceneⁱComprises the following steps:

wherein n is the number of the feature vectors;

the feature vector center point of the ith scene

Comprises the following steps:

wherein a is the a-th feature vector belonging to the i-th scene;

3.2, calculating Euclidean distances from all the characteristic vectors of the scene to a central point, and if the characteristic vectors larger than a threshold value d exist, directly entering the step 3.4 to construct a multi-element group for the scene; if all the Euclidean distances from the feature vectors to the central point are smaller than the threshold value d, the step 3.3 is carried out to continue judging;

3.3, calculating Euclidean distances from all the characteristic vectors which do not belong to the scene to the central point, if the characteristic vectors smaller than a threshold value d exist, entering step 3.4, constructing a multi-element group for the scene, and if the characteristic vectors are not smaller than the threshold value d, directly entering step 3.5, and judging whether the scene is traversed completely;

3.4, constructing a multi-element group for the scene, taking A picture characteristics belonging to the scene and B picture characteristics not belonging to the scene, wherein the A picture characteristics and the B picture characteristics are from different scenes, and the multi-element group UⁱCan be expressed as:

wherein A is the number of the selected characteristic vectors belonging to the ith scene, and B is the number of the characteristic vectors not belonging to the ith scene;

and 3.5, judging whether the scene is traversed or not, if so, starting to perform reverse propagation, and otherwise, returning to the step 3.1 to take the picture characteristics of the next scene.

As a further preferred embodiment of the present invention, in step four, in the optimization process, the distance loss function L of the multi-tuple pictures of 5 scenes and the multi-tuple picture of the p-th scene are input each time_{multi-constraintunit}(O) is calculated as follows:

wherein the content of the first and second substances,

is the feature vector center point of the p-th scene,

is the feature vector in the p-th scene that is farthest from the center point,

is the feature vector of the non-p-th scene closest to the center point,

is different from random selection in the tuples

Is a predefined constant parameter, representing a

To

A distance between

To

The minimum boundary of the distance between, beta is a predefined constant parameter, representing

To

A distance between

To

The minimum boundary of the distance between.

As a further preferable scheme of the present invention, the step six specifically includes:

6.1, inputting continuous k frames of pictures into the optimized k shared CNN models simultaneously, wherein k is A + B, and extracting picture characteristics of each picture;

6.2, carrying out similarity measurement on the extracted feature vector of each frame of picture and the feature vector in the historical picture features;

6.3, supposing that the similarity measurement is carried out on the i-frame picture, selecting a T-frame picture closest to the (i-1) frame picture, taking the T-frame picture as a center, and selecting Num frame pictures before and after the T-frame picture as reference candidate frames, so that at most T (Num +1) reference candidate frames can be obtained;

comparing Euclidean distances between the ith frame picture and each reference candidate frame, and if the Euclidean distances are smaller than a threshold value theta, taking the reference candidate frame as an undetermined loop of the ith frame picture so as to reduce the similarity measurement range;

6.4, once a loop is formed between two frames, the two frames are listed as pending loops;

and further judging whether a loop exists between the front frame and the rear frame of the two frames, and if the loop is formed between the front frame and the rear frame, determining the front frame and the rear frame as loop frames.

As a further preferable aspect of the present invention, the method for determining whether a loop exists between two frames before and after the two frames includes:

calculating the Euclidean distance between the feature vector extracted from the Mth frame and the feature vector extracted from the Nth frame, and recording the 1 st result as a loop if the distance is less than a threshold value theta;

respectively calculating Euclidean distances between the feature vectors extracted from the two frames before and after the Mth frame and the feature vectors extracted from the two frames before and after the Nth frame, and recording the Euclidean distances as a W result, wherein W is greater than 1, if a loop exists in the W result, recording the Mth frame and the Nth frame as a loop group, otherwise, recording the Mth frame and the Nth frame as a non-loop group.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects: the invention solves the technical problem of robustness to the appearance change and the visual angle change at the same time, and reduces the computation amount of similarity measurement.

Drawings

FIG. 1 is a system flow diagram of the method of the present invention.

Fig. 2 is a schematic diagram of a training process in the method of the present invention.

FIG. 3 is a schematic diagram of a multi-element construction method in the method of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the system flow chart of the method of the invention is shown in fig. 1, and the method for detecting visual SLAM loop based on distance metric learning disclosed by the invention comprises the following specific steps:

step one, on the basis of a pre-trained CNN model, a training set is used for optimization, and then picture features are extracted. The type of training set and the structure of the CNN are two important considerations for feature extraction. The New college and City Center datasets are two outdoor datasets that are widely used in the field of visual SLAM closed loop detection. They are collected by the mobile robot at the left and right two cameras and provide GPS information and real loop back picture markers. The TUM is an indoor data set acquired by a depth camera and provides true trajectory information. Therefore, the selection is trained on the NewCollege, City Center, and TUM datasets.

Secondly, VGG-16 is a classification model trained on a large object recognition dataset ImageNet. The device has simple and effective structure and strong generalization capability to other data sets. Therefore, VGG-16 is selected as the pre-training model. Meanwhile, studies have shown that features extracted from different layers of CNN exhibit different properties in image tasks. The middle layer characteristics of the CNN have better generalization capability than the high layer characteristics, and the higher the number of the CNN full-connection layers is, the worse the generalization capability is. In order to ensure that the extracted features have low dimensionality and good generalization capability, the first full-link layer of VGG-16 is selected to be used for extracting the picture features.

Step two, the training process is as shown in fig. 2, before starting the training, firstly, the pictures of the training set are averagely divided into k groups, and k CNN pre-training models are simultaneously input, and the k CNN pre-training models share parameters. In this way, the feature vector of each picture can be extracted.

And step three, constructing the multi-tuple by using a multi-tuple construction method for training, as shown in fig. 3.

And 3.1, taking all picture features of the first scene, and calculating the central points of the feature vectors. Let the feature vector belonging to the ith scene be:

n is the number of feature vectors. Then the feature vector center point of the ith scene is:

a is the a-th feature vector belonging to the i-th scene.

And 3.2, calculating Euclidean distances from all the feature vectors of the scene to a central point, and if the feature vectors larger than a threshold value d exist, directly entering the step 3.4 to construct a multi-element group for the scene. And if the Euclidean distances from all the feature vectors to the central point are smaller than the threshold value d, continuing to judge in the next step.

And 3.3, calculating Euclidean distances from all the feature vectors which do not belong to the scene to the central point, and if the feature vectors smaller than a threshold value d exist, entering the next step and constructing a multi-element group for the scene. Otherwise, directly entering the step 3.5 and judging whether the scene is traversed.

And 3.4, constructing a multi-tuple for the scene. And taking A picture features belonging to the scene and B picture features not belonging to the scene, wherein the A picture features and the B picture features are from different scenes. The multivariate group can be represented as:

and B is the number of the feature vectors which do not belong to the ith scene.

And 3.5, judging whether the scene is traversed or not, and starting to perform reverse propagation if the scene is traversed. Otherwise, returning to the step 3.1, and taking the picture characteristics of the next scene.

And fourthly, after the multi-element groups of all scenes are constructed, optimizing the CNN model according to the distance relation among the multi-element groups. In the optimization process, each time 5 scene multi-tuple pictures are input, the distance loss function of the p scene multi-tuple picture is defined as follows:

wherein the content of the first and second substances,

is the feature vector center point of the p-th scene,

is the feature vector in the p-th scene that is farthest from the center point,

is the feature vector of the non-p-th scene closest to the center point,

is different from random selection in the tuples

Is not the p-th scene. α is a predefined constant parameter, representing

To

A distance between

To

To

A distance between

To

The minimum boundary of the distance between.

And step five, if no proper multi-element group is constructed in the step four, directly finishing the training and entering a testing stage.

And step six, using the optimized CNN model to perform actual robot loop detection application, wherein the specific implementation process is as follows:

and 6.1, simultaneously inputting continuous k frames of pictures into the optimized k shared CNN models, wherein k is A + B, and extracting picture characteristics of each picture.

And 6.2, performing similarity measurement on the extracted feature vector of each frame of picture and the feature vector in the historical picture features.

And 6.3, assuming that similarity measurement is carried out on the i-frame picture, selecting a T-frame picture closest to the (i-1) frame picture, taking the T-frame picture as a center, and selecting Num frame pictures before and after the T-frame picture as reference candidate frames, so that at most T (Num +1) reference candidate frames can be obtained. And comparing Euclidean distances between the picture of the ith frame and each reference candidate frame, and if the Euclidean distances are smaller than a threshold value theta, taking the reference candidate frame as a loop to be determined of the picture of the ith frame. Thereby reducing the range of similarity metrics.

6.4, in order to avoid detecting false loops as much as possible, the following optimization strategies are adopted: once a loop is detected between two frames, they are first classified as pending loops. The robot has continuity in the process of traveling, whether a loop exists between the two frames before and after the two frames can be judged, and if the loop is formed between the two frames before and after, the two frames are determined as loop frames.

Namely: and calculating the Euclidean distance between the feature vector extracted from the Mth frame and the feature vector extracted from the Nth frame, and if the distance is smaller than a threshold value theta, marking the 1 st result as a loop. Respectively calculating Euclidean distances between the feature vectors extracted from two frames before and after the Mth frame and the feature vectors extracted from two frames before and after the Nth frame, and recording as the Wth result (wherein W is more than 1). And if the loop exists in the W results, recording the Mth frame and the Nth frame as a loop group, and otherwise, recording the Mth frame and the Nth frame as a non-loop group.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A visual SLAM loop detection method based on distance metric learning is characterized by comprising the following steps:

firstly, pre-training a CNN model, and optimizing by using a training set;

step six, using the optimized CNN model to perform actual robot loop detection application;

the third step specifically comprises:

wherein n is the number of the feature vectors;

the feature vector center point of the ith scene

Comprises the following steps:

wherein a is the a-th feature vector belonging to the i-th scene;

3.5, judging whether the scene is traversed or not, if so, starting to perform reverse propagation, otherwise, returning to the step 3.1, and taking the picture characteristics of the next scene;

in step four, in the optimization process, the distance loss function L of the multi-tuple pictures of 5 scenes and the multi-tuple picture of the p-th scene are input each time_{multi-constraintunit}(O) is calculated as follows:

wherein the content of the first and second substances,

is the feature vector center point of the p-th scene,

is the feature vector in the p-th scene that is farthest from the center point,

is the feature vector of the non-p-th scene closest to the center point,

is different from random selection in the tuples

Is a predefined constant parameter, representing a

To

A distance between

To

To

A distance between

To

The minimum boundary of the distance between.

2. The visual SLAM loop detection method based on distance metric learning of claim 1, characterized in that: in the first step, the training set is a NewCollege, City Center and TUM data set; the New college and City Center datasets are collected by two cameras on the left and right sides of the mobile robot and provide GPS information and real loopback picture markers; the TUM data set is an indoor data set collected by a depth camera and provides true trajectory information.

3. The visual SLAM loop detection method based on distance metric learning of claim 1, characterized in that: in the second step, the pre-training model is VGG-16, and the picture features are extracted by using the first full connection layer of the VGG-16.

4. The visual SLAM loop detection method based on distance metric learning of claim 1, wherein the sixth step specifically comprises:

5. The method as claimed in claim 4, wherein the method for determining whether a loop exists between two frames before and after the two frames is as follows: