CN111626417B

CN111626417B - Closed loop detection method based on unsupervised deep learning

Info

Publication number: CN111626417B
Application number: CN202010360548.2A
Authority: CN
Inventors: 石朝侠; 汪丹
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2022-10-28
Anticipated expiration: 2040-04-30
Also published as: CN111626417A

Abstract

The invention discloses a closed loop detection method based on unsupervised deep learning, which utilizes the theory that very rich semantic information is usually embedded in the last several layers of convolutional layers of a convolutional neural network, directly identifies an interested region by deep layers of convolutional layers of the convolutional neural network to generate landmarks, and then extracts convolution characteristics from each landmark to generate the final expression of an image. The novel mode has appearance invariance and viewpoint invariance, and can detect whether the current position is the position which the robot has arrived at under the condition of extreme change, thereby eliminating the accumulated error of the robot in the simultaneous positioning and mapping and the relocation after the tracking is lost. The method can be applied to the field of mobile robots, such as unmanned vehicles, unmanned planes, virtual reality and augmented reality, the positioning capability of the unmanned vehicles, the unmanned planes, the virtual reality and the augmented reality is improved, and meanwhile, a globally consistent map is constructed.

Description

Closed loop detection method based on unsupervised deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a closed loop detection method based on unsupervised deep learning.

Background

The mobile robot technology is a frontier field with wide application and great prospect in the world at present. The system integrates theoretical research results of a plurality of subjects such as artificial intelligence, sensor technology, signal processing, automatic control engineering, computer technology, industrial design and the like, is widely applied to various industries such as industry, agriculture, service industry, medical treatment, national defense and the like, can assist or replace the work of human beings, and is particularly important for application research in occasions such as space and underwater exploration and the like under the condition that the human beings cannot reach or are in danger. Therefore, the mobile robot technology is generally concerned and invested in a large amount in all countries of the world, and also becomes an important index for measuring the national scientific research technology. For a mobile robot, its autonomous navigation, path planning, relies on good positioning and mapping. Loop detection is a method for eliminating accumulated errors caused in the long-term motion process of the robot. The method has the key idea that the current position is identified to be the position where the robot has arrived, the track of the robot is pulled back to the correct position, and the robot can be repositioned under the condition that the camera is lost, so that more accurate positioning is realized, and a globally consistent map is constructed.

There are two significant challenges in the closed loop detection algorithm: 1) Appearance changes due to weather, shading, and dynamic objects; 2) A change in viewpoint due to a camera photographing position or the like. The current mainstream methods are as follows: (1) Generating an image representation by using the characteristics of local artificial design characteristics extracted from the image, and then accelerating image descriptor matching through a bag-of-words model; (2) And (4) directly extracting globally considered design features of the image and then directly matching.

Method (1) is robust to viewpoint changes, but is not suitable for handling appearance changes. Method (2) performs well in environmental changes, but they do not perform well when viewpoints and occlusions are present in the environment. And neither approach provides satisfactory performance in the presence of variations in combinations of lighting, occlusion, viewpoint, and other factors

Disclosure of Invention

The invention aims to provide a closed loop detection method based on unsupervised deep learning.

The technical scheme adopted by the invention is as follows: a closed loop detection method based on unsupervised deep learning comprises the following steps:

1) Inputting the scene query frame and the scene database image into a pre-trained vgg-16 convolutional neural network, and directly identifying an interested region from a convolutional layer of the convolutional neural network;

2) For each scene query frame and scene database image, generating 100 landmarks with the identified regions of interest;

3) Extracting a convolution feature descriptor from each landmark generated from the image by using an unsupervised deep neural network to obtain a corresponding feature vector;

4) Cross-matching landmark regions of the two frames by calculating cosine distances between landmark vectors of the scene query frames and landmark vectors of each scene database image, and reserving landmarks which are matched with each other;

5) And calculating the overall similarity between the scene query frame and each scene database image according to the matched landmark pairs to determine whether a scene similar to the scene query frame exists in the scene database, thereby judging whether a loop appears.

In addition, the closed-loop detection method based on unsupervised deep learning according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, according to the input requirement of vgg-16 network, all images are adjusted to 224 × 224 size as the input of vgg-16 network, the deep convolutional layer of the convolutional neural network is used to obtain the feature map corresponding to the image, and then all non-zero activation values and its surrounding 8 adjacent activation values are respectively gathered into a type as the region of interest of the identified image.

Further, in one embodiment of the invention, each cluster C is calculated _i Energy value E of _i ：

Wherein, | c _i | represents the size of the ith cluster,

is represented by C _i The jth activation value of (a).

The first 100 clusters with the largest energy values are then selected as landmarks for the image generation.

Further, in one embodiment of the invention, an unsupervised deep neural network pair is utilized to extract the convolution feature descriptors for each landmark. The unsupervised deep neural network is specially designed for closed-loop detection tasks and aims to train the ability of the network to learn and extract HOG features. When training is finished, the network has the capability of learning and reconstructing HOG characteristics, only three convolution layers and corresponding pooling layers are reserved, and all network layers except the three convolution layers are discarded to extract convolution characteristics of the image.

Further, in one embodiment of the present invention, all landmarks extracted from both images are cross-matched. Scene query frame I using cosine distance metric _q A landmark u and each image in the scene database frame

Similarity between one landmark v:

in the formula (d) _u,v I.e. the cosine distances of u and v. Wherein

Respectively represent a pair I _q Landmark u and

the landmark v in (1) extracts the convolved feature vector, | represents the length of the vector.

I is then determined using a simple linear search _q And

similarity between all landmarks and applying cross-checking to accept only landmarks that match each other. For each matching landmark pair (u, v), its weight is determined according to their region size, and the weight is denoted as W _u,v ：

Wherein h is _u ，h _v ，w _u ，w _v Height and width, | h, of (u, v) regions, respectively _u -h _v I and | w _u -w _v | represents the absolute value of the difference between the height and the width of the two regions, respectively

Final I _q And

global similarity score

Comprises the following steps:

further, in one embodiment of the present invention, image I is queried for each frame _q Traverse and calculate it and all images in the database

Wherein the image with the highest score is I _q The best matching:

z is represented by _q The reference frame with the highest similarity score. Thereby obtaining a scene similar to the scene query frame in the scene database.

Compared with the prior art, the invention has the following remarkable advantages:

(1) The method disclosed by the invention combines the deep learning technology and the closed-loop detection by utilizing the successful application of the deep learning in the field of scene identification, thereby greatly improving the closed-loop detection capability of the mobile robot in the environment with extreme appearance change and viewpoint change.

(2) The invention utilizes the last several layers of convolution layers of the convolution neural network to embed very rich semantic information which corresponds to some image areas meaningful for closed loop detection tasks and can directly generate landmark representation of the image.

(3) The invention extracts landmark feature descriptors by using an unsupervised deep neural network model specially designed for closed-loop detection, and the convolution features are lighter and more compact than those extracted from a general neural network.

Drawings

FIG. 1 is an overall block diagram of the method of the present invention.

Detailed Description

With the wide application of the mobile robot, the positioning and mapping capability of the mobile robot is an important factor limiting the application scenario of the mobile robot. And a good closed-loop detection algorithm can greatly improve the positioning and mapping capability of the mobile robot in an unknown environment. In order to overcome the defects in the prior art, the invention provides a closed-loop detection method based on unsupervised deep learning.

The invention is further described in the following with reference to the drawings.

The specific steps of the present invention are further described in detail with reference to fig. 1, and the present implementation takes a data set, campusLoop, as an example to illustrate the closed loop detection process.

Step 1, a CamputLoop dataset is used as input of the method.

The CamputLoop dataset is read and reset to 224 × 224, a sequence of 100 images shot in clear weather in the dataset is used as a scene query set, and another sequence of 100 images shot in snow weather is used as a scene database set. Respectively input into a pre-trained vgg-16 convolutional neural network

And 2, generating 100 landmarks for each image frame.

For each frame image, firstly, a specific convolution layer of the vgg-16 convolution neural network is utilized to obtain corresponding feature mapping, and then all non-zero activation values and 8 adjacent activation values around the activation values are respectively gathered into a class which is marked as C _i (i∈{1,2,…,N})，N(N>= 100) represents the number of clusters in one image. Each cluster C _i Energy value E of _i Can be calculated as:

wherein | c _i | represents the size of the ith cluster,

is represented by C _i The jth activation value of (b). After the energy values of N clusters are obtained, 100 clusters with the largest energy value are taken as detected landmarks and are recorded as: l is _s ,s∈[1,2,…,100]

And 3, training the capability of unsupervised deep neural network learning and HOG feature extraction.

In this network, X represents the dimension of the HOG feature,

representing the dimensions of the reconstructed feature descriptors. In the self-coding model, linear rectification function (ReLU) activation is used for three convolutional layers, and sigmoid activation is used for a fully-connected layer so that the network can better reconstruct the HOG features. When training is finished, the network has the ability of learning and reconstructing HOG features, only three convolution layers and corresponding pooling layers are reserved, and all the network layers except the convolution layers are discarded to extract convolution features of the image. Furthermore, since the dimensions of the HOG features extracted for the same size input are the same, it is possible to utilize the euclidean distance as a distance metric for the HOG descriptor, utilizing l at the loss level ₂ By comparison of X with its reconstruction by a loss function

The size of (2):

and 4, extracting convolution characteristics from each landmark.

For each landmark detected, the convolutional feature descriptors are extracted using the above trained unsupervised convolutional autoencoder network, which is specifically designed for closed-loop detection tasks. The network is fast and reliable, can realize real-time detection of closed loops without reducing the dimensionality of the extracted convolution features, and can replace a general neural network in a closed loop detection system based on the convolution features. Furthermore, the network does not require context-specific training. For any one image, the total feature dimension is 106400.

And 5, cross-matching the landmarks between frames.

By computing scene query frame I _q Landmark vectors and each database image

Cross-matching the landmark regions of two frames by cosine distance between landmark vectors, I _q A landmark u (u e i) and

the similarity between one landmark v (v ∈ i), i.e. the cosine distance, is:

d _u,v i.e. the cosine distances of u and v. Wherein

Respectively represent a pair I _q Landmark u and

the landmark v extracted convolved feature vector of (1), ii · |, represents the length of the vector.

Determining I using a simple linear search _q And

and matching all the landmarks, and keeping the landmarks which are matched with each other.

And 6, generating a final representation of the image to perform image retrieval.

For each matching landmark pair (u, v), its weight is determined according to their region size, and the weight is denoted as W _u,v ：

Wherein h is _u ，h _v ，w _u ，w _v Height and width, | h, of (u, v) region, respectively _u -h _v I and | w _u -w _v | represents the absolute value of the difference between the height and the width of the two regions, respectively

Finally, I _q And

global similarity score

Comprises the following steps:

query image I for each frame _q Traverse and calculate it and all images in the database

Wherein the image with the highest score is I _q The best matching:

z is represented by _q The reference frame with the highest similarity score.

And 7, judging whether a loop appears or not.

And 6, judging whether a loop is detected or not by combining the real scene relation corresponding to the data set according to the retrieved result in the step 6.

Claims

1. A closed loop detection method based on unsupervised deep learning is characterized by comprising the following steps: 1) Inputting the scene query frame and the scene database image into a pre-trained vgg-16 convolutional neural network, and directly identifying an interested region from a convolutional layer of the convolutional neural network; 2) Generating a landmark for each scene query frame and scene database image by using the identified region of interest; 3) Extracting a convolution feature descriptor from each landmark generated from the image by using an unsupervised deep neural network to obtain a corresponding feature vector; 4) Calculating cosine distance between landmark vectors of scene query frames and landmark vectors of each scene database image to cross-match landmark areas of the two frames and keep landmarks which are matched with each other; 5) Calculating the overall similarity between the scene query frame and each scene database image according to the matched landmark pairs to obtain the optimal matching, and judging whether the closed loop can be correctly detected according to the corresponding relation of the real scene;

the unsupervised deep neural network in the step 3) is used for training the ability of the network to learn and extract the HOG characteristics; when training is finished, the network has the capability of learning and reconstructing HOG characteristics, only three convolution layers and corresponding pooling layers are reserved, and all network layers except the three convolution layers are discarded to extract convolution characteristics of the image;

in the step 4), all landmarks extracted from the two images are matched in a cross mode, and a scene query frame I is measured by using cosine distance _q A landmark u and each image in the scene database frame

Similarity between one landmark v:

in the formula (d) _u,v I.e. the cosine distance of u and v, where

Respectively represent a pair I _q Landmark u and

the landmark v of (1) represents a convolution feature vector of (II · |) the vectorA length;

determining I using a linear search _q And

matches between all landmarks and applying cross-checking to accept only landmarks that match each other; for each matching landmark pair (u, v), its weight is determined according to their region size, and the weight is denoted as W _u,v ：

In the formula, h _u ，h _v ，w _u ，w _v Height and width, | h, of (u, v) region, respectively _u -h _v I and | w _u -w _v And | represents the absolute value of the difference between the high and wide values of the two regions, respectively.

2. The closed-loop detection method based on unsupervised deep learning of claim 1, characterized in that: in step 1, according to the input requirement of vgg-16 network, all images are adjusted to 224 × 224 size as the input of vgg-16 network, the deep convolutional layer of the convolutional neural network is used to obtain the feature mapping corresponding to the image, and then all nonzero activation values and 8 adjacent activation values around the activation values are respectively gathered into a type of region of interest as the identified image.

3. The closed-loop detection method based on unsupervised deep learning of claim 2, characterized in that: in step 2, each cluster C is calculated _i Energy value E of _i ：

Wherein, | c _i | represents the size of the ith cluster,

to representC _i The jth activation value of (a).

4. The unsupervised deep learning-based closed-loop detection method according to claim 3, characterized in that: in step 2, the first 100 clusters with the largest energy value are selected as landmarks for current image generation.

5. The unsupervised deep learning-based closed-loop detection method of claim 1, wherein final I _q And

global similarity score

Comprises the following steps:

6. the closed-loop detection method based on unsupervised deep learning of claim 1, characterized in that: in step 5, the image I is queried for each frame _q Traverse and calculate it and all images in the database

Wherein the image with the highest score is I _q The best matching of (2):

z is represented by _q The reference frame with the highest similarity score, thereby obtaining a scene similar to the scene query frame in the scene database.