CN116958057A

CN116958057A - Strategy-guided visual loop detection method

Info

Publication number: CN116958057A
Application number: CN202310759867.4A
Authority: CN
Inventors: 梁玮; 李佳鑫; 邸慧军
Original assignee: Yangtze River Delta Research Institute Of Beijing University Of Technology Jiaxing
Current assignee: Yangtze River Delta Research Institute Of Beijing University Of Technology Jiaxing
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-10-27

Abstract

The application provides a strategy-guided visual loop detection method, which firstly utilizes the time correlation of scene data streams to learn the distribution rule of loop frames in a specific scene, so that a given current frame and a history frame can directly obtain the most similar loop candidate frame, and tedious comparison one by one is avoided; then, a strategy of refining the loop candidate frames is introduced, the reliability of the selected candidate frames is obtained, and the historical frame picture with the highest probability is selected as the final loop candidate frame; therefore, the application provides a new method for detecting the loop in the visual real-time positioning and the image construction, trains a strategy for selecting the loop candidate frame, reacts according to the current frame and the selected frame, selects the most similar frame as the next loop candidate frame, can find the loop frame of the current frame in the visual real-time positioning and the image construction, and combines with an optimization algorithm to reduce the accumulated error, thereby ensuring that the positioning and the image construction are more accurate.

Description

Strategy-guided visual loop detection method

Technical Field

The application belongs to the technical field of sensing and positioning in robots, and particularly relates to a strategy-guided visual loop detection method.

Background

Visual loop detection is an indispensable module in visual real-time localization and mapping (VSLAM, visual Simultaneous Localization And Mapping). VSLAM is a technique that utilizes camera perception data to simultaneously estimate camera motion and map creation. The method is a simultaneous positioning and map construction technology based on visual sensors, and can perform autonomous navigation and positioning in an unknown environment. In VSLAMs, the camera acquires sensory data by capturing an image of the surrounding environment, and then extracts useful information such as feature points, edges, etc. through image processing and computer vision techniques and tracks it in previous images to determine the motion of the camera. At the same time, a map is also built, which may be a point cloud map, a topology map, or other form of map. The map may be used to represent the trajectory of the previous camera, may be used to estimate the motion of the camera, and update the map by matching previously detected feature points with new observation data.

The general frame is divided into two parts, a front end and a rear end. The main responsibility of the front end is tracking and mapping. The method mainly relies on the computer vision technology, such as ORB, SIFT and other feature points are extracted to be used as the representation of the current video frame, and then the feature points are registered by using descriptors of the feature points, so that the tracking of the feature points is realized. Meanwhile, the three-dimensional coordinates of the feature points and the pose of the current camera are calculated by utilizing the multi-vision geometry technology, so that the tracking and the mapping of the feature points are completed. The feature points extracted by the VSLAM based on the feature points are sparse, so the built map is also a sparse map. In addition, the direct VSLAM method can utilize pixel-level information as a basis for tracking the front and rear frames, and restore the point cloud representation of the scene to obtain a denser map.

The front end of the VSLAM solves the three-dimensional coordinates of the pose and the characteristic point of the camera according to the data of the sensor, but the input of the sensor is noisy, so that the three-dimensional coordinates of the pose and the characteristic point of the camera obtained by the front end are noisy. And with the movement of the camera, the noise of each obtained result is accumulated, so that the difference between the pose of the camera and the true value is larger and larger, the difference between the built map and the true map is also larger and larger, and the error is called accumulated error. The function of the back end is to reduce the effect of accumulated errors by using loop-back detection techniques. After the robot explores a position explored before, the positions of the robot and the robot are almost similar in the global coordinate system, but due to the existence of accumulated errors, the newly obtained pose and the previously obtained pose have certain difference, and the accumulated errors can be distributed into the poses between two frames through the technologies of image optimization and the like, so that the built image is more accurate. Finding the exact loop frame of the current frame by loop detection is therefore of great importance. At present, the rough framework of loop detection is divided into two stages, wherein the first stage is to find some loop candidate frames similar to the current frame from the historical frames, and the second stage is to screen more accurate loop frames according to the loop candidate frames. The current method mainly differs in the first stage, and in the second stage, the geometric verification is mainly performed by means of local feature points. The screening method of the loop candidate frames is mainly divided into three types, namely a method based on the traditional feature points, a method based on deep learning and a method based on semantics. The method based on the traditional feature points can utilize a bag-of-word model to acquire bag-of-word vectors of the picture to serve as global features of the picture, and find some most similar frames from historical frames according to the global features of the picture to serve as loop candidate frames. Methods such as ORB-SLAM2, FAB-MAP2.0, and the like utilize visual bag of words models for loop candidate detection. The bag of words model typically performs unsupervised learning on multiple categories of pictures in advance to obtain a visual dictionary from which a visual word histogram that can be represented as global features can be created. The off-line training method has the advantages that the method can be used in various scenes, but can be invalid in certain special scenes, so that some methods can train dictionaries on line, dictionaries obtained in different scenes are different, and the accuracy of the obtained results is higher. The deep learning-based method uses a neural network to extract global features or local features as a representation of the current frame and searches for loop candidate frames from the historical frames. Liu et al propose a method for extracting both local and global features for loop candidate detection. In addition, semantic information exists in most scenes, and the semantic information can be used for improving the robustness of the model to illumination, dynamic objects and angle changes. Li et al created a semantic map based on ORB-SLAM's framework and trained a network that could be used to infer the orientation of object blocks of a picture, which model could detect loops in a picture where two object angles were 125 degrees changed.

The main purpose of loop detection is to find a reliable loop frame, and then correct the camera pose and the position of the related three-dimensional map point between two loop frames by solving an optimization problem, so as to reduce the accumulated error. The existing loop detection method mainly comprises two stages, wherein the first stage screens some reliable candidate frames from the historical frames, and the second stage screens more accurate loop frames from the loop candidate frames screened in the first stage through characteristic point information. Therefore, the loop candidate frame selected in the first stage is important, and if the selection criterion is low, the calculation space of the second stage is increased, so that the calculation resource is consumed.

In the detection process of the loop candidate frames, the method of using the neural network and the word bag model can be used for screening the reliable loop candidate frames according to the global characteristics of the current frame and the historical frame. But in a large number of history frames, typically only a small fraction of the reliable loop candidate frames, this means that a large number of invalid comparisons are made in the history frames in order to obtain reliable loop candidate frames, and none of these methods take into account timing information in the history sequence.

Disclosure of Invention

In order to solve the above problems, the present application provides a strategy-guided visual loop detection method, which uses time-sequence context information to predict possible positions of loop candidates, thereby avoiding inefficient frame-by-frame comparison.

A strategy-guided visual loop detection method sequentially takes each frame picture as a current frame picture to execute loop frame acquisition operation according to time sequence, and loop frames corresponding to each frame picture are obtained, wherein the loop frame acquisition operation is as follows:

s1: inputting the current frame picture and the historical frame picture before the current frame picture into a feature expression network to obtain feature vectors of the current frame picture and the historical frame picture;

s2: the method comprises the steps that pictures which are not selected to serve as loop candidate frames in historical frame pictures form a candidate set, the pictures which are selected to serve as loop candidate frames form a selected set, feature vectors corresponding to the candidate set, feature vectors corresponding to current frame pictures and feature vectors corresponding to the selected set are input into a strategy network, probability distribution that each historical frame picture in the candidate set is selected to serve as the loop candidate frame is obtained, and the historical frame picture with the largest probability is selected to serve as the loop candidate frame of the current frame picture;

s3: removing the picture currently selected as the loop candidate frame from the candidate set, adding the picture into the selected set, completing updating of the candidate set and the selected set, and repeating the step S2 by adopting the updated candidate set and the selected set until the loop candidate frames with the set number are obtained;

s4: and performing geometric check and time consistency check on all the loop candidate frames, and selecting the frame with the maximum interior point rate from the loop candidate frames passing the check as the loop frame of the current frame picture.

Further, the method for acquiring the characteristic expression network comprises the following steps:

s11: any two-by-two combination is carried out on all frame pictures in the training set, so that all possible picture pairs are obtained, and each picture pair is provided with a similar label or a dissimilar label;

s12: sequentially inputting the photos in each picture pair into a feature expression network to obtain feature vector pairs corresponding to each picture pair;

s13: respectively calculating the similarity between two feature vectors in each feature vector pair, acquiring a loss function of a feature expression network according to the similarity corresponding to each picture and a label, judging whether the loss function is smaller than a set value, if so, determining that the feature expression network under the current network parameters is a final feature expression network, and if not, updating the network parameters of the feature expression network based on the back propagation of the loss function;

s14: and repeating the steps S12 to S13 by adopting the updated characteristic expression network until a final characteristic expression network is obtained.

Further, the loss functionThe method comprises the following steps:

wherein ,a label representing a picture pair, if it is a similar label +.>If the labels are dissimilar labels, thenThe method for judging whether any picture pair is marked as a similar label or a dissimilar label comprises the following steps: if the difference value of the frame numbers corresponding to the two pictures in the picture pair is not more than 20, the picture pair is marked as a similar label, otherwise, the picture pair is marked as a dissimilar label; d represents cosine similarity of feature vectors of two pictures in the pair; m represents a set constant greater than 0.

Further, the method for acquiring the policy network comprises the following steps: the feature expression network is adopted to obtain feature vectors of all frame pictures in the training set, and the feature vectors are used as the feature vectors of the current frame to execute the following operations:

s21: inputting the historical frame feature vector before the current frame feature vector, the current frame feature vector and the feature vector which is selected as a loop candidate frame into a strategy network to obtain probability distribution that each historical frame feature vector is selected as the most similar feature vector, and randomly selecting a frame feature vector as a candidate frame feature vector based on the probability distribution;

s22: obtaining rewarding value corresponding to candidate feature vector

Wherein t represents the time of finding the candidate frame feature vector corresponding to the current frame feature vector, f _q Representing the feature vector of the current frame,representing candidate frame feature vectors,/->The similarity between the current frame feature vector and the candidate frame feature vector is represented, and the similarity calculating method is as follows:

wherein alpha represents a set weight,cosine values representing the feature vector of the current frame and the feature vector of the candidate frame,/a>Representing the number of feature point matches in the current frame feature vector and the candidate frame feature vector;

s23: recording prize valuesUpdating the selected set and the candidate set, repeating S21-S22 until N candidate feature vectors are selected, and calculating the expected rewards +.>Judging the desired reward->If the current network parameters are larger than the set values, if so, the strategy network under the current network parameters is the final strategy network, and if not, the expected rewards are adopted>Updating network parameters of the strategy network in a gradient rising mode;

s24: repeating steps S21-S23 with the updated policy network until a reward is desiredAnd (5) being larger than the set value, and obtaining the final strategy network.

Further, the probability distribution pi of each history frame feature vector selected as the most similar feature vector in step S21 _w (a _t |s _t ) The calculation method of (1) is as follows:

wherein ,a_t Representing an action of selecting a candidate frame feature vector from the historical frame feature vectors s _t Representing the current state, wherein the current state comprises the selection conditions of the current frame feature vector, the historical frame feature vector and the candidate frame feature vector,the motion space representing the current state is composed of feature vectors which are not selected as loop candidate frames in the history frame feature vectors, w is a network parameter of the policy network, and T represents transposition.

Further, before performing geometric checksum time consistency check on all loop candidate frames, firstly judging the credibility of all loop candidate frames, and performing geometric checksum time consistency check only on the loop candidate frames with the credibility larger than a set value, wherein the credibility of each loop candidate frame is determined by the following steps:

arranging the current loop candidate frames on an original picture frame sequence according to the frame sequence of the current loop candidate frames, expanding the current loop candidate frames back and forth by a set frame number by taking the current loop candidate frames as the center to obtain an expansion window with a set frame length, and taking the sum of the rest loop candidate frames and the current loop candidate frames which simultaneously fall into the expansion window as the credibility of the current loop candidate frames.

Further, before performing geometric checksum time consistency check on all loop candidate frames, firstly, performing expansion and combination on all loop candidate frames to obtain candidate segments, and then performing geometric checksum time consistency check in the form of the candidate segments, wherein the expansion and combination method comprises the following steps:

expanding W frames before and after each loop candidate frame to form a candidate segment with the length of 2W;

and judging whether each candidate segment has frame overlapping or not, and if so, merging the overlapped candidate segments.

Further, the set number of loop candidate frames in step S3 is 10.

The beneficial effects are that:

1. the application provides a strategy-guided visual loop detection method, which firstly utilizes the time correlation of scene data streams to learn the distribution rule of loop frames in a specific scene, so that a given current frame and a history frame can directly obtain the most similar loop candidate frame, and tedious comparison one by one is avoided; then, a strategy of refining the loop candidate frames is introduced, the reliability of the selected candidate frames is obtained, and the historical frame picture with the highest probability is selected as the final loop candidate frame; therefore, the application provides a new method for detecting the loop in the visual real-time positioning and the image construction, trains a strategy for selecting the loop candidate frame, reacts according to the current frame and the selected frame, selects the most similar frame as the next loop candidate frame, can find the loop frame of the current frame in the visual real-time positioning and the image construction, and combines with an optimization algorithm to reduce the accumulated error, thereby ensuring that the positioning and the image construction are more accurate.

2. The application provides a strategy-guided visual loop detection method, which is characterized in that the initially screened loop candidate frames are further screened according to the reliability of the loop candidate frames, and only the next stage is carried out when the reliability meets the requirement, so that the accuracy of loop detection can be improved.

3. The application provides a strategy-guided visual loop detection method, which expands a loop candidate frame into a candidate segment, then performs geometric verification, finally takes the frame which passes the multi-layer verification as a loop frame, finally selects the loop frame with the largest internal point rate in the loop frame to be taken as the final loop frame, and can further improve the recall rate so as to balance the calculation efficiency and the algorithm effect.

4. The application provides a strategy-guided visual loop detection method, which is characterized in that the set number of loop candidate frames is set to be 10 frames, the number can ensure that the number of the loop candidate frames left after geometric verification is enough to be used for calculating relative pose so as to eliminate accumulated errors in visual real-time positioning and image construction, the calculated amount of subsequent geometric verification can be reduced, and the loop detection efficiency is improved.

Drawings

FIG. 1 is a framework for implementing the present application;

FIG. 2 is a flow chart of a method of policy-directed visual loop detection provided by the present application;

FIG. 3 is a comparison of the results and true values of the feature expression network training of the present application;

FIG. 4 is a process of loop candidate frame screening according to the present application;

fig. 5 is a process of calculating reliability of a loop candidate frame and constructing a candidate segment according to the loop candidate frame according to the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

The application is divided into two phases in total, wherein the first phase is screening of loop candidate frames, and the second phase is refinement operation of image level. The first stage gives some reliable candidate frame areas according to the learned strategy, and screens out reliable candidate frames in the candidate frame areas. And in the second stage, geometric verification is carried out by utilizing the feature points according to the candidate frames obtained in the first stage, a basic matrix and an interior point rate are calculated to further screen the loop candidate frames, and meanwhile, the relative pose of the current frame and the loop frame is calculated to serve as a target of subsequent optimization.

Firstly, the problem modeling of the policy-based loop candidate frame detection of the present application is as follows:

the traditional method for screening the loop candidate frames can evaluate the similarity degree of the current frame and the historical frame one by one, and then screen the historical frame meeting certain conditions according to the similarity degree to serve as the loop candidate frames. The conventional search method does not utilize the long-standing time continuity relationship in the scene and the distribution of loops in the scene. Therefore, the distribution of loops in a scene can be integrated into a screening strategy under a specific scene, and the application regards the process of screening loop candidate frames as a Markov decision process, wherein the Markov decision process comprises five parts, namely states, actions, state transfer functions, rewards and strategies. The process of screening the loop candidate frames can be regarded as given current state, and needs to select a current most reasonable action to execute according to a strategy, namely, continuously selecting a frame of loop candidate frame from historical frames, after one action is executed, converting the state, obtaining corresponding rewards, and then selecting the action to execute according to the strategy until a certain amount of loop candidate frames are obtained.

The state consists of four parts, namely a current frame, a historical frame, a previous selection and a last selection candidate frame. All states constitute a state space. The actions represent selecting a frame from the history frames as a candidate frame given the current frame, and in a given state, a series of actions may result in a set of loop candidate frames, as shown in fig. 1. When a state is given, after an action is performed, the state will change according to the state transfer function, and a new state is obtained. In particular the previously selected frame and the last selected frame will change. The reward reflects the value of the current action and is related to the rank of the selected candidate frame and the similarity of the candidate frame.

The strategy represents how to select actions given the current state, the selection space of the actions is a frame which is not selected in the history frames, the previously selected frame is hidden in the history frames, the strategy is represented as probability distribution of a history frame, and finally the frame with the highest probability is selected as the action which should be executed currently.

It can be seen that the first stage of the present application involves two parts, the first part is a feature expression part, that is, a set of pictures is mapped to a vector space, it is hoped that similar pictures can have similar vector expressions, dissimilar pictures have vector expressions with larger difference, and the present application adopts a convolutional neural network to achieve better feature expression through countermeasure learning, and then needs to train countermeasures. The application adopts reinforcement learning to train the screening strategy of the loop candidate frames. Meanwhile, the method of the present application mainly contributes to the following points. (a) The novel visual loop detection method under a specific scene is provided; (b) A selection strategy of loop candidate frames is trained, and the strategy predicts possible positions of the loop candidates by using time sequence context information, so that inefficient frame-by-frame comparison is avoided; (c) And a strategy of refining the loop candidate frames is introduced, the loop candidate frames are screened by utilizing time domain information, and a certain expansion is carried out to ensure enough recall rate.

Based on the above problem model, the present application provides a method for policy-guided visual loop detection, as shown in fig. 2, in which each frame of picture is sequentially executed in time sequence as a current frame of picture to perform loop frame acquisition operation, so as to obtain a loop frame corresponding to each frame of picture, where the loop frame acquisition operation is as follows:

it should be noted that, the set number is recorded as N, the value of N affects the quality of the candidate frame and the time consumption of the subsequent geometric verification, if N is too small, the selected loop candidate may not have high quality, and the loop frame left after the geometric verification is also difficult to be used for calculating the relative pose to eliminate the accumulated error; on the contrary, if N is too large, the calculation amount of the subsequent geometric verification is large and is very inefficient, and through experiments, the application can achieve a better effect by selecting n=10.

It should be noted that, after the first N most similar loop candidates are selected, the present application performs a candidate frame refinement operation to screen out part of the noise data. Because the images of the entire exploration environment are temporally continuous, the loop of the current frame tends to be concentrated and distributed over a certain segment, and loop candidates should also exist around the loop candidates. Therefore, the application calculates the number of the N loop candidates in the window with the given width to become the credibility of the loop candidate frame, and finally only leaves the next stage of the process with the credibility meeting the requirement. Meanwhile, after N frames of loop candidate frames are obtained, more accurate loop frames are required to be screened according to the loop candidate frames, and relative pose is calculated to provide parameters for an optimization equation. The loop candidate frame screened according to the learned strategy has similar characteristic expression with the current frame, but the loop frame which can be really used for calculating the relative pose is likely not to be in the selected N frame loop candidate frames. In addition, the loop candidate frame screened by the reliability shows that the position of the loop candidate frame is likely to be similar to that of the current frame, so that the more accurate loop frame is likely to be in the vicinity of the loop candidate frame with higher reliability. Therefore, the application expands part of the loop candidate frames back and forth to be used as a candidate segment, and then searches for the loop frame in each candidate segment.

Based on the above, before geometric check and time consistency check are carried out on all loop candidate frames, firstly, the credibility of all loop candidate frames is judged, then the loop candidate frames with the credibility larger than a set value are expanded and combined to obtain candidate segments, and geometric check and time consistency check are carried out in the form of the candidate segments, wherein the credibility of each loop candidate frame is determined by the following steps:

arranging the current loop candidate frames on an original picture frame sequence according to the frame sequence of the current loop candidate frames, expanding the current loop candidate frames back and forth by a set frame number by taking the current loop candidate frames as the center to obtain an expansion window with a set frame length, and taking the sum of the rest loop candidate frames and the current loop candidate frames which simultaneously fall into the expansion window as the credibility of the current loop candidate frames. Meanwhile, through experiments, the window width is selected to be 30, the minimum reliability requirement is 2, and a better effect can be obtained.

The method for expansion and combination comprises the following steps:

That is, after obtaining N frame loop candidate frames, we expand each loop candidate frame back and forth by W frames to form a candidate segment with a length of 2W, and merge the two candidate segments into one candidate segment if they intersect. As shown in fig. 5, in the case of w=3,the five loop candidate frames form two candidate segments, and the candidate segments formed after expansion of the first four loop candidate frames intersect to be combined into one candidate segment.

Each candidate segment is determined, as is the search space for the geometric check. Each frame in the candidate segment requires a geometric check with the current frame to screen for more accurate loop frames. Each frame in the candidate segment and the current frame will extract ORB feature points, if the feature points are less than 8 pairs, the frame will be discarded, otherwise the base matrix is calculated using RANSAC. Frames that do not succeed in computing the base matrix or whose inliers are less than the threshold are discarded. The remaining frames also need to undergo a temporal consistency check before they can be selected as loop-back frames.

Since the camera's data stream is continuous in time, the loop frame of the current frame is most likely the loop frame of the frames surrounding the current frame, considering that the NewCollege and citycontre datasets are arranged across the left and right camera views, the currently inspected frame is considered to pass the temporal consistency check if the loop frame of the previous frame is in the vicinity of the currently inspected frame. And finally, selecting the frame with the maximum interior point rate from the frames subjected to geometric verification as a loop frame of the current frame.

The following describes a method for obtaining a feature expression network, as shown in fig. 4, specifically including the following steps:

it should be noted that the feature expression network and the decision network. The application selects the NewCollege and citycontre data sets, wherein the data sets respectively comprise 1073 and 1273 photos taken by left and right cameras on tour in a scene, wherein the photos taken by the left and right cameras are arranged in a crossed way, and each photo is numbered from 1. The two data sets respectively give a truth value matrix, the truth value matrix is a two-dimensional binary square matrix, a value of 0 indicates that two frames corresponding to the subscript are not loops, and a value of 1 indicates that two frames corresponding to the subscript are loops. The feature expression part of the application adopts 30% of data of two data sets as a training set for training, and the strategy decision part adopts 50% of data as the training set for training.

it should be noted that, the feature expression network of the present application may select the classical Resnet50 as the backbone network, and two full-connection layers are connected after the Resnet50, so as to reduce the feature dimension output by the Resnet50 from 2048 to 128. Feature expression is performed by countermeasure learning, i.e., a pair of photographs with similar or dissimilar labels is input, the output is two 128-dimensional feature vectors, a loss function is calculated from the similarity of the labels and the two feature vectors, and the network parameters are updated back-propagated. In addition, since the data of the feature expression network training is a photo pair, if the value of the photo in the truth value matrix is 1 or the numbers of two pictures are close (the numbers differ by less than 20), the two frames are considered to be close, the labels of the photo pair are similar, otherwise, the labels are dissimilar, and the equal number of positive and negative samples are taken for training.

S13: respectively calculating the similarity between two feature vectors in each feature vector pair, acquiring a loss function of a feature expression network according to the similarity corresponding to each picture and a label, judging whether the loss function is smaller than a set value, if so, determining that the feature expression network under the current network parameters is a final feature expression network, and if not, updating the network parameters of the feature expression network based on the back propagation of the loss function; wherein the loss functionThe method comprises the following steps:

wherein ,a label representing a picture pair, if it is a similar label +.>If the labels are dissimilar labels, thenThe method for judging whether any picture pair is marked as a similar label or a dissimilar label comprises the following steps: if the difference value of the frame numbers corresponding to the two pictures in the picture pair is not more than 20, the picture pair is marked as a similar label, otherwise, the picture pair is marked as a dissimilar label; d represents cosine similarity of feature vectors of two pictures in the pair; m represents a set constant greater than 0, which is used to limit cosine similarity. The smaller the expected loss function, the cosine similarity of similar pictures should be greater than m and the cos similarity of dissimilar pictures should be less than-m. Through a series of experiments, the present application finally chooses m=0.79. As shown in fig. 3, fig. 3 (b) is a reference number of a picture in the citycontre data set, where a white part indicates that two frames are looped, that is, the two frames are similar, and fig. 3 (a) is a cosine similarity of a picture feature representation in the citycontre after training, so that the similarity and true value of the feature representation of the picture can be seen to be about the similarity of the picture, which indicates that the feature representation of the trained picture can better measure the similarity of the picture.

After a better feature expression is obtained, the strategy decision part needs to be trained. The application discloses a method for acquiring a strategy network, which adopts reinforcement learning to train, and specifically comprises the following steps:

the feature expression network is adopted to obtain feature vectors of all frame pictures in the training set, and the feature vectors are used as the feature vectors of the current frame to execute the following operations:

s21: inputting the historical frame feature vector before the current frame feature vector, the current frame feature vector and the feature vector which is selected as a loop candidate frame into a strategy network to obtain probability distribution that each historical frame feature vector is selected as the most similar feature vector, and randomly selecting a frame feature vector as a candidate frame feature vector based on the probability distribution; in step S21, each of the history frame feature vectors is selected as a probability score of the most similar feature vectorCloth pi _w (a _t |s _t ) The calculation method of (1) is as follows:

wherein ,a_t Representing an action of selecting a candidate frame feature vector from the historical frame feature vectors s _t Representing the current state, wherein the current state comprises the selection conditions of the current frame feature vector, the historical frame feature vector and the candidate frame feature vector,the motion space representing the current state is composed of feature vectors which are not selected as the most similar feature vectors in the history frame feature vectors, w is a network parameter of the policy network, and T represents transposition.

It should be noted that the present application uses three full-connection layers to represent the policy, and the parameters of the full-connection layers are updated by using REINFORCE algorithm, and the purpose of this algorithm is to maximize the cumulative rewards:

wherein Is a round selection action, +.>Is a round of jackpot:

the gradient of J (w) is thus:

wherein G_t Is the cumulative discount prize from time t to the end of this round. The application updates parameters based on the gradient and learning rate of a round of jackpots:

s22: obtaining rewarding value corresponding to candidate feature vector

Wherein t represents the time of finding the candidate frame feature vector corresponding to the current frame feature vector, f _q Representing the feature vector of the current frame,representing candidate frame feature vectors,/->Representing the similarity between the feature vector of the current frame and the feature vector of the candidate frame, it can be seen that the more similar frame is found earlier, the larger the obtained reward function is, and the similarity is calculated as follows:

wherein alpha represents a set weight,cosine values representing the feature vector of the current frame and the feature vector of the candidate frame,/a>The feature points of the two frames are similar when the feature points of the two frames are similar, and the more the feature points of the two frames can be extracted, the higher the similarity measurement reliability of the two frames is;

The application utilizes the trained network to detect the loop candidate frames online, namely, each time the probability distribution formula is used for expressing the probability distribution of each action, and the action with the highest probability is selected to be executed until the first N most similar loop candidate frames are selected. It should be noted that the present application was performed on both of the disclosed data sets and exceeded the currently best method to demonstrate the effectiveness of the present application.

In summary, the conventional visual loop detection algorithm, whether it is an offline or online bag of words or an algorithm based on deep learning, needs to compare the feature representation of the current frame with the feature representation of the history frame one by one, although most of the comparison is unnecessary, because the conventional method does not utilize semantic information related to the data flow in the scene. The application provides a new framework for visual loop detection in a specific scene, wherein the screening of loop candidate frames is modeled into a Markov decision process, and the time correlation of scene data streams is utilized to learn the distribution rule of the loop frames in the specific scene, so that a plurality of frames of loop candidate frames which are most similar can be directly obtained by giving the current frame and the history frame, and the complex comparison one by one is avoided.

Meanwhile, the application comprehensively considers time efficiency and performance, provides a method for screening the credibility of the loop candidate frames, and expands the loop candidate frames into segments to increase the search space. In order to further improve recall rate, the application expands the loop candidate frames into candidate segments, then carries out geometric check and the like, finally, the frames passing the multi-layer check can be used as loop frames, and finally, the loop frames with the largest inner point rate in the loop frame are selected to be used as loop frames. Several steps of screening the loop-back candidate frames mainly consider accuracy and computational efficiency, while screening the loop-back frames based on the loop-back candidate frames mainly consider recall. Finally, the good performance of recall rate and accuracy rate can be obtained.

Of course, the present application is capable of other various embodiments and its several details are capable of modification and variation in light of the present application by one skilled in the art without departing from the spirit and scope of the application as defined in the appended claims.

Claims

1. The strategy-guided visual loop detection method is characterized in that each frame of picture is used as a current frame of picture to execute loop frame acquisition operation in sequence according to time sequence, loop frames corresponding to each frame of picture are obtained, and the loop frame acquisition operation is as follows:

2. The method for policy-guided visual loop detection of claim 1, wherein the method for obtaining the feature expression network comprises:

3. A method of policy directed visual loop detection according to claim 2 wherein said loss functionThe method comprises the following steps:

wherein ,a label representing a picture pair, if it is a similar label +.>If the tags are dissimilar tags, then->The method for judging whether any picture pair is marked as a similar label or a dissimilar label comprises the following steps: if the difference value of the frame numbers corresponding to the two pictures in the picture pair is not more than 20, the picture pair is marked as a similar label, otherwise, the picture pair is marked as a dissimilar label; d represents cosine similarity of feature vectors of two pictures in the pair; m represents a set constant greater than 0.

4. The method for policy-guided visual loop detection of claim 1, wherein the method for obtaining the policy network comprises: the feature expression network is adopted to obtain feature vectors of all frame pictures in the training set, and the feature vectors are used as the feature vectors of the current frame to execute the following operations:

s22: obtaining rewarding value corresponding to candidate feature vector

wherein alpha represents a set weight,cosine values representing the current frame feature vector and the candidate frame feature vector,representing the number of feature point matches in the current frame feature vector and the candidate frame feature vector;

5. The method of policy-guided visual loop detection of claim 4 wherein each historical frame feature vector is selected as the probability distribution pi of the most similar feature vector in step S21 _w (a _t |s _t ) The calculation method of (1) is as follows:

6. The method for policy-guided visual loop detection as defined in claim 1, wherein before performing geometric checksum time consistency check on all the loop candidate frames, determining the reliability of all the loop candidate frames is performed by first determining that all the loop candidate frames have reliability, and performing geometric checksum time consistency check only on the loop candidate frames having reliability greater than a set value, wherein the method for determining the reliability of each loop candidate frame is as follows:

7. The method for policy-guided visual loop detection as defined in claim 1, wherein before performing geometric checksum time consistency check on all loop candidate frames, the method for performing expansion and combination on all loop candidate frames to obtain candidate segments and performing geometric checksum time consistency check on the candidate segments is as follows:

8. A method of policy directed visual loop detection according to any of claims 1-7, wherein the set number of loop candidate frames in step S3 is 10.