CN102831446A

CN102831446A - Image appearance based loop closure detecting method in monocular vision SLAM (simultaneous localization and mapping)

Info

Publication number: CN102831446A
Application number: CN2012102951816A
Authority: CN
Inventors: 梁志伟; 陈燕燕; 朱松豪; 徐国政
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2012-08-20
Filing date: 2012-08-20
Publication date: 2012-12-19

Abstract

The invention discloses an image appearance based loop closure detecting method in monocular vision SLAM (simultaneous localization and mapping). The image appearance based loop closure detecting method includes acquiring images of the current scene by a monocular camera carried by a mobile robot during advancing, and extracting characteristics of bag of visual words of the images of the current scene; preprocessing the images by details of measuring similarities of the images according to inner products of image weight vectors and rejecting the current image highly similar to a previous history image; updating posterior probability in a loop closure hypothetical state by a Bayesian filter process to carry out loop closure detection so as to judge whether the current image is subjected to loop closure or not; and verifying loop closure detection results obtained in the previous step by an image reverse retrieval process. Further, in a process of establishing a visual dictionary, the quantity of clustering categories is regulated dynamically according to TSC (tightness and separation criterion) values which serve as an evaluation criterion for clustering results. Compared with the prior art, the loop closure detecting method has the advantages of high instantaneity and detection precision.

Description

Closed loop detection method based on image appearance in monocular vision SLAM

Technical Field

The invention provides a closed-loop detection method based on image appearance in a monocular vision SLAM (synchronous localization and mapping) aiming at the problem of closed-loop detection in the SLAM, and belongs to the technical field of mobile robot navigation.

Background

Synchronous positioning and mapping are basic problems and research hotspots in the field of mobile robot navigation, and whether the synchronous positioning and mapping capabilities are provided is considered by many people as a key precondition for whether the robot can realize autonomous navigation. The robot realizes self-positioning and simultaneously constructs an environment map in the SLAM process, and due to the lack of prior knowledge and the uncertainty of the environment, the robot needs to judge whether the current position is in an environment area which is visited or not in the walking process and uses the environment area as a basis for judging whether the environment needs to be updated or not, namely the problem of closed-loop detection is solved.

Due to the limited range of vision sensor observation, monocular vision SLAM closed-loop detection faces many problems, including uncertainty and error of robot motion, which may lead to data correlation error, how to detect visual features, how to characterize visual scene models, and so on. How to accurately establish a scene model is the key of visual SLAM closed-loop detection, and most of the visual-based scene models are described by directly obtained environmental appearance characteristics at present. The BoVW (bag of services) algorithm is an effective image feature modeling method and is widely used for visual SLAM closed-loop detection. According to the method, local features of an image are extracted by using a SURF or SIFT operator, then the features are classified to construct a visual dictionary, and any one image can be represented by a visual word set in the visual dictionary based on the created visual dictionary.

In the aspect of visual SLAM closed-loop detection, Angeli and the like propose a topological closed-loop detection method based on enhanced vision, Cummins and the like propose a probabilistic closed-loop detection method based on topological appearance, and the two methods can be used for effective detection in a large-scale environment but cannot meet the closed-loop detection requirements of high efficiency and real-time performance in the SLAM problem. RTAB-MAP is a real-time closed-loop detection method based on scene appearance, and the strong memory management function of the RTAB-MAP enables the robot to process each frame of image online for a long time, but the detection accuracy is low and false closed-loop detection is easy to occur.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art, and provide a closed-loop detection method based on image appearance in monocular vision SLAM, which can effectively improve the real-time and accuracy of online closed-loop detection.

The invention specifically adopts the following technical scheme to solve the technical problems:

a closed loop detection method based on image appearance in monocular vision SLAM comprises the following steps:

step 1, acquiring a current scene image by using a monocular camera carried by a mobile robot in the advancing process of the mobile robot, and extracting visual bag-of-word characteristics of the current scene image;

step 2, calculating the content similarity between the current scene image and the previous frame of historical image, and if the maximum value of the content similarity is smaller than a preset similarity threshold value, calculating the content similarity between the current scene image and the previous frame of historical imageSaving the current scene image and executing the step 3; otherwise, deleting the current scene image and transferring to the step 1 to obtain a new image, wherein the current scene image I_tAnd the previous frame history image I_cThe content similarity S is calculated according to the following formula:

S = \frac{(V_{I_{t}}, V_{I_{c}})}{| V_{I_{t}} | | V_{I_{c}} |}

in the formula,

representing a current scene image I_tThe visual bag of words feature vector of (1),

representing the previous frame of the historical image I_cThe visual bag of words feature vector;

and 3, continuously updating the posterior probability of the closed-loop assumed state by using a Bayesian filtering method to perform closed-loop detection, and judging whether the current scene image is closed-loop or not.

In order to improve the accuracy of closed-loop detection, the method further verifies the detection result obtained in the step 3 by using an image reverse retrieval method, and specifically, the method further comprises the following steps:

and 4, verifying the detection result of the step 3 according to the following method: when the current image and a historical image are detected to form a closed loop in the step 3, counting the frequency of the visual words in the visual word bag characteristics of the current image in the visual word bag characteristics of each historical image; selecting the previous P historical images with the highest frequency, wherein P is a natural number; if the historical image detected in the step 3 is any one of the P historical images, the closed loop detection is considered to be correct, and the closed loop is accepted; otherwise, the closed loop detection is wrong, and the closed loop is rejected.

The invention can adopt the existing method to extract the visual word bag characteristics of the image, the existing method usually directly uses the frequency vector of the visual word to represent the image, in order to ensure that the visual word bag characteristics can more accurately represent the image, the invention uses the tf-idf weighting method in the text retrieval for reference, and uses the word frequency vector with weight to represent each frame of image. Specifically, the invention extracts the visual bag-of-words feature of the image according to the following method:

step 1, extracting local visual features of an image I to obtain a local visual feature vector set of the image I;

step 2, representing the image I as a K-dimensional vector V_INamely, the visual bag-of-words feature vector of the image I:

V_I=[t₁，...,t_j,…t_K]^T j=1,2,…,K

wherein,k is the number of visual words in the visual dictionary, n_jIRepresenting the frequency, n, of occurrence of the jth visual word in the set of local visual feature vectors of the image I_IRepresenting the number of visual words appearing in the set of local visual feature vectors of image I, N representing the number of all the historical images currently saved, N_jIndicating the number of images of which the visual bag-of-words feature contains the jth visual word in all the current saved history images.

Preferably, the visual dictionary is constructed offline by the following method:

step 1, collecting a group of environment scene images in advance and extracting local visual features of the images respectively, wherein all local visual feature vectors form a training sample set;

step 2, clustering the training sample set, and constructing a visual dictionary by taking the obtained clustering centers as visual words, wherein each clustering center is a visual word; the method specifically comprises the following steps:

step 201, setting an initial clustering category number K;

202, carrying out fuzzy K-means clustering on the training sample set according to the current clustering class number K, in each iteration step, distributing the samples to a certain clustering center according to the maximum membership criterion, wherein the membership R of the ith sample to the jth clustering center_ljThe following formula:

in the formula, D_lRepresenting the ith sample in the training sample set; v_jRepresenting the jth cluster center; m is the number of samples in the training sample set; k is the number of the categories of the clusters;

and updating the clustering center according to the following formula:

in the formula, V_j(t)、V_j(t +1) respectively representing the clustering centers of the jth class in the t-th and t + 1-th iteration steps; r_lj(t) represents the membership of the ith sample to the jth clustering center in the tth iteration step;

step 203, judging whether the TSC value of the current clustering result is within a preset range, if so, turning to step 204; if not, changing the current clustering category number K, and turning to the step 202; the TSC value is calculated according to the following formula:

in the formula,

representing nearest ones of K cluster centers

And

a distance between, D_lRepresents the l sample in the training sample set, V_jRepresenting the jth clustering center, wherein M is the number of samples in the training sample set; k is the number of classes of the cluster, R_ljIs the first sampleMembership to a jth cluster center;

and step 204, constructing a visual dictionary by taking the K clustering centers as visual words, wherein each clustering center is a visual word.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the acquired image is preprocessed by utilizing the image similarity, a scene image with high similarity to a previous frame of historical image is eliminated, only a small number of images which can represent the current environmental characteristics most are reserved for subsequent closed-loop detection, the calculated amount and the requirement on hardware storage capacity are greatly reduced, and the detection real-time performance is improved;

on the basis of the closed loop detection result of the Bayesian filtering updating method, the closed loop detection result is verified by using an image inverse retrieval method, so that the accuracy of the detection result is effectively improved;

thirdly, the visual word frequency vector with weight is used as the visual word bag characteristic to represent the image, so that the description of the image characteristic is more accurate;

and fourthly, when the visual dictionary is constructed offline, evaluating the clustering effect by adopting the TSC judgment standard, thereby obtaining a more accurate clustering result.

Drawings

FIG. 1 is a schematic flow chart of a closed-loop detection method based on image appearance in monocular vision SLAM according to the present invention;

FIG. 2 is a flow chart of the construction process of the visual dictionary in the method of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings:

the basic flow of the closed-loop detection method based on the image appearance in the monocular vision SLAM is shown in figure 1, and the method comprises the following steps:

step 1, the mobile robot collects a current image by using a monocular camera carried by the mobile robot, and extracts visual word bag characteristics of the current image.

In a BoVW image feature model (the details can be referred to by the documents T.Botterill, S.Mill, R.Green.Bags-of-words-drive, Single camera multiple localization and mapping. journal of field Robotics,2011,28(2): 204-226), a visual dictionary is constructed by using a large number of image local visual feature vectors, each local visual feature is used as a visual word in the visual dictionary, so that any one image can be characterized by using a visual word set in the visual dictionary based on the created visual dictionary, namely, the visual bag feature of the image. The invention adopts the following method to extract the visual bag-of-words characteristics of the image:

step 1.1, extracting local visual features of an image I to obtain a local visual feature vector set of the image I; the extraction of local visual features can be realized by using the existing SURF or SIFT operator, and the SURF algorithm is briefly described below, and the content of the SIFT algorithm is referred to in the literature [ D.Lowe.object recognition from local-innovative recognition. in Proceedings of IEEE International Conference on computer Vision,1999:1150-1157 ].

SURF detects feature points using the most basic Hessian approximation matrix, whose determinant values can be used as a basis for scale selection. Given any point (x, y) in the image I, the Hessian matrix H (x, y, σ) has a dimension σ at (x, y) defined as follows:

wherein L (x, y, σ) is the log (laplacian of Gaussian) operator of the image, which is the convolution operation of the original image I (x, y) with a 2-dimensional Gaussian function of variable scale:

L(x,y,σ)=G(x,y,σ)*I(x,y) (2)

SURF approximates the LoG operator with the dog (difference of gaussian) operator:

D(x,y,σ)=[G(x,y,kσ)-G(x,y,σ)]*I(x,y)=L(x,y,kσ)-L(x,y,σ) (3)

the DoG only needs to subtract the images after adjacent scale gaussian smoothing in calculation, thereby simplifying the calculation. The box-type filter is adopted to approximate Gaussian second derivative, and the integral image is used to quickly calculate the image convolution of the average filters, so as to obtain D_xx,D_xy,D_yyIs approximately L_xx,L_xy,L_yyThus, the determinant of the Hessian matrix has an approximation:

det(H_approx)=D_xxD_yy-(ωD_xy)² (4)

wherein, omega is an adjusting parameter used for balancing the Hessian determinant expression, and is generally set as a constant of 0.9 in actual calculation. The SURF scale space is divided according to groups, the number of layers of each group is a constant, the difference is that the size of an image is kept unchanged, the size of a filter is changed to construct the scale space, and characteristic points are searched under different scales.

And calculating the wavelet response of the circular neighborhood taking 6s (s is the scale of the characteristic point) as the radius in the x and y directions of the detected characteristic point, and performing Gaussian weighting by taking the characteristic point as the center to obtain the point coordinate description (x and y) of the characteristic point, wherein x represents the response in the x direction, and y represents the response in the y direction. And calculating the response sum in the window by adopting a sliding window to obtain a local direction vector, and taking the longest vector as a description vector of the characteristic point. A20 s-sized box is constructed by taking the feature point as the center, and is divided into 16 (4 x 4) subregions, each subregion is divided into 4 small blocks, and 64 description primitives are formed. And respectively calculating sigma dx, sigma | dx |, ∑ dy and sigma | dy |, for each subregion, representing each subregion by a vector v = (∑ dx, sigma | dx |, ∑ dy and sigma | dy |), and combining 16 vectors to obtain a descriptor vector with the length of 64, namely a feature point can be described by a feature vector with the dimension of 64.

Step 1.2, visual word bag characteristics of the image are extracted, namely, according to a visual dictionary, a local visual characteristic vector set of the image is expressed in a visual word set mode. In the embodiment, a traditional method for representing images by directly using frequency vectors of appearance of visual words is not adopted, but a weighting method of tf-idf in text retrieval is used for reference, and a word frequency vector with weight is used for representing each frame of image. The method comprises the following specific steps: representing the image I as a K-dimensional vector V_INamely, the visual bag-of-words feature vector of the image I:

V_I=[t₁,...,t_j,…t_K]^T j=1,2,…,K

wherein,

k is the number of visual words in the visual dictionary, n_jIRepresenting the frequency, n, of occurrence of the jth visual word in the set of local visual feature vectors of the image I_IRepresenting the number of visual words appearing in the set of local visual feature vectors of image I, N representing the number of all the historical images currently saved, N_jIndicating the number of images of which the visual bag-of-words feature contains the jth visual word in all the current saved history images.

The visual dictionary has a crucial influence on the accuracy of visual bag feature extraction, and the visual dictionary is constructed offline by adopting the following method:

a, acquiring a group of environment scene images in advance, extracting local visual features of the images respectively, and forming a training sample set by all local visual feature vectors; for example, local visual features are extracted using SURF, and a total of M visual feature vectors are extracted, and are denoted as { D }_l1 ≦ l ≦ M), each visual feature vector (sample) in the set has a fixed length T (in this embodiment T is 64).

And B, clustering the training sample set, and constructing a visual dictionary by taking the obtained clustering centers as visual words, wherein each clustering center is a visual word. In order to improve the clustering accuracy, the conventional fuzzy K-means clustering is improved, the TSC (timing and Separation criterion) is used as an evaluation standard of a clustering result, the clustering category number K is dynamically adjusted, and an optimized clustering result can be obtained. Specifically, the present step includes the following steps

Step B1, setting the initial clustering category number K; the initial K value may be determined empirically, and is preferably 500 in the present invention.

B2, carrying out fuzzy K-means clustering on the training sample set according to the current clustering class number K, and in each iteration step, distributing the samples to a certain clustering center according to the maximum membership criterion, namely classifying each sample in the training sample set into the class where the clustering center with the maximum membership is located; membership R of ith sample to jth clustering center_ljThe following formula:

and updating the clustering center according to the following formula:

and repeating the iteration in such a way, stopping the iteration when a preset iteration termination condition is met, for example, when a preset iteration frequency is reached or the distance between the clustering centers obtained by two iterations reaches a preset distance threshold value, and at this time, classifying all local visual feature vectors in the training sample set into K classes. In this embodiment, the latter is used as an iteration termination condition, and the expression is as follows:

namely, when the maximum distance value of the clustering centers obtained by two iterations reaches a preset distance threshold value epsilon, the iteration is stopped.

B3, judging whether the TSC value of the current clustering result is within a preset range, if so, turning to the step B4; if not, changing the current cluster category number K, and turning to the step B2; the TSC value is calculated according to the following formula:

in the formula,

representing nearest ones of K cluster centers

And

a distance between, D_lRepresents the l sample in the training sample set, V_jRepresenting the jth clustering center, wherein M is the number of samples in the training sample set; k is the number of classes of the cluster, R_ljThe membership degree of the ith sample to the jth clustering center;

and step B4, constructing a visual dictionary by taking K clustering centers as visual words, wherein each clustering center is a visual word.

The construction process of the visual dictionary is shown in fig. 2.

Step 2, calculating the content similarity between the current scene image and the previous frame of historical image, such as content phaseIf the maximum value of the similarity is smaller than a preset similarity threshold value, saving the current scene image and executing the step 3; otherwise, deleting the current scene image and transferring to the step 1 to obtain a new image, wherein the current scene image I_tAnd the previous frame history image I_cThe content similarity S is calculated according to the following formula:

S = \frac{(V_{I_{t}}, V_{I_{c}})}{| V_{I_{t}} | | V_{I_{c}} |}

in the formula,representing a current scene image I_tThe visual bag of words feature vector of (1),

representing the previous frame of the historical image I_cThe visual bag of words feature vector.

Because the mobile robot continuously shoots scene images during operation, the similarity between adjacent images is high, and therefore the content similarity between a newly acquired image and a reserved image at the previous moment needs to be judged. And only the image with the similarity smaller than a certain threshold value represents a new position scene image, and subsequent processing is carried out. The similarity between the images is measured by calculating the inner product between the image weight vectors, and the larger the value of the similarity S is, the more similar the images are. If S is lower than a fixed threshold, the current image is considered to represent a new scene position; and if the S is higher than the fixed threshold, the similarity of the two frames of images is higher, and the current image is not used for closed-loop detection and is directly rejected. Therefore, a large number of similar redundant images can be removed, and only a small number of images which fully reflect the environmental characteristics are reserved, so that the algorithm complexity is reduced, and the detection real-time performance is improved. The setting of the similarity threshold value depends on the quality of the image and the rate of acquiring the image, the smaller the value is, the more scene images are rejected, the higher the accuracy is, but if the similarity threshold value is too small, the images which cannot be detected by the robot when the robot returns to the circulation starting point cannot be closed, and the setting of the similarity threshold value in practice follows the principle of 'unjustly less' and 'unjustly less'.

And 3, continuously updating the posterior probability of the closed-loop assumed state by using a Bayesian filtering method to perform closed-loop detection, and judging whether the current scene image is closed-loop or not. The method is prior art (see the literature [ M.Labbe, F.Michaud.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.M.. The method adopts a probability calculation method, treats closed-loop detection as a recursive Bayesian estimation problem, and detects the closed loop by estimating the posterior probability distribution of the current closed-loop assumed state. If the probability is greater than a given threshold, the closed loop is considered to be detected; and if not, adding the current image as a new scene image into the map and continuously detecting. The basic content is as follows:

let X_tFor a random variable representing the assumed state of the closed loop at time t, X_tI denotes the picture I_tAnd image I_iMatch, end loop, at which time I_tAnd I_iRepresenting the same scene location; x_t0 represents I_tIs a new scene image, i.e. no closed loop occurs at time t. The filtering process is carried out by calculating each time i is 0, …, t to form a closed loopProbability of occurrence to estimate the posterior probability distribution p (X) of the system integrity_t/I^t) The whole filtering process is divided into two steps of prediction/updating and recursion:

and (3) prediction:

updating: bel (X)_t)=ηp(I_t|X_t)Bel^-(X_t) (10)

And (3) expanding the posterior probability by using a Bayes formula to obtain the posterior probability density at the moment t:

where eta is a normalization factor, I^t=I_O,…,I_tRepresenting the sequence of images acquired at time t.

Observation model p (I)_t|X_t) Using likelihood functions L (X)_t|I_t) And (6) evaluating. Likelihood function L (X) when closed-loop occurs_t|I_t) The calculation is as follows:

if no closed loop occurs at the current moment, likelihood function L (X)_t|I_t) Calculated by the following formula:

wherein mu is the similarity mean value obtained by comparing the current image with each frame of historical image, and sigma is the standard deviation. Therefore, when the closed loop does not occur, the likelihood value of the current image as a new scene image is also large, and the probability is updated.

Motion model p (X)_t|X_t-1) Taking values according to experience:

(1)p(X_t=0|X_t-1=0) =0.9 indicates that no closed loop occurs at time t-1 and the probability that no closed loop occurs at time t;

(2)p(X_t=i|X_t-1=0) =0.1/N (i ═ 0, …, t) denotes that no closed loop occurs at time t-1, the probability of closed loop occurring at time t, and N denotes the total number of observed images;

(3)p(X_t=0|X_t-1= j) =0.1(j =0, …, t) represents the probability that the current image has high similarity to the image j and closed loop occurs at the time t-1, and the closed loop does not occur at the time t;

(4)p(X_t＝i|X_t-1= j) (i, j =0, …, t) indicates the probability that closed loop will occur at time t and time t-1, probability

Defined as a discrete gaussian curve centered at j, 8 neighborhood non-zero values (i j-4, …, j +4) are computed, giving the sum of the gaussian coefficients of these 9 values 0.9.

Constantly updated and normalized posterior probability p (X) in closed loop detection_t|I^t) When probability p (X)_t|I^t) Than a predetermined closed-loop threshold value T_loopAnd if the current time is high, the current time is considered to be closed loop, otherwise, no closed loop is generated, and the detection is continued.

And 4, verifying the detection result of the step 3 according to the following method: when the current image and a historical image are detected to form a closed loop in the step 3, counting the frequency of the visual words in the visual word bag characteristics of the current image in the visual word bag characteristics of each historical image; selecting the previous P historical images with the highest frequency, wherein P is a natural number (for example, 10); if the historical image detected in the step 3 is any one of the P historical images, the closed loop detection is considered to be correct, and the closed loop is accepted; otherwise, the closed loop detection is wrong, and the closed loop is rejected.

Claims

1. A closed loop detection method based on image appearance in monocular vision SLAM is characterized by comprising the following steps:

step 2, calculating the content similarity between the current scene image and the previous frame of historical image, if the maximum value of the content similarity is smaller than a preset similarity threshold value, saving the current scene image and executing the step 3; otherwise, deleteThe current scene image is transferred to step 1, a new image is obtained, wherein the current scene image

With the previous frame of history image

Figure 2012102951816100001DEST_PATH_IMAGE004

Content similarity between themSThe formula is as follows:

Figure 2012102951816100001DEST_PATH_IMAGE006

in the formula,

Figure 2012102951816100001DEST_PATH_IMAGE008

representing a current scene image

The visual bag of words feature vector of (1),

Figure 2012102951816100001DEST_PATH_IMAGE010

representing a previous frame of historical image

The visual bag of words feature vector;

2. The closed-loop image appearance-based detection method in monocular vision SLAM as claimed in claim 1, further comprising:

step 4, detecting the result of the step 3 according to the following methodAnd (4) carrying out verification: when the current image and a historical image are detected to form a closed loop in the step 3, counting the frequency of the visual words in the visual word bag characteristics of the current image in the visual word bag characteristics of each historical image; selecting the front with the highest frequencyPThe number of the history images is one,Pis a natural number; if the history image detected in step 3 is thisPIf any one of the historical images is correct, the closed loop detection is considered to be correct, and the closed loop is accepted; otherwise, the closed loop detection is wrong, and the closed loop is rejected.

3. The closed-loop detection method based on image appearance in monocular vision SLAM as set forth in claim 1 or 2, characterized in that the visual bag-of-words feature of the image is extracted according to the following method:

step 101, extracting an imageObtaining an image based on the local visual features of the imageA set of local visual feature vectors;

step 102, image is processed

Is shown as followsKA dimension vector ofKDimension vector

I.e. an image

The visual bag of words feature vector of (2):

wherein,

，Kis the number of visual words in the visual dictionary,is shown in the image

Is present in the local visual feature vector set

The frequency of the individual visual words is,is shown in the image

The number of visual words appearing in the set of local visual feature vectors,

indicating the number of all history images currently saved,

indicating that the visual bag-of-words feature of all the current saved history images containsNumber of images of individual visual words.

4. The closed-loop image appearance-based detection method in monocular vision SLAM as claimed in claim 3, wherein the visual dictionary is constructed offline by using the following method:

step 201, setting initial clustering category numberK；

Step 202, according to the current cluster category numberKFuzzy training sample setKMean clustering, in each iteration step, assigning samples to a certain clustering center according to the maximum membership criterion, the first step

A sample pair

Membership of individual cluster centers

The following formula:

in the formula,

representing the first in a training sample set

A sample is obtained;

is shown as

A cluster center;Mthe number of samples in the training sample set;Kthe number of the clusters is the category number;

and updating the clustering center according to the following formula:

in the formula,

、respectively represent

Is classified as

The first steptCluster centers in +1 iteration steps;is shown as

In an iterative stepA sample pair

Membership of each clustering center;

step 203, judging whether the TSC value of the current clustering result is within a preset range, if so, turning to step 204; if not, changing the current cluster category numberKAnd go to step 202; the TSC value is calculated according to the following formula:

in the formula,

to representKNearest in cluster center

And

the distance between the two or more of the two or more,

representing the first in a training sample set

The number of the samples is one,

is shown as

The center of each cluster is determined by the center of each cluster,Mthe number of samples in the training sample set;Kis the number of categories of the cluster,

is as follows

A sample pair

Membership of each clustering center;

step 204, inKThe clustering centers are used as visual words to construct a visual dictionary, and each clustering center is a visual wordVisual words.

5. The closed-loop detection method based on image appearance in monocular vision SLAM as claimed in claim 3, wherein the SURF or SIFT operator is adopted to extract the local visual features of the image.