CN111767905A - Improved image method based on landmark-convolution characteristics - Google Patents

Improved image method based on landmark-convolution characteristics Download PDF

Info

Publication number
CN111767905A
CN111767905A CN202010903567.5A CN202010903567A CN111767905A CN 111767905 A CN111767905 A CN 111767905A CN 202010903567 A CN202010903567 A CN 202010903567A CN 111767905 A CN111767905 A CN 111767905A
Authority
CN
China
Prior art keywords
image
landmark
convolution
closed
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010903567.5A
Other languages
Chinese (zh)
Inventor
王燕清
王寅同
石朝侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xiaozhuang University
Original Assignee
Nanjing Xiaozhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xiaozhuang University filed Critical Nanjing Xiaozhuang University
Priority to CN202010903567.5A priority Critical patent/CN111767905A/en
Publication of CN111767905A publication Critical patent/CN111767905A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an improved image method based on landmark-convolution characteristics, and provides a robust closed-loop detection method based on scene recognition, aiming at the problems that accumulated drift of tracks and camera pose tracking loss are easily caused when a robot moves in a large-scale environment. Firstly, identifying a remarkable dynamic object in a scene image frame by using a target detection network, carrying out image blurring processing on the region so as to filter out the influence of a dynamic factor, then taking a preprocessed image as the input of a convolutional neural network, directly generating a landmark sequence of the scene frame through a last layer of convolutional layer, and extracting the convolution characteristic of a landmark by using an unsupervised deep neural network to obtain an image description mode based on the landmark-convolution characteristic. The method has the advantages that accumulated errors of the track are corrected according to whether similar scenes exist in the current scene frame and the scene database or not, and relocation is carried out when pose tracking is lost, and experimental results show that the method has excellent performance in closed loop detection identification.

Description

Improved image method based on landmark-convolution characteristics
Technical Field
The invention relates to the technical field of mobile robot positioning, in particular to an improved image method based on landmark-convolution characteristics.
Background
The mobile robot technology is a frontier field with wide application and great prospect in the world at present. The system integrates theoretical research results of a plurality of subjects such as artificial intelligence, sensor technology, signal processing, automatic control engineering, computer technology, industrial design and the like, is widely applied to various industries such as industry, agriculture, service industry, medical treatment, national defense and the like, can assist or replace the work of human beings, and is particularly important for application research in occasions such as space and underwater exploration and the like under the condition that the human beings cannot reach or are in danger. Before the SLAM method is proposed, positioning and mapping cannot be carried out simultaneously, and positioning needs to depend on an existing map. However, in most tasks, the mobile robot is applied in an unknown environment, and neither a map prepared in advance nor a current position can be determined. On the IEEE Robotics and AutomationConference conference in 1986, researchers put forward the concept of SLAM (Simultaneous localization and mapping) for probability, namely, current pose information is estimated by using repeatedly observed map data, and then a map is constructed in an incremental manner by the pose information, so that the purpose of simultaneously positioning and mapping in an unknown environment is achieved. From this time, the SLAM technology has an important position in the robot research field as a core link for realizing autonomous navigation of the mobile robot.
A typical visual SLAM system consists of several modules, a visual odometer, back-end optimization, closed-loop detection, and mapping. Firstly, information such as images is collected through a sensor arranged on a robot, then the motion between adjacent images is estimated according to the read information, a local scene space structure is recovered, and finally a corresponding map is built according to application requirements. If only the visual odometer is used for positioning and mapping, errors inevitably occur because the current position and the map are only related to the previous moment, so a rear-end optimization mode is adopted in the visual SLAM to locally optimize the camera pose and the map estimated by the visual odometer at the adjacent moment, global optimization is carried out according to the feedback result of closed-loop detection, and finally, the globally consistent track and map can be obtained. The closed loop detection is to eliminate the accumulated error of the robot by detecting whether the robot reaches the pre-identified scene, and once the system detects the closed loop, the information is provided to the back end. Closed-loop detection is an essential link in SLAM for constructing tracks and maps with global consistency, and good closed-loop detection can eliminate accumulated drift of motion tracks, and can identify camera tracking loss caused by weather change, viewpoint change, shading, dynamic environment and the like and perform relocation. Some mainstream visual SLAMs such as LSD-SLAM, ORB-SLAM, LDSO, etc. are not robust enough in extreme appearance changes and viewpoint changes, and in the presence of dynamic object disturbances in the environment. With the successful application of deep learning in visual scene recognition, generating an image representation with convolution features can eliminate closed-loop false detections caused by appearance changes due to weather, seasonal, or time-of-day variations. Relying on landmark regions rather than whole image features to describe a scene can significantly improve robustness when there is viewpoint change or partial occlusion in the scene.
Chen et al extract convolution features as an image global descriptor using an Overfeat network, but the descriptor is too large to detect closed loops in real time. Bai et al propose the use of advanced deep learning techniques to extract robust features for replacing the original features in SeqSLAM. In both methods, the convolution features are extracted from the general neural network rather than the dedicated network for closed loop detection. For this reason, Gomez-Ojeda et al have designed a targeted convolutional neural network for identifying scenes. Chen et al further train such proprietary networks in data sets that are large enough and varied, where the training data sets are shot at thousands of different scenes, with a large amount of appearance variation. These network architectures rely on supervised learning, requiring tagged images as a training data set. Merrill et al have constructed an unsupervised deep neural network architecture specifically for closed-loop detection, with the key idea that the convolution features extracted from the network can be lighter and more compact than all of the above convolution features. The convolution feature still does not solve the viewpoint invariance well, because of similarity to the global features describing the entire image. Researches find that the robustness of closed-loop detection under the condition of viewpoint change can be remarkably improved by image representation generated based on convolution characteristics of landmarks. Both of these methods require special landmark detectors to identify regions of interest, Region-of-interest, ROIs, in the loop detection task.
The last few convolutional layers of a convolutional neural network usually embed very rich semantic information corresponding to some image regions that are meaningful for the closed-loop detection task. The invention provides a method based on landmark-convolution characteristics, which adopts a brand-new landmark generation mechanism, namely directly identifies an ROI (region of interest) in an image according to an activation value of a convolution layer without any landmark detector. The convolution features of the landmarks are then extracted using an unsupervised deep neural network designed specifically for the closed-loop detection task. The closed-loop detection has the advantages of viewpoint invariance and appearance invariance, dynamic objects obviously existing in the environment are filtered, and the robustness of the closed-loop detection is further improved.
Disclosure of Invention
It is an object of the present invention to provide an improved image method based on landmark-convolution features to solve the problems set forth in the background art described above.
Closed loop detection, namely a visual scene recognition problem, has the core of how to generate image representations so as to calculate the similarity between images to detect whether closed loop occurs. There are two significant challenges in the closed loop detection algorithm: 1. appearance changes due to weather, shading, and dynamic objects; 2. a change in viewpoint due to a camera photographing position, etc. The traditional approach is to generate an image representation using visual features extracted from the image and then speed up the matching of image descriptors by a bag-of-words model. There are usually two types of visual features, the first being local features such as SIFT, SURF, and ORB, and the second being global visual features such as GIST, HOG, etc.
The closed loop detection module adopts a mode of combining SURF characteristics with a bag-of-words model. On the basis of DSO, Gaolun et al newly add closed loop detection and pose diagram optimization, so that the system becomes a complete visual SLAM system based on the direct method. Whether closed loop occurs is detected by combining the ORB characteristics and the bag-of-words model, due to the addition of the closed loop detection module, even if tracking is lost, an algorithm can be easily relocated to effectively operate, and the performance and the precision of the LDSO with closed loop detection on positioning and map reconstruction are obviously superior to those of a pure visual odometer.
Local feature-based descriptors are robust to viewpoint changes, but are not suitable for handling appearance changes. While global feature descriptors perform well in environmental changes, they do not perform well when there are viewpoints and occlusions in the environment. Thus, neither local nor global visual features provide satisfactory performance in the event of a change in the combination of lighting, occlusion, viewpoint, and other factors.
With the successful application of deep learning in the fields of robot and computer vision, it is shown that the method based on convolution features in closed-loop detection has more obvious advantages than the traditional visual features, especially in the environment with illumination changes. Compared with local visual features, the convolution features have better environment invariance; compared with global visual features, the convolution features have better semantic recognition capability.
The scenario-based loop detection process can be described as: given a query frame
Figure 192422DEST_PATH_IMAGE001
And a set of database images with N images
Figure 534761DEST_PATH_IMAGE002
The purpose of closed loop detection is to find and match in the database image
Figure 704354DEST_PATH_IMAGE001
Reference frame shot in same scene
Figure 197390DEST_PATH_IMAGE003
The closed-loop detection method based on the landmark-convolution characteristics can directly generate the landmark, extract the convolution characteristics from an unsupervised deep learning network and combine the influence of dynamic factors in the environment on the representation of the generated image. The composition structure of the method is shown in figure 1 and mainly comprises four parts:
a. image preprocessing: firstly, identifying dynamic factors in a scene image frame by using a target detection network, and then filtering dynamic objects in the scene by adopting image filtering processing on the areas;
b. and land mark generation: inputting the preprocessed image into a pre-trained convolutional neural network, then directly identifying an interested region from the last layer of convolutional layer of the convolutional neural network, and respectively identifying the interested region for each query frame and database image to generate a landmark feature identifier;
c. convolution feature extraction: extracting a convolution feature descriptor from each landmark generated from the image by using an unsupervised deep neural network to obtain a corresponding feature vector;
d. scene retrieval: and finally, calculating the overall similarity between the query frame and each database image according to the matched landmark pairs so as to determine the best matching reference frame of the query frame.
1. Filtering out dynamic objects in a scene
In recent years, the use of target detection network models, such as the R-CNN series, YOLO, etc., to detect and locate objects in a scene has achieved very excellent results, and satisfactory accuracy and precision can be achieved. The target detection method based on the R-CNN is characterized in that some candidate areas where objects may exist are searched from an image, and then each candidate area is identified, so that the efficiency of object identification and positioning is greatly improved. YOLO is another framework for object detection, which creatively combines two stages of candidate area and object recognition, but in reality YOLO does not really remove the candidate area, but adopts a predefined candidate area, and the YOLO-based method has been developed into YOLOv1, YOLOv2, YOLOv3 and YOLOv4 versions, which are faster and more accurate and better.
Dynamic objects, such as pedestrians, automobiles, etc., existing in a scene can have a great influence on the representation of an image, and finally, an erroneous loop judgment is caused. In order to construct a robust and stable closed-loop detection method, the problem of dynamic objects cannot be ignored, the dynamic objects are detected from a scene, and then the dynamic objects are filtered out through a technology.
The target detection network can identify the elicitation of most dynamic objects in the scene, and considering that YOLO has faster image processing capability than other target detection networks, and can also meet the requirement of detecting dynamic objects in a closed-loop detection task, in the image preprocessing stage, YOLOv4 is first used as a tool for detecting dynamic factors in the scene. Since the pre-training model trained on the Pascal VOC dataset can correctly distinguish most dynamic objects appearing in the closed-loop detection task, the pre-training model provided by the model is directly used without retraining.
After detecting the area of the dynamic object existing in the image, processing the area by means of the image average blurring method under the condition of keeping the image details as much as possible, thereby reducing or eliminating the influence of the dynamic object existing in the environment on the finally generated image representation. Although this idea of filtering out dynamic factors in a scene is simple, experiments have verified that this method is effective. The precision of the closed-loop detection task can be improved only by adding an image preprocessing process, namely a quick object detection network and simple image filtering processing, and the method is a novel dynamic filtering mode.
2. Region of interest generating landmarks for identifying images
a. Taking each frame of dynamically filtered image as the input of a convolutional neural network AlexNet, and directly outputting the feature mapping corresponding to the image through the last layer of convolutional layer of the convolutional network;
b. all the non-zero activation values of the feature maps and 8 adjacent activation values around the non-zero activation values are respectively grouped into one type and recorded as
Figure 978002DEST_PATH_IMAGE004
M denotes the number of clusters in an image, each cluster
Figure 39761DEST_PATH_IMAGE005
Energy value of
Figure 575522DEST_PATH_IMAGE006
Can be calculated as:
Figure 290275DEST_PATH_IMAGE007
wherein
Figure 671316DEST_PATH_IMAGE008
Is shown asiThe size of the individual clusters is such that,
Figure 787914DEST_PATH_IMAGE009
to represent
Figure 494577DEST_PATH_IMAGE010
To (1)jAn activation value;
c. after obtaining the energy values of M clusters, taking T clusters with the maximum energy value and mapping the T clusters back to the original image as the finally generated landmark set
Figure 720063DEST_PATH_IMAGE011
And is recorded as:
Figure 701532DEST_PATH_IMAGE012
3. extracting convolution features
For each generated landmark, extracting a convolution feature descriptor by utilizing the constructed unsupervised convolution automatic encoder network; as input, a road sign, X represents a dimension of the HOG feature,
Figure 938216DEST_PATH_IMAGE013
representing dimensions of reconstructed feature descriptors in a self-coding modelWhen training is finished, the network has the capability of learning and reconstructing the HOG features, the dimensionality of the HOG features extracted by inputting the same size is the same, the Euclidean distance can be used as the distance metric of the HOG descriptor, and the loss layer utilizes the linear rectification function (ReLU) activation
Figure 815780DEST_PATH_IMAGE014
By comparison of X with its reconstruction by a loss function
Figure 301863DEST_PATH_IMAGE015
The size of (2):
Figure 290285DEST_PATH_IMAGE016
the parameter settings of the network are shown in fig. 2.
The network has been proved to be fast and reliable, can realize real-time detection of closed loop without reducing the dimensionality of the extracted convolution characteristic, and experiments show that the ability of detecting the loop by the HOG characteristic learned by the network is obviously superior to the original HOG characteristic, and can replace a general neural network in a closed loop detection system based on the convolution characteristic. Since the network does not require context-specific training, the pre-trained model can be applied directly to extract features of the dataset images used in the experiment. The feature vector extracted from any landmark generated by the image I is marked as
Figure 115896DEST_PATH_IMAGE017
The characteristic dimension is 1064. So for any one image, the total feature dimension is
Figure 102045DEST_PATH_IMAGE018
4. Computing similarity
To calculate
Figure 51986DEST_PATH_IMAGE001
And
Figure 844099DEST_PATH_IMAGE003
similarity score between, cross-matching all landmarks extracted from the two images. Using cosine distance measures
Figure 789796DEST_PATH_IMAGE001
A landmark
Figure 540321DEST_PATH_IMAGE019
And
Figure 204259DEST_PATH_IMAGE003
a landmark
Figure 534483DEST_PATH_IMAGE020
Similarity between:
Figure 662583DEST_PATH_IMAGE021
Figure 459376DEST_PATH_IMAGE022
is thatuAndvthe cosine distance of (d). Wherein
Figure 571729DEST_PATH_IMAGE023
Figure 705645DEST_PATH_IMAGE024
Respectively represent pair
Figure 360355DEST_PATH_IMAGE001
Chinese landmarkuAnd
Figure 655945DEST_PATH_IMAGE003
chinese landmarkvThe extracted convolution feature direction represents the length of the vector.
Determining using a simple linear search
Figure 841868DEST_PATH_IMAGE001
And
Figure 678981DEST_PATH_IMAGE003
matches between all landmarks and cross-checking is applied to accept only landmarks that match each other.
For each matching landmark pair (u, v), its weight is determined according to their region size, with the weight being W u, v:
Figure 145472DEST_PATH_IMAGE026
wherein
Figure 826214DEST_PATH_IMAGE027
Respectively the height and width of the (u, v) region,
Figure 364249DEST_PATH_IMAGE028
and
Figure 931235DEST_PATH_IMAGE029
respectively representing the absolute value of the high difference and the absolute value of the wide difference of the two regions.
In the end of this process,
Figure 568627DEST_PATH_IMAGE001
and
Figure 978487DEST_PATH_IMAGE003
global similarity score
Figure 992317DEST_PATH_IMAGE030
Comprises the following steps:
Figure 741705DEST_PATH_IMAGE031
querying images for each frame
Figure 549998DEST_PATH_IMAGE001
Traverse and calculate it and all images in the database
Figure 387767DEST_PATH_IMAGE003
Wherein the score is the highestThe high image is
Figure 228725DEST_PATH_IMAGE001
The best matching:
Figure 629357DEST_PATH_IMAGE032
z is namely represented by
Figure 811815DEST_PATH_IMAGE001
The reference frame with the highest similarity score.
Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
for some of the drawbacks of using conventional visual features to generate image representations in conventional closed-loop detection, it is proposed to represent the images with landmark-convolution features, and this algorithm differs from other landmark-based correlation algorithms, not requiring any additional landmark detectors, but rather generating landmarks directly from deep convolutional layers of convolutional neural networks to identify salient regions. The algorithm utilizes the unsupervised deep neural network specially designed for closed-loop detection to extract image features instead of extracting from a general neural network, so that the performance of the algorithm is further improved. The result shows that the algorithm has high robustness no matter severe viewpoint change or extreme appearance change exists in the environment.
Drawings
FIG. 1 is a block diagram of the closed loop detection method based on landmark-convolution features according to the present invention;
fig. 2 shows the parameter settings of the network according to the invention.
Detailed Description
1. Evaluation index
In the closed-loop detection algorithm, the accuracy and robustness of detecting the closed loop are one of the criteria for evaluating the algorithm, and when the robot moves in an unknown environment with extreme appearance change and viewpoint change, a sufficiently robust closed-loop detection method largely eliminates accumulated errors and relocates when the camera tracking is lost. How to quantify the robustness of the closed-loop detection method, in the closed-loop detection, two representative indexes of Precision (Precision) and Recall (Recall) are generally adopted for verification. The accuracy refers to the probability that all closed loops detected by the algorithm are really real closed loops; the recall ratio is the probability that all real closed loops are detected by the algorithm. The corresponding calculation formula is as follows:
Figure 930687DEST_PATH_IMAGE033
wherein TP represents True Positive (True Positive), i.e. closed loop in fact, and the result detected by the algorithm is also the number of closed loops; FP indicates False Positive (False Positive), i.e. not actually closed loop, but counts as the number of closed loops detected; FN indicates False negatives (False negatives), i.e. in fact closed loops, but the result of the algorithmic detection is not the number of closed loops. Accordingly, there is also a related definition of True negatives (True negatives), i.e., not in fact closed loops, nor the number of loops detected by the algorithm, denoted by TN. False positives and false negatives are also known as perceptual bias and perceptual variation, which both affect the accuracy of closed-loop detection in practical applications. Ideally, a good closed loop detection algorithm would correctly detect whether a closed loop exists in the face of the above two situations, which requires that the values of TP and TN be as high as possible and the values of FP and FN be as low as possible during the algorithm implementation.
Actually, the accuracy and the recall rate are a pair of contradictory statistics, and when the accuracy of closed-loop detection is higher, the closed-loop detection means that the parameter setting for judging the existence of the closed loop is stricter, the number of the closed loops detected by the algorithm is reduced, and the real closed loops still exist in the environment and are not detected, so that the recall rate is reduced; when the recall rate of closed-loop detection is high, it is indicated that the setting of closed-loop parameters is relatively loose, and the algorithm can detect more closed loops, but the accuracy rate is reduced when the closed loops are not true closed loops. In closed-loop detection, it is common practice to obtain the Recall rate and accuracy in each case, and then draw an accuracy-Recall Curve (Precision-Recall Curve). In SLAM, the accuracy requirement is typically more stringent, since if the accuracy is lower, it will cause the algorithm to detect that it is a closed loop, and not actually, this will cause the optimization algorithm to give a completely wrong result, resulting in the failure of the built map. If the recall rate is low, it means that there will be some closed loops undetected, so that the constructed map is affected by some accumulated error, but it only needs two closed loops to completely eliminate the error caused by the closed loops. Therefore, in SLAM, it is more desirable to obtain the highest possible accuracy than the recall rate, and in the present invention, the area under the curve (AUC) of the accuracy-recall curve, the maximum recall rate at an accuracy of 100%, and the accuracy when there is a higher recall rate are used as evaluation indexes of the experiment.
2. Public data set introduction
In order to verify that the proposed closed-loop detection method is a robust method to both appearance changes and viewpoint changes, experimental verification will be performed using several challenging public data sets. These data sets contain scene changes that are common in the real world, such as viewpoint, weather, light, season, etc. The four data sets are described in detail as follows:
(1) gardens Point dataset
The Gardens Point dataset includes three traversal tracks. One of the track sequences is shot at night; the other two tracks are photographed in the daytime, along the left and right sides of the sidewalk, respectively, and show a change in viewpoint occurring when walking on the left and right sides of the path and a slight change in appearance mainly caused by a dynamic object such as a pedestrian. Evaluating the robustness of the proposed closed-loop detection method based on unsupervised deep learning to the change of the viewpoint by using two track sequences in the daytime; sequences of traces along the right side of the road during the day and sequences of traces taken at night are used as test data sets to assess the robustness of the appearance changes caused by extreme light changes. The robustness of the closed-loop detection method when there are both viewpoint changes and drastic light changes in the environment is evaluated using sequences of tracks taken along the right side of the road during the day and at night.
(2) Campus Loop dataset
The Campus Loop dataset consists of two image sequences, each sequence containing 100 frames of images, and contains both indoor and outdoor environments in the dataset. The first sequence is taken on snowy days, with the ground covered with snow in an outdoor environment, and the second sequence is taken on sunny days. The two image sequences are utilized to verify the robustness of the closed-loop detection method provided by the invention under the comprehensive change condition of appearance change and viewpoint change caused by weather, illumination and the like.
3. Results of the experiment
In order to prove the superior performance of the closed-loop detection algorithm based on the unsupervised deep learning, the effect of constructing the closed-loop detection composition method based on the unsupervised deep learning is firstly evaluated, and meanwhile, the closed-loop detection composition method is compared with a closed-loop detection method commonly used in a classical visual SLAM based on a direct method, wherein the first comparison method is a closed-loop detection method FABMAP used in an LSD-SLAM, and the other comparison method is an open-source framework DoW3 based on a word bag model used in an LDSO.
(1) Method evaluation
Experimental results obtained on the Campus Loop data set show the effects of four ways of generating the representation of the whole image only by using convolution characteristics, filtering dynamic objects in a scene firstly to regenerate the convolution characteristic representation of the image, generating landmarks for an original image and then extracting the image representation of the landmark-convolution characteristics of the convolution characteristics and a complete method for constructing (namely filtering the dynamic objects in the scene firstly, regenerating the landmarks and finally extracting the convolution characteristics for the landmarks). The corresponding curves are named as DeepLC-W, DeepLC-D, DeepLC-L and DeepLC respectively, and in the rest evaluation experiments, DeepLC is used for representing the experimental effect curve of the closed-loop detection algorithm based on unsupervised deep learning.
According to the comparison of the two curves of DeeplC-W and DeeplC-L, the AUC of the latter can reach 0.94, the accuracy rate of the curve is obviously higher than that of the representation method of the convolution characteristics of the whole image under the condition that the recall rate is as high as possible, and the conclusion that the image description mode adopting the landmark-convolution characteristics is remarkably superior to the global mode only adopting the convolution characteristics to describe the image can be obtained. According to the effect of deep lc-W, DeepLC-D, it can be analyzed that if dynamic factors in the scene are filtered out in the image preprocessing stage, which is helpful for improving the accuracy of closed-loop detection, compared with the convolution feature representation of the whole image, although the AUC value is only improved by 0.01, the maximum recall rate at the accuracy of 100% is almost unchanged, but the effect of deep lc can show that once these several techniques are combined, the closed-loop detection capability can be greatly improved, not only the AUC can reach 0.98, but also the maximum recall rate at the accuracy of 100% can reach 70%, so that both the image representation manner of landmark-convolution and the dynamic factor filtering processing in the preprocessing stage are important and effective components in the closed-loop detection method based on unsupervised deep learning provided by the present invention.
(2) Viewpoint change robustness assessment
The closed-loop detection method and the experiments of FAB-MAP and DBoW3 on two track sequences in the daytime of the Gardens Point data set are provided, and the image data in the two sequences are respectively collected along the left side and the right side of the road. Experimental results show that when only viewpoint changes exist in the environment, the closed-loop detection method based on unsupervised deep learning can achieve nearly perfect effect, and the AUC value is as high as 1. The effects of FABMAP and DBoW are inferior to the closed-loop detection method provided by the invention, and although both methods are based on local visual features and are theoretically robust to viewpoint changes in the environment, the effect of filtering out targets in a scene in the image preprocessing stage in the method cannot be ignored, and the convolution feature has excellent performance in scene recognition relative to artificially designed features. Therefore, the conclusion that the loop detection method provided by the invention has viewpoint invariance can be obtained.
(3) Illumination change robustness assessment
The closed loop detection method proposed by the present invention and the experimental results of FAB-MAP and DBoW3 on these two image sequences, when there are strong light changes in the scene, the method of DBoW3 is almost ineffective and does not provide convincing accuracy and recall. In contrast, the FAB-MAP is much better, the accuracy rate is over 70% when there is 50% recall rate, while the method based on unsupervised deep learning proposed by the present invention is the highest method among them, and the method of the present invention has obvious advantages no matter the AUC value, the recall rate corresponding to 100% accuracy rate, or the accuracy rate corresponding to higher recall rate, so when there is strong light change in the scene, the closed-loop detection based on unsupervised deep learning is still reliable.
(4) Viewpoint and illumination change robustness assessment
The two experiments prove that satisfactory closed-loop detection capability can be obtained by closed-loop detection based on unsupervised deep learning whether in viewpoint change or in extreme light change appearance change, and the performance of the closed-loop detection method is considered when the two changes exist in the environment at the same time, track images from the Gardens Point data set along the left side and the right side of the road in the daytime and track images from the Gardens Point data set along the right side of the road in the evening are selected, namely, the viewpoint change and the strong light change are included, and pedestrian interference exists in the scene in the daytime. DBoW3 remained ineffective in this scenario, not much less the effect of FAB-MAP than the illumination change alone, still with the same AUC and the same recall at 100% accuracy, but with a drop in accuracy when the recall was higher. Loop detection based on unsupervised deep learning is still the optimal method, and has the highest AUC, recall rate and accuracy rate in comparison.
(5) Comprehensive change assessment
Experiments on the Campus Loop data set compare the effects of the three comparison methods when seasonal changes, viewpoint changes, indoor and outdoor switching, slight light changes and dynamic object comprehensive changes exist in the scene, and the data set contains various change conditions. The method based on unsupervised deep learning still shows very good effect, and the recall rate of 70% can be achieved with 100% accuracy. The FAB-MAP and DBoW methods perform almost well, the former method has a slightly better performance, the recall rate of 10% can be realized at the accuracy rate of 100%, the accuracy rate is higher than the DBoW3, but the FAB-MAP and DBoW methods do not have satisfactory closed-loop detection capability under comprehensive change.
It is to be noted that, in the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. An improved landmark-convolution feature based image method, characterized by: a landmark generation mechanism is adopted, namely ROI in an image is directly identified according to an activation value of a convolutional layer, a convolution characteristic of a landmark is extracted by utilizing an unsupervised deep neural network specially designed for a closed-loop detection task, the closed-loop detection has the characteristics of viewpoint invariance and appearance invariance, dynamic objects obviously existing in the environment are filtered, and the landmark generation mechanism mainly comprises four parts:
a. image preprocessing: firstly, identifying dynamic factors in a scene image frame by using a target detection network, and then filtering dynamic objects in the scene by adopting image filtering processing on the areas;
b. and land mark generation: inputting the preprocessed image into a pre-trained convolutional neural network, then directly identifying an interested region from the last layer of convolutional layer of the convolutional neural network, and respectively identifying the interested region for each query frame and database image to generate a landmark feature identifier;
c. convolution feature extraction: extracting a convolution feature descriptor from each landmark generated from the image by using an unsupervised deep neural network to obtain a corresponding feature vector;
d. scene retrieval: and finally, calculating the overall similarity between the query frame and each database image according to the matched landmark pairs so as to determine the best matching reference frame of the query frame.
2. An improved landmark-convolution based image method according to claim 1, characterized in that: in the image preprocessing stage, YOLOv4 is used as a tool for detecting dynamic factors in a scene, and a pre-training model trained on a Pascal VOCdataet can correctly distinguish most dynamic objects appearing in a closed-loop detection task, and the pre-training model provided by the pre-training model can be directly used without retraining.
3. An improved landmark-convolution based image method according to claim 1, characterized in that: after detecting the area of the dynamic object in the image, processing the area by adopting an image average blurring method to cover the dynamic object information.
4. An improved landmark-convolution based image method according to claim 1, characterized in that: the method specifically comprises the following steps of identifying a region of interest of an image and generating a landmark:
taking each frame of dynamically filtered image as the input of a convolutional neural network AlexNet, and directly outputting the feature mapping corresponding to the image through the last layer of convolutional layer of the convolutional network;
all the non-zero activation values of the feature maps and 8 adjacent activation values around the non-zero activation values are respectively grouped into one type and recorded as
Figure 82150DEST_PATH_IMAGE001
M represents in one imageNumber of clusters, each cluster
Figure 205879DEST_PATH_IMAGE002
Energy value of
Figure 487081DEST_PATH_IMAGE003
Can be calculated as:
Figure 174633DEST_PATH_IMAGE004
wherein
Figure 224891DEST_PATH_IMAGE005
Is shown asiThe size of the individual clusters is such that,
Figure 404943DEST_PATH_IMAGE006
to represent
Figure 902003DEST_PATH_IMAGE002
To (1)jAn activation value;
c. after obtaining the energy values of M clusters, taking T clusters with the maximum energy value and mapping the T clusters back to the original image as the finally generated landmark set
Figure 602368DEST_PATH_IMAGE007
And is recorded as:
Figure 587027DEST_PATH_IMAGE008
5. an improved landmark-convolution based image method according to claim 1, characterized in that: for each generated landmark, extracting a convolution feature descriptor by utilizing the constructed unsupervised convolution automatic encoder network; as input, a road sign, X represents a dimension of the HOG feature,
Figure 816277DEST_PATH_IMAGE009
representing the dimension of the reconstructed feature descriptor, in the self-coding model, linear rectification function (ReLU) activation is used for three convolutional layers, sigmoid activation is used for a full-connection layer so as to reconstruct HOG features through a network, when training is finished, the network has the capability of learning and reconstructing the HOG features, the dimension of the HOG features extracted for the input with the same size is the same, the Euclidean distance can be used as the distance measure of the HOG descriptor, and the loss layer uses Euclidean distance
Figure 882191DEST_PATH_IMAGE010
By comparison of X with its reconstruction by a loss function
Figure 39109DEST_PATH_IMAGE011
The size of (2):
Figure 608499DEST_PATH_IMAGE012
6. an improved landmark-convolution based image method according to claim 1, characterized in that: to calculate a query frame
Figure 330687DEST_PATH_IMAGE013
And a reference frame
Figure 136094DEST_PATH_IMAGE014
Similarity score between them, cross-matching all landmarks extracted from the two images, using cosine distance measure
Figure 454293DEST_PATH_IMAGE013
The landmark and
Figure 520600DEST_PATH_IMAGE014
similarity between landmarks.
CN202010903567.5A 2020-09-01 2020-09-01 Improved image method based on landmark-convolution characteristics Pending CN111767905A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010903567.5A CN111767905A (en) 2020-09-01 2020-09-01 Improved image method based on landmark-convolution characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010903567.5A CN111767905A (en) 2020-09-01 2020-09-01 Improved image method based on landmark-convolution characteristics

Publications (1)

Publication Number Publication Date
CN111767905A true CN111767905A (en) 2020-10-13

Family

ID=72729778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010903567.5A Pending CN111767905A (en) 2020-09-01 2020-09-01 Improved image method based on landmark-convolution characteristics

Country Status (1)

Country Link
CN (1) CN111767905A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112461228A (en) * 2020-11-03 2021-03-09 南昌航空大学 IMU and vision-based secondary loop detection positioning method in similar environment
CN113011359A (en) * 2021-03-26 2021-06-22 浙江大学 Method for simultaneously detecting plane structure and generating plane description based on image and application
CN114018271A (en) * 2021-10-08 2022-02-08 北京控制工程研究所 Accurate fixed-point landing autonomous navigation method and system based on landmark images
CN114429192A (en) * 2022-04-02 2022-05-03 中国科学技术大学 Image matching method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444853A (en) * 2020-03-27 2020-07-24 长安大学 Loop detection method of visual S L AM
CN111626417A (en) * 2020-04-30 2020-09-04 南京理工大学 Closed loop detection method based on unsupervised deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444853A (en) * 2020-03-27 2020-07-24 长安大学 Loop detection method of visual S L AM
CN111626417A (en) * 2020-04-30 2020-09-04 南京理工大学 Closed loop detection method based on unsupervised deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NATE MERRILL等: "Lightweight Unsupervised Deep Loop Closure", 《ARXIV:1805.07703V2》 *
小白学视觉: "【论文解读】使用有监督和无监督的深度神经网络进行闭环检测", 《OSCHINA HTTPS://MY.OSCHINA.NET/U/4581492/BLOG/4371406》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112461228A (en) * 2020-11-03 2021-03-09 南昌航空大学 IMU and vision-based secondary loop detection positioning method in similar environment
CN112461228B (en) * 2020-11-03 2023-05-09 南昌航空大学 IMU and vision-based secondary loop detection positioning method in similar environment
CN113011359A (en) * 2021-03-26 2021-06-22 浙江大学 Method for simultaneously detecting plane structure and generating plane description based on image and application
CN113011359B (en) * 2021-03-26 2023-10-24 浙江大学 Method for simultaneously detecting plane structure and generating plane description based on image and application
CN114018271A (en) * 2021-10-08 2022-02-08 北京控制工程研究所 Accurate fixed-point landing autonomous navigation method and system based on landmark images
CN114429192A (en) * 2022-04-02 2022-05-03 中国科学技术大学 Image matching method and device and electronic equipment
CN114429192B (en) * 2022-04-02 2022-07-15 中国科学技术大学 Image matching method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Zhang et al. Visual place recognition: A survey from deep learning perspective
Garg et al. Don't look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition
Tsintotas et al. The revisiting problem in simultaneous localization and mapping: A survey on visual loop closure detection
Chen et al. Only look once, mining distinctive landmarks from convnet for visual place recognition
CN111767905A (en) Improved image method based on landmark-convolution characteristics
Xin et al. Localizing discriminative visual landmarks for place recognition
Wang et al. Compressed holistic convnet representations for detecting loop closures in dynamic environments
Camara et al. Highly robust visual place recognition through spatial matching of CNN features
Chen et al. Semantic loop closure detection with instance-level inconsistency removal in dynamic industrial scenes
Zeng et al. Robust multivehicle tracking with wasserstein association metric in surveillance videos
Schubert et al. What makes visual place recognition easy or hard?
Lu et al. Pic-net: Point cloud and image collaboration network for large-scale place recognition
Tsintotas et al. Dimensionality reduction through visual data resampling for low-storage loop-closure detection
Papapetros et al. Visual loop-closure detection via prominent feature tracking
Cai et al. Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition
CN114882351A (en) Multi-target detection and tracking method based on improved YOLO-V5s
CN111626417B (en) Closed loop detection method based on unsupervised deep learning
Wu et al. Deep supervised hashing with similar hierarchy for place recognition
Hafez et al. Visual localization in highly crowded urban environments
Li et al. Deep fusion of multi-layers salient CNN features and similarity network for robust visual place recognition
Feng et al. A benchmark dataset and multi-scale attention network for semantic traffic light detection
Naseer et al. Vision-based Markov localization across large perceptual changes
Chen et al. A survey on visual place recognition for mobile robots localization
Wang et al. Two-stage vSLAM loop closure detection based on sequence node matching and semi-semantic autoencoder
Hu et al. Loop Closure Detection Algorithm Based on Attention Mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201013