CN111767905A

CN111767905A - Improved image method based on landmark-convolution characteristics

Info

Publication number: CN111767905A
Application number: CN202010903567.5A
Authority: CN
Inventors: 王燕清; 王寅同; 石朝侠
Original assignee: Nanjing Xiaozhuang University
Current assignee: Nanjing Xiaozhuang University
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-10-13

Abstract

The invention discloses an improved image method based on landmark-convolution characteristics, and provides a robust closed-loop detection method based on scene recognition, aiming at the problems that accumulated drift of tracks and camera pose tracking loss are easily caused when a robot moves in a large-scale environment. Firstly, identifying a remarkable dynamic object in a scene image frame by using a target detection network, carrying out image blurring processing on the region so as to filter out the influence of a dynamic factor, then taking a preprocessed image as the input of a convolutional neural network, directly generating a landmark sequence of the scene frame through a last layer of convolutional layer, and extracting the convolution characteristic of a landmark by using an unsupervised deep neural network to obtain an image description mode based on the landmark-convolution characteristic. The method has the advantages that accumulated errors of the track are corrected according to whether similar scenes exist in the current scene frame and the scene database or not, and relocation is carried out when pose tracking is lost, and experimental results show that the method has excellent performance in closed loop detection identification.

Description

Improved image method based on landmark-convolution characteristics

Technical Field

The invention relates to the technical field of mobile robot positioning, in particular to an improved image method based on landmark-convolution characteristics.

Background

The mobile robot technology is a frontier field with wide application and great prospect in the world at present. The system integrates theoretical research results of a plurality of subjects such as artificial intelligence, sensor technology, signal processing, automatic control engineering, computer technology, industrial design and the like, is widely applied to various industries such as industry, agriculture, service industry, medical treatment, national defense and the like, can assist or replace the work of human beings, and is particularly important for application research in occasions such as space and underwater exploration and the like under the condition that the human beings cannot reach or are in danger. Before the SLAM method is proposed, positioning and mapping cannot be carried out simultaneously, and positioning needs to depend on an existing map. However, in most tasks, the mobile robot is applied in an unknown environment, and neither a map prepared in advance nor a current position can be determined. On the IEEE Robotics and AutomationConference conference in 1986, researchers put forward the concept of SLAM (Simultaneous localization and mapping) for probability, namely, current pose information is estimated by using repeatedly observed map data, and then a map is constructed in an incremental manner by the pose information, so that the purpose of simultaneously positioning and mapping in an unknown environment is achieved. From this time, the SLAM technology has an important position in the robot research field as a core link for realizing autonomous navigation of the mobile robot.

A typical visual SLAM system consists of several modules, a visual odometer, back-end optimization, closed-loop detection, and mapping. Firstly, information such as images is collected through a sensor arranged on a robot, then the motion between adjacent images is estimated according to the read information, a local scene space structure is recovered, and finally a corresponding map is built according to application requirements. If only the visual odometer is used for positioning and mapping, errors inevitably occur because the current position and the map are only related to the previous moment, so a rear-end optimization mode is adopted in the visual SLAM to locally optimize the camera pose and the map estimated by the visual odometer at the adjacent moment, global optimization is carried out according to the feedback result of closed-loop detection, and finally, the globally consistent track and map can be obtained. The closed loop detection is to eliminate the accumulated error of the robot by detecting whether the robot reaches the pre-identified scene, and once the system detects the closed loop, the information is provided to the back end. Closed-loop detection is an essential link in SLAM for constructing tracks and maps with global consistency, and good closed-loop detection can eliminate accumulated drift of motion tracks, and can identify camera tracking loss caused by weather change, viewpoint change, shading, dynamic environment and the like and perform relocation. Some mainstream visual SLAMs such as LSD-SLAM, ORB-SLAM, LDSO, etc. are not robust enough in extreme appearance changes and viewpoint changes, and in the presence of dynamic object disturbances in the environment. With the successful application of deep learning in visual scene recognition, generating an image representation with convolution features can eliminate closed-loop false detections caused by appearance changes due to weather, seasonal, or time-of-day variations. Relying on landmark regions rather than whole image features to describe a scene can significantly improve robustness when there is viewpoint change or partial occlusion in the scene.

Chen et al extract convolution features as an image global descriptor using an Overfeat network, but the descriptor is too large to detect closed loops in real time. Bai et al propose the use of advanced deep learning techniques to extract robust features for replacing the original features in SeqSLAM. In both methods, the convolution features are extracted from the general neural network rather than the dedicated network for closed loop detection. For this reason, Gomez-Ojeda et al have designed a targeted convolutional neural network for identifying scenes. Chen et al further train such proprietary networks in data sets that are large enough and varied, where the training data sets are shot at thousands of different scenes, with a large amount of appearance variation. These network architectures rely on supervised learning, requiring tagged images as a training data set. Merrill et al have constructed an unsupervised deep neural network architecture specifically for closed-loop detection, with the key idea that the convolution features extracted from the network can be lighter and more compact than all of the above convolution features. The convolution feature still does not solve the viewpoint invariance well, because of similarity to the global features describing the entire image. Researches find that the robustness of closed-loop detection under the condition of viewpoint change can be remarkably improved by image representation generated based on convolution characteristics of landmarks. Both of these methods require special landmark detectors to identify regions of interest, Region-of-interest, ROIs, in the loop detection task.

The last few convolutional layers of a convolutional neural network usually embed very rich semantic information corresponding to some image regions that are meaningful for the closed-loop detection task. The invention provides a method based on landmark-convolution characteristics, which adopts a brand-new landmark generation mechanism, namely directly identifies an ROI (region of interest) in an image according to an activation value of a convolution layer without any landmark detector. The convolution features of the landmarks are then extracted using an unsupervised deep neural network designed specifically for the closed-loop detection task. The closed-loop detection has the advantages of viewpoint invariance and appearance invariance, dynamic objects obviously existing in the environment are filtered, and the robustness of the closed-loop detection is further improved.

Disclosure of Invention

It is an object of the present invention to provide an improved image method based on landmark-convolution features to solve the problems set forth in the background art described above.

Closed loop detection, namely a visual scene recognition problem, has the core of how to generate image representations so as to calculate the similarity between images to detect whether closed loop occurs. There are two significant challenges in the closed loop detection algorithm: 1. appearance changes due to weather, shading, and dynamic objects; 2. a change in viewpoint due to a camera photographing position, etc. The traditional approach is to generate an image representation using visual features extracted from the image and then speed up the matching of image descriptors by a bag-of-words model. There are usually two types of visual features, the first being local features such as SIFT, SURF, and ORB, and the second being global visual features such as GIST, HOG, etc.

The closed loop detection module adopts a mode of combining SURF characteristics with a bag-of-words model. On the basis of DSO, Gaolun et al newly add closed loop detection and pose diagram optimization, so that the system becomes a complete visual SLAM system based on the direct method. Whether closed loop occurs is detected by combining the ORB characteristics and the bag-of-words model, due to the addition of the closed loop detection module, even if tracking is lost, an algorithm can be easily relocated to effectively operate, and the performance and the precision of the LDSO with closed loop detection on positioning and map reconstruction are obviously superior to those of a pure visual odometer.

Local feature-based descriptors are robust to viewpoint changes, but are not suitable for handling appearance changes. While global feature descriptors perform well in environmental changes, they do not perform well when there are viewpoints and occlusions in the environment. Thus, neither local nor global visual features provide satisfactory performance in the event of a change in the combination of lighting, occlusion, viewpoint, and other factors.

With the successful application of deep learning in the fields of robot and computer vision, it is shown that the method based on convolution features in closed-loop detection has more obvious advantages than the traditional visual features, especially in the environment with illumination changes. Compared with local visual features, the convolution features have better environment invariance; compared with global visual features, the convolution features have better semantic recognition capability.

The scenario-based loop detection process can be described as: given a query frame

And a set of database images with N images

The purpose of closed loop detection is to find and match in the database image

Reference frame shot in same scene

。

The closed-loop detection method based on the landmark-convolution characteristics can directly generate the landmark, extract the convolution characteristics from an unsupervised deep learning network and combine the influence of dynamic factors in the environment on the representation of the generated image. The composition structure of the method is shown in figure 1 and mainly comprises four parts:

a. image preprocessing: firstly, identifying dynamic factors in a scene image frame by using a target detection network, and then filtering dynamic objects in the scene by adopting image filtering processing on the areas;

b. and land mark generation: inputting the preprocessed image into a pre-trained convolutional neural network, then directly identifying an interested region from the last layer of convolutional layer of the convolutional neural network, and respectively identifying the interested region for each query frame and database image to generate a landmark feature identifier;

c. convolution feature extraction: extracting a convolution feature descriptor from each landmark generated from the image by using an unsupervised deep neural network to obtain a corresponding feature vector;

d. scene retrieval: and finally, calculating the overall similarity between the query frame and each database image according to the matched landmark pairs so as to determine the best matching reference frame of the query frame.

1. Filtering out dynamic objects in a scene

In recent years, the use of target detection network models, such as the R-CNN series, YOLO, etc., to detect and locate objects in a scene has achieved very excellent results, and satisfactory accuracy and precision can be achieved. The target detection method based on the R-CNN is characterized in that some candidate areas where objects may exist are searched from an image, and then each candidate area is identified, so that the efficiency of object identification and positioning is greatly improved. YOLO is another framework for object detection, which creatively combines two stages of candidate area and object recognition, but in reality YOLO does not really remove the candidate area, but adopts a predefined candidate area, and the YOLO-based method has been developed into YOLOv1, YOLOv2, YOLOv3 and YOLOv4 versions, which are faster and more accurate and better.

Dynamic objects, such as pedestrians, automobiles, etc., existing in a scene can have a great influence on the representation of an image, and finally, an erroneous loop judgment is caused. In order to construct a robust and stable closed-loop detection method, the problem of dynamic objects cannot be ignored, the dynamic objects are detected from a scene, and then the dynamic objects are filtered out through a technology.

The target detection network can identify the elicitation of most dynamic objects in the scene, and considering that YOLO has faster image processing capability than other target detection networks, and can also meet the requirement of detecting dynamic objects in a closed-loop detection task, in the image preprocessing stage, YOLOv4 is first used as a tool for detecting dynamic factors in the scene. Since the pre-training model trained on the Pascal VOC dataset can correctly distinguish most dynamic objects appearing in the closed-loop detection task, the pre-training model provided by the model is directly used without retraining.

After detecting the area of the dynamic object existing in the image, processing the area by means of the image average blurring method under the condition of keeping the image details as much as possible, thereby reducing or eliminating the influence of the dynamic object existing in the environment on the finally generated image representation. Although this idea of filtering out dynamic factors in a scene is simple, experiments have verified that this method is effective. The precision of the closed-loop detection task can be improved only by adding an image preprocessing process, namely a quick object detection network and simple image filtering processing, and the method is a novel dynamic filtering mode.

2. Region of interest generating landmarks for identifying images

a. Taking each frame of dynamically filtered image as the input of a convolutional neural network AlexNet, and directly outputting the feature mapping corresponding to the image through the last layer of convolutional layer of the convolutional network;

b. all the non-zero activation values of the feature maps and 8 adjacent activation values around the non-zero activation values are respectively grouped into one type and recorded as

M denotes the number of clusters in an image, each cluster

Energy value of

Can be calculated as:

wherein

Is shown asiThe size of the individual clusters is such that,

to represent

To (1)jAn activation value;

c. after obtaining the energy values of M clusters, taking T clusters with the maximum energy value and mapping the T clusters back to the original image as the finally generated landmark set

And is recorded as:

。

3. extracting convolution features

For each generated landmark, extracting a convolution feature descriptor by utilizing the constructed unsupervised convolution automatic encoder network; as input, a road sign, X represents a dimension of the HOG feature,

representing dimensions of reconstructed feature descriptors in a self-coding modelWhen training is finished, the network has the capability of learning and reconstructing the HOG features, the dimensionality of the HOG features extracted by inputting the same size is the same, the Euclidean distance can be used as the distance metric of the HOG descriptor, and the loss layer utilizes the linear rectification function (ReLU) activation

By comparison of X with its reconstruction by a loss function

The size of (2):

the parameter settings of the network are shown in fig. 2.

The network has been proved to be fast and reliable, can realize real-time detection of closed loop without reducing the dimensionality of the extracted convolution characteristic, and experiments show that the ability of detecting the loop by the HOG characteristic learned by the network is obviously superior to the original HOG characteristic, and can replace a general neural network in a closed loop detection system based on the convolution characteristic. Since the network does not require context-specific training, the pre-trained model can be applied directly to extract features of the dataset images used in the experiment. The feature vector extracted from any landmark generated by the image I is marked as

The characteristic dimension is 1064. So for any one image, the total feature dimension is

。

4. Computing similarity

To calculate

And

similarity score between, cross-matching all landmarks extracted from the two images. Using cosine distance measures

A landmark

And

a landmark

Similarity between:

is thatuAndvthe cosine distance of (d). Wherein

，

Respectively represent pair

Chinese landmarkuAnd

chinese landmarkvThe extracted convolution feature direction represents the length of the vector.

Determining using a simple linear search

And

matches between all landmarks and cross-checking is applied to accept only landmarks that match each other.

For each matching landmark pair (u, v), its weight is determined according to their region size, with the weight being W u, v:

wherein

Respectively the height and width of the (u, v) region,

and

respectively representing the absolute value of the high difference and the absolute value of the wide difference of the two regions.

In the end of this process,

and

global similarity score

Comprises the following steps:

querying images for each frame

Traverse and calculate it and all images in the database

Wherein the score is the highestThe high image is

The best matching:

z is namely represented by

The reference frame with the highest similarity score.

Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

for some of the drawbacks of using conventional visual features to generate image representations in conventional closed-loop detection, it is proposed to represent the images with landmark-convolution features, and this algorithm differs from other landmark-based correlation algorithms, not requiring any additional landmark detectors, but rather generating landmarks directly from deep convolutional layers of convolutional neural networks to identify salient regions. The algorithm utilizes the unsupervised deep neural network specially designed for closed-loop detection to extract image features instead of extracting from a general neural network, so that the performance of the algorithm is further improved. The result shows that the algorithm has high robustness no matter severe viewpoint change or extreme appearance change exists in the environment.

Drawings

FIG. 1 is a block diagram of the closed loop detection method based on landmark-convolution features according to the present invention;

fig. 2 shows the parameter settings of the network according to the invention.

Detailed Description

1. Evaluation index

In the closed-loop detection algorithm, the accuracy and robustness of detecting the closed loop are one of the criteria for evaluating the algorithm, and when the robot moves in an unknown environment with extreme appearance change and viewpoint change, a sufficiently robust closed-loop detection method largely eliminates accumulated errors and relocates when the camera tracking is lost. How to quantify the robustness of the closed-loop detection method, in the closed-loop detection, two representative indexes of Precision (Precision) and Recall (Recall) are generally adopted for verification. The accuracy refers to the probability that all closed loops detected by the algorithm are really real closed loops; the recall ratio is the probability that all real closed loops are detected by the algorithm. The corresponding calculation formula is as follows:

wherein TP represents True Positive (True Positive), i.e. closed loop in fact, and the result detected by the algorithm is also the number of closed loops; FP indicates False Positive (False Positive), i.e. not actually closed loop, but counts as the number of closed loops detected; FN indicates False negatives (False negatives), i.e. in fact closed loops, but the result of the algorithmic detection is not the number of closed loops. Accordingly, there is also a related definition of True negatives (True negatives), i.e., not in fact closed loops, nor the number of loops detected by the algorithm, denoted by TN. False positives and false negatives are also known as perceptual bias and perceptual variation, which both affect the accuracy of closed-loop detection in practical applications. Ideally, a good closed loop detection algorithm would correctly detect whether a closed loop exists in the face of the above two situations, which requires that the values of TP and TN be as high as possible and the values of FP and FN be as low as possible during the algorithm implementation.

Actually, the accuracy and the recall rate are a pair of contradictory statistics, and when the accuracy of closed-loop detection is higher, the closed-loop detection means that the parameter setting for judging the existence of the closed loop is stricter, the number of the closed loops detected by the algorithm is reduced, and the real closed loops still exist in the environment and are not detected, so that the recall rate is reduced; when the recall rate of closed-loop detection is high, it is indicated that the setting of closed-loop parameters is relatively loose, and the algorithm can detect more closed loops, but the accuracy rate is reduced when the closed loops are not true closed loops. In closed-loop detection, it is common practice to obtain the Recall rate and accuracy in each case, and then draw an accuracy-Recall Curve (Precision-Recall Curve). In SLAM, the accuracy requirement is typically more stringent, since if the accuracy is lower, it will cause the algorithm to detect that it is a closed loop, and not actually, this will cause the optimization algorithm to give a completely wrong result, resulting in the failure of the built map. If the recall rate is low, it means that there will be some closed loops undetected, so that the constructed map is affected by some accumulated error, but it only needs two closed loops to completely eliminate the error caused by the closed loops. Therefore, in SLAM, it is more desirable to obtain the highest possible accuracy than the recall rate, and in the present invention, the area under the curve (AUC) of the accuracy-recall curve, the maximum recall rate at an accuracy of 100%, and the accuracy when there is a higher recall rate are used as evaluation indexes of the experiment.

2. Public data set introduction

In order to verify that the proposed closed-loop detection method is a robust method to both appearance changes and viewpoint changes, experimental verification will be performed using several challenging public data sets. These data sets contain scene changes that are common in the real world, such as viewpoint, weather, light, season, etc. The four data sets are described in detail as follows:

(1) gardens Point dataset

The Gardens Point dataset includes three traversal tracks. One of the track sequences is shot at night; the other two tracks are photographed in the daytime, along the left and right sides of the sidewalk, respectively, and show a change in viewpoint occurring when walking on the left and right sides of the path and a slight change in appearance mainly caused by a dynamic object such as a pedestrian. Evaluating the robustness of the proposed closed-loop detection method based on unsupervised deep learning to the change of the viewpoint by using two track sequences in the daytime; sequences of traces along the right side of the road during the day and sequences of traces taken at night are used as test data sets to assess the robustness of the appearance changes caused by extreme light changes. The robustness of the closed-loop detection method when there are both viewpoint changes and drastic light changes in the environment is evaluated using sequences of tracks taken along the right side of the road during the day and at night.

(2) Campus Loop dataset

The Campus Loop dataset consists of two image sequences, each sequence containing 100 frames of images, and contains both indoor and outdoor environments in the dataset. The first sequence is taken on snowy days, with the ground covered with snow in an outdoor environment, and the second sequence is taken on sunny days. The two image sequences are utilized to verify the robustness of the closed-loop detection method provided by the invention under the comprehensive change condition of appearance change and viewpoint change caused by weather, illumination and the like.

3. Results of the experiment

In order to prove the superior performance of the closed-loop detection algorithm based on the unsupervised deep learning, the effect of constructing the closed-loop detection composition method based on the unsupervised deep learning is firstly evaluated, and meanwhile, the closed-loop detection composition method is compared with a closed-loop detection method commonly used in a classical visual SLAM based on a direct method, wherein the first comparison method is a closed-loop detection method FABMAP used in an LSD-SLAM, and the other comparison method is an open-source framework DoW3 based on a word bag model used in an LDSO.

(1) Method evaluation

Experimental results obtained on the Campus Loop data set show the effects of four ways of generating the representation of the whole image only by using convolution characteristics, filtering dynamic objects in a scene firstly to regenerate the convolution characteristic representation of the image, generating landmarks for an original image and then extracting the image representation of the landmark-convolution characteristics of the convolution characteristics and a complete method for constructing (namely filtering the dynamic objects in the scene firstly, regenerating the landmarks and finally extracting the convolution characteristics for the landmarks). The corresponding curves are named as DeepLC-W, DeepLC-D, DeepLC-L and DeepLC respectively, and in the rest evaluation experiments, DeepLC is used for representing the experimental effect curve of the closed-loop detection algorithm based on unsupervised deep learning.

According to the comparison of the two curves of DeeplC-W and DeeplC-L, the AUC of the latter can reach 0.94, the accuracy rate of the curve is obviously higher than that of the representation method of the convolution characteristics of the whole image under the condition that the recall rate is as high as possible, and the conclusion that the image description mode adopting the landmark-convolution characteristics is remarkably superior to the global mode only adopting the convolution characteristics to describe the image can be obtained. According to the effect of deep lc-W, DeepLC-D, it can be analyzed that if dynamic factors in the scene are filtered out in the image preprocessing stage, which is helpful for improving the accuracy of closed-loop detection, compared with the convolution feature representation of the whole image, although the AUC value is only improved by 0.01, the maximum recall rate at the accuracy of 100% is almost unchanged, but the effect of deep lc can show that once these several techniques are combined, the closed-loop detection capability can be greatly improved, not only the AUC can reach 0.98, but also the maximum recall rate at the accuracy of 100% can reach 70%, so that both the image representation manner of landmark-convolution and the dynamic factor filtering processing in the preprocessing stage are important and effective components in the closed-loop detection method based on unsupervised deep learning provided by the present invention.

(2) Viewpoint change robustness assessment

The closed-loop detection method and the experiments of FAB-MAP and DBoW3 on two track sequences in the daytime of the Gardens Point data set are provided, and the image data in the two sequences are respectively collected along the left side and the right side of the road. Experimental results show that when only viewpoint changes exist in the environment, the closed-loop detection method based on unsupervised deep learning can achieve nearly perfect effect, and the AUC value is as high as 1. The effects of FABMAP and DBoW are inferior to the closed-loop detection method provided by the invention, and although both methods are based on local visual features and are theoretically robust to viewpoint changes in the environment, the effect of filtering out targets in a scene in the image preprocessing stage in the method cannot be ignored, and the convolution feature has excellent performance in scene recognition relative to artificially designed features. Therefore, the conclusion that the loop detection method provided by the invention has viewpoint invariance can be obtained.

(3) Illumination change robustness assessment

The closed loop detection method proposed by the present invention and the experimental results of FAB-MAP and DBoW3 on these two image sequences, when there are strong light changes in the scene, the method of DBoW3 is almost ineffective and does not provide convincing accuracy and recall. In contrast, the FAB-MAP is much better, the accuracy rate is over 70% when there is 50% recall rate, while the method based on unsupervised deep learning proposed by the present invention is the highest method among them, and the method of the present invention has obvious advantages no matter the AUC value, the recall rate corresponding to 100% accuracy rate, or the accuracy rate corresponding to higher recall rate, so when there is strong light change in the scene, the closed-loop detection based on unsupervised deep learning is still reliable.

(4) Viewpoint and illumination change robustness assessment

The two experiments prove that satisfactory closed-loop detection capability can be obtained by closed-loop detection based on unsupervised deep learning whether in viewpoint change or in extreme light change appearance change, and the performance of the closed-loop detection method is considered when the two changes exist in the environment at the same time, track images from the Gardens Point data set along the left side and the right side of the road in the daytime and track images from the Gardens Point data set along the right side of the road in the evening are selected, namely, the viewpoint change and the strong light change are included, and pedestrian interference exists in the scene in the daytime. DBoW3 remained ineffective in this scenario, not much less the effect of FAB-MAP than the illumination change alone, still with the same AUC and the same recall at 100% accuracy, but with a drop in accuracy when the recall was higher. Loop detection based on unsupervised deep learning is still the optimal method, and has the highest AUC, recall rate and accuracy rate in comparison.

(5) Comprehensive change assessment

Experiments on the Campus Loop data set compare the effects of the three comparison methods when seasonal changes, viewpoint changes, indoor and outdoor switching, slight light changes and dynamic object comprehensive changes exist in the scene, and the data set contains various change conditions. The method based on unsupervised deep learning still shows very good effect, and the recall rate of 70% can be achieved with 100% accuracy. The FAB-MAP and DBoW methods perform almost well, the former method has a slightly better performance, the recall rate of 10% can be realized at the accuracy rate of 100%, the accuracy rate is higher than the DBoW3, but the FAB-MAP and DBoW methods do not have satisfactory closed-loop detection capability under comprehensive change.

It is to be noted that, in the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An improved landmark-convolution feature based image method, characterized by: a landmark generation mechanism is adopted, namely ROI in an image is directly identified according to an activation value of a convolutional layer, a convolution characteristic of a landmark is extracted by utilizing an unsupervised deep neural network specially designed for a closed-loop detection task, the closed-loop detection has the characteristics of viewpoint invariance and appearance invariance, dynamic objects obviously existing in the environment are filtered, and the landmark generation mechanism mainly comprises four parts:

2. An improved landmark-convolution based image method according to claim 1, characterized in that: in the image preprocessing stage, YOLOv4 is used as a tool for detecting dynamic factors in a scene, and a pre-training model trained on a Pascal VOCdataet can correctly distinguish most dynamic objects appearing in a closed-loop detection task, and the pre-training model provided by the pre-training model can be directly used without retraining.

3. An improved landmark-convolution based image method according to claim 1, characterized in that: after detecting the area of the dynamic object in the image, processing the area by adopting an image average blurring method to cover the dynamic object information.

4. An improved landmark-convolution based image method according to claim 1, characterized in that: the method specifically comprises the following steps of identifying a region of interest of an image and generating a landmark:

taking each frame of dynamically filtered image as the input of a convolutional neural network AlexNet, and directly outputting the feature mapping corresponding to the image through the last layer of convolutional layer of the convolutional network;

all the non-zero activation values of the feature maps and 8 adjacent activation values around the non-zero activation values are respectively grouped into one type and recorded as

M represents in one imageNumber of clusters, each cluster

Energy value of

Can be calculated as:

wherein

Is shown asiThe size of the individual clusters is such that,

to represent

To (1)jAn activation value;

And is recorded as:

。

5. an improved landmark-convolution based image method according to claim 1, characterized in that: for each generated landmark, extracting a convolution feature descriptor by utilizing the constructed unsupervised convolution automatic encoder network; as input, a road sign, X represents a dimension of the HOG feature,

representing the dimension of the reconstructed feature descriptor, in the self-coding model, linear rectification function (ReLU) activation is used for three convolutional layers, sigmoid activation is used for a full-connection layer so as to reconstruct HOG features through a network, when training is finished, the network has the capability of learning and reconstructing the HOG features, the dimension of the HOG features extracted for the input with the same size is the same, the Euclidean distance can be used as the distance measure of the HOG descriptor, and the loss layer uses Euclidean distance

By comparison of X with its reconstruction by a loss function

The size of (2):

。

6. an improved landmark-convolution based image method according to claim 1, characterized in that: to calculate a query frame

And a reference frame

Similarity score between them, cross-matching all landmarks extracted from the two images, using cosine distance measure

The landmark and

similarity between landmarks.