CN111598149A

CN111598149A - Loop detection method based on attention mechanism

Info

Publication number: CN111598149A
Application number: CN202010388573.1A
Authority: CN
Inventors: 孟凡阳; 任艺帆; 陈俊宏; 何震宇; 柳伟; 田第鸿
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-08-28
Anticipated expiration: 2040-05-09
Also published as: CN111598149B

Abstract

The application discloses a loop detection method based on an attention mechanism, which comprises the steps of obtaining a target image frame and a plurality of historical image frames corresponding to the target image frame; acquiring a first feature map corresponding to a target image frame and a second feature map corresponding to each historical image frame; respectively inputting the first feature map and the plurality of second feature maps into the multi-scale feature fusion structure, and outputting a first activation map corresponding to a target image frame and a second activation map corresponding to each historical image frame; and determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps. This application can obtain the multiscale fusion mechanism of characteristic map through the self-attention mechanism through multiscale feature fusion structure, can extract the activation map that carries the environmental change that can adapt to complicacy and the good feature descriptor of robustness, recycles the activation map and carries out the detection of returning the ring, can improve the accuracy and the detection speed that return the ring and detect.

Description

Loop detection method based on attention mechanism

Technical Field

The application relates to the technical field of loop detection, in particular to a loop detection method based on an attention mechanism.

Background

Most of the currently used loop detection methods are appearance-based loop detection methods which adopt a bag-of-words model (such as a visual SLAM loop detection method) and are all based on manual design, so that the image characteristics are very sensitive to illumination change in the environment, once a scene or a sensing condition changes, efficient and robust image characteristic description cannot be provided, the success rate of closed-loop detection is greatly reduced, the mismatching rate is high, and the correct construction of a track map is influenced.

Disclosure of Invention

The technical problem to be solved by the application is to provide a loop detection method based on an attention mechanism aiming at the defects of the prior art, so that loop detection is performed through a multi-scale feature fusion structure based on the attention mechanism, and the accuracy of loop detection in a complex scene is improved.

In order to solve the above technical problem, a first aspect of embodiments of the present application provides an attention mechanism-based loopback detection method, including:

acquiring a target image frame and a plurality of historical image frames corresponding to the target image frame;

acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each historical image frame in a plurality of historical image frames;

inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of a plurality of historical image frames;

and determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps.

In an implementation manner of this embodiment, the acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each of a plurality of history image frames specifically includes:

inputting the target image frame and a plurality of historical image frames into a convolution structure;

and outputting a first feature map corresponding to the target image frame and a second feature map corresponding to each of a plurality of historical image frames through a convolution structure.

In one implementation manner of this embodiment, the multi-scale feature fusion structure includes several convolution layers and a full-connection layer; the respectively inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of the plurality of historical image frames specifically includes:

inputting the reference feature map into each convolution layer of the plurality of convolution layers, and outputting a target feature map corresponding to the reference feature map through each convolution layer, wherein the convolution kernels of each convolution layer of the plurality of convolution layers are different in size;

and inputting each target feature image into a full connection layer, and outputting a target activation graph through the full connection layer, wherein when the reference feature graph is a first feature graph, the target activation graph is a first activation graph, and when the reference feature graph is a second feature graph, the target activation graph is a second activation graph.

In an implementation manner of this embodiment, the training process of the multi-scale feature fusion structure includes:

acquiring a training image set, wherein the training image set comprises a plurality of groups of training image groups, each group of training image groups comprises a training image, a positive sample image corresponding to the training image and a plurality of negative sample images corresponding to the training image;

respectively obtaining a training feature map corresponding to a training image, a positive sample feature map corresponding to a positive sample image, and a negative sample feature map corresponding to each negative sample image in a plurality of negative sample images;

inputting the training feature map, the positive sample feature map and the negative sample feature map into the multi-scale feature fusion structure, and outputting a training activation map corresponding to the training feature map, a positive sample activation map corresponding to the positive sample feature map and a negative sample activation map corresponding to each negative sample feature map;

determining a loss function corresponding to the training image according to the training activation image, the positive sample activation image and each negative sample activation image;

and training the multi-scale feature fusion structure according to the loss function.

In an implementation manner of this embodiment, the expression of the loss function is: the expression of the loss function is:

wherein, F_qA training activation graph corresponding to the training image q;

for positive sample images q⁺A corresponding positive sample activation map;

for positive sample images q^-Corresponding negative sample activation images, wherein K is the number of the negative sample images; m is an adjustment parameter.

In an implementation manner of this embodiment, the determining, according to the first activation map and the plurality of second activation maps, a loop frame corresponding to the target image frame specifically includes:

for each second activation map, determining a hamming distance of the first activation map from the second activation map;

and determining a loop frame corresponding to the target image frame according to all the determined Hamming distances.

In an implementation manner of this embodiment, before determining, according to the first activation map and the second activation maps, a loop frame corresponding to the target image frame, the method includes:

converting the feature descriptors in the first activation graph into binary feature descriptors, and taking the converted first activation graph as a first activation graph;

and converting the feature descriptor in each second activation map in the plurality of second activation maps into a binary feature descriptor, and taking the converted second activation map as the second activation map.

A second aspect of the embodiments of the present application provides a loop detection apparatus, including:

the frame acquisition module is used for acquiring a target image frame and a plurality of historical image frames corresponding to the target image frame;

the characteristic output module is used for acquiring a first characteristic diagram corresponding to the target image frame and a second characteristic diagram corresponding to each historical image frame in a plurality of historical image frames;

the fusion module is used for inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of a plurality of historical image frames;

and the loop determining module is used for determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps.

A second aspect of embodiments of the present application provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the attention-based loop detection method according to the first aspect.

A second aspect of an embodiment of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, performs the steps in the attention mechanism based loop back detection method as described in the first aspect above.

A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the attention-based loopback detection method as described in any one of the above.

A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the attention mechanism based loopback detection method as described in any of the above.

According to the method, after the target image frame and the plurality of historical image frames corresponding to the target image frame are obtained; acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each historical image frame in a plurality of historical image frames; inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of a plurality of historical image frames; and determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps. This application can obtain the multiscale fusion mechanism of characteristic map through the self-attention mechanism through multiscale feature fusion structure, can extract the activation map that carries the environmental change that can adapt to complicacy and the good feature descriptor of robustness, recycles the activation map and carries out the detection of returning the ring, can improve the accuracy and the detection speed that return the ring and detect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without any inventive work.

Fig. 1 is a flowchart of a loop detection method based on an attention mechanism provided in the present application.

Fig. 2 is a schematic flowchart of a loop detection method based on an attention mechanism according to the present application.

Fig. 3 is an exemplary diagram of a loop detection method based on an attention mechanism provided in the present application.

Fig. 4 is a flowchart illustrating a process of acquiring an activation map through a convolution structure and a multi-scale feature fusion structure in the method for detecting a loop based on an attention mechanism provided in the present application.

Fig. 5 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

In order to make the purpose, technical scheme and effect of the present application clearer and clearer, the present application is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In particular implementations, the terminal devices described in embodiments of the present application include, but are not limited to, other portable devices such as mobile phones, laptops, or tablet computers with touch sensitive surfaces (e.g., touch displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch-sensitive display screen and/or touchpad).

In the discussion that follows, a terminal device that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal device may also include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal device supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a video conferencing application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a data camera application, a digital video camera application, a web browsing application, a digital music player application, and/or a digital video playing application, etc.

Various applications that may be executed on the terminal device may use at least one common physical user interface device, such as a touch-sensitive surface. The first or more functions of the touch-sensitive surface and the corresponding information displayed on the terminal may be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical framework (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

The inventor finds that the mobile robot performs positioning and mapping simultaneously according to the data of the visual sensor, wherein the visual SLAM technology is the key for realizing the autonomous positioning of the mobile robot. The traditional visual SLAM technology comprises four parts of visual odometry, back-end optimization, loop detection and graph building. The visual odometer is mainly responsible for estimating the motion and local map between two adjacent visual images, and comprises the technologies of feature extraction, image registration and the like. The loop detection is mainly responsible for judging whether the robot reaches the previous position or not and providing the detected loop information to the back end for processing. The accuracy of loop detection directly affects the accuracy of map construction, and if an incorrect loop is detected, an incorrect map is generated, and the positioning of the robot is affected. In addition, the environment around the robot may change during the actual navigation. These changes can be classified into changes in appearance and changes in viewing angle. The change in appearance may be due to changes in lighting, weather, and shadows. Meanwhile, pictures shot by the robot at different angles in the same place in the moving process look different. Therefore, loop detection needs to be robust to changes in appearance and viewing angle.

The traditional visual SLAM loop detection method such as a bag-of-words method is based on manual design, is very sensitive to illumination change in the environment, and once a scene or a perception condition changes, efficient and robust image feature description cannot be provided, so that the success rate of closed-loop detection is greatly reduced, the mismatching rate is high, and the correct construction of a track map is influenced. Therefore, the improvement of the accuracy of loop detection and the robustness under the complex illumination environment have important practical significance. Recently, many related researchers have used deep learning to solve the loop back detection problem. Compared with the traditional method, the deep convolutional network can extract more effective picture characteristics, and the convolutional neural network CNN model pre-trained on the large picture classification data set has strong generalization and can be used for solving various different visual tasks. Research has found that features extracted from the middle layer of the CNN are robust to appearance changes, and features extracted from the high layer of the CNN are robust to view angle changes. However, how to send changes to both the appearance change and the viewing angle, the CNN model does not have a high success rate of detecting loops.

In order to solve the above problem, in the embodiment of the present application, after a target image frame and a plurality of history image frames corresponding to the target image frame are obtained; acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each historical image frame in a plurality of historical image frames; inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of a plurality of historical image frames; and determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps. This application can obtain the multiscale fusion mechanism of characteristic map through the self-attention mechanism through multiscale feature fusion structure, can extract the activation map that carries the environmental change that can adapt to complicacy and the good feature descriptor of robustness, recycles the activation map and carries out the detection of returning the ring, can improve the accuracy and the detection speed that return the ring and detect.

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The present embodiment provides a loop detection method based on an attention mechanism, where the loop detection method is applied to a terminal device, as shown in fig. 1 to 4, and the method includes:

and S10, acquiring a target image frame and a plurality of historical image frames corresponding to the target image frame.

Specifically, the target image frame is an image frame for loop detection, and the target image frame may be any image frame acquired in a loop detection process, or may be an image frame acquired at a current time in the loop detection process, where the current time refers to a latest time in the loop detection. It may be accumulated that the acquisition time corresponding to the target image frame is later than the acquisition time corresponding to each image frame in all the image frames that have been acquired, and the target image frame is a video frame for loop detection. The plurality of historical image frames are image frames subjected to loop detection, and the acquisition time corresponding to each historical image frame is earlier than that corresponding to the target image frame. For example, three image frames have been acquired, and then the third frame is the target image frame, and the first frame image frame and the second frame image frame are the history image frames.

Further, the plurality of history image frames may be all image frames that have undergone loop detection, or may be partial image frames in image frames that have undergone loop detection, for example, the plurality of history image frames are key frames in image frames that have undergone loop detection, and the like. In an implementation manner of this embodiment, the history image frames are key frames in image frames that have undergone loop detection, that is, the history image frames are history key frames, and the history key frames are consecutive history key frames, so that the amount of computation of the history image frames can be reduced, and the speed of loop detection is increased. For example, 10 video frames are acquired, wherein the first frame, the fifth frame and the ninth frame are key frames, the target image frame is a tenth frame image frame, and the historical image frames are the first frame image frame, the fifth frame image frame and the ninth frame image frame, respectively.

S20, acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each of a plurality of historical image frames.

Specifically, the first feature map is a feature map corresponding to the target image frame, and the first feature map is an image representation of the target image frame, which is an image formed by extracting useful information in the target image frame and discarding irrelevant information. For example, if the position information of an object carried in an image is predicted, edge detection can be used to change the image into an edge-only image, which is useful for the edge information of the image, and the color information is not useful. Similarly, the second feature map corresponding to each historical image frame is an image representation of the historical image frame.

Further, the first characteristic diagram and each second characteristic diagram can be obtained through a convolution structure, wherein the convolution structure is a network model used for obtaining the characteristic image of the image, the input item of the convolution structure is the image to be detected, and the output is the characteristic diagram corresponding to the image to be detected. Based on this, as shown in fig. 4, the acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each of a plurality of history image frames specifically includes:

Specifically, the inputting of the target image frame and the plurality of history image frames to the convolution structure may be sequentially inputting the target image frame and the plurality of history image frames to the convolution structure, or inputting the target image frame and each of the plurality of history image frames to the convolution mechanism together, and the like, which is not limited herein. When the convolution structure outputs the first feature map corresponding to the target image frame and the second feature map corresponding to each historical image frame, the convolution structure may output the first feature map corresponding to the target image frame and the second feature map corresponding to each historical image frame, respectively. In this embodiment, the convolution structure is a trained convolution structure, and the output item of the convolution structure is a feature map with dimensions w × h × C, which represents a C-dimensional local feature extracted at a w × h position. It is understood that the first feature map is a local feature map of the target image frame, and each second feature map is a local feature map of the corresponding history image frame.

And S30, inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of the plurality of historical image frames.

Specifically, the multi-scale feature fusion structure may be a network model independent from the convolution structure, or may be formed by cascading with the convolution structure, which is not limited herein. And the input item of the multi-scale feature fusion structure is the output item of the convolution structure, and the output item of the multi-scale feature fusion structure is the activation graph corresponding to the input item.

Further, in an implementation manner of this embodiment, the multi-scale feature fusion structure includes a plurality of convolution layers and a full-link layer; the plurality of convolution layers are arranged in parallel, and each convolution layer is connected with the full-connection layer. The input items of each convolution layer in the plurality of convolution layers are the same, the output items of each convolution layer are feature graphs with different scales corresponding to the input items, the input items of the fully-connected layers are output items output by the plurality of convolution layers, and the output items of the fully-connected layers are activation graphs.

Based on this, as shown in fig. 3 and 4, the inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure, and outputting the first activation map corresponding to the target image frame and the second activation map corresponding to each of the plurality of history image frames specifically includes:

Specifically, the input items of the convolution layers are all reference feature images, the sizes of convolution kernels corresponding to the convolution layers are different, and the image scales of the output items corresponding to the convolution layers are the same. It will be appreciated that the image scale of the target feature map output by each convolutional layer is the same, and that different convolutional layers may be used to filter the first feature map to obtain the features of different particles of the reference feature map. In one implementation of this embodiment, the plurality of convolutional layers may include a first convolutional layer, a second convolutional layer, and a third convolutional layer; the first convolution layer, the second convolution layer and the third convolution layer are filtered by convolution kernels with different sizes to the characteristic graph, for example, the convolution sum size of the first convolution layer is 3 × 3, the convolution sum size of the second convolution layer is 5 × 5, and the convolution sum size of the third convolution layer is 7 × 7.

Further, after the target feature maps of all image scales are obtained, inputting the target feature maps of all image scales into a full connection layer, wherein the target feature maps of all image scales are connected to a feature map with dimensions of w × h × D, w is the width of the target feature map, and h is the height of the target feature map; d is equal to the number of target feature maps (i.e., the number of convolutional layers), e.g., three convolutional layers, D is equal to 3. It should be noted that, of course, the number of channels of the target feature map is 1, that is, the dimension of the target feature map is w × h × 1. In addition, after a feature map with dimensions w × h × D is obtained, all activations of each spatial position of the feature map are combined with the 1 × 1 × D convolution layer, and then ReLU activation is performed to generate a target activation map, wherein the ReLU operation ensures that all values of the activation map are non-negative. In this way, the output target activation map of the multi-scale feature fusion mechanism positions the most discriminant description area of the feature image, and the feature points which should be focused on in the reference feature map are defined. It will be appreciated that the generation of the target activation map by the multi-scale feature fusion mechanism may indicate the significance of each local feature in the reference feature map to which the target feature map corresponds. Meanwhile, the higher the activation degree of the local features in the target activation map is, the more representative the description of the image is.

Further, in an implementation manner of this embodiment, the training process of the multi-scale feature fusion structure includes:

acquiring a training image set;

Specifically, the training image set includes a plurality of sets of training images, each set of training images includes a training image, a positive sample image corresponding to the training image, and a plurality of negative sample images corresponding to the training image, where position information corresponding to the positive sample image is the same as position information corresponding to the training image, and position information corresponding to each negative sample image is different from position information corresponding to the training image, where the position information refers to an image acquisition position. For example, there are 10 acquisition points on a track, the training image is the image acquired at the first acquisition point, then the positive sample is the image acquired at the first acquisition point, and the negative sample is the image acquired at other acquisition points except the first acquisition point. Of course, the corresponding position information of each negative sample image may be the same or different, and is not limited herein.

Further, the training feature graph corresponding to the training image, the positive sample feature graph corresponding to the positive sample image and the negative sample feature graph corresponding to each negative sample image can be output through the trained convolution structure, so that the training speed of the multi-scale fusion structure can be improved by migrating and learning to the convolution structure in the training process of the multi-scale fusion structure. Of course, the convolution structure may be untrained and synchronously trained with the multi-scale fusion structure, so that the learned features of the convolution structure may be matched with the multi-scale fusion structure, thereby improving the accuracy of the multi-scale fusion structure. Of course, in practical applications, the convolution structure and the multi-scale fusion structure form a network model, and a first activation map corresponding to the target image frame and a second activation map corresponding to each history frame can be determined through the network model.

Furthermore, the multi-scale fusion structure adopts unsupervised learning, can be trained without data containing marks, can effectively reduce the workload of the marks, simplifies the training difficulty of a training model, and improves the training efficiency of the multi-scale fusion mechanism. In addition, when the convolution structure and the multi-scale fusion mechanism are synchronously trained, the convolution structure also adopts unsupervised learning, and the training efficiency of the convolution structure is further improved.

Further, the purpose of training the multi-scale feature fusion structure is to ensure that the feature representation distance between the training image and the positive sample image is less than the feature representation distance between the training image and the negative sample image. However, since the perceptual variation between the training image and the positive sample image is severe, the multi-scale feature fusion structure can learn to locate the most discriminative local features to represent the image. Thus, a triple function may be employed as a loss function, where the loss function has the expression:

wherein, F_qA training activation graph corresponding to the training image q;

for positive sample images q⁺A corresponding positive sample activation map;

And S40, determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps.

Specifically, the determining of the loop frame corresponding to the target image frame according to the first activation map and the second activation maps refers to determining whether the target image frame corresponding to the first activation map is a loop frame according to a distance between the first activation map and the second activation maps. It can be understood that a plurality of historical image frames correspond to a plurality of distances, wherein each distance is calculated according to the second activation map corresponding to the historical image frame and the first activation map corresponding to the target image frame.

Specifically, the hamming distances are calculated from the first activation map and the second activation map, the plurality of history image frames correspond to the plurality of hamming distances, after the hamming distances between the target image frame and each history image frame are calculated, the hamming distances can be obtained through comparison calculation, the shortest hamming distance is selected from the plurality of hamming distances, the history image frame corresponding to the shortest hamming distance is the image frame corresponding to the target image, that is, the history image frame corresponding to the target image and the shortest hamming distance is the loopback. Optionally, in order to further improve accuracy of loop back, a distance threshold may be preset, and the distance threshold is used to determine whether the historical image frame and the target image frame are loop back, for example, after the shortest hamming distance is selected, it is determined whether the shortest hamming distance is smaller than the distance threshold, and if the shortest hamming distance is smaller than the distance threshold, it is determined that the historical image frame corresponding to the shortest hamming distance of the target image frame is loop back; if the shortest hamming distance is not less than the distance threshold, determining that the target image frame and the history image frame corresponding to the shortest hamming distance are not looped back, that is, the image frame looped back to the target image frame does not exist in the plurality of history image frames.

Further, because of the location-sensitive hashing method (i.e., similar points (close points) before hashing), certain similarity can be guaranteed after hashing, and there is a certain probability to guarantee. Therefore, after two points which are close (similar) in the original space are mapped by the LSH hash function, the hashes of the two points are the same with a high probability; and two points that are far away (dissimilar) from each other have a small probability of having the same hash value after mapping. Based on the above, before determining the corresponding loop frame of the target image frame according to the first activation map and the second activation maps, a floating point type feature descriptor can be converted into a binary feature descriptor by using a hash method. Correspondingly, before determining the corresponding loop-back frame of the target image frame according to the first activation map and the second activation maps, the method includes:

Specifically, the feature descriptor in the first activation map is a floating point feature descriptor, and the floating point feature descriptor is converted into a binary feature descriptor in a hash manner, so that the hamming distance between the first activation image and the second activation image can be calculated based on the binary feature descriptor, and the matching speed of the first activation image and the second activation image can be increased, thereby achieving a real-time effect.

In summary, the present application provides a loopback detection method based on an attention mechanism, where after a target image frame and a plurality of historical image frames corresponding to the target image frame are obtained; acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each historical image frame in a plurality of historical image frames; inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure respectively, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of a plurality of historical image frames; and determining a loop frame corresponding to the target image frame according to the first activation map and the plurality of second activation maps. This application can obtain the multiscale fusion mechanism of characteristic map through the self-attention mechanism through multiscale feature fusion structure, can extract the activation map that carries the environmental change that can adapt to complicacy and the good feature descriptor of robustness, recycles the activation map and carries out the detection of returning the ring, can improve the accuracy and the detection speed that return the ring and detect.

Based on the loop detection method based on attention mechanism, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the loop detection method based on attention mechanism as described in the foregoing embodiment.

Based on the above loop detection method based on attention mechanism, the present application further provides a terminal device, as shown in fig. 5, including at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the storage medium and the instruction processors in the mobile terminal are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of loop detection based on an attention mechanism, the method comprising:

2. The attention mechanism-based loopback detection method as claimed in claim 1, wherein the acquiring a first feature map corresponding to the target image frame and a second feature map corresponding to each of a plurality of historical image frames specifically comprises:

3. The attention-based mechanism loopback detection method as recited in claim 1, wherein the multi-scale feature fusion structure comprises several convolution layers and a full connection layer; the respectively inputting the first feature map and the plurality of second feature maps into a multi-scale feature fusion structure, and outputting a first activation map corresponding to the target image frame and a second activation map corresponding to each of the plurality of historical image frames specifically includes:

4. The attention-based loopback detection method according to any one of claims 1-3, wherein the training process of the multi-scale feature fusion structure comprises:

5. The attention mechanism-based loopback detection method as recited in claim 4, wherein the loss function is expressed by: the expression of the loss function is:

wherein, F_qA training activation graph corresponding to the training image q;

for positive sample images q⁺A corresponding positive sample activation map;

6. The attention mechanism-based loop detection method according to claim 1, wherein the determining, according to the first activation map and a plurality of second activation maps, a loop frame corresponding to the target image frame specifically includes:

7. The attention mechanism-based loop detection method according to claim 1 or 6, wherein before determining the loop frame corresponding to the target image frame according to the first activation map and the second activation maps, the method comprises:

8. A loop detection apparatus, comprising:

9. A computer readable storage medium, storing one or more programs, which are executable by one or more processors, to implement the steps in the attention mechanism based loop back detection method according to any one of claims 1 to 7.

10. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the attention mechanism based loop back detection method of any of claims 1-7.