WO2022126529A1

WO2022126529A1 - Positioning method and device, and unmanned aerial vehicle and storage medium

Info

Publication number: WO2022126529A1
Application number: PCT/CN2020/137313
Authority: WO
Inventors: 梁湘国; 杨健; 蔡剑钊
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-06-23
Also published as: CN114556425A

Abstract

A positioning method and device, and an unmanned aerial vehicle and a storage medium. The method is applied to a movable platform, wherein the movable platform comprises a visual sensor (403). The method comprises: acquiring first image description information and first key point description information of historical images collected by a visual sensor, and acquiring first position information corresponding to a movable platform; acquiring the current image collected by the visual sensor, and acquiring second image description information and second key point description information of the current image on the basis of a feature extraction model; on the basis of the description information corresponding to the historical images and the description information corresponding to the current image, determining a matching result between the plurality of historical images and the current image; and according to the matching result and the first position information, determining second position information of the movable platform of when the current image is collected. The efficiency of acquiring the two pieces of description information is improved, and the two pieces of description information can also be relatively accurately determined, such that positioning time is further saved, and the positioning precision is improved.

Description

Method, device, drone and storage medium for positioning

technical field

The present application relates to the field of visual return navigation, and in particular, to a positioning method, device, unmanned aerial vehicle and storage medium.

Background technique

UAVs are unmanned aircraft operated by radio remote control equipment and self-contained program control devices. Alternatively, UAVs can also be fully or intermittently operated autonomously by on-board computers. During the flight of the UAV, since many times it is in the range of over-the-horizon, in order to ensure the safety of the UAV, it is quite necessary to automatically return to the home.

In the process of automatic return of the UAV, the UAV needs to locate the current position relatively quickly and accurately. However, it is very important to quickly and accurately locate the small-sized equipment such as UAVs.

SUMMARY OF THE INVENTION

The present application provides a positioning method, device, unmanned aerial vehicle and storage medium, which can be used for faster and more accurate positioning.

A first aspect of the present application is to provide a positioning method, the method is applied to a movable platform, and the movable platform includes a vision sensor, including: acquiring first image description information of historical images collected by the vision sensor and the first key point description information, and obtain the first position information of the movable platform when collecting the historical image; obtain the current image collected by the vision sensor, and obtain the first position of the current image based on the feature extraction model two image description information and second key point description information; based on the first image description information and the first key point description information of the historical image, and the second image description information and all the current image The second key point description information is used to determine the matching results of a plurality of the historical images and the current image; according to the matching results and the first position information of the historical images, determine the second position information of the movable platform.

A second aspect of the present application is to provide a positioning device, including: a memory, a processor and a visual sensor; the memory for storing a computer program; the visual sensor for collecting historical images and current images; The processor invokes the computer program to implement the following steps: acquiring the first image description information and the first key point description information of the historical images collected by the visual sensor, and acquiring the available information when collecting the historical images. the first position information of the mobile platform; obtain the current image collected by the visual sensor, and obtain the second image description information and second key point description information of the current image based on the feature extraction model; The first image description information and the first key point description information, and the second image description information and the second key point description information of the current image, to determine a plurality of the historical images and the current image The matching result; according to the matching result and the first position information of the historical image, determine the second position information of the movable platform when the current image is collected.

A third aspect of the present application is to provide an unmanned aerial vehicle, comprising: a body and the positioning device described in the second aspect.

A fourth aspect of the present application is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used in the first aspect. method described.

A positioning method provided by the present application, and the method is applied to a movable platform. The movable platform includes a visual sensor, including: acquiring first image description information and first key point description information of historical images collected by the visual sensor, and acquiring The first position information of the movable platform when the historical image is collected; the current image collected by the visual sensor is acquired, and the second image description information and the second key point description information of the current image are obtained based on the feature extraction model; The image description information and the first key point description information, as well as the second image description information and the second key point description information of the current image, determine the matching results of multiple historical images and the current image; according to the matching results and the first position of the historical image information to determine the second position information of the movable platform when the current image is captured. Wherein, by obtaining the second image description information and the second key point description information of the current image based on the feature extraction model, the second image description information and the second key point description information can be obtained at the same time, which improves the efficiency of obtaining the two description information. The two description information can be more accurately determined, thereby further saving the positioning time, improving the positioning accuracy, and satisfying the real-time performance of acquiring the two description information and the real-time performance of positioning. At the same time, this method can also be applied to movable platforms such as UAVs, which can help the UAVs to return home more smoothly.

In contrast, for the feature extraction model, in the process of model training, fusion training can be performed for the second image description information and the second key point description information, that is, training is performed for the one feature extraction model, that is, it can be achieved. The acquisition of the second image description information and the second key point description information improves the global performance.

In addition, the embodiments of the present application also provide a device, an unmanned aerial vehicle, and a storage medium based on the method, all of which can achieve the above effects.

Description of drawings

The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

1 is a schematic flowchart of a positioning method provided by an embodiment of the present application;

2 is a schematic structural diagram of a feature extraction model provided by an embodiment of the present application;

3 is a schematic structural diagram of a positioning device provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a positioning device provided by an embodiment of the present application.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.

In order to facilitate the understanding of the technical solutions and technical effects of the present application, the prior art is briefly described below:

Based on the foregoing content, it can be seen that in the process of returning home, the UAV needs to locate the current position relatively quickly and accurately, especially for a small-sized mobile platform such as the UAV.

In the prior art, the key frame matching and key point matching can be divided into two tasks for visual return navigation, but considering the time efficiency, the key frame matching can use the BoW (Bag of Words, bag of words model) method. Although the time efficiency is high, the effect is not ideal. In the key point matching task, the ORB (Oriented Fast and Rotated Brief, a fast feature point extraction and description algorithm) descriptor is often used, but the ORB descriptor often appears in the visual return navigation. Large-scale and large-angle perspective Transformation is less effective. In order to further improve the time efficiency, the embodiments of the present application provide a method of generating the descriptors of key frames and key points in the same network model, and reconstructing the network structure for positioning.

Some embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following embodiments and features in the embodiments may be combined with each other without conflict between the embodiments.

1 is a schematic flowchart of a positioning method provided by an embodiment of the present invention; the method 100 provided by an embodiment of the present application can be applied to a movable platform, such as an unmanned aerial vehicle and an intelligent mobile robot, and the movable platform includes a visual sensor. The method 100 includes the following steps:

101: Acquire first image description information and first key point description information of a historical image collected by a visual sensor, and obtain first position information of the movable platform when collecting the historical image.

102: Acquire the current image collected by the visual sensor, and acquire second image description information and second key point description information of the current image based on the feature extraction model.

103: Based on the first image description information and the first key point description information of the historical image, and the second image description information and the second key point description information of the current image, determine the matching results of the multiple historical images and the current image.

104: According to the matching result and the first position information of the historical image, determine the second position information of the movable platform when the current image is collected.

Wherein, by obtaining the second image description information and the second key point description information of the current image based on the feature extraction model, the second image description information and the second key point description information can be obtained at the same time, which improves the efficiency of obtaining the two description information. The two description information can be more accurately determined, thereby further saving the positioning time, improving the positioning accuracy, and satisfying the real-time performance of acquiring the two description information and the real-time performance of positioning. At the same time, this method can also be applied to movable platforms such as UAVs, which can help the UAVs to return home smoothly and ensure the safety of UAVs.

It should be noted that the method 100 can be applied to a movable platform, and the movable platform can also involve other movable platforms besides drones, or movable devices, such as sweeping robots, etc., which can make these movable platforms Or the mobile device can automatically return to the home or automatically return to the original location, etc.

The above steps are described in detail below:

Among them, the visual sensor refers to an instrument that uses optical components and imaging devices to obtain image information of the external environment. It can be set inside the movable platform and used to obtain the external environment information of the movable platform, such as the outside of the current geographic location of the UAV. environment image.

Historical images refer to the historical images obtained by the movable platform during the moving process, such as the external environment images obtained by the UAV during the normal navigation phase. When the UAV starts to return to the flight, the external environment image obtained during the normal navigation phase can be used as a historical image for reference to determine the current geographic location of the UAV during the process of returning to the flight.

The first image description information refers to information representing the characteristics of the image, such as image descriptors. Wherein, the image can be a key frame image in the moving process, which can be called a key frame descriptor. The first key point description information refers to information representing the features of key points in the image, such as key point descriptors in the image. The key point may be a corner or edge in the image.

The first location information refers to the geographic location where the movable platform is located when the movable platform obtains the corresponding historical image. The current geographic location can be determined by a positioning device of the movable platform, such as GPS (Global Positioning System, global positioning system). In addition to obtaining the geographic location, the pose of the movable platform, which may also be called orientation information, can also be obtained, so as to determine the pose of the movable platform.

The above-mentioned first image description information and first key point description information can be obtained by the following feature extraction model. It can also be acquired by other acquisition methods, such as SIFT (scale-invariant feature transform, Scale-invariant feature transform) algorithm, or, SuperPoint (feature point) algorithm, etc. It should be noted that although the above-mentioned other acquisition methods cannot adapt to the above-mentioned complex algorithms due to their relatively complex algorithms and the mobile platform may have problems such as small size, the above-mentioned complex algorithms can still be run, but the real-time performance is not ideal. . However, for the historical images, the historical images may not be acquired in real time, and the historical images may be acquired with time intervals.

For example, when the UAV is sailing normally in the air, the UAV can obtain the image of the external environment when the UAV is in the air through the camera of the vision sensor set on the UAV from the normal navigation, that is, the historical image. The vision sensor transmits the acquired historical images to the drone for image processing. The vision sensor can acquire historical images in real time, or it can acquire historical images with time intervals. In addition, the vision sensor or UAV can also determine whether it belongs to the key frame image according to the acquired historical image, which can be determined according to the determination rules of the key frame, and then the vision sensor determines whether to send the historical image to the UAV. , or the drone processes historical images after determining keyframes. The UAV can obtain image descriptors of historical images and key point descriptors in the image, such as corner descriptors, through the following feature extraction model or SIFT algorithm.

Wherein, the current image refers to an image obtained by the movable platform at the current geographic location during the process of returning to the voyage or returning to the mobile platform. The second image description information and the second key point description information are of the same nature as the first image description information and the first key point description information in the foregoing step 101, and will not be repeated here.

It should be noted that, since the second image description information and the second key point description information are obtained based on the same model (ie, the feature extraction model), it can not only improve the efficiency of information acquisition, but also meet the requirements of the descriptors obtained in real time during model training. When the model is fused instead of separately trained, the overall situation is better.

Among them, the feature extraction model includes a feature extraction layer, an image description information generation layer and a key point information generation layer; the feature extraction layer is used to extract the common feature information of the current image based on the convolutional network; the image description information generation layer is used based on the shared feature information. generating second image description information; the key point information generation layer is used for generating second key point description information based on the common feature information.

Figure 2 shows a schematic diagram of the structure of the feature extraction model. The feature extraction model includes a feature extraction layer 201 , an image description information generation layer 203 and a key point information generation layer 202 . There may be a convolutional network in the feature extraction layer 201, the convolutional network including multiple layers of convolutional layers.

Specifically, the feature extraction layer is used to extract the common feature information of the current image based on the convolutional network, including: the feature extraction layer is used to extract the common feature information based on multiple convolutional layers in the convolutional network. In FIG. 2 , four convolution layers may be included, and the convolution layer may perform convolution on the input historical image or current image, that is, the real image 204 , to obtain convolved image feature information, that is, shared feature information.

Specifically, the key point information generation layer is used to generate the second key point description information based on the shared feature information, including: the key point information generation layer is used to extract the first key point information from the shared feature information based on a convolutional layer and bilinear upsampling. Two key point description information, the number of convolution kernels of one convolution layer is the same as the number of the second key point description information. As shown in FIG. 2 , the keypoint information generation layer 202 may include a convolutional layer, and the number of convolution kernels in the convolutional layer is the same as that of the second keypoint description information. After bilinear upsampling, the key point descriptor 2021 in the current image can be obtained.

Specifically, the image description information generation layer is configured to generate the second image description information based on the shared feature information, including: the image information generation layer is configured to extract the second image description information from the shared feature information through two convolution layers and a NetVLAD layer. As shown in FIG. 2, for the image description information generation layer 203, it may include two convolution layers and a NetVLAD (Net Vector of Local Aggregated Descriptors) layer, thereby generating the current image descriptor 2031 .

It should be noted that, in the key point descriptor generation method, although it is described above that the SIFT algorithm can be used to obtain the descriptor, and its effect is good, the SIFT algorithm has high complexity and cannot be implemented in real time on embedded devices. For the return flight, the real-time performance is poor. In addition, the SIFT algorithm has not been specially improved for large-scale and large-angle changes, and is not suitable for use in visual return. However, other traditional methods generally have poor effect or high time complexity, and cannot meet the requirements of descriptors in visual return.

In addition, for the key point descriptor generation method based on learning, SuperPoint is a better model at present. This model can obtain the position of the key point in the image and the key point descriptor at the same time, but the model is relatively large. It is difficult to run in real time on embedded devices, and because its training data is generated through homography changes, it cannot well simulate the actual usage scenarios in visual return navigation. However, in the embodiment of the present application, the above-mentioned one feature extraction model can meet the requirement of being able to run in real time on a mobile platform, that is, the embedded device. Among them, the feature extraction layer can extract the features of the current image or historical images by fast downsampling with a convolutional layer with a stride of 2, which can reduce computing power consumption. And this model structure can reduce the computational complexity of the network as much as possible while ensuring the effect of extracting the descriptors, and at the same time combine the information of graph features and point features, that is, the shared feature information, to generate key point descriptors and points in the same network model. Image descriptors not only make full use of the commonality of images and key points, but also save a lot of time for repeated feature calculation. For example, as mentioned above, when the drone is subjected to environmental factors such as weather, which causes the signal to be interrupted, or there is a problem with the GPS positioning device, it can automatically trigger the return flight. In the process of returning home, the UAV can obtain the current image in real time through the vision sensor and send it to the UAV for processing. After the drone receives the current image, the current image can be input into the feature extraction model. For any current image, the feature extraction layer in the model is first passed, and the common feature information of the current image is obtained through the convolution layer in the feature extraction layer. Then this shared feature information is sent to the image description information generation layer and the key point information generation layer respectively. After receiving the shared feature information, the image description information generation layer obtains the current image descriptor through two convolution layers and NetVLAD layers. After the key point information generation layer receives the shared feature information, it obtains the key point descriptor in the current image through a convolution layer and bilinear upsampling.

In addition, since it is necessary to obtain the image descriptor and key point descriptor of the current image in real time, this descriptor needs to be used in large quantities. Since most of the currently obtained descriptors are based on floating-point data type descriptors, not only do they take up a lot of space , and the measurement time is expensive. It will consume more resources for embedded devices. Therefore, in order to improve the utilization of resources, reduce the occupied space and memory consumption, it can be realized by converting the descriptor of the floating-point data type to the descriptor of the Boolean data type.

Specifically, the image information generation layer is used to extract the second image description information from the shared feature information through the two convolution layers and the NetVLAD layer, including: the image information generation layer is used to pass the two convolution layers and the NetVLAD layer. The feature information is extracted into the second image description information of the floating point data type; the image information generation layer is used to convert the second image description information of the floating point data type into the second image description information of the Boolean data type.

For example, according to the foregoing, as shown in FIG. 2 , after receiving the shared feature information, the image description information generation layer 203 obtains the current image descriptor 2031 through two convolutional layers and the NetVLAD layer, and the current image description obtained at this time The sub 2031 is of floating point data type, and the current image descriptor 2031 of the floating data type is converted into the current image descriptor 2032 of the Boolean data type.

On the contrary, the above problem also exists for the key point information generation layer, so the data type of the key point descriptors in this layer can also be converted from the floating point data type to the Boolean data type.

Specifically, the key point information generation layer is used to extract the second key point description information from the shared feature information through a convolution layer and bilinear upsampling, including: the key point information generation layer is used to pass a convolution layer. and bilinear upsampling, extracting the second key point description information of the floating point data type from the common feature information; the key point information generation layer is used to convert the second key point description information of the floating point data type into the Boolean data type. The second key point describes the information. For example, according to the foregoing, as shown in FIG. 2 , after receiving the shared feature information, the keypoint information generation layer 202 obtains the keypoint descriptor 2021 in the current image through a convolutional layer and bilinear upsampling. The key point descriptor 2021 obtained at this time is of the floating point data type, and then the key point descriptor 2021 of the floating point data type is converted into the key point descriptor 2022 of the Boolean data type.

It is different from the SuperPoint algorithm to finally obtain all the keypoint descriptors, and then obtain the descriptors of the keypoints according to the positions of the keypoints. In the embodiment of the present application, in the key point information generation layer, after passing through a convolution layer, the key point descriptors are directly obtained by bilinear upsampling, and by only performing bilinear upsampling on the key point positions, when using can greatly reduce the amount of computation.

Specifically, the key point information generation layer is used to extract the second key point description information from the shared feature information based on a convolution layer and bilinear upsampling, including: determining the position of the key point in the current image; generating the key point information The layer is used to obtain the down-sampling information of the shared feature information through a convolution layer; the key point information generation layer is used to directly up-sample the information of the corresponding position in the down-sampling information through bilinear up-sampling to obtain the second key point Description.

The position of the key point refers to the position of the key point in the image, that is, the size of the image in which the key point is located. If the size of the key point is 16*16 pixels, the position of the key point in the image can be determined.

For example, according to the foregoing, in order to further improve the time efficiency of the model, key points are obtained in a certain grid area in the current image, such as 16x16 pixels. After the key point information generation layer is downsampled to the original 1/16 feature map (for the current image) by a layer of convolution, the descriptor is directly obtained by bilinear upsampling, which not only reduces other The consumption of deconvolution upsampling in the learning-based method, and in the actual training and use, only upsampling for the position of the key points can greatly reduce the time consumption.

The creation of the above feature extraction model is obtained through model training. Because the model is a multi-task branch network model, it can be trained step by step during training. First, use the key point training set to train the model, and initially fix the parameters in the image description information generation layer, which are used for Determine the image descriptor (which can be the current image descriptor or the historical image descriptor). When the model is trained until the loss does not decrease significantly, the parameters of the key point information generation layer and the parameters of the feature extraction layer can be determined. Then, the obtained image matching training set is used to train the image description information generation layer, and its final parameters are determined.

It should be noted that, the image description information generation layer can also be trained first, so that the parameters of the feature extraction layer can also be determined, and then the key point information generation layer is trained. However, the model trained in this way is slightly less accurate than the model trained by the above training method.

In addition, in order to improve the training time of the entire model, the model can be trained through other training platforms, such as through a server or a computer, and then transplanted to a mobile platform after the training is completed. Of course, if the performance of the movable platform can support training the model, the model can also be trained on the movable platform.

Specifically, the initial feature extraction layer is trained through the first training data, and the trained feature extraction layer is generated as the feature extraction layer in the feature extraction model; the first training data includes image point pairs corresponding to the same spatial point, The image point pair is represented in different corresponding real images of the same visual scene; through the first training data, the initial key point information generation layer is trained, and the trained key point information generation layer is generated as the key point in the feature extraction model. Point information generation layer.

The first training data is the training data in the above-mentioned key point training set. The structure of the initial keypoint information generation layer is the same as that of the trained keypoint information generation layer. Only the parameters are different. For the initial keypoint information generation layer, the parameters are the initial parameters.

The training process is the training process of the network model, which will not be repeated here. It is only explained: the image point pair may be the image point pair corresponding to the spatial point in the same three-dimensional space. The image point pair is derived from two images that are represented as different real images of the same visual scene, like two real images of the same location but at different angles, or at different image acquisition locations.

The acquisition method of the first training data including the above-mentioned image point pairs is as follows:

Specifically, for different visual scenes, obtain real images from different angles in each visual scene; for each visual scene, build a three-dimensional spatial model according to the real images corresponding to different angles; based on the similarity between spatial points, Select spatial points from the spatial three-dimensional model, and obtain the real image point pairs corresponding to each selected spatial point in the real image; select the real image point pairs according to the similarity between the collection positions of the real image point pairs , and the selected real image point pairs are used as key point pairs to obtain the first training data.

The acquisition of real images from different angles may be: For example, according to the foregoing description, the UAV may acquire real images according to the size of the flight height and the attitude angle. For example, according to the standard of low, medium and high flight height and small, medium and large flight attitude angle, the UAV is used to collect the real image of the downward view, that is, the image data. At the same time, in order to speed up the subsequent model training and improve the balance of data distribution, the data that is too similar is eliminated according to the similarity of the collected images.

Therefore, for the application scenarios of UAV flying, the perspective and scale transformation of large data can be overcome. At the same time, real data can be provided during model training and testing. The real data can include a large number of collected real images and matching feature points in the real images, which can also be called key points.

The process of constructing a three-dimensional spatial model may be: For example, according to the foregoing, for at least two real images (which may be two, three, four, and five, etc.) of the same visual scene obtained above, using SFM (Structure from motion, three-dimensional modeling) modeling method to build a three-dimensional model of space. After the model is established, there are 2D points in at least two real images corresponding to each real 3D point in the spatial three-dimensional model, thereby forming a 2D point pair. In order to improve the generalization ability of the model, robust feature descriptions can be extracted when processing different types of key points. The embodiment of the present application can use a variety of different types of key points to construct a three-dimensional model through SFM. The key point type may include, but is not limited to, SIFT type key points (which may be key points or corner points obtained by SIFT algorithm, etc.), FAST (Features from accelerated segment test, feature point detection algorithm) type key points (which may be The key points or corner points obtained by the FAST algorithm), the key points of the ORB type (which can be the key points or corner points obtained by the ORB algorithm, etc.), the key points of the Harris type (which can be obtained by the Harris algorithm) key points or corner points, etc.). This results in more general training data.

The 3D points corresponding to the 2D point pairs that can be obtained through the above process will contain a large number of 3D points with similar distances, especially when a certain area in the image is particularly rich in texture, a large number of 3D points corresponding to this area will appear, which will affect The balanced distribution of training data needs to be filtered. The screening process is as follows:

A 3D point set S can be defined first, including the filtered 3D points. It can be traversed from the 3D points generated above, so that the similarity of any two 3D points in the filtered 3D points is less than or equal to a threshold, then the similarity of any two 3D points in the 3D points obtained in the set S is degree is less than or equal to a threshold. The similarity algorithm can be determined by the Euclidean distance algorithm.

In addition, a set P may also be set, and the set P is a set of candidate 3D points. Before screening, all the generated 3D points can be set in the set P as candidate 3D points. First, it is determined from the set P that the similarity of any two 3D points is less than or equal to a threshold value, and any two 3D points can be selected by traversing to determine their similarity. If the similarity is less than or equal to a threshold, put the any two 3D points into the set S. At this time, the similarity between the set P and each 3D point in the set S, that is, the Euclidean distance d, is calculated. If d is not greater than the set threshold α, the corresponding 3D points in the set P are added to the set S, so that The similarity of any two 3D points in the set S is less than or equal to a threshold, so that the 3D points in the set S are not overly similar and have data balance. If the set S is empty after screening, the candidate 3D point P may be added to the set S, where the candidate 3D point may be the 3D point generated above.

After the 3D points are screened, the selection of spatial points is completed, and the corresponding 2D point pairs need to be screened, that is, the corresponding real image point pairs are selected. It should be understood that, according to the foregoing, after the three-dimensional spatial model is created, the spatial points in the three-dimensional spatial model have a corresponding relationship with the real image points in the real image used to construct the three-dimensional spatial model. Afterwards, the spatial points that have been screened also have corresponding real image point pairs, that is, 2D point pairs.

Since each 3D point after screening will correspond to 2D points in multiple perspectives (real images from different perspectives used to construct the 3D model of the space), in order to increase the difficulty of the dataset, improve the accuracy and universality of the model. For each 3D point only the hardest pair of matching 2D points is kept. Through the above process, a 3D point set S is obtained, and any 3D point m in S is defined, and the corresponding 2D points under different viewing angles form a set T, and an image acquisition device (set on a movable platform) under each viewing angle in T is used. ), such as a camera, the poses constitute a set Q, and it should be understood that each pose corresponds to an image acquisition device, such as a camera. Traverse the set Q, calculate the similarity between the corresponding image acquisition devices, such as cameras, and positions, such as the Euclidean distance, to obtain the two camera positions with the largest Euclidean distance, keep the corresponding 2D points in T, and discard the remaining 2D points. Thus, the set S is traversed, and the unique 2D point pair corresponding to each 3D point in the set S is determined, and all the filtered 2D point pairs constitute the set T. It should be understood that the positions of the two cameras (ie, the image acquisition devices) with the largest Euclidean distance are the two positions that represent the least similarity. Then the 2D point pairs obtained are the most difficult. Thus, the first training data can be obtained.

However, in order to better represent the effect of the model after training, the first training data can be divided according to different difficulties, and divided into three categories: simple, general, and difficult. For the sets S and T obtained above, since any 3D point in the set S corresponds to a 2D point pair in the set T, then it can be defined that each group of corresponding 3D points m, 2D points x and 2D points y constitute a sample n , and calculate the difficulty score L of each sample n according to the following formula (1).

L=λ1 La+λ2Ld+λ3Lq (1)

Among them, La represents the angle ∠xpy formed by the 2D point pair in the sample n and the 3D point, Ld represents the 2D point x, 2D point y corresponding to the image acquisition device, such as the camera, the spatial distance between the positions, Lq represents the corresponding 2D point Image acquisition device, such as camera, quaternion angle of pose. In order to improve the rationality of the division, weight parameters λ1, λ2 and λ3 are introduced. According to the final difficulty score L, the first training data is divided into easy, normal, and difficult.

It should be noted that the difficulty level of the first training data can be known based on the division of the first training data, so that the training of subsequent models can be more accurately controlled, especially whether the model can meet many application scenarios, Whether the descriptors can be obtained more accurately in different application scenarios. At the same time, the first training data can also be adjusted according to the degree of difficulty, so that the degree of difficulty of the samples can meet the requirements and meet the requirements of model training. As can be seen from the foregoing, in order to further reduce the storage space used by the descriptors and reduce the time for measuring the distance between the descriptors. In this embodiment of the present application, the loss function of the Boolean descriptor can be added, and under the combined action of multiple loss functions, the image descriptor of the Boolean data type and the key point descriptor of the Boolean data type can finally be output. The dimension is much smaller than the traditional feature descriptor, and its effect is also better than the traditional feature descriptor. In addition, binary descriptors of Boolean data type are directly output from the feature extraction model, which is more convenient for subsequent retrieval and matching of descriptors.

Specifically, using the first training data to train the initial key point information generation layer to generate a trained key point information generation layer, including: adding Boolean to the loss function of the floating point data type in the initial key point information generation layer The loss function of the data type; through the first training data, the loss function of the floating point data type, and the loss function of the Boolean data type, the initial key point information generation layer is trained, and the trained key point information generation layer is generated.

According to the above, before training the key point information generation layer for training, the loss function of this layer can be converted from a loss function of floating point data type to a loss function of floating point data type, and a loss function of Boolean data type is added to the loss function, that is, Form multiple loss functions. It should be understood that only a loss function of floating-point data type can also implement model training, but the descriptor obtained by the model during training is a descriptor of floating-point data type. Therefore, the loss function of the boolean data type is added to the loss function of the floating point data type as the loss function of the layer, and at the same time, the layer is trained through the first training data, and the trained layer is obtained. The trained feature extraction layer can also be obtained.

Based on this, after the key point information generation layer and the feature extraction layer are trained, the image description information generation layer can be trained.

Specifically, the method 100 may further include: based on the trained feature extraction layer, training the initial image description information generation layer by using the second training data, and generating the trained image description information generation layer as the image in the feature extraction model The description information generation layer, wherein the second training data includes key-frame image matching pairs and information indicating whether each key-frame image matching pair belongs to the same visual scene.

The second training data may be acquired in the following manner: acquiring real images, determining real image matching pairs from the real images based on the classification model, and using them as key frame image matching pairs, and determining whether each real image matching pair belongs to the same vision scene, so as to obtain the second training data.

The classification model may be a model for matching real images, and the model may determine real image matching pairs that belong to the same visual scene and real image matching pairs that do not belong to the same visual scene, such as two matching pairs belonging to the same location. real image. The model can be BoW.

For example, according to the foregoing description, multiple real images in multiple different visual scenes can be obtained by the drone in the actual flight scene of the visual return flight. Then use the BoW model to input real images into the model to obtain matching pairs of real images in the same visual scene and matching pairs of real images in different scenes determined by the model. Among them, the model can determine matching pairs by scoring. The real image matching pairs with scores higher than the threshold are regarded as real image matching pairs in the same visual scene, that is, positive sample training data. The real image matching pairs whose scores are lower than the threshold are regarded as real image matching pairs that are not in the same visual scene, that is, the negative sample training data. Thus, the second training data can be obtained.

In order to improve the ability of the model and the universality of the model. After the BOW model is determined, a random candidate matching pair can be added, that is, a candidate matching pair is randomly selected from the collected real image to generate a candidate matching pair, and after the candidate matching pair is generated. Whether there are errors or problems in these matching pairs is further determined manually. When there are problems or errors, especially caused by the classification model, precious negative sample training data can be obtained to improve the model ability.

Specifically, the method 100 further includes: by displaying the real image matching pairs, in response to a user's determination operation, determining whether the real image matching pairs belong to the same visual scene, thereby acquiring the second training data.

According to the foregoing, after obtaining the real image matching pairs and candidate matching pairs, or just after obtaining the real matching pairs, the images corresponding to the matching pairs can be displayed through a display device, such as a display screen, and the matching pairs can be displayed as the matching pairs. For example, when a pair of matching pairs is displayed, in addition to displaying the corresponding two real images, the corresponding feature points between the two real images can also be displayed, and the corresponding feature points can be displayed through line to connect. Then, annotations are made by workers (i.e. users). The annotation can include the following situations: same, not same, and indeterminate. The same can be represented by "0", the difference can be represented by "1", and the uncertainty can be represented by "2". Matching pairs that are manually marked as uncertain can be eliminated and not used as the second training data. Others are used as the second training data, that is, the matching pairs marked with "0" and the matching pairs marked with "1" are used as the second training data.

Specifically, the method 100 further includes: randomly selecting real image matching pairs from real images as key frame image matching pairs; by displaying the randomly selected real image matching pairs, in response to a user's determination operation, determining the randomly selected real images Whether the matching pair belongs to the same visual scene, so as to obtain the second training data.

Since it has been described above, it will not be repeated here.

It should be noted that by using the BoW model to select matching pairs, the selected negative sample training data also has a difficult problem. By manually labeling based on the BOW model, more valuable negative sample training data can be found. (that is, the scenes are similar and are mistakenly identified by BOW as matching pairs belonging to the same visual scene), which helps to train a more robust model network. In addition, since the images of the matching pair can be obtained by the UAV from the actual flight scene of the visual return, it can fully reflect the change of the perspective scale in the visual return task.

After obtaining the second training data, the initial image description information generation layer can be trained, and the specific training process will not be repeated. Finally, the trained image description information generation layer can be obtained.

As can be seen from the foregoing, in order to further reduce the storage space used by the descriptors and reduce the time for measuring the distance between the descriptors. In this embodiment of the present application, the loss function of the Boolean descriptor can be added, and under the combined action of multiple loss functions, the image descriptor of the Boolean data type and the key point descriptor of the Boolean data type can finally be output. The dimension is much smaller than the traditional feature descriptor, and its effect is also better than the traditional feature descriptor. In addition, the binary descriptor of Boolean data type is directly output from the feature extraction model, which is more convenient for the retrieval and matching of subsequent descriptors. For example, the binary descriptor of the Boolean data type of the second image description information is output from the image description information generation layer.

Specifically, based on the feature extraction layer after training, the initial image description information generation layer is trained by the second training data, and the trained image description information generation layer is generated, including: floating point data in the initial image description information generation layer The loss function of Boolean data type is added to the loss function of the type; based on the feature extraction layer after training, the initial image description information generation layer is processed through the second training data, the loss function of the floating point data type and the loss function of the Boolean data type. Train, generate the image description information generation layer after training.

Since it has been described above, it will not be repeated here. Only note: The training of the initial image description information generation layer is based on the feature extraction layer after training and the loss function of floating point data type and the loss function of Boolean data type. Through the second training data, the initial image description information generation layer is training. Thus, the feature extraction model can be completely trained.

Because the network of this feature extraction model is a multi-task branch network, a step-by-step training method can be adopted during training. The first training data can be used to train the model, that is, the initial key point information generation layer is trained. When the loss does not decrease significantly after training, the parameters of the initial key point information generation layer and the initial feature extraction layer are fixed to obtain the key point information generation layer and the feature extraction layer. Then use the second training data to train the initial image description information generation layer to obtain the image description information generation layer.

Since the embodiment of the present application obtains the second image description information and the second key point description information simultaneously through the same feature extraction model, in the training process, the first training data can be used to firstly use the first training data to correspond to the feature extraction model, because the The first training data is obtained from the spatial three-dimensional model, which is completely correct data. After the training of the first training data, the common layer of the model, that is, the feature extraction layer, is already a good feature extraction layer.

Then use the second training data marked by the workers to train the initial image description information generation layer. After the shared feature extraction layer trained by the first training data, the second training data can be used for training to avoid the generation of worker annotations. The influence of the error on the network, so as to obtain a better image description information generation layer.

It should be noted that the above-mentioned training methods can also be reversed in order, that is, the training is performed first with the second training data, and then the training is performed with the first training data. It will not be repeated here.

To fine-tune the model more precisely, the entire network can be fine-tuned using training data that contains both keypoints and keyframe images.

Specifically, after the feature extraction model is trained, the method 100 further includes: adjusting the feature extraction layer, the image description information generation layer and/or the key point information generation layer in the feature extraction model through the third training data, and the third The three training data includes keyframe image matching pairs and keypoint matching pairs in the keyframe image matching pairs.

Wherein, the third training data can be determined in the following way: when the number of real image point pairs in the two real images is greater than the threshold, the two real images and the corresponding real image point pairs are used as the key frame image matching pair and key point matching pairs to obtain the third training data.

According to the method described above, a three-dimensional spatial model can be constructed. After the model is established, there are at least two 2D points in the real images corresponding to each real 3D point in the spatial three-dimensional model, thereby forming a 2D point pair, that is, a pair of A 2D point pair belongs to a 3D point of a 3D model of space. When there are multiple 2D point pairs in the two real images, the number of which is greater than the number threshold, the two real images and the 2D point pairs in them can be used as the third training data, and the third training data can have multiple pairs real images, and each pair of real images has corresponding 2D point pairs.

After the third training data is obtained, the model trained by the first training data and the second training data is fine-tuned by using the third training data. Fine-tune the parameters of the feature extraction layer, image description information generation layer and/or keypoint information generation layer in the trained feature extraction model. It will not be repeated here.

The fine-tuned model can be used. If the model is trained on a mobile platform, it can be used directly. If the model is trained on a terminal, such as a server or a computer, the trained final model can be transplanted to the mobile platform.

According to the foregoing, after obtaining the description information of the second key points, the corresponding information can be combined according to the order of the key points in the image, so as to perform subsequent matching.

Specifically, the method 100 further includes: combining the corresponding multiple second key point description information into a vector according to the sequence of multiple key points in the current image.

For example, according to the foregoing, after the UAV obtains multiple key point descriptors of the current image through the feature extraction model, the corresponding descriptors can be integrated into a vector according to the order of the key points in the current image. for subsequent matching.

Correspondingly, for the first key point description information, it is also possible to integrate the corresponding descriptors into a vector according to the order of the key points in the historical image. for subsequent matching.

Wherein, the image description information is used to find a first type of historical image with a scene similar to the current image in a plurality of the historical images, and the key point description information is used to find a first type of historical image in the first type of historical image A key point matching the key point of the current image is searched, and the matching result includes the matching relationship between the key point of the current image and the key point in the historical image.

In other words, the image description information is used to roughly pair the images, and based on this, one or more historical images (the first type of historical images) that more closely match the current image scene are obtained. What is related to positioning is the matching relationship of key points. Based on the description information of key points, the key points in the current image history image can be further matched to obtain the matching relationship of key points, that is, a key point in the current image, A match to a keypoint in the historical image.

The position information of the key points in the historical image can be considered accurate, because based on the position information of the key points in the historical image, the matching relationship between a key point in the current image and a key point in the historical image can be obtained. Location information of a key point in the image.

For example, according to the foregoing, the UAV obtains the image descriptor of the above-mentioned historical image and the key point descriptor or the vector composed of the key point descriptor. And the image descriptor of the current image and the keypoint descriptor or the vector composed of the keypoint descriptor are obtained. The vector composed of the image descriptor corresponding to the current image and the key point descriptor or key point descriptor can be compared with the image descriptor corresponding to multiple historical images and the vector composed of the key point descriptor or key point descriptor. The comparison result, that is, the matching result, can be determined through a similarity algorithm.

When the current image is compared with the historical image, the comparison result may be that the current image may be exactly the same as one of the historical images, or partially the same, that is, similar. When there is only similarity, the similarity can be obtained according to the similarity algorithm to determine whether the similarity is greater than the similarity threshold. When it is greater than the similarity threshold, it can be determined that the matching result is a matching. Otherwise, it is a mismatch.

It should be noted that the above similarity algorithm may include Hamming distance, Euclidean distance, and the like.

In addition, according to the foregoing, the above-mentioned image descriptors and key point descriptors can be Boolean descriptors. When the Boolean descriptor measures the distance between the corresponding descriptors through the similarity algorithm, it only needs to perform the XOR operation to obtain the similarity, such as the Hamming distance, which can greatly speed up the calculation of the distance between the corresponding descriptors. , thereby further reducing the time consumption.

For example, according to the above-mentioned matching results, the UAV determines which historical image the current image is the same as or meets the similarity threshold, so as to determine the geographic location to which the current image belongs based on the geographic location to which the historical image belongs. The determined geographic location may be an absolute geographic location of the current image, that is, a geographic location based on a geographic location coordinate system or a geographic location relative to a historical image. When the two images do not belong to exactly the same, such as there is an angle change, but belong to the two images in the same visual scene. Then, spatial 3D modeling can be performed through these two images. Then, different angles or different positions of the two images are determined according to the three-dimensional spatial model, thereby determining the position of the current image.

After determining the position of the current image, the movable platform can go back according to the position.

Specifically, the method 100 may further include: determining the posture of the movable platform according to the position deviation between the corresponding first position information and the corresponding second position information in the matching result; according to the posture, the movable platform moves from the second position to the The first position to realize the automatic return of the movable platform.

For example, as described above, the posture of the drone is determined and adjusted according to the deviation of the above two positions. So that the drone can move from the second position to the first position, so as to realize the return flight.

3 is a schematic structural diagram of a positioning device according to an embodiment of the present invention; the device 300 can be applied to a movable platform, such as an unmanned aerial vehicle, an intelligent mobile robot, etc., and the movable platform includes a visual sensor. The apparatus 300 can perform the above-mentioned positioning method. The apparatus 300 includes: a first obtaining module 301 , a second obtaining module 302 , a first determining module 303 and a second determining module 304 . The functions of each module are described in detail below:

The first obtaining module 301 is configured to obtain first image description information and first key point description information of historical images collected by the visual sensor, and obtain first position information of the movable platform when collecting the historical images.

The second obtaining module 302 is configured to obtain the current image collected by the visual sensor, and obtain second image description information and second key point description information of the current image based on the feature extraction model.

The first determination module 303 is configured to determine a plurality of historical images and the current image based on the first image description information and the first key point description information of the historical image, and the second image description information and the second key point description information of the current image. match result.

The first determination module 304 is configured to determine the second position information of the movable platform when the current image is collected according to the matching result and the first position information of the historical image.

Specifically, the feature extraction model includes a feature extraction layer, an image description information generation layer and a key point information generation layer; the feature extraction layer is used to extract the common feature information of the current image based on the convolutional network; the image description information generation layer is used based on the shared features. The information generates second image description information; the key point information generation layer is used for generating the second key point description information based on the common feature information.

Specifically, the feature extraction layer is used to extract common feature information based on multiple convolutional layers in the convolutional network.

Specifically, the image information generation layer is used to extract the second image description information from the shared feature information through the two convolution layers and the NetVLAD layer.

Specifically, the image information generation layer is used to extract the second image description information of the floating point data type from the common feature information through the two convolution layers and the NetVLAD layer; the image information generation layer is used to convert the floating point data type to the second image description information. The second image description information is converted into the second image description information of the Boolean data type.

Specifically, the key point information generation layer is used to extract the second key point description information from the shared feature information based on one convolution layer and bilinear upsampling, and the number of convolution kernels of one convolution layer is related to the second key point. The amount of descriptive information is the same.

Specifically, the key point information generation layer is used to extract the second key point description information of the floating point data type from the shared feature information through a convolution layer and bilinear upsampling; the key point information generation layer is used to convert the floating point data. The second key point description information of the point data type is converted into the second key point description information of the Boolean data type.

Specifically, the second acquisition module 302 is used to: determine the position of the key point in the current image; the key point information generation layer is used to obtain the downsampling information of the shared feature information through a layer of convolution layer; the key point information generation layer is used to The information of the corresponding position in the down-sampling information is directly up-sampled by bilinear up-sampling, so as to obtain the second key point description information.

In addition, the apparatus 300 further includes: a combining module, configured to combine the corresponding multiple second key point description information into a vector according to the sequence of the multiple key points in the current image.

In addition, the device 300 further includes: a third determining module, configured to determine the posture of the movable platform according to the position deviation between the corresponding first position information and the corresponding second position information in the matching result; a moving module, configured to determine the posture of the movable platform according to the posture , the movable platform moves from the second position to the first position to realize the automatic return of the movable platform.

In addition, the device 300 further includes: a training module for training the initial feature extraction layer through the first training data, and generating a trained feature extraction layer as the feature extraction layer in the feature extraction model; the first training data includes Image point pairs corresponding to the same spatial point, the image point pairs are represented in different corresponding real images of the same visual scene; the training module is used to train the initial key point information generation layer through the first training data, and generate training The latter key point information generation layer is used as the key point information generation layer in the feature extraction model.

In addition, the second acquisition module 302 is used for acquiring real images from different angles under each visual scene for different visual scenes; the device 300 further includes: a creation module for each visual scene, according to corresponding different angles The real image of the space is constructed, and the spatial three-dimensional model is constructed; the selection module is used to select the space point from the space three-dimensional model based on the similarity between the space points, and obtain the real image point corresponding to each selected space point in the real image. Right; the selection module is used to select the real image point pair according to the similarity between the collection positions of the real image point pair, and use the selected real image point pair as the key point pair to obtain the first training data.

Specifically, the training module includes: an adding unit for adding a loss function of Boolean data type to the loss function of floating point data type in the initial key point information generation layer; a training unit for passing the first training data, floating point data The loss function of point data type and the loss function of Boolean data type are used to train the initial key point information generation layer, and generate the key point information generation layer after training.

In addition, the training module is also used for: training the initial image description information generation layer through the second training data based on the trained feature extraction layer, and generating the trained image description information generation layer as the image description in the feature extraction model The information generation layer, wherein the second training data includes key-frame image matching pairs and information indicating whether each key-frame image matching pair belongs to the same visual scene.

In addition, the second acquisition module 302 further includes: acquiring real images, determining real image matching pairs from the real images based on the classification model, as key frame image matching pairs, and determining whether each real image matching pair belongs to the same visual scene , so as to obtain the second training data.

In addition, the apparatus 300 further includes: a third determination module, configured to determine whether the real image matching pairs belong to the same visual scene by displaying the real image matching pairs in response to the user's determination operation, thereby acquiring the second training data.

Specifically, the selection module is further configured to: randomly select real image matching pairs from the real images as key frame image matching pairs; the third determining module is configured to respond to the user's determination by displaying the randomly selected real image matching pairs operation to determine whether the randomly selected matching pairs of real images belong to the same visual scene, so as to obtain the second training data.

Specifically, the adding unit is also used for: adding a loss function of boolean data type to the loss function of floating point data type in the initial image description information generation layer; the training unit is also used to extract the layer based on the features after training, through The second training data, the loss function of the floating point data type, and the loss function of the Boolean data type are used to train the initial image description information generation layer to generate a trained image description information generation layer.

In addition, after the feature extraction model is trained, the apparatus 300 further includes: an adjustment module, configured to perform a feature extraction layer, an image description information generation layer and/or a key point information generation layer in the feature extraction model through the third training data After adjustment, the third training data includes key frame image matching pairs and key point matching pairs in the key frame image matching pairs.

In addition, the selection module is also used for: when the number of real image point pairs in the two real images is greater than the threshold, the two real images and the corresponding real image point pairs are used as key frame image matching pairs and key points Match pairs to obtain the third training data.

In a possible design, the structure of the positioning apparatus 300 shown in FIG. 3 may be implemented as an electronic device, and the electronic device may be a positioning device, such as a movable platform. As shown in FIG. 4 , the positioning device 400 may include: one or more processors 401 , one or more memories 402 and a visual sensor 403 . Among them, the visual sensor 403 is used to collect historical images and current images. The memory 402 is used to store a program that supports the electronic device to execute the positioning method provided in the embodiments shown in FIG. 1 to FIG. 2 . The processor 401 is configured to execute programs stored in the memory 402 . Specifically, the program includes one or more computer instructions, wherein the one or more computer instructions can implement the following steps when executed by the processor 401:

Running the computer program stored in the memory 402 to achieve: obtaining the first image description information and the first key point description information of the historical images collected by the vision sensor, and obtaining the first position information of the movable platform when collecting the historical images; obtaining visual the current image collected by the sensor, and obtain the second image description information and the second key point description information of the current image based on the feature extraction model; the first image description information and the first key point description information based on the historical image, and The second image description information and the second key point description information of the current image are used to determine the matching results between multiple historical images and the current image; 2. Location information.

Specifically, the image information generation layer is used to extract the second image description information of the floating point data type from the shared feature information through the two convolution layers and the NetVLAD layer; the image information generation layer is used to extract the second image description information of the floating point data type The image description information is converted into the second image description information of the Boolean data type.

In addition, the processor 401 is further configured to: determine the position of the key point in the current image; the key point information generation layer is used to obtain the down-sampling information of the common feature information through a convolution layer; the key point information generation layer is used to The information of the corresponding position in the down-sampling information is directly up-sampled by bilinear up-sampling, so as to obtain the second key point description information.

In addition, the processor 401 is further configured to: combine the corresponding multiple second key point description information into a vector according to the sequence of the multiple key points in the current image.

Specifically, the processor 401 is further configured to: determine the posture of the movable platform according to the position deviation between the corresponding first position information and the corresponding second position information in the matching result; according to the posture, the movable platform moves from the second position to the first position for automatic return of the movable platform.

In addition, the processor 401 is further configured to: train the initial feature extraction layer by using the first training data, and generate a trained feature extraction layer as the feature extraction layer in the feature extraction model; Image point pairs of spatial points, the image point pairs are represented in different corresponding real images of the same visual scene; through the first training data, the initial key point information generation layer is trained, and the trained key point information generation layer is generated, As the keypoint information generation layer in the feature extraction model.

In addition, the processor 401 is further configured to: for different visual scenes, obtain real images from different angles in each visual scene; for each visual scene, build a three-dimensional spatial model according to the real images corresponding to different angles; The similarity between the points, select the spatial point from the spatial three-dimensional model, and obtain the real image point pair corresponding to each selected spatial point in the real image; according to the similarity between the collection positions of the real image point pair, The real image point pair is selected, and the selected real image point pair is used as the key point pair to obtain the first training data.

In addition, the processor 401 is specifically configured to: add a loss function of the Boolean data type to the loss function of the floating point data type in the initial key point information generation layer; pass the first training data, the loss function of the floating point data type and the Boolean data type The loss function of the data type, the initial key point information generation layer is trained, and the trained key point information generation layer is generated.

In addition, the processor 401 is further configured to: based on the trained feature extraction layer, train the initial image description information generation layer by using the second training data, and generate the trained image description information generation layer as the image in the feature extraction model The description information generation layer, wherein the second training data includes key-frame image matching pairs and information indicating whether each key-frame image matching pair belongs to the same visual scene.

In addition, the processor 401 is further configured to: obtain a real image, determine a real image matching pair from the real image based on the classification model, use it as a key frame image matching pair, and determine whether each real image matching pair belongs to the same visual scene, Thus, the second training data is obtained.

In addition, the processor 401 is further configured to: by displaying the real image matching pairs, in response to the user's determination operation, determine whether the real image matching pairs belong to the same visual scene, thereby acquiring the second training data.

In addition, the processor 401 is further configured to: randomly select real image matching pairs from real images as key frame image matching pairs; by displaying the randomly selected real image matching pairs, in response to a user's determination operation, determine the randomly selected real image matching pairs Whether the image matching pairs belong to the same visual scene, so as to obtain the second training data.

Specifically, the processor 401 is specifically configured to: add a loss function of the Boolean data type to the loss function of the floating point data type in the initial image description information generation layer; based on the feature extraction layer after training, through the second training data, The loss function of the floating point data type and the loss function of the Boolean data type are used to train the initial image description information generation layer, and generate the trained image description information generation layer.

In addition, after training the feature extraction model, the processor 401 is further configured to: adjust the feature extraction layer, the image description information generation layer and/or the key point information generation layer in the feature extraction model through the third training data, The third training data includes key frame image matching pairs and key point matching pairs in the key frame image matching pairs.

In addition, the processor 401 is further configured to: when the number of real image point pairs in the two real images is greater than the threshold, then use the two real images and the corresponding real image point pairs as the key frame image matching pair and the key The points are matched to obtain the third training data.

In addition, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement the above-mentioned methods in FIG. 1 to FIG. 2 . .

An embodiment of the present invention provides an unmanned aerial vehicle; specifically, the unmanned aerial vehicle includes: a body and a positioning device as shown in FIG. 4 , and the positioning device is provided on the body.

The technical solutions and technical features in the above embodiments can be used alone or combined in the case of conflict with the present invention, as long as they do not exceed the cognitive scope of those skilled in the art, they all belong to the equivalent embodiments within the protection scope of the present application .

In the several embodiments provided by the present invention, it should be understood that the disclosed related detection apparatus (eg, IMU) and method may be implemented in other manners. For example, the embodiments of the remote control device described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or components. May be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection of the remote control device or unit may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer processor (processor) to perform all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

The above descriptions are only the embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

A positioning method, characterized in that the method is applied to a movable platform, and the movable platform includes a vision sensor, including:

obtaining first image description information and first key point description information of a plurality of historical images collected by the visual sensor, and obtaining first position information of the movable platform when collecting the historical images;

acquiring the current image collected by the visual sensor, and acquiring the second image description information and the second key point description information of the current image based on the feature extraction model;

Based on the first image description information and the first key point description information of the historical image, and the second image description information and the second key point description information of the current image, determine a plurality of the matching result between the historical image and the current image;

According to the matching result and the first position information of the historical image, the second position information of the movable platform when the current image is collected is determined.
The method according to claim 1, wherein the feature extraction model comprises a feature extraction layer, an image description information generation layer and a key point information generation layer;

The feature extraction layer is used to extract the common feature information of the current image based on the convolutional network;

The image description information generation layer is configured to generate the second image description information based on the common feature information;

The key point information generation layer is configured to generate second key point description information based on the common feature information.
The method according to claim 2, wherein the feature extraction layer is configured to extract common feature information of the current image based on a convolutional network, comprising:

The feature extraction layer is used for extracting the common feature information based on multiple convolutional layers in the convolutional network.
The method according to claim 2, wherein the image description information generation layer is configured to generate the second image description information based on the common feature information, comprising:

The image information generation layer is configured to extract the second image description information from the common feature information through two convolution layers and a NetVLAD layer.
The method according to claim 4, wherein the image information generation layer is configured to extract the second image description information from the common feature information through two convolution layers and a NetVLAD layer, comprising:

The image information generation layer is used to extract the second image description information of the floating point data type from the shared feature information through two layers of convolution layers and the NetVLAD layer;

The image information generation layer is used for converting the second image description information of the floating point data type into the second image description information of the Boolean data type.
The method according to claim 2, wherein the key point information generation layer is configured to generate second key point description information based on the common feature information, comprising:

The key point information generation layer is used to extract the second key point description information from the shared feature information based on one convolution layer and bilinear upsampling, and the number of convolution kernels of the one convolution layer is the same as the second key The amount of point description information is the same.
The method according to claim 6, wherein the key point information generation layer is configured to extract the second key point description information from the shared feature information through a convolution layer and bilinear upsampling, comprising:

The key point information generation layer is used to extract the second key point description information of the floating point data type from the shared feature information through a layer of convolution layer and bilinear upsampling;

The key point information generation layer is used to convert the second key point description information of the floating point data type into the second key point description information of the Boolean data type.
The method according to claim 2, wherein the key point information generation layer is configured to extract the second key point description information from the shared feature information based on one convolution layer and bilinear upsampling, comprising:

Determine the location of key points in the current image;

The key point information generation layer is used to obtain down-sampling information of the shared feature information through a convolution layer;

The key point information generation layer is used for directly up-sampling the information of the corresponding position in the down-sampling information through bilinear up-sampling to obtain the second key point description information.
The method according to claim 2, wherein the method further comprises:

According to the sequence of the multiple key points in the current image, the corresponding multiple second key point description information is combined into a vector.
The method according to claim 1, wherein the method further comprises:

Determine the posture of the movable platform according to the position deviation between the corresponding first position information and the corresponding second position information in the matching result;

Based on the gesture, the movable platform moves from the second position to the first position to achieve automatic return of the movable platform.
The method according to claim 2, wherein the method further comprises:

Through the first training data, the initial feature extraction layer is trained, and the trained feature extraction layer is generated as the feature extraction layer in the feature extraction model; the first training data includes image point pairs corresponding to the same spatial point, the Image point pairs in different corresponding real images represented as the same visual scene;

Through the first training data, the initial key point information generation layer is trained, and the trained key point information generation layer is generated as the key point information generation layer in the feature extraction model.
The method according to claim 11, wherein the method further comprises:

For different visual scenes, obtain real images from different angles under each visual scene;

For each visual scene, build a three-dimensional spatial model according to the real images corresponding to different angles;

Based on the similarity between the spatial points, selecting spatial points from the three-dimensional spatial model, and obtaining a real image point pair corresponding to each selected spatial point in the real image;

According to the similarity between the collection positions of the real image point pairs, the real image point pairs are selected, and the selected real image point pairs are used as key point pairs, thereby obtaining the first training data.
The method according to claim 11, wherein the first training data is used to train an initial key point information generation layer to generate a trained key point information generation layer, comprising:

A loss function of Boolean data type is added to the loss function of floating point data type in the initial key point information generation layer;

The initial key point information generation layer is trained by using the first training data, the loss function of the floating point data type, and the loss function of the Boolean data type, and a trained key point information generation layer is generated.
The method according to claim 2, wherein the method further comprises:

Based on the trained feature extraction layer, the initial image description information generation layer is trained by the second training data, and the trained image description information generation layer is generated as the image description information generation layer in the feature extraction model. The second training data includes key-frame image matching pairs and information indicating whether each key-frame image matching pair belongs to the same visual scene.
The method of claim 14, wherein the method further comprises:

Acquire a real image, determine a real image matching pair from the real image based on the classification model, and use it as a key frame image matching pair, and determine whether each real image matching pair belongs to the same visual scene, thereby obtaining the second training data.
The method of claim 15, wherein the method further comprises:

By displaying the real image matching pairs, in response to the user's determination operation, it is determined whether the real image matching pairs belong to the same visual scene, thereby acquiring second training data.
The method of claim 15, wherein the method further comprises:

Randomly select real image matching pairs from the real images as key frame image matching pairs;

By displaying the randomly selected real image matching pairs, in response to the user's determination operation, it is determined whether the randomly selected real image matching pairs belong to the same visual scene, thereby acquiring the second training data.
The method according to claim 14, wherein, based on the trained feature extraction layer, the initial image description information generation layer is trained by the second training data, and the trained image description information generation layer is generated, include:

adding a loss function of boolean data type to the loss function of floating point data type in the initial image description information generation layer;

Based on the trained feature extraction layer, the initial image description information generation layer is trained through the second training data, the loss function of the floating point data type and the loss function of the Boolean data type, and the trained image description information is generated Generate layers.
The method according to claim 14, wherein after training the feature extraction model, the method further comprises:

Adjust the feature extraction layer, the image description information generation layer and/or the key point information generation layer in the feature extraction model through third training data, where the third training data includes key frame image matching pairs and key frame images Keypoint matching pairs in matching pairs.
The method of claim 12, wherein the method further comprises:

When the number of real image point pairs in the two real images is greater than the threshold, the two real images and the corresponding real image point pairs are used as key frame image matching pairs and key point matching pairs, so as to obtain a third training data.
A positioning device, comprising: a memory, a processor and a vision sensor;

the memory for storing computer programs;

The visual sensor is used to collect historical images and current images;

The processor invokes the computer program to implement the following steps:

obtaining first image description information and first key point description information of a plurality of historical images collected by the visual sensor, and obtaining first position information of the movable platform when collecting the historical images;

acquiring the current image collected by the visual sensor, and acquiring the second image description information and the second key point description information of the current image based on the feature extraction model;

Based on the first image description information and the first key point description information of the historical image, and the second image description information and the second key point description information of the current image, determine a plurality of the matching result between the historical image and the current image;

According to the matching result and the first position information of the historical image, the second position information of the movable platform when the current image is collected is determined.
The device according to claim 21, wherein the feature extraction model comprises a feature extraction layer, an image description information generation layer and a key point information generation layer;

The feature extraction layer is used to extract the common feature information of the current image based on the convolutional network;

The image description information generation layer is configured to generate the second image description information based on the common feature information;

The key point information generation layer is configured to generate second key point description information based on the common feature information.
The device according to claim 22, wherein the feature extraction layer is configured to extract the common feature information based on a plurality of convolutional layers in a convolutional network.
The device according to claim 22, wherein the image information generation layer is configured to extract the second image description information from the common feature information through two convolution layers and a NetVLAD layer.
The device according to claim 24, wherein the image information generation layer is used to extract the second image description information of the floating point data type from the shared feature information through two convolution layers and a NetVLAD layer;

The image information generation layer is used for converting the second image description information of the floating point data type into the second image description information of the Boolean data type.
The device according to claim 22, wherein the key point information generation layer is configured to extract the second key point description information from the shared feature information based on one convolution layer and bilinear upsampling, and the one The number of convolution kernels in the convolutional layer is the same as the number of the second keypoint description information.
The device according to claim 26, wherein the keypoint information generation layer is configured to extract the second keypoint of the floating-point data type from the shared feature information through a convolutional layer and bilinear upsampling Description;

The key point information generation layer is used to convert the second key point description information of the floating point data type into the second key point description information of the Boolean data type.
The device according to claim 22, wherein the processor is further configured to:

Determine the position of key points in the current image;

The key point information generation layer is used to obtain down-sampling information of the shared feature information through a convolution layer;

The key point information generation layer is used for directly up-sampling the information of the corresponding position in the down-sampling information through bilinear up-sampling to obtain the second key point description information.
The device according to claim 22, wherein the processor is further configured to: combine the corresponding multiple second key point description information into one according to the sequence of multiple key points in the current image in vector.
The device according to claim 21, wherein the processor is further configured to:

Determine the posture of the movable platform according to the position deviation between the corresponding first position information and the corresponding second position information in the matching result;

Based on the gesture, the movable platform moves from the second position to the first position to achieve automatic return of the movable platform.
The device according to claim 22, wherein the processor is further configured to: train an initial feature extraction layer by using the first training data, and generate a trained feature extraction layer, which is used as a feature extraction layer in the feature extraction model. Feature extraction layer; the first training data includes image point pairs corresponding to the same spatial point, and the image point pairs are represented in different corresponding real images of the same visual scene;

Through the first training data, the initial key point information generation layer is trained, and the trained key point information generation layer is generated as the key point information generation layer in the feature extraction model.
The device according to claim 31, wherein the processor is further configured to: for different visual scenes, obtain real images from different angles in each visual scene;

For each visual scene, build a three-dimensional spatial model according to the real images corresponding to different angles;

Based on the similarity between the spatial points, selecting spatial points from the three-dimensional spatial model, and obtaining a real image point pair corresponding to each selected spatial point in the real image;

According to the similarity between the collection positions of the real image point pairs, the real image point pairs are selected, and the selected real image point pairs are used as key point pairs, thereby obtaining the first training data.
The device according to claim 31, wherein the processor is specifically configured to: add a loss function of Boolean data type to the loss function of floating point data type in the initial key point information generation layer;

The initial key point information generation layer is trained by using the first training data, the loss function of the floating point data type, and the loss function of the Boolean data type, and a trained key point information generation layer is generated.
The device according to claim 32, wherein the processor is further configured to: based on the trained feature extraction layer, perform training on the initial image description information generation layer by using the second training data, and generate a post-training The image description information generation layer is used as the image description information generation layer in the feature extraction model, wherein the second training data includes key frame image matching pairs and information indicating whether each key frame image matching pair belongs to the same visual scene.
The device according to claim 34, wherein the processor is further configured to: obtain a real image, and determine a real image matching pair from the real image based on a classification model, as a key frame image matching pair, And it is determined whether each real image matching pair belongs to the same visual scene, so as to obtain the second training data.
The device according to claim 35, wherein the processor is further configured to: by displaying the real image matching pairs, in response to a user's determination operation, determine whether the real image matching pairs belong to the same visual scene , so as to obtain the second training data.
The device according to claim 35, wherein the processor is further configured to: randomly select real image matching pairs from the real images as key frame image matching pairs;

By displaying the randomly selected real image matching pairs, in response to the user's determination operation, it is determined whether the randomly selected real image matching pairs belong to the same visual scene, thereby acquiring the second training data.
The device according to claim 34, wherein the processor is specifically configured to: add a loss function of Boolean data type to the loss function of floating point data type in the initial image description information generation layer;

Based on the trained feature extraction layer, the initial image description information generation layer is trained through the second training data, the loss function of the floating point data type and the loss function of the Boolean data type, and the trained image description information is generated Generate layers.
The device according to claim 34, wherein after the feature extraction model is trained, the processor is further configured to: perform third training data on the feature extraction layer in the feature extraction model, The image description information generation layer and/or the key point information generation layer is adjusted, and the third training data includes key frame image matching pairs and key point matching pairs in the key frame image matching pairs.
The device according to claim 32, wherein the processor is further configured to: when the number of real image point pairs in the two real images is greater than a threshold, then the two real images and the The corresponding real image point pairs are used as key frame image matching pairs and key point matching pairs, thereby obtaining third training data.
An unmanned aerial vehicle, characterized by comprising: an airframe and the device according to claims 21-40.
A computer-readable storage medium, characterized in that the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement any one of claims 1-20 The method of positioning described in item.