WO2021035966A1 - 视觉定位方法及相关装置 - Google Patents
视觉定位方法及相关装置 Download PDFInfo
- Publication number
- WO2021035966A1 WO2021035966A1 PCT/CN2019/117224 CN2019117224W WO2021035966A1 WO 2021035966 A1 WO2021035966 A1 WO 2021035966A1 CN 2019117224 W CN2019117224 W CN 2019117224W WO 2021035966 A1 WO2021035966 A1 WO 2021035966A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- target
- feature
- candidate
- camera
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/587—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/30—Determination of transform parameters for the alignment of images, i.e. image registration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/36—Applying a local operator, i.e. means to operate on image points situated in the vicinity of a given point; Non-linear local filtering operations, e.g. median filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/752—Contour matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
- G06V10/7625—Hierarchical techniques, i.e. dividing or merging patterns to obtain a tree-like representation; Dendograms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/86—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B29/00—Maps; Plans; Charts; Diagrams, e.g. route diagram
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30244—Camera pose
Definitions
- the present disclosure relates to the field of computer vision but is not limited to the field of computer vision, and in particular to a visual positioning method and related devices.
- GPS positioning is very important in people's daily life. Since the Global Positioning System (Global Positioning System, GPS) performs positioning, GPS positioning is mostly used for outdoor positioning. At present, the implementation of indoor positioning systems is mainly based on Wi-Fi signals, Bluetooth signals, and Ultra Wide Band (UWB) technology. Based on Wi-Fi signal positioning, many wireless access points (APs) need to be arranged in advance.
- APs wireless access points
- Vision-based positioning technology uses visual information (images or videos) collected by image or video capture devices such as mobile phones for positioning.
- the embodiments of the present disclosure provide a visual positioning method and related devices.
- an embodiment of the present disclosure provides a visual positioning method, the method includes: determining a first candidate image sequence from an image library; the image library is used to construct an electronic map, and the first candidate image sequence The frames in the image are arranged in the order of the degree of matching with the first image, the first image is an image collected by the camera; the order of the frames in the first candidate image sequence is adjusted according to the target window, and the second backup image is obtained.
- the target window is a continuous multi-frame image containing a target frame image determined from an image library, the target frame image is an image in the image library that matches a second image, the second image It is the image collected by the camera before the first image is collected; the target pose of the camera when the first image is collected is determined according to the second candidate image sequence.
- the embodiments of the present disclosure of the present application utilize the continuity of image frames in time sequence to effectively improve the positioning speed of consecutive frames.
- the determining the target pose of the camera when acquiring the first image according to the second candidate image sequence includes: determining the target pose according to the first image sequence and the first image The first pose of the camera; the first image sequence includes consecutive multiple frames of images adjacent to the first reference frame image in the image library, and the first reference frame image is included in the second candidate sequence The first reference frame image is included in the second candidate image sequence; if it is determined that the position of the camera is successfully located according to the first pose, it is determined that the first pose is the target pose .
- the method further includes: after determining that the location of the camera is not successfully located based on the first pose According to the position of the camera, the second pose of the camera is determined according to the second image sequence and the first image; the second image sequence includes the image adjacent to the second reference frame image in the image library A continuous multi-frame image, the second reference frame image is the next frame image or the previous frame image of the first reference frame image in the second candidate image sequence; in determining the success according to the second pose In the case of locating the position of the camera, it is determined that the second pose is the target pose.
- the determining the first pose of the camera according to the first image sequence and the first image includes: determining and determining from the features extracted from each image in the first image sequence.
- F features matching the features extracted from the first image F is an integer greater than 0; according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera, The first pose is determined; the point cloud map is an electronic map of a scene to be positioned, and the scene to be positioned is a scene where the camera collects the first image.
- the adjusting the sequence of each frame image in the first candidate image sequence according to the target window to obtain the second candidate image sequence includes: each frame image in the first candidate image sequence is in accordance with the In the case that the matching degree of the first image is arranged in descending order, adjusting the image located in the target window in the first candidate image sequence to the last position of the first candidate image sequence; In the case that the frames of images in the first candidate image sequence are arranged in the order of matching degree with the first image from high to low, the first candidate image sequence located in the target window The image is adjusted to the foremost position of the first candidate image sequence.
- the determining the first candidate image sequence from the image library includes:
- any image in the image library corresponds to a visual word vector, and the image library
- the image in is used to construct an electronic map of the scene to be located when the target device collects the first image
- the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector in the image library and the visual word vector corresponding to the first image includes: determining that the corresponding visual word vector in the image library is similar to the one in the image library.
- the first image corresponds to at least one image of the same visual word, and multiple primary images are obtained; any image in the image library corresponds to at least one visual word, and the first image corresponds to at least one visual word; the multiple primary images are determined
- a plurality of candidate images with the highest similarity between the corresponding visual word vector and the visual word vector of the first image are selected.
- the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector and the visual word vector of the first image in the plurality of primary selection images includes: determining the plurality of primary selections The first Q percent of the image with the highest similarity between the corresponding visual word vector and the visual word vector of the first image is obtained to obtain the multiple candidate images; Q is a real number greater than 0.
- the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image includes:
- the vocabulary tree is obtained by clustering the features extracted from the training images collected from the scene to be located;
- the vocabulary tree is a visual word vector obtained from the features extracted from any of the primary selected images
- a plurality of candidate images with the highest similarity between the corresponding visual word vector and the target word vector among the plurality of primary selected images are determined.
- the feature extracted from the first image is converted into a target word vector using a vocabulary tree, and multiple candidate images are obtained by calculating the similarity between the target word vector and the visual word vector corresponding to each primary selected image. Filter out candidate images quickly and accurately.
- each leaf node in the vocabulary tree corresponds to a visual word
- the last node in the vocabulary tree is a leaf node
- the feature extracted from the first image using the vocabulary tree includes:
- the target word vector can be quickly calculated.
- the vocabulary tree to classify the features extracted from the first image to obtain intermediate features that are classified into a target leaf node;
- the target leaf node is any leaf node in the vocabulary tree, and the target leaf node Correspond to the target visual word;
- the weight of the target visual word, and the cluster center corresponding to the target visual word, the target weight corresponding to the target visual word in the first image is calculated; the target weight and the target
- the weights of visual words are positively related, and the weights of the target visual words are determined according to the number of features corresponding to the target visual words when the vocabulary tree is generated.
- the intermediate feature includes at least one sub-feature; the target weight is the sum of the weight parameters corresponding to each sub-feature included in the intermediate feature; the weight parameter corresponding to the sub-feature is negatively related to the feature distance
- the feature distance is the Hamming distance between the sub feature and the corresponding cluster center.
- the feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image includes:
- the third feature extracted from the first image is classified into leaf nodes according to a vocabulary tree; the vocabulary tree is obtained by clustering the features extracted from the image collected from the scene to be located; the vocabulary tree The nodes in the last layer are leaf nodes, and each leaf node contains multiple features;
- the fourth feature is a slave target
- the feature extracted from the candidate image, the target candidate image is included in any image in the first candidate image sequence;
- the number of features matching the target candidate image with the first image is obtained.
- the method further include:
- the conversion matrix is obtained by aligning the contour of the point cloud map with the indoor floor plan by transforming the angle and position of the point cloud map .
- the determining that the first pose successfully locates the position of the camera includes: determining that the positional relationship of the L pair of feature points conforms to the first pose, and one of each pair of feature points A feature point is extracted from the first image, another feature point is extracted from an image in the first image sequence, and L is an integer greater than 1.
- the method before the determining the first pose of the camera according to the first image sequence and the first image, the method further includes:
- the point cloud map is constructed according to the multiple image sequences; wherein any one of the multiple image sequences is used to construct a sub-point cloud map of one or more regions; the point cloud map includes the The first electronic map and the second electronic map.
- the scene to be positioned is divided into multiple regions, and the sub-point cloud map is constructed for each region.
- the sub-point cloud map is constructed for each region.
- the method before using the vocabulary tree to convert the features extracted from the first image into a target word vector, the method further includes:
- the visual positioning method is applied to a server; before the first candidate image sequence is determined from the image library, the method further includes: receiving the first image from a target device, and the target The device is equipped with the camera.
- the server performs positioning based on the first image from the target device, which can take full advantage of the server's advantages in processing speed and storage space, with high positioning accuracy and fast positioning speed.
- the method further includes: sending location information of the camera to the target device.
- the server sends the location information of the target device to the target device, so that the target device can display the location information, so that the user can accurately know its location.
- the visual positioning method is applied to an electronic device equipped with the camera.
- the embodiments of the present disclosure provide another visual positioning method, which may include: collecting a target image through a camera;
- Target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera
- the location information is used to indicate the location and direction of the camera;
- the location information is information about the location when the camera collects the target image determined by the server according to a second candidate image sequence;
- the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to a target window, and the target window is determined from an image library and contains multiple consecutive frames of the target frame image
- the image library is used to construct an electronic map, the target frame image is an image in the image library that matches a second image, and the second image is collected by the camera before the first image is collected
- the images of each frame in the first candidate image sequence are arranged in the order of the degree of matching with the first image;
- An electronic map is displayed, and the electronic map contains the location and direction of the camera.
- a visual positioning device which includes:
- the screening unit is configured to determine a first candidate image sequence from an image library; the image library is used to construct an electronic map, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image ,
- the first image is an image collected by a camera;
- the screening unit is further configured to adjust the order of each frame image in the first candidate image sequence according to a target window to obtain a second candidate image sequence;
- the target window is determined from an image library containing the target frame image
- the target frame image is an image that matches a second image in the image library, and the second image is an image collected by the camera before the first image is collected;
- the determining unit is configured to determine the target pose of the camera when acquiring the first image according to the second candidate image sequence.
- a terminal device which includes:
- Camera configured to collect target images
- a sending unit configured to send target information to a server, the target information including the target image or the feature sequence extracted from the target image, and the internal parameters of the camera;
- the receiving unit is configured to receive position information, where the position information is used to indicate the position and direction of the camera; the position information is determined by the server according to a second candidate image sequence when the camera acquires the target image
- the position information of the second candidate image sequence is determined by the server according to a second candidate image sequence when the camera acquires the target image
- the position information of the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to the target window, and the target window contains the target frame determined from the image library
- the image library is used to construct an electronic map, and the target frame image is an image that matches a second image in the image library.
- the second image is the first image captured by the camera. For an image collected before an image, each frame of the image in the first candidate image sequence is arranged in the order of the degree of matching with the first image;
- the display unit is configured to display an electronic map, and the electronic map contains the position and direction of the camera.
- an embodiment of the present disclosure provides an electronic device, the electronic device includes: a memory, configured to store a program; a processor, configured to execute the program stored in the memory, and when the program is executed, The processor is configured to execute the method of any one of the foregoing first aspect to the foregoing second aspect and any implementation manner.
- an embodiment of the present disclosure provides a visual positioning system, including: a server and a terminal device, the server executes the method of the first aspect and any one of the implementation manners described above, and the terminal device is configured to execute the first aspect described above. Two-sided approach.
- an embodiment of the present disclosure provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the above-mentioned first From the first aspect to the second aspect and any implementation method.
- embodiments of the present disclosure provide a computer program product, wherein the computer program product includes program instructions; wherein, when the program instructions are executed by a processor, the processor executes any of the foregoing embodiments.
- the visual positioning method when the program instructions are executed by a processor, the processor executes any of the foregoing embodiments.
- FIG. 1 is a schematic diagram of a vocabulary tree provided by an embodiment of the disclosure
- Figure 2 is a visual positioning method provided by an embodiment of the present disclosure
- FIG. 3 is another visual positioning method provided by an embodiment of the disclosure.
- FIG. 4 is another visual positioning method provided by an embodiment of the present disclosure.
- FIG. 5 is a positioning and navigation method provided by an embodiment of the disclosure.
- FIG. 6 is a method for constructing a point cloud map provided by an embodiment of the present disclosure.
- FIG. 7 is a schematic structural diagram of a visual positioning device provided by an embodiment of the disclosure.
- FIG. 8 is a schematic structural diagram of a terminal provided by an embodiment of the disclosure.
- FIG. 9 is a schematic structural diagram of another terminal provided by an embodiment of the disclosure.
- FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the disclosure.
- the positioning method based on non-visual information usually needs to arrange devices in the scene to be positioned in advance, and the positioning accuracy is not high.
- the positioning method based on visual information is the main direction of current research.
- the visual positioning method provided by the embodiments of the present disclosure can be applied to scenarios such as location recognition and positioning navigation.
- the application of the visual positioning method provided by the embodiments of the present disclosure in the location recognition scene and the positioning navigation scene will be briefly introduced below.
- Location recognition scene For example, in a large shopping mall, you can divide the area of the shopping mall (that is, the scene to be located), and use the structure from motion (SFM) technology to construct the point cloud map of the shopping mall for each area.
- SFM structure from motion
- the user can start the target application on the mobile phone.
- the mobile phone uses the camera to collect surrounding images, displays an electronic map on the screen, and displays it on the electronic map. Mark the current location and direction of the user.
- the target application is an application specially developed to achieve accurate indoor positioning.
- Positioning and navigation scenarios For example, in a large shopping mall, you can divide the area of the shopping mall, and use SFM and other technologies to build a point cloud map of the shopping mall for each area.
- the user starts the target application on the mobile phone and enters the destination address to be reached; the user raises the mobile phone to collect images in front of the mobile phone, and the mobile phone displays the collected images in real time, and Display a mark indicating that the user has reached the destination address, such as an arrow.
- the target application is an application specially developed to achieve accurate indoor positioning. Since the computing performance of the mobile phone is very small, it needs to be placed in the cloud for calculation, that is, the cloud realizes the positioning operation. Since shopping malls often change, you can rebuild the point cloud map only for the changed area, instead of rebuilding the entire mall.
- the feature points of the image can be simply understood as the more prominent points in the image, such as contour points, bright spots in darker areas, dark spots in brighter areas, etc. This definition is based on the gray value of the image around the feature point, and the pixel value of a circle around the candidate feature point is detected. If there are enough pixels in the area around the candidate point and the gray value of the candidate point is sufficiently different, it is considered The candidate point is a feature point. After the feature points are obtained, the attributes of these feature points need to be described in some way. The output of these attributes is called the feature point descriptor (Feature Descritors).
- ORB algorithm is a fast feature point extraction and description algorithm. The ORB algorithm uses the FAST (Features from Accelerated Segment Test) algorithm to detect feature points.
- the FAST algorithm is an algorithm for corner detection.
- the principle of the algorithm is to take a detection point in an image, and use the point as the center of the circle to determine whether the detection point is a corner point.
- the ORB algorithm uses the BRIEF algorithm to calculate the descriptor of a feature point.
- the core idea of the BRIEF algorithm is to select N point pairs in a certain pattern around the key point P, and combine the comparison results of these N point pairs as a descriptor.
- the biggest feature of ORB algorithm is fast calculation speed. This firstly benefits from the use of FAST to detect feature points. FAST's detection speed is as famous as its name.
- the third is to use the BRIEF algorithm to calculate the descriptor.
- the unique binary string representation of the descriptor not only saves storage space, but also greatly shortens the matching time.
- the descriptors of feature points A and B are as follows: A: 10101011; B: 10101010.
- We set a threshold such as 80%.
- the similarity between the descriptors of A and B is greater than 90%, we judge that A and B are the same feature points, that is, the two points are matched successfully. In this example, only the last digit of A and B are different, and the similarity is 87.5%, which is greater than 80%; then A and B are matched.
- the Structure From Motion (SFM) algorithm is an offline algorithm for 3D reconstruction based on various collected disordered pictures. Before proceeding to the core algorithm StructureFrom Motion, some preparations are needed to select suitable pictures. First extract the focal length information from the picture, and then use feature extraction algorithms such as SIFT to extract image features, and use the kd-tree model to calculate the Euclidean distance between the feature points of the two pictures for feature point matching, so as to find the number of feature points matching Image pairs that meet the requirements.
- SIFT Scale-Invariant Feature Transform
- kd-tree is developed from BST (Binary Search Tree) and is a high-dimensional index tree data structure.
- mainly nearest neighbor search Nearest Neighbor
- approximate nearest neighbor search Approximate Nearest Neighbor
- For each image matching pair calculate the epipolar geometry, estimate the fundamental matrix (ie, F matrix), and optimize and improve the matching pair through the ransac algorithm. If there are feature points that can be chained in such a matching pair and are detected all the time, then a trajectory can be formed.
- the key first step is to select a good image pair to initialize the entire Bundle Adjustment (BA) process.
- BA Bundle Adjustment
- Random sample consensus uses an iterative method to estimate the parameters of the mathematical model from a set of observed data containing outliers.
- the basic assumption of the RANSAC algorithm is that the sample contains correct data (inliers, data that can be described by the model) and abnormal data (outliers, data that deviates far from the normal range and cannot adapt to the mathematical model), that is, the data set contains noise. These abnormal data may be caused by wrong measurements, wrong assumptions, wrong calculations, etc.
- the input of the RANSAC algorithm is a set of observation data, a parameterized model that can be explained or adapted to the observation data, and some credible parameters. RANSAC achieves its goal by repeatedly selecting a set of random subsets of the data.
- the selected subsets are assumed to be interior points, and the following methods are used to verify: 1. There is a model adapted to the assumed interior points, that is, all unknown parameters can be calculated from the assumed interior points. 2. Use the model obtained in 1 to test all other data. If a certain point is suitable for the estimated model, consider it as an inside point. 3. If enough points are classified as hypothetical interior points, then the estimated model is reasonable enough. 4. Then, use all the assumed interior points to re-estimate the model, because it has only been estimated by the initial assumed interior points. 5. Finally, evaluate the model by estimating the error rate of the interior points and the model. This process is repeated for a fixed number of times, and the model generated each time is either discarded because there are too few interior points, or selected because it is better than the existing model.
- Vocabulary tree is an efficient data structure for retrieving images based on visual words (also called visual words).
- a tree structure allows keyword queries in sub-linear time instead of scanning all keywords to find matching images, which can greatly increase the retrieval speed.
- the following describes the steps of building a vocabulary tree: 1. Extract the ORB features of all training images. Each training image extracts about 3000 features. The training images are collected from the scene to be positioned. 2. Use K-mean to cluster all the extracted features into K categories, and then cluster each category into K categories in the same way until the L layer, retain each cluster center in each layer, and finally generate vocabulary tree. Both K and L are integers greater than 1, for example, K is 10 and L is 6.
- FIG. 1 is a schematic diagram of a vocabulary tree provided by an embodiment of the disclosure. As shown in Figure 1, the vocabulary tree includes a total of (L+1) layers, the first layer includes a root node, and the last layer includes multiple leaf nodes.
- FIG. 2 is a visual positioning method provided by an embodiment of the present disclosure. As shown in FIG. 2, the method may include:
- the vision positioning device determines a first candidate image sequence from an image library.
- the visual positioning device can be a server, or a mobile terminal that can collect images, such as a mobile phone or a tablet computer.
- This image library is used to construct electronic maps.
- the first candidate image sequence includes M images, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image.
- the first image is an image collected by the camera of the target device, and M is an integer greater than 1. For example, M is 5, 6, or 8, etc.
- the target device can be a device that can collect images and/or videos, such as a mobile phone or a tablet.
- multiple candidate images are first selected by calculating the similarity of visual word vectors, and then from the multiple candidate images The M images with the largest number of feature matches with the first image are obtained in the, and the image retrieval efficiency is high.
- the number of feature matches between the first frame of image and the first image in the first candidate image sequence is the largest, and the number of feature matches between the last frame of image in the first candidate image sequence and the first image is the largest least.
- the number of feature matches between the first frame of the image in the first candidate image sequence and the first image is the smallest, and the number of feature matches between the last frame of image in the first candidate image sequence and the first image is the smallest most.
- the visual positioning device is a server
- the first image is an image received from a mobile terminal such as a mobile phone
- the first image may be an image collected by the mobile terminal in a scene to be positioned.
- the visual positioning device is a mobile terminal capable of collecting images, such as a mobile phone or a tablet computer, and the first image is an image extracted by the visual positioning device in a scene to be positioned.
- the target window contains a continuous multi-frame image containing a target frame image determined from an image library, and the target frame image is an image in the image library that matches the second image, and the second image is the first image captured by the camera. The image captured before the image.
- each frame image in the first candidate image sequence is When the matching degree with the first image is arranged from low to high, the image located in the target window in the first candidate image sequence is adjusted to the last position of the first candidate image sequence; In the case that each frame image in the candidate image sequence is arranged in the order of matching degree with the first image from high to low, the image located in the target window in the first candidate image sequence is adjusted to the first backup image. Select the top position of the image sequence.
- the visual positioning device may store or be associated with an image library, and the images in the image library are used to construct a point cloud map of the scene to be positioned.
- the image library includes one or more image sequences, and each image sequence includes consecutive multiple frames of images obtained by collecting an area of the scene to be located, and each image sequence can be used to construct a sub-point cloud map, That is, a point cloud map of an area. These sub-point cloud maps constitute the point cloud map.
- the images in the image library may be continuous.
- the scene to be positioned can be divided into regions, and a multi-angle image sequence is collected for each region, and each region requires at least two image sequences in the front and back directions.
- the target window may be an image sequence including the target frame image, or may be a part of the image sequence including the target frame image.
- the target window includes 61 frames of images, that is, the target frame image and 30 frames of images before and after the target frame image.
- the size of the target window is not limited. Assuming that the images in the first candidate image sequence are image 1, image 2, image 3, image 4, and image 5 in sequence, where image 3 and image 5 are calibration images, then the images in the second candidate image sequence are sequentially These are image 3, image 5, image 1, image 2, and image 4. It can be understood that the method flow in FIG. 2 implements continuous frame positioning, and the visual positioning device performs step 201, step 203, step 204, and step 205 to achieve single frame positioning.
- the target pose here may include at least the position of the camera when the first image is captured; in other embodiments, the target pose may include: the position and pose of the camera when the first image is captured.
- the pose of the camera includes but is not limited to the orientation of the camera.
- the implementation of determining the target pose of the camera when acquiring the first image according to the second candidate image sequence is as follows: according to the first image sequence and the first image, determine the first image of the camera Pose; the first image sequence includes consecutive multiple frames of images adjacent to the first reference frame image in the image library, and the first reference frame image is included in the second candidate image sequence.
- the first pose is the target pose.
- the second pose of the camera is determined according to the second image sequence and the first image.
- the second image sequence includes consecutive multiple frames of images adjacent to the second reference frame image in the image library, and the second reference frame image is the next frame image of the first reference frame image in the second candidate image sequence Or the previous image.
- the first image sequence includes the first K1 frame image of the first reference frame image, the first reference frame image, and the last K1 frame image of the first reference frame image; K1 is an integer greater than 1, For example, K1 is 10.
- the determination of the first pose of the camera according to the first image sequence and the first image may be: from the features extracted from each image in the first image sequence, determining the first image F features that match the extracted features, F is an integer greater than 0; the first pose is determined according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera.
- the point cloud map is an electronic map of a scene to be positioned, and the scene to be positioned is a scene where the camera is located when the first image is collected. The scene to be located is the scene where the target device is located when the first image is collected.
- the visual positioning device may use the PnP algorithm to determine the first pose of the camera according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the camera's internal parameters.
- Each of the F features corresponds to a feature point in the image. That is, each feature corresponds to a 2D reference point (that is, the two-dimensional coordinates of the feature point in the image).
- the space coordinate point corresponding to each 2D reference point can be determined, so that the one-to-one correspondence between the 2D reference point and the space coordinate point can be known.
- each 2D reference point matches a space coordinate point, so that the space coordinate point corresponding to each feature can be known.
- the visual positioning device may also use other methods to determine the spatial coordinate points corresponding to each feature in the point cloud map, which is not limited in the present disclosure.
- the spatial coordinate points corresponding to the F features in the point cloud map are 3D reference points (ie, spatial coordinate points) in the F world coordinate systems.
- Multi-point perspective imaging (Perspective-n-Point, PnP) is a method to solve the movement of 3D to 2D point pairs: that is, how to solve the pose of the camera when F 3D space points are given.
- 3D reference points (3D reference points) coordinates in F world coordinate systems, F is an integer greater than 0; 2D reference points (2D reference points) corresponding to these F 3D points and projected on the image reference points) coordinates; internal parameters of the camera.
- Solving the PnP problem can get the pose of the camera (or camera).
- typical PnP problems such as P3P, direct linear transformation (DLT), EPnP (Efficient PnP), UPnP, and nonlinear optimization methods. Therefore, the visual positioning device can adopt any method to solve the PnP problem, and determine the second pose of the camera according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera.
- Ransac algorithm can be used to iterate here, and the number of interior points can be counted in each iteration.
- R is the rotation matrix
- t is the translation vector, that is, the two sets of parameters included in the pose of the camera.
- the camera is equivalent to a camera and other image or video capture devices.
- the embodiment of the present disclosure of the present application provides a continuous frame positioning method, which uses a frame before the first image to locate the image of the first pose of the camera to adjust the ordering of the images in the first candidate image sequence. Taking advantage of the continuity of images in time series, the image most likely to match the first image is ranked at the forefront of the first candidate image sequence, so that an image that matches the first image can be found more quickly .
- the visual positioning device may also perform the following operations to determine the three-dimensional position of the camera: determine the three-dimensional position of the camera according to the conversion matrix and the target pose of the camera.
- the conversion matrix is obtained by transforming the angle and position of the point cloud map, and aligning the outline of the point cloud map with the indoor floor plan.
- the rotation matrix R and the translation vector t are combined into a 4*4 matrix
- Use the transformation matrix T i to multiply the matrix T′ to get a new matrix Will represent T as t * is the last three-dimensional position of the camera.
- the three-dimensional position of the camera can be accurately determined, which is simple to implement.
- the embodiment of the present disclosure provides a continuous frame positioning method, which uses a frame before the first image to locate the image of the first pose of the camera to adjust the order of each image in the first candidate image sequence, which can make full use of the image
- a continuous frame positioning method which uses a frame before the first image to locate the image of the first pose of the camera to adjust the order of each image in the first candidate image sequence, which can make full use of the image
- the image most likely to match the first image is ranked at the top of the first candidate image sequence, so that an image that matches the first image can be found more quickly, and then Position faster.
- the situation where the position of the camera is successfully located according to the first pose may be: it is determined that the positional relationship between the L pair of feature points is consistent with the first pose, and one feature point in each pair of feature points is derived from the first pose. One image is extracted, and the other feature point is extracted from an image in the first image sequence, and L is an integer greater than 1.
- the Ransac algorithm is used to iteratively solve the PnP according to the first pose, and the number of interior points is counted in each iteration.
- the visual positioning device fails to locate the position of the camera by using a certain frame of image in the second candidate image sequence, it uses the next frame of image in the second candidate image sequence for positioning.
- the embodiment of the present disclosure provides a continuous frame positioning method. After the position of the camera is successfully located using the first image, the next frame image of the first image collected by the camera is used for positioning.
- the visual positioning device can use each frame image in sequence to locate the position of the camera according to the sequence of each frame image in the second candidate sequence until the position of the camera is located. If the position of the camera cannot be successfully located using each frame of the second candidate image sequence, then the positioning failure is returned. For example, the visual positioning device first uses the first frame image in the second candidate image sequence for positioning, if the positioning is successful, it stops this positioning; if the positioning is unsuccessful, it uses the second candidate image sequence. Position the second frame of image; and so on. The method of using the image sequence and the first image sequence for different times to locate the target pose of the camera may be the same.
- step 201 The following describes how to determine the first candidate image sequence from the image library, that is, the implementation of step 201.
- the method of determining the first candidate image sequence from the image database may be as follows: using a vocabulary tree to convert the features extracted from the first image into a target word vector; calculating the target word vector and the image database The similarity score of the word vector corresponding to each image; obtain the first 10 frames of images with the highest similarity score to the first image in each image sequence included in the image library to obtain the primary image sequence; according to the similarity score from high After sorting the images in the primary image sequence in the lowest order, take out the top 20% of the images as the selected image sequence, if less than 10 frames, directly take the first 10 frames; each frame in the selected image sequence The image and the first image are feature-matched; after sorting according to the number of feature matches of each frame image in the selected image sequence with the first image, the first M images are selected to obtain the first candidate image sequence.
- the method of determining the first candidate image sequence from the image library may be as follows: determining the similarity (ie similarity score) between the corresponding visual word vector in the image library and the visual word vector corresponding to the first image The highest multiple candidate images; feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image; obtaining the multiple candidate images The M images that match the features of the first image the most, obtain the first candidate image sequence.
- M is 5.
- Any image in the image library corresponds to a visual word vector, and the images in the image library are used to construct an electronic map of the scene to be located when the target device collects the first image.
- the determining the plurality of candidate images with the highest similarity between the visual word vector corresponding to the image library and the visual word vector corresponding to the first image may be: determining that the image library corresponds to the first image At least one image of the same visual word is obtained, and multiple primary selected images are obtained; determine the top Q percent image with the highest similarity between the corresponding visual word vector in the multiple primary selected images and the visual word vector of the first image, Obtain the multiple candidate images; Q is a real number greater than 0. For example, Q is 10, 15, 20, 30, etc. Any image in the image library corresponds to at least one visual word, and the first image corresponds to at least one visual word.
- the visual positioning device obtains multiple candidate images in the following manner: using a vocabulary tree to convert the features extracted from the first image into a target word vector; respectively calculating the target word vector and the plurality of primary selected images The similarity of the visual word vector corresponding to each primary selected image in the primaries; determine the top Q percent of the image with the highest similarity between the visual word vector in the multiple primary selected images and the target word vector to obtain the multiple candidates image.
- the vocabulary tree is obtained by clustering the features extracted from the training images collected from the scene to be located.
- the visual word vector corresponding to any one of the plurality of primary images is a visual word vector obtained from the feature extracted from any one of the primary images using the vocabulary tree.
- the feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image may be: according to the vocabulary tree, the first image will be selected from the first image.
- the third feature extracted from an image is classified to the reference leaf node; the third feature and the fourth feature are matched with features to obtain a feature matching the third feature.
- the vocabulary tree is obtained by clustering the features extracted from the images collected from the scene to be located; the nodes in the last layer of the vocabulary tree are leaf nodes, and each leaf node contains multiple features.
- the fourth feature is included in the reference leaf node and is a feature extracted from a target candidate image, and the target candidate image is included in the first candidate image sequence.
- the visual positioning device may pre-store the image index and feature index corresponding to each visual word (ie, leaf node).
- a corresponding image index and feature index are added to each visual word, and these indexes are used to accelerate feature matching. For example, if 100 images in the image library correspond to a certain visual word, the index of these 100 images (ie, image index) and the leaf node of the 100 images that fall on the leaf node corresponding to the visual word are added to the visual word.
- the index of the feature ie feature index
- the reference feature extracted from the first image falls on the reference node.
- the following describes how to use the vocabulary tree to convert the features extracted from the first image into the target word vector.
- Using the vocabulary tree to convert the features extracted from the first image into the target word vector includes: calculating the target visual word based on the features extracted from the first image, the weight of the target visual word, and the cluster center corresponding to the target visual word The target weight corresponding to the first image; the target word vector includes the weight of each visual word corresponding to the vocabulary tree in the first image; the target weight is positively related to the weight of the target visual word.
- the word vector is calculated by the residual weighting method. Taking into account the difference of the features in the same visual word, the distinction is increased, and it is easy to access TF-IDF (term frequency-inverse document frequency). In the framework of ), the speed of image retrieval and feature matching can be improved.
- TF-IDF term frequency-inverse document frequency
- the following formula is used to convert the features extracted from the first image into a target word vector using a vocabulary tree:
- W iweight is the weight of the i-th visual word itself
- Dis(f i , c i ) is the Hamming distance from the feature f i to the cluster center c i of the i-th visual word
- n represents from the first image
- the extracted features fall on the number of features on the node corresponding to the i-th visual word
- W i represents the weight of the i-th visual word in the first image.
- a leaf node in the vocabulary tree corresponds to a visual word
- the target word vector includes the weight of each visual word corresponding to the vocabulary tree in the first image.
- a node of the vocabulary tree corresponds to a cluster center.
- the vocabulary tree includes 1000 leaf nodes, and each leaf node corresponds to a visual word.
- the visual positioning device needs to calculate the weight of each visual word in the first image to obtain the target word vector of the first image.
- the visual positioning device may calculate the weight of the visual word corresponding to each leaf node in the vocabulary tree in the first image; combine the weight of the visual word corresponding to each leaf node in the first image to synthesize A vector to get the target word vector.
- the word vector corresponding to each image in the image library can be calculated in the same manner to obtain the visual word vector corresponding to each primary selected image. Both i and n are integers greater than 1.
- the feature f i is any feature extracted from the first image, and any feature corresponds to a binary string, that is, f i is a binary string.
- the center of each visual word corresponds to a binary string.
- c i is a binary string.
- the Hamming distance may be calculated feature F i to the i-th word in the visual center of C i.
- the Hamming distance indicates the number of different bits corresponding to two (same length) words. In other words, it is the number of characters that need to be replaced to transform one string into another. For example: The Hamming distance between 1011101 and 1001001 is 2.
- the weight of each visual word in the vocabulary tree is negatively related to the number of features included in its corresponding node.
- an index of the corresponding image is added to the i-th visual word, and the index is used to speed up image retrieval.
- the calculation of the target weight corresponding to the target visual word in the first image based on the features extracted from the first image, the weight of the target visual word, and the cluster center corresponding to the target visual word includes: using vocabulary The tree classifies the features extracted from the first image to obtain intermediate features classified into the target leaf node; according to the intermediate feature, the weight of the target visual word, and the cluster center corresponding to the target visual word, the target visual word is calculated The target weight corresponding to the first image.
- the target leaf node corresponds to the target visual word. It can be seen from formula (1) that the target weight is the sum of the weight parameters corresponding to each feature included in the intermediate feature. For example, the weight parameter corresponding to the feature f i:
- the intermediate feature may include a first feature and a second feature; the Hamming distance between the first feature and the cluster center is a first distance, and the Hamming distance between the second feature and the cluster center is a second distance; if If the first distance and the second distance are different, the first weight parameter corresponding to the first feature is different from the second weight parameter corresponding to the second feature.
- the word vector is calculated by the residual weighting method. Taking into account the difference of the features in the same visual word, the distinction is increased, and it is easy to access TF-IDF (term frequency-inverse document frequency). In the framework of ), the speed of image retrieval and feature matching can be improved.
- TF-IDF term frequency-inverse document frequency
- FIG. 3 is another visual positioning method provided by an embodiment of the present disclosure, and the method may include:
- the terminal shoots a target image.
- the terminal can be a mobile phone and other devices with camera function and/or camera function.
- the terminal uses the ORB algorithm to extract the ORB feature of the target image.
- the terminal uses other feature extraction methods to extract features of the target image.
- the terminal transmits the ORB features extracted from the target image and the internal parameters of the camera to the server.
- Steps 302 to 303 can be replaced by: the terminal transmits the target image and the internal parameters of the camera to the server.
- the ORB feature of the image can be extracted by the server, so as to reduce the amount of calculation of the terminal.
- the user can start the target application on the terminal, and use the camera to collect the target image through the target application and transmit the target image to the server.
- the internal reference of the camera may be the internal reference of the camera of the terminal.
- the server converts the ORB feature into an intermediate word vector.
- the manner in which the server converts the ORB feature into the intermediate word vector is the same as the manner in which the feature extracted from the first image is converted into the target word vector by using the vocabulary tree in the foregoing embodiment, and will not be detailed here.
- the server determines the first H images most similar to the target image in each image sequence according to the intermediate word vector, and obtains the similarity score corresponding to the first H image with the highest similarity score of the target image in each image sequence .
- Each image sequence is contained in the image library, and each image sequence is used to construct a sub-point cloud map, and these sub-point cloud maps form a point cloud map corresponding to the scene to be located.
- Step 305 is to query the first H images most similar to the target image in each image sequence in the image library.
- H is an integer greater than 1, for example, H is 10.
- Each image sequence may be obtained by collecting one or more regions of the scene to be located.
- the server calculates the similarity score between each image in each image sequence and the target image according to the intermediate word vector.
- the similarity score formula can be as follows:
- s(v1, v2) represents the similarity score of the visual word vector v1 and the visual word vector v2).
- the visual word vector v1 can be a word vector calculated based on the ORB feature extracted from the target image using formula (1); the visual word vector v2) can be based on the ORB feature extracted from any image in the image library, using the formula (1 ) The calculated word vector.
- the server may store visual word vectors (corresponding to the aforementioned reference word vectors) corresponding to each image in the image library.
- the visual word vector corresponding to each image is the feature extracted from the image, which is calculated by formula (1). It can be understood that the server only needs to calculate the visual word vector corresponding to the target image, and does not need to calculate the visual word vector corresponding to the images included in each image sequence in the image library.
- the server only queries images that have a common visual word with the intermediate word vector, that is, only compares the similarity based on the image index in the leaf node corresponding to the non-zero item in the intermediate word vector. In other words, determine the image corresponding to at least one visual word of the target image in the image library to obtain multiple primary images; query the first H frames of the multiple primary images that are most similar to the target image according to the intermediate word vector image. For example, if the weight corresponding to the i-th visual word in the target image and the weight corresponding to a certain primary image are not 0, both the target image and the primary image correspond to the i-th visual word.
- the server sorts the similarity scores corresponding to the first H images with the highest similarity score of the target image in each image sequence from high to low, and takes out multiple images with higher similarity scores to the target image as Alternative image.
- the image library includes F image sequences, and the top 20% of the images with the highest similarity score to the target image among (F ⁇ H) images are taken as candidate images.
- the (F ⁇ H) images include the first H images with the highest similarity score to the target image in each image sequence. If the number of images with the top 20% is less than 10, then the first 10 images are taken directly.
- Step 306 is an operation of screening candidate images.
- the server performs feature matching on each of the candidate images with the target image, and determines the top G images with the largest number of feature matching.
- the features of the target image are first classified to a node in the L layer one by one according to the vocabulary tree.
- the classification method is to select the cluster center point with the shortest distance (Hamming distance) from the current feature layer by layer starting from the root node ( Nodes in the tree), each classified feature is matched only with features that have a feature index in the corresponding node and the image to which it belongs is a candidate image. This can speed up feature matching.
- Step 307 is a process of performing feature matching between each image in the candidate image and the target image. Therefore, step 307 can be regarded as a process of feature matching between two images.
- the server obtains (2K+1) consecutive images in the reference image sequence.
- the images in the reference image sequence are sorted according to the sequence of acquisition.
- the reference image sequence includes any one of the first G images, the (2K+1) images (corresponding to the local point cloud map) include any one of the images, the first K images of the any one of the images, and The last K images of any image.
- Step 308 is an operation of determining a local point cloud map.
- the server determines multiple features that match the features extracted from the target image among the features extracted from the (2K+1) images.
- step 309 can be regarded as a matching operation between the target image and the local point cloud map, that is, the frame-local point cloud map matching in FIG. 3.
- the vocabulary tree is first used to classify the features extracted from the corresponding similarity scores, and then the same processing is performed on the features extracted from the target image, and only the features of the two parts that fall in the same node are considered. Matching, which can speed up feature matching. Among them, one part of the two parts is the target image, and the other part is the (2K+1) image.
- the server determines the pose of the camera according to the multiple features, the spatial coordinate points corresponding to the multiple features in the point cloud map, and the internal parameters of the camera.
- Step 310 is similar to step 203 in FIG. 2 and will not be described in detail here.
- the server executes step 310 and fails to determine the pose of the camera, it uses another image in the previous G images to perform steps 308 to 310 again until the pose of the camera is successfully determined. For example, first determine the (2K+1) image based on the first image in the previous G images, and then use the (2K+1) image to determine the pose of the camera; if the pose of the camera is not determined successfully Next, determine a new (2K+1) image based on the second image of the previous G images, and then use the new (2K+1) image to determine the pose of the camera; repeat the above operations until the camera is successfully determined The pose.
- the server sends the location information of the camera to the terminal when it successfully determines the pose of the camera.
- the position information may include the three-dimensional position of the camera and the direction of the camera.
- the server can determine the three-dimensional position of the camera according to the conversion matrix and the pose of the camera, and generate the position information.
- the server executes step 308 if it fails to determine the pose of the camera.
- step 308 Each time the server executes step 308, it needs to determine consecutive (2K+1) images based on one of the previous G images. It should be understood that the consecutive (2K+1) images determined by the server each time step 308 is executed are different.
- the terminal displays the location of the camera on the electronic map.
- the terminal displays the location and direction of the camera on the electronic map. It can be understood that the camera (ie, camera) is installed on the terminal, and the position of the camera is the position of the terminal. Users can accurately and quickly determine their own location and direction according to the location and direction of the camera.
- the camera ie, camera
- the position of the camera is the position of the terminal. Users can accurately and quickly determine their own location and direction according to the location and direction of the camera.
- the terminal and the server work together.
- the terminal collects images and extracts features.
- the server is responsible for positioning and sending the positioning results (ie location information) to the terminal; the user only needs to use the terminal to send an image to the server. Determine exactly where you are.
- FIG. 4 is another visual positioning method provided by an embodiment of the present disclosure. As shown in FIG. 4, the method may include:
- the server obtains continuous multiple frames of images or multiple sets of features collected by the terminal.
- Each set of features may be features extracted from one frame of image, and the multiple sets of features are in turn features extracted from multiple consecutive frames of images.
- the consecutive multiple frames of images are sorted according to the sequence of acquisition.
- the server determines the pose of the camera according to the first frame of image or the feature extracted from the first frame of image.
- the first frame of image is the first frame of images in the continuous multiple frames of images.
- Step 402 corresponds to the method of positioning based on a single image in FIG. 3.
- the server can use the method in FIG. 3 to determine the pose of the camera by using the first frame of image.
- Using the first frame of continuous images to perform positioning is the same as positioning based on a single image.
- the first frame positioning in the continuous multi-frame positioning is the same as the single-frame positioning. If the positioning is successful, it will switch to continuous frame positioning; if the positioning fails, it will continue to single-frame positioning.
- the server successfully determines the pose of the camera according to the previous frame of image, determine N frames of continuous images in the target image sequence.
- the situation in which the pose of the camera is successfully determined in the previous frame of image means that the server executes step 402 to successfully determine the pose of the camera.
- the target image sequence is an image sequence to which the features used to successfully locate the pose of the camera belong to the previous frame of image.
- the server uses the first K images of an image in the target image sequence, the image, and the last K images of the image to perform feature matching with the previous image, and uses the matching feature points to successfully locate the camera The pose; the server obtains the first thirty images of the image in the target image sequence, the image, and the last thirty images of the image, that is, consecutive N frames of images.
- the server determines the pose of the camera according to N consecutive images in the target image sequence.
- Step 404 corresponds to step 308 to step 310 in FIG. 3.
- the server determines multiple candidate images in the case that the pose of the camera is not successfully determined according to the previous frame of image.
- the multiple candidate images are candidate images determined by the server according to the previous frame of image. That is to say, in the case that the pose of the camera is not successfully determined according to the previous frame of image, the server may use the candidate image of the previous frame as the candidate image of the current frame of image. This can reduce the steps of image retrieval and save time.
- the server determines the pose of the camera according to the candidate image of the previous frame of image.
- Step 406 corresponds to step 307 to step 310 in FIG. 3.
- the server After the server enters the continuous frame positioning, it mainly uses the prior knowledge of the successful positioning of the previous frame to deduce that the image matching the current frame has a high probability of being near the image that was successfully positioned last time. In this way, a window can be opened near the image that was successfully positioned last time, and priority is given to those frames of images that fall in the window.
- the window size can be up to 61 frames, with 30 frames before and after each, and truncated if it is less than 30 frames. If the positioning is successful, the window is passed down; if the positioning is unsuccessful, the positioning is performed according to the candidate image of a single frame.
- a continuous frame sliding window mechanism is adopted, and sequential information is used to effectively reduce the amount of calculation, and the positioning success rate can be improved.
- the prior knowledge of the successful positioning of the previous frame may be used to accelerate subsequent positioning operations.
- FIG. 5 is a positioning and navigation method provided by an embodiment of the present disclosure. As shown in FIG. 5, the method may include:
- the terminal starts the target application.
- the target application is an application specially developed to achieve accurate indoor positioning. In actual applications, after the user clicks the icon corresponding to the target application on the screen of the terminal, the target application is started.
- the terminal receives the destination address input by the user through the target interface.
- the target interface is the interface displayed on the screen of the terminal after the terminal starts the target application, that is, the interface of the target application.
- the destination address can be a restaurant, coffee shop, movie theater, etc.
- the terminal displays the currently collected image, and transmits the collected image or the features extracted from the collected image to the server.
- the terminal After the terminal receives the destination address input by the user, it can collect images of the surrounding environment through the camera (ie, the camera on the terminal) in real time or near real time, and transmit the collected images to the server at a fixed interval. In some embodiments, the terminal extracts the features of the collected image, and transmits the extracted features to the server at fixed intervals.
- the server determines the pose of the camera according to the received image or feature.
- Step 504 corresponds to step 401 to step 406 in FIG. 4.
- the server uses the positioning method in Figure 4 to determine the camera's position and posture according to each frame of image received or the characteristics of each frame of image. It can be understood that the server can sequentially determine the pose of the camera according to the image sequence or feature sequence sent by the terminal, and then determine the position of the camera. In other words, the server can determine the pose of the camera in real time or near real time.
- the server determines the three-dimensional position of the camera according to the conversion matrix and the pose of the camera.
- the conversion matrix is obtained by transforming the angle and position of the point cloud map, and aligning the outline of the point cloud map with the indoor floor plan. Specifically, the rotation matrix R and the translation vector t are combined into a 4*4 matrix Use the transformation matrix T i to multiply the matrix T′ to get a new matrix Will represent T as t * is the last three-dimensional position of the camera.
- the server sends location information to the terminal.
- the position information may include the three-dimensional position of the camera, the direction of the camera, and mark information.
- the mark information indicates the route the user needs to walk from the current location to the target address.
- the marking information only indicates the route within the target distance, and the target distance is the farthest distance from the road in the currently displayed image.
- the target distance may be 10 meters, 20 meters, 50 meters, and so on.
- the server successfully determines the pose of the camera, it can determine the three-dimensional position of the camera according to the conversion matrix and the pose of the camera. Before performing step 506, the server may generate the mark information according to the location, destination address, and electronic map of the camera.
- the terminal displays the collected images in real time and displays a mark indicating that the user has reached the destination address.
- the user starts the target application on the mobile phone and enters the destination address that needs to be reached; the user raises the mobile phone to the front to collect images, and the mobile phone displays the collection in real time , And display a mark indicating that the user has reached the destination address, such as an arrow.
- the server can accurately locate the location of the camera and provide navigation information to the user, and the user can quickly reach the target address according to the guidance.
- Fig. 6 is a method for constructing a point cloud map provided by an embodiment of the disclosure. As shown in Figure 6, the method may include:
- the server obtains multiple video sequences.
- the user can divide the area of the scene to be positioned, and collect a multi-angle video sequence for each area, and each area needs at least two front and back video sequences.
- the multiple video sequences are video sequences obtained by shooting each area in the scene to be positioned from multiple angles.
- the server extracts images for each of the multiple video sequences according to the target frame rate to obtain multiple image sequences.
- the server extracts a video sequence according to the target frame rate to obtain an image sequence.
- the target frame rate may be 30 frames/sec.
- Each image sequence is used to construct a sub-point cloud map.
- the server uses each image sequence to construct a point cloud map.
- the server may use the SFM algorithm to construct a sub-point cloud map using each image sequence, and all the sub-point cloud maps form the point cloud map.
- the scene to be positioned is divided into multiple regions, and the sub-point cloud map is constructed in each region.
- the sub-point cloud map is constructed in each region.
- the multiple image sequences can be stored in the image database, and the vocabulary tree is used to determine the visual word vector corresponding to each image in the multiple image sequences .
- the server may store the visual word vector corresponding to each image in the multiple image sequences.
- the index of the corresponding image is added to each visual word included in the vocabulary tree. For example, if the weight of a certain visual word in the vocabulary tree corresponding to a certain image in the image library is not 0, then the index of the image is added to the visual word.
- the server adds an index and a feature index of the corresponding image to each visual word included in the vocabulary tree.
- the server can use the vocabulary tree to classify each feature of each image into leaf nodes, and each leaf node corresponds to a visual word. For example, among the features extracted from the images in each image sequence, 100 features fall on a certain leaf node, then the feature index of the 100 features is adjusted on the visual word corresponding to the leaf node. The feature index indicates the 100 features.
- the following provides a specific example of locating the target pose of the camera based on the image sequence and the first image, which may include: determining, based on the image database, a sub-point cloud map established based on the first image sequence, wherein the sub-point cloud
- the map includes: 3D coordinates and 3D descriptors corresponding to the 3D coordinates; determining the 2D coordinates of the first image and the 2D descriptors corresponding to the 2D coordinates; combining the 2D coordinates and the 2D descriptors with The 3D coordinates and 3D descriptors are matched; according to the successfully matched 2D coordinates and the conversion relationship between the 2D descriptors and the 3D coordinates and 3D descriptors, the first pose or the second pose, etc., can be determined.
- the 3D descriptor may be description information of 3D coordinates, including: adjacent coordinates of the 3D coordinates and/or attribute information of bell coordinates.
- the 2D descriptor may be description information of 2D coordinates. For example, using the pnp algorithm to use the above conversion relationship to determine the first pose or the second pose of the camera.
- Figure 7 is a schematic structural diagram of a visual positioning device provided by an embodiment of the disclosure. If shown in Figure 7, the visual positioning device may include:
- the screening unit 701 is configured to determine a first candidate image sequence from an image library; the image library is used to construct an electronic map, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image,
- the first image is an image collected by a camera;
- the screening unit 701 is further configured to adjust the order of each frame image in the first candidate image sequence according to the target window to obtain a second candidate image sequence;
- the target window is a continuous multiple that contains the target frame image determined from the image library.
- the determining unit 702 is configured to determine the target pose of the camera when acquiring the first image according to the second candidate image sequence.
- the determining unit 702 is configured to determine the first pose of the camera according to the first image sequence and the first image; the first image sequence includes the first image sequence and the first image sequence in the image library. Consecutive multiple frames of images adjacent to a reference frame image, where the first reference frame image is included in the second candidate sequence;
- the first pose is the target pose.
- the determining unit 702 is configured to determine the position of the camera according to the second image sequence and the first image in the case of determining that the position of the camera is not successfully located according to the first pose
- the second pose includes consecutive multiple frames of images adjacent to the second reference frame image in the image library, and the second reference frame image is the first reference frame image in the second candidate image sequence
- the next frame of image or the previous frame of image in the case where it is determined that the position of the camera is successfully located according to the second pose, the second pose is determined to be the target pose.
- the determining unit 702 is configured to determine F features that match the features extracted from the first image among the features extracted from each image in the first image sequence, where F is An integer greater than 0;
- the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera determine the first pose;
- the point cloud map is an electronic map of the scene to be positioned, and the scene to be positioned is The scene where the camera was in when the first image was collected.
- the screening unit 701 is configured to, in the case that the frames of images in the first candidate image sequence are arranged in a descending order of matching degree with the first image, Adjusting the image located in the target window in the first candidate image sequence to the last position of the first candidate image sequence;
- the image located in the target window in the first candidate image sequence is adjusted to The foremost position of the first candidate image sequence.
- the screening unit 701 is configured to, in the case that the frames of images in the first candidate image sequence are arranged in a descending order of matching degree with the first image, The image located in the target window in the first candidate image sequence is adjusted to the last position of the first candidate image sequence; each frame of the image in the first candidate image sequence is adjusted according to the degree of matching with the first image from In the case of high-to-low order arrangement, the image located in the target window in the first candidate image sequence is adjusted to the foremost position of the first candidate image sequence.
- the screening unit 701 is configured to determine an image in the image library corresponding to at least one same visual word as the first image to obtain a plurality of primary selected images; any image in the image library Corresponding to at least one visual word, the first image corresponds to at least one visual word; determining a plurality of candidate images with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image.
- the screening unit 701 is configured to determine the top Q percent image with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image , Get the multiple candidate images; Q is a real number greater than 0.
- the filtering unit 701 is configured to use a vocabulary tree to convert the features extracted from the first image into a target word vector; the vocabulary tree is a training image collected from the scene to be located The extracted features are clustered;
- the visual word vector corresponding to any primary image in the multiple primary images is calculated by using the vocabulary tree The visual word vector obtained from the features extracted from any primary selected image;
- a plurality of candidate images with the highest similarity between the corresponding visual word vector and the target word vector in the plurality of primary selected images are determined.
- a leaf node in the vocabulary tree corresponds to a visual word, and the last node in the vocabulary tree is a leaf node;
- the screening unit 701 is configured to calculate the weight of the visual word corresponding to each leaf node in the vocabulary tree in the first image; combine the weight of the visual word corresponding to each leaf node in the first image into a vector to obtain The target word vector.
- a node of the vocabulary tree corresponds to a cluster center
- the screening unit 701 is configured to use the vocabulary tree to classify the features extracted from the first image to obtain intermediate features classified into a target leaf node;
- the target leaf node is any leaf node in the vocabulary tree, and the target leaf The node corresponds to the target visual word;
- the weight of the target visual word, and the cluster center corresponding to the target visual word calculate the target weight of the target visual word in the first image; the target weight is positively related to the weight of the target visual word, The weight of the target visual word is determined according to the number of features corresponding to the target visual word when the vocabulary tree is generated.
- the filtering unit 701 is configured to classify the third feature extracted from the first image into leaf nodes according to a vocabulary tree;
- the vocabulary tree is an image collected from the scene to be located The features extracted in the vocabulary tree are clustered;
- the nodes in the last layer of the vocabulary tree are leaf nodes, and each leaf node contains multiple features;
- the fourth feature is extracted from the target candidate image Feature, the target candidate image is included in any image in the first candidate image sequence;
- the number of features matching the target candidate image with the first image is obtained.
- the determining unit 702 is further configured to determine the three-dimensional position of the camera according to the conversion matrix and the first pose; the conversion matrix is to transform the angle and position of the point cloud map, It is obtained by aligning the outline of the point cloud map with the indoor floor plan.
- the determining unit 702 is configured to determine that the positional relationship between the L pair of feature points conforms to the first pose, and one feature point in each pair of feature points is extracted from the first image , The other feature point is extracted from the image in the first image sequence, and L is an integer greater than 1.
- the device further includes:
- the first obtaining unit 703 is configured to obtain a plurality of image sequences, each image sequence being obtained by collecting one area or multiple areas in the scene to be positioned;
- the map construction unit 704 is configured to construct the point cloud map according to the multiple image sequences; wherein any one of the multiple image sequences is used to construct a sub-point cloud map of one or more regions; the point cloud map Including the first electronic map and the second electronic map.
- the device further includes:
- the second acquiring unit 705 is configured to acquire multiple training images obtained by shooting the scene to be positioned;
- the feature extraction unit 706 is configured to perform feature extraction on the multiple training images to obtain a training feature set
- the clustering unit 707 is configured to perform multiple clustering of the features in the training feature set to obtain the vocabulary tree.
- the second acquiring unit 705 and the first acquiring unit 703 may be the same unit or different units.
- the visual positioning device is a server, and the device further includes:
- the receiving unit 708 is configured to receive the first image from a target device that has the camera installed.
- the device further includes:
- the sending unit 709 is configured to send the location information of the camera to the target device.
- Figure 8 is a schematic structural diagram of a terminal provided by an embodiment of the present disclosure. If shown in Figure 8, the terminal may include:
- the camera 801 is configured to collect a target image
- the sending unit 802 is configured to send target information to the server, where the target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera;
- the receiving unit 803 is configured to receive position information; the position information is used to indicate the position and direction of the camera; the position information is information about the position of the camera when the target image is collected by the server determined by the server according to the second candidate image sequence;
- the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to a target window, the target window being a continuous multiple frame image containing the target frame image determined from the image library, the The image library is used to construct an electronic map.
- the target frame image is an image that matches the second image in the image library.
- the second image is an image collected by the camera before the first image is collected.
- the first candidate The frames of images in the image sequence are arranged in the order of matching degree with the first image;
- the display unit 804 is configured to display an electronic map including the position and direction of the camera.
- the terminal further includes: a feature extraction unit 805, configured to extract features in the target image.
- the position information may include the three-dimensional position of the camera and the direction of the camera.
- the camera 801 can be specifically used to execute the method mentioned in step 301 and the method that can be equivalently replaced;
- the feature extraction unit 805 can be specifically configured to execute the method mentioned in step 302 and the method that can be equivalently replaced;
- the sending unit 802 can It is specifically used to execute the method mentioned in step 303 and the method that can be equivalently replaced;
- the display unit 804 is specifically configured to execute the method mentioned in step 313 and step 507 and the method that can be equivalently replaced. It can be understood that the terminal in FIG. 8 can implement the operations performed by the terminal in FIG. 3 and FIG. 5.
- each unit in the visual positioning device and the terminal is only a division of logical functions, and may be fully or partially integrated into a physical entity in actual implementation, or may be physically separated.
- the above units can be separately established processing elements, or they can be integrated into the same chip for implementation.
- they can also be stored in the storage element of the controller in the form of program code, which is called and combined by a certain processing element of the processor.
- each unit can be integrated together or implemented independently.
- the processing element here can be an integrated circuit chip with signal processing capabilities.
- each step of the above method or each of the above units may be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
- the processing element may be a general-purpose processor, such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, such as one or more specific integrated circuits.
- Circuit English: application-specific integrated circuit, abbreviation: ASIC
- microprocessors English: digital signal processor, abbreviation: DSP
- FPGA field-programmable gate array
- FIG. 9 is a schematic diagram of another terminal structure provided by an embodiment of the present disclosure.
- the terminal in this embodiment as shown in FIG. 9 may include: one or more processors 901, a memory 902, a transceiver 903, a camera 904, and an input and output device 905.
- the aforementioned processor 901, transceiver 903, memory 902, camera 904, and input/output device 905 are connected via a bus 906.
- the memory 902 is used to store instructions
- the processor 901 is used to execute instructions stored in the memory 902.
- the transceiver 903 is used to receive and send data.
- the camera 904 is used to collect images.
- the processor 901 is used to control the transceiver 903, the camera 904, and the input/output device 905 to implement the operations performed by the terminal in FIG. 3 and FIG. 5.
- the processor 901 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors or digital signal processors (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the memory 902 may include a read-only memory and a random access memory, and provides instructions and data to the processor 901. A part of the memory 902 may also include a non-volatile random access memory. For example, the memory 902 may also store device type information.
- the processor 901, the memory 902, the transceiver 903, the camera 904, and the input/output device 905 described in the embodiments of the present disclosure can implement the implementation of the terminal described in any of the foregoing embodiments, which will not be repeated here.
- the transceiver 903 can implement the functions of the sending unit 802 and the receiving unit 803.
- the processor 901 may implement the function of the feature extraction unit 805.
- the input and output device 905 is used to implement the function of the display unit 804, and the input and output device 905 may be a display screen.
- FIG. 10 is a schematic diagram of a server structure provided by an embodiment of the present disclosure.
- the server 1100 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1022 (for example, , One or more processors) and memory 1032, and one or more storage media 1030 (for example, one or more storage devices with a large amount of storage) for storing application programs 1042 or data 1044.
- the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage.
- the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
- the central processing unit 1022 may be configured to communicate with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the server 1100.
- the server 1100 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the input/output interface 1058 can implement the functions of the receiving unit 708 and the sending unit 709.
- the central processing unit 1022 can implement the functions of the screening unit 701, the determining unit 702, the first obtaining unit 703, the map constructing unit 704, the second obtaining unit 705, the feature extraction unit 706, and the clustering unit 707.
- a computer-readable storage medium stores a computer program, and the above-mentioned computer program is implemented when executed by a processor: a first candidate image sequence is determined from an image library; The image library is used to construct an electronic map, each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image, and the first image is an image collected by a camera; and the image is adjusted according to the target window.
- the sequence of each frame image in the first candidate image sequence is used to obtain the second candidate image sequence;
- the target window is a continuous multi-frame image containing the target frame image determined from the image library, and the target frame image is the The image matching the second image in the image library, where the second image is the image collected by the camera before the first image is collected; according to the second candidate image sequence, it is determined that the camera is collecting the The pose of the target in the first image.
- the computer-readable storage medium stores a computer program.
- the computer program realizes: collect a target image through a camera; send target information to a server ,
- the target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera; receiving position information, where the position information is used to indicate the position and direction of the camera;
- the location information is the information of the location when the camera collects the target image determined by the server according to the second candidate image sequence;
- the second candidate image sequence is the server adjusting the first candidate image according to the target window
- the target window is a continuous multi-frame image containing a target frame image determined from an image library, the image library is used to construct an electronic map, and the target frame image is the image
- the image in the library that matches the second image, where the second image is the image collected by the camera before the first image is collected, and each frame of image in the first candidate image sequence is in accordance with the The matching degree of an image is
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Remote Sensing (AREA)
- Mathematical Physics (AREA)
- Business, Economics & Management (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Nonlinear Science (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (46)
- 一种视觉定位方法,包括:从图像库中确定第一备选图像序列;所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
- 根据权利要求1所述的方法,其中,所述根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿包括:根据第一图像序列和所述第一图像,确定第一位姿;所述第一图像序列包括所述图像库中与第一参考帧图像相邻的连续多帧图像,所述第一参考帧图像包含于所述第二备选序列;在根据所述第一位姿成功定位所述相机的位置的情况下,确定所述第一位姿为所述目标位姿。
- 根据权利要求2所述的方法,其中,所述根据第一图像序列和所述第一图像,确定第一位姿之后,所述方法还包括:在根据所述第一位姿未成功定位所述相机的位置的情况下,根据第二图像序列和所述第一图像,确定第二位姿;所述第二图像序列包括所述图像库中与第二参考帧图像相邻的连续多帧图像,所述第二参考帧图像为所述第二备选图像序列中所述第一参考帧图像的后一帧图像或前一帧图像;在根据所述第二位姿成功定位所述相机的位置的情况下,确定所述第二位姿为所述目标位姿。
- 根据权利要求2或3所述的方法,其中,所述根据第一图像序列和所述第一图像,确定第一位姿包括:从所述第一图像序列中各图像提取的特征中,确定与从所述第一图像提取的特征相匹配的F个特征,F为大于0的整数;根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿;所述点云地图为待定位场景的电子地图,所述待定位场景为所述相机采集所述第一图像时所处的场景。
- 根据权利要求1至4任一项所述的方法,其中,所述根据目标窗口调整第一备选图像序列中各帧图像的顺序,得到第二备选图像序列包括:在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
- 根据权利要求5所述的方法,其中,所述从图像库中确定第一备选图像序列包括:确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选 图像;所述图像库中任一图像对应一个视觉词向量,所述图像库中的图像用于构建所述目标设备采集所述第一图像时所处的待定位场景的电子地图;将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量;获取所述多个备选图像中与所述第一图像的特征匹配数量最多的M个图像,得到所述第一备选图像序列。
- 根据权利要求6所述的方法,其中,所述确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选图像包括:确定所述图像库中与所述第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;所述图像库中任一图像对应至少一个视觉单词,所述第一图像对应至少一个视觉单词;确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像。
- 根据权利要求7所述的方法,其中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的前百分之Q的图像,得到所述多个备选图像;Q为大于0的实数。
- 根据权利要求7或8所述的方法,其中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:利用词汇树将从所述第一图像提取的特征转换为目标词向量;所述词汇树为将从所述待定位场景采集的训练图像中提取的特征进行聚类得到的;分别计算所述目标词向量与所述多个初选图像中各初选图像对应的视觉词向量的相似度;所述多个初选图像中任一初选图像对应的视觉词向量为利用所述词汇树由从所述任一初选图像提取的特征得到的视觉词向量;确定所述多个初选图像中对应的视觉词向量与所述目标词向量相似度最高的多个备选图像。
- 根据权利要求9所述的方法,其中,所述词汇树中的每一个叶子节点对应一个视觉单词,所述词汇树中最后一层的节点为叶子节点;所述利用词汇树将从所述第一图像提取的特征转换为目标词向量包括:计算所述词汇树中各叶子节点对应的视觉单词在所述第一图像对应的权重;将由所述各叶子节点对应的视觉单词在所述第一图像对应的权重组合成一个向量,得到所述目标词向量。
- 根据权利要求10所述的方法,其中,所述词汇树的每一个节点对应一个聚类中心;所述计算所述词汇树对应的各视觉单词在所述第一图像对应的权重包括:利用所述词汇树对从所述第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;所述目标叶子节点为所述词汇树中的任意一个叶子节点,所述目标叶子节点与目标视觉单词相对应;根据所述中间特征、所述目标视觉单词的权重以及所述目标视觉单词对应的聚类中心,计算所 述目标视觉单词在所述第一图像对应的目标权重;所述目标权重与所述目标视觉单词的权重正相关,所述目标视觉单词的权重为根据生成所述词汇树时所述目标视觉单词对应的特征数量确定的。
- 根据权利要求11所述的方法,其中,所述中间特征包括至少一个子特征;所述目标权重为所述中间特征包括的各子特征对应的权重参数之和;所述子特征对应的权重参数与特征距离负相关,所述特征距离为所述子特征与对应的聚类中心的汉明距离。
- 根据权利要求6至12任一项所述的方法,其中,所述将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量包括:根据词汇树将从所述第一图像提取的第三特征分类至叶子节点;所述词汇树为将从所述待定位场景采集的图像中提取的特征进行聚类得到的;所述词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;对各所述叶子节点中的所述第三特征和第四特征做特征匹配,以得到各所述叶子节点中与所述第三特征相匹配的第四特征;所述第四特征为从目标备选图像提取的特征,所述目标备选图像包含于所述第一备选图像序列中的任一图像;根据各所述叶子节点中与所述第三特征相匹配的第四特征,得到所述目标备选图像与所述第一图像相匹配的特征的数量。
- 根据权利要求4至13任一项所述的方法,其中,所述根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿之后,所述方法还包括:根据转换矩阵和所述第一位姿,确定所述相机的三维位置;所述转换矩阵为通过变换所述点云地图的角度和位置,将所述点云地图的轮廓和室内平面图对齐得到的。
- 根据权利要求1至14所述的方法,其中,所述确定所述第一位姿成功定位所述相机的位置的情况包括:确定L对特征点的位置关系均符合所述第一位姿,每对特征点中的一个特征点是从所述第一图像提取的,另一个特征点是从所述第一图像序列中的图像提取的,L为大于1的整数。
- 根据权利要求2至15所述的方法,其中,根据第一图像序列和所述第一图像,确定第一位姿所述根据第一图像序列和所述第一图像,确定第一位姿之前,所述方法还包括:获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;根据所述多个图像序列,构建所述点云地图;其中,所述多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;所述点云地图包括所述第一电子地图和所述第二电子地图。
- 根据权利要求9至16任一项所述的方法,其中,所述利用词汇树将从所述第一图像提取的特征转换为目标词向量之前,所述方法还包括:获得拍摄所述待定位场景得到的多张训练图像;对所述多张训练图像进行特征提取,以得到训练特征集;对所述训练特征集中的特征进行多次聚类,得到所述词汇树。
- 根据权利要求1至17所述的方法,其中,所述视觉定位方法应用于服务器;所述从图像库中确定第一备选图像序列之前,所述方法还包括:接收来自目标设备的所述第一图像,所述目标设备安装有所述相机。
- 根据权利要求18所述的方法,其中,所述确定所述第一位姿成功定位所述相机的位置的情况之后,所述方法还包括:将所述相机的位置信息发送至目标设备。
- 根据权利要求1至17所述的方法,其中,所述视觉定位方法应用于安装有所述相机的电子设备。
- 一种视觉定位方法,包括:通过相机采集目标图像;向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;显示电子地图,所述电子地图中包含所述相机的位置和方向。
- 一种视觉定位装置,其中,包括:筛选单元,配置为从图像库中确定第一备选图像序列;所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;所述筛选单元,还配置为根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口包括从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;确定单元,用于根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
- 根据权利要求22所述的装置,其中,所述确定单元,具体用于根据第一图像序列和所述第一图像,确定第一位姿;所述第一图像序列包括所述图像库中与第一参考帧图像相邻的连续多帧图像,所述第一参考帧图像包含于所述第二备选图像序列;在根据所述第一位姿成功定位所述相机的位置的情况下,确定所述第一位姿为所述目标位姿。
- 根据权利要求23所述的装置,其中,所述确定单元,还配置为在根据所述第一位姿未成功定位所述相机的位置的情况下,根据第二图像序列和所述第一图像,确定所述相机的第二位姿;所述第二图像序列包括所述图像库中与第二参考帧图像相邻的连续多帧图像,所述第二参考帧图像为所述第二备选图像序列中所述第一参考帧图像的后一帧图像或前一帧图像;在根据所述第二位姿成功定位所述相机的位置的情况下,确定所述第二位姿为所述目标位姿。
- 根据权利要求23或24所述的装置,其中,所述确定单元,配置为从所述第一图像序列中各图像提取的特征中,确定与从所述第一图像提取的特征相匹配的F个特征,F为大于0的整数;根 据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿;所述点云地图为待定位场景的电子地图,所述待定位场景为所述相机采集所述第一图像时所处的场景。
- 根据权利要求22至25任一项所述的装置,其中,所述筛选单元,配置为在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
- 根据权利要求26所述的装置,其中,所述筛选单元,配置为在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
- 根据权利要求27所述的装置,其中,所述筛选单元,配置为确定所述图像库中与所述第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;所述图像库中任一图像对应至少一个视觉单词,所述第一图像对应至少一个视觉单词;确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像。
- 根据权利要求28所述的装置,其中,所述筛选单元,配置为确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的前百分之Q的图像,得到所述多个备选图像;Q为大于0的实数。
- 根据权利要求28或29所述的装置,其中,所述筛选单元,配置为利用词汇树将从所述第一图像提取的特征转换为目标词向量;所述词汇树为将从所述待定位场景采集的训练图像中提取的特征进行聚类得到的;分别计算所述目标词向量与所述多个初选图像中各初选图像对应的视觉词向量的相似度;所述多个初选图像中任一初选图像对应的视觉词向量为利用所述词汇树由从所述任一初选图像提取的特征得到的视觉词向量;确定所述多个初选图像中对应的视觉词向量与所述目标词向量相似度最高的多个备选图像。
- 根据权利要求30所述的装置,其中,所述词汇树中的每一个叶子节点对应一个视觉单词,所述词汇树中最后一层的节点为叶子节点;所述筛选单元,配置为计算所述词汇树中各叶子节点对应的视觉单词在所述第一图像对应的权重;将由所述各叶子节点对应的视觉单词在所述第一图像对应的权重组合成一个向量,得到所述目标词向量。
- 根据权利要求31所述的装置,其中,所述词汇树的一个节点对应一个聚类中心;所述筛选单元,配置为利用所述词汇树对从所述第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;所述目标叶子节点为所述词汇树中的任意一个叶子节点,所述目标叶子节点与目标视觉单词相对应;根据所述中间特征、所述目标视觉单词的权重以及所述目标视觉单词对应的聚类中心,计算所述目标视觉单词在所述第一图像对应的目标权重;所述目标权重与所述目标视觉单词的权重正相关,所述目标视觉单词的权重为根据生成所述词汇树时所述目标视觉单词对应的特征数量确定的。
- 根据权利要求32所述的装置,其中,所述中间特征包括至少一个子特征;所述目标权重为所述中间特征包括的各子特征对应的权重参数之和;所述子特征对应的权重参数与特征距离负相关,所述特征距离为所述子特征与对应的聚类中心的汉明距离。
- 根据权利要求27至33任一项所述的装置,其中,所述筛选单元,配置为根据词汇树将从所述第一图像提取的第三特征分类至叶子节点;所述词汇树为将从所述待定位场景采集的图像中提取的特征进行聚类得到的;所述词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;对各所述叶子节点中的所述第三特征和第四特征做特征匹配,以得到各所述叶子节点中与所述第三特征相匹配的第四特征;所述第四特征为从目标备选图像提取的特征,所述目标备选图像包含于所述第一备选图像序列中的任一图像;根据各所述叶子节点中与所述第三特征相匹配的第四特征,得到所述目标备选图像与所述第一图像相匹配的特征的数量。
- 根据权利要求25至34任一项所述的装置,其中,所述确定单元,还配置为根据转换矩阵和所述第一位姿,确定所述相机的三维位置;所述转换矩阵为通过变换所述点云地图的角度和位置,将所述点云地图的轮廓和室内平面图对齐得到的。
- 根据权利要求22至35任一项所述的装置,其中,所述确定单元,配置为确定L对特征点的位置关系均符合所述第一位姿,每对特征点中的一个特征点是从所述第一图像提取的,另一个特征点是从所述第一图像序列中的图像提取的,L为大于1的整数。
- 根据权利要求23至36任一项所述的装置,其中,所述装置还包括:第一获取单元,配置为获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;地图构建单元,配置为根据所述多个图像序列,构建所述点云地图;其中,所述多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;所述点云地图包括所述第一电子地图和所述第二电子地图。
- 根据权利要求30至37任一项所述的装置,其中,所述装置还包括:第二获取单元,配置为获得拍摄所述待定位场景得到的多张训练图像;特征提取单元,用于对所述多张训练图像进行特征提取,以得到训练特征集;聚类单元,配置为对所述训练特征集中的特征进行多次聚类,得到所述词汇树。
- 根据权利要求22至37任一项所述的装置,其中,所述视觉定位装置为服务器,所述装置还包括:接收单元,配置为接收来自目标设备的所述第一图像,所述目标设备安装有所述相机。
- 根据权利要求39所述的装置,其中,所述装置还包括:发送单元,配置为将所述相机的位置信息发送至所述目标设备。
- 根据权利要求22至38任一项所述的装置,其中,所述视觉定位装置为安装有所述相机的电子设备。
- 一种终端设备,其中,包括:相机,配置为采集目标图像;发送单元,配置为向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;接收单元,配置为接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;;显示单元,配置为显示电子地图,所述电子地图中包含所述相机的位置和方向。
- 一种视觉定位系统,其中,包括服务器和终端设备,所述服务器执行如权利要求1-19中任一所述的方法,所述终端设备用于执行权利要求21中的方法。
- 一种电子设备,其中,包括:存储器,配置为存储程序;处理器,配置为执行所述存储器存储的所述程序,当所述程序被执行时,所述处理器用于执行如权利要求1-20中任一所述的方法。
- 一种计算机可读存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-20任一项所述的方法。
- 一种计算机程序产品,其中,所述计算机程序产品包含有程序指令;其中,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-20任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020227001898A KR20220024736A (ko) | 2019-08-30 | 2019-11-11 | 시각적 포지셔닝 방법 및 관련 장치 |
JP2022503488A JP7430243B2 (ja) | 2019-08-30 | 2019-11-11 | 視覚的測位方法及び関連装置 |
US17/585,114 US20220148302A1 (en) | 2019-08-30 | 2022-01-26 | Method for visual localization and related apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910821911.3A CN112445929B (zh) | 2019-08-30 | 2019-08-30 | 视觉定位方法及相关装置 |
CN201910821911.3 | 2019-08-30 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/585,114 Continuation US20220148302A1 (en) | 2019-08-30 | 2022-01-26 | Method for visual localization and related apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021035966A1 true WO2021035966A1 (zh) | 2021-03-04 |
Family
ID=74684964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/117224 WO2021035966A1 (zh) | 2019-08-30 | 2019-11-11 | 视觉定位方法及相关装置 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20220148302A1 (zh) |
JP (1) | JP7430243B2 (zh) |
KR (1) | KR20220024736A (zh) |
CN (1) | CN112445929B (zh) |
TW (1) | TWI745818B (zh) |
WO (1) | WO2021035966A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463429A (zh) * | 2022-04-12 | 2022-05-10 | 深圳市普渡科技有限公司 | 机器人、地图创建方法、定位方法及介质 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11620829B2 (en) * | 2020-09-30 | 2023-04-04 | Snap Inc. | Visual matching with a messaging application |
CN113177971A (zh) * | 2021-05-07 | 2021-07-27 | 中德(珠海)人工智能研究院有限公司 | 一种视觉跟踪方法、装置、计算机设备及存储介质 |
KR102366364B1 (ko) * | 2021-08-25 | 2022-02-23 | 주식회사 포스로직 | 기하학적 패턴 매칭 방법 및 이러한 방법을 수행하는 장치 |
CN118052867A (zh) * | 2022-11-15 | 2024-05-17 | 中兴通讯股份有限公司 | 定位方法、终端设备、服务器及存储介质 |
CN116659523B (zh) * | 2023-05-17 | 2024-07-23 | 深圳市保臻社区服务科技有限公司 | 一种基于社区进入车辆的位置自动定位方法及装置 |
CN117708357B (zh) * | 2023-06-16 | 2024-08-23 | 荣耀终端有限公司 | 一种图像检索方法和电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107796397A (zh) * | 2017-09-14 | 2018-03-13 | 杭州迦智科技有限公司 | 一种机器人双目视觉定位方法、装置和存储介质 |
CN108596976A (zh) * | 2018-04-27 | 2018-09-28 | 腾讯科技(深圳)有限公司 | 相机姿态追踪过程的重定位方法、装置、设备及存储介质 |
US20180297207A1 (en) * | 2017-04-14 | 2018-10-18 | TwoAntz, Inc. | Visual positioning and navigation device and method thereof |
CN109710724A (zh) * | 2019-03-27 | 2019-05-03 | 深兰人工智能芯片研究院(江苏)有限公司 | 一种构建点云地图的方法和设备 |
CN109816769A (zh) * | 2017-11-21 | 2019-05-28 | 深圳市优必选科技有限公司 | 基于深度相机的场景地图生成方法、装置及设备 |
CN110057352A (zh) * | 2018-01-19 | 2019-07-26 | 北京图森未来科技有限公司 | 一种相机姿态角确定方法及装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2418588A1 (en) * | 2010-08-10 | 2012-02-15 | Technische Universität München | Visual localization method |
EP2423873B1 (en) * | 2010-08-25 | 2013-12-11 | Lakeside Labs GmbH | Apparatus and Method for Generating an Overview Image of a Plurality of Images Using a Reference Plane |
US9324151B2 (en) * | 2011-12-08 | 2016-04-26 | Cornell University | System and methods for world-scale camera pose estimation |
JP5387723B2 (ja) * | 2012-04-26 | 2014-01-15 | カシオ計算機株式会社 | 画像表示装置、及び画像表示方法、画像表示プログラム |
US10121266B2 (en) * | 2014-11-25 | 2018-11-06 | Affine Technologies LLC | Mitigation of disocclusion artifacts |
CN104700402B (zh) * | 2015-02-06 | 2018-09-14 | 北京大学 | 基于场景三维点云的视觉定位方法及装置 |
CN106446815B (zh) * | 2016-09-14 | 2019-08-09 | 浙江大学 | 一种同时定位与地图构建方法 |
CN107368614B (zh) * | 2017-09-12 | 2020-07-07 | 猪八戒股份有限公司 | 基于深度学习的图像检索方法及装置 |
CN108198145B (zh) * | 2017-12-29 | 2020-08-28 | 百度在线网络技术(北京)有限公司 | 用于点云数据修复的方法和装置 |
-
2019
- 2019-08-30 CN CN201910821911.3A patent/CN112445929B/zh active Active
- 2019-11-11 KR KR1020227001898A patent/KR20220024736A/ko not_active Application Discontinuation
- 2019-11-11 JP JP2022503488A patent/JP7430243B2/ja active Active
- 2019-11-11 WO PCT/CN2019/117224 patent/WO2021035966A1/zh active Application Filing
- 2019-12-30 TW TW108148436A patent/TWI745818B/zh active
-
2022
- 2022-01-26 US US17/585,114 patent/US20220148302A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180297207A1 (en) * | 2017-04-14 | 2018-10-18 | TwoAntz, Inc. | Visual positioning and navigation device and method thereof |
CN107796397A (zh) * | 2017-09-14 | 2018-03-13 | 杭州迦智科技有限公司 | 一种机器人双目视觉定位方法、装置和存储介质 |
CN109816769A (zh) * | 2017-11-21 | 2019-05-28 | 深圳市优必选科技有限公司 | 基于深度相机的场景地图生成方法、装置及设备 |
CN110057352A (zh) * | 2018-01-19 | 2019-07-26 | 北京图森未来科技有限公司 | 一种相机姿态角确定方法及装置 |
CN108596976A (zh) * | 2018-04-27 | 2018-09-28 | 腾讯科技(深圳)有限公司 | 相机姿态追踪过程的重定位方法、装置、设备及存储介质 |
CN109710724A (zh) * | 2019-03-27 | 2019-05-03 | 深兰人工智能芯片研究院(江苏)有限公司 | 一种构建点云地图的方法和设备 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463429A (zh) * | 2022-04-12 | 2022-05-10 | 深圳市普渡科技有限公司 | 机器人、地图创建方法、定位方法及介质 |
CN114463429B (zh) * | 2022-04-12 | 2022-08-16 | 深圳市普渡科技有限公司 | 机器人、地图创建方法、定位方法及介质 |
Also Published As
Publication number | Publication date |
---|---|
JP7430243B2 (ja) | 2024-02-09 |
KR20220024736A (ko) | 2022-03-03 |
CN112445929A (zh) | 2021-03-05 |
US20220148302A1 (en) | 2022-05-12 |
JP2022541559A (ja) | 2022-09-26 |
TW202109357A (zh) | 2021-03-01 |
CN112445929B (zh) | 2022-05-17 |
TWI745818B (zh) | 2021-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI745818B (zh) | 視覺定位方法、電子設備及電腦可讀儲存介質 | |
WO2021057744A1 (zh) | 定位方法及装置、设备、存储介质 | |
WO2021057742A1 (zh) | 定位方法及装置、设备、存储介质 | |
EP4056952A1 (en) | Map fusion method, apparatus, device, and storage medium | |
US9626585B2 (en) | Composition modeling for photo retrieval through geometric image segmentation | |
CN111323024B (zh) | 定位方法及装置、设备、存储介质 | |
CN111652934A (zh) | 定位方法及地图构建方法、装置、设备、存储介质 | |
KR20140043393A (ko) | 위치 기반 인식 기법 | |
US20230351794A1 (en) | Pedestrian tracking method and device, and computer-readable storage medium | |
WO2017114237A1 (zh) | 一种图像查询方法和装置 | |
WO2022142049A1 (zh) | 地图构建方法及装置、设备、存储介质、计算机程序产品 | |
WO2023221790A1 (zh) | 图像编码器的训练方法、装置、设备及介质 | |
CN111709317A (zh) | 一种基于显著性模型下多尺度特征的行人重识别方法 | |
Xue et al. | A fast visual map building method using video stream for visual-based indoor localization | |
JP7430254B2 (ja) | 場所認識のための視覚的オブジェクトインスタンス記述子 | |
Jiang et al. | Indoor localization with a signal tree | |
CN114743139A (zh) | 视频场景检索方法、装置、电子设备及可读存储介质 | |
Orhan et al. | Semantic pose verification for outdoor visual localization with self-supervised contrastive learning | |
US11127199B2 (en) | Scene model construction system and scene model constructing method | |
Sui et al. | An accurate indoor localization approach using cellphone camera | |
US20230281867A1 (en) | Methods performed by electronic devices, electronic devices, and storage media | |
Wu et al. | A vision-based indoor positioning method with high accuracy and efficiency based on self-optimized-ordered visual vocabulary | |
Zhang et al. | Hierarchical Image Retrieval Method Based on Bag-of-Visual-Word and Eight-point Algorithm with Feature Clouds for Visual Indoor Positioning | |
CN116843754A (zh) | 一种基于多特征融合的视觉定位方法及系统 | |
WO2022268094A1 (en) | Methods, systems, and media for image searching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19943486 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022503488 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20227001898 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19943486 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.07.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19943486 Country of ref document: EP Kind code of ref document: A1 |