WO2021035966A1 - 视觉定位方法及相关装置 - Google Patents

视觉定位方法及相关装置 Download PDF

Info

Publication number
WO2021035966A1
WO2021035966A1 PCT/CN2019/117224 CN2019117224W WO2021035966A1 WO 2021035966 A1 WO2021035966 A1 WO 2021035966A1 CN 2019117224 W CN2019117224 W CN 2019117224W WO 2021035966 A1 WO2021035966 A1 WO 2021035966A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
feature
candidate
camera
Prior art date
Application number
PCT/CN2019/117224
Other languages
English (en)
French (fr)
Inventor
鲍虎军
章国锋
余海林
叶智超
盛崇山
Original Assignee
浙江商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江商汤科技开发有限公司 filed Critical 浙江商汤科技开发有限公司
Priority to KR1020227001898A priority Critical patent/KR20220024736A/ko
Priority to JP2022503488A priority patent/JP7430243B2/ja
Publication of WO2021035966A1 publication Critical patent/WO2021035966A1/zh
Priority to US17/585,114 priority patent/US20220148302A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/587Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/36Applying a local operator, i.e. means to operate on image points situated in the vicinity of a given point; Non-linear local filtering operations, e.g. median filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/752Contour matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/7625Hierarchical techniques, i.e. dividing or merging patterns to obtain a tree-like representation; Dendograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B29/00Maps; Plans; Charts; Diagrams, e.g. route diagram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the present disclosure relates to the field of computer vision but is not limited to the field of computer vision, and in particular to a visual positioning method and related devices.
  • GPS positioning is very important in people's daily life. Since the Global Positioning System (Global Positioning System, GPS) performs positioning, GPS positioning is mostly used for outdoor positioning. At present, the implementation of indoor positioning systems is mainly based on Wi-Fi signals, Bluetooth signals, and Ultra Wide Band (UWB) technology. Based on Wi-Fi signal positioning, many wireless access points (APs) need to be arranged in advance.
  • APs wireless access points
  • Vision-based positioning technology uses visual information (images or videos) collected by image or video capture devices such as mobile phones for positioning.
  • the embodiments of the present disclosure provide a visual positioning method and related devices.
  • an embodiment of the present disclosure provides a visual positioning method, the method includes: determining a first candidate image sequence from an image library; the image library is used to construct an electronic map, and the first candidate image sequence The frames in the image are arranged in the order of the degree of matching with the first image, the first image is an image collected by the camera; the order of the frames in the first candidate image sequence is adjusted according to the target window, and the second backup image is obtained.
  • the target window is a continuous multi-frame image containing a target frame image determined from an image library, the target frame image is an image in the image library that matches a second image, the second image It is the image collected by the camera before the first image is collected; the target pose of the camera when the first image is collected is determined according to the second candidate image sequence.
  • the embodiments of the present disclosure of the present application utilize the continuity of image frames in time sequence to effectively improve the positioning speed of consecutive frames.
  • the determining the target pose of the camera when acquiring the first image according to the second candidate image sequence includes: determining the target pose according to the first image sequence and the first image The first pose of the camera; the first image sequence includes consecutive multiple frames of images adjacent to the first reference frame image in the image library, and the first reference frame image is included in the second candidate sequence The first reference frame image is included in the second candidate image sequence; if it is determined that the position of the camera is successfully located according to the first pose, it is determined that the first pose is the target pose .
  • the method further includes: after determining that the location of the camera is not successfully located based on the first pose According to the position of the camera, the second pose of the camera is determined according to the second image sequence and the first image; the second image sequence includes the image adjacent to the second reference frame image in the image library A continuous multi-frame image, the second reference frame image is the next frame image or the previous frame image of the first reference frame image in the second candidate image sequence; in determining the success according to the second pose In the case of locating the position of the camera, it is determined that the second pose is the target pose.
  • the determining the first pose of the camera according to the first image sequence and the first image includes: determining and determining from the features extracted from each image in the first image sequence.
  • F features matching the features extracted from the first image F is an integer greater than 0; according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera, The first pose is determined; the point cloud map is an electronic map of a scene to be positioned, and the scene to be positioned is a scene where the camera collects the first image.
  • the adjusting the sequence of each frame image in the first candidate image sequence according to the target window to obtain the second candidate image sequence includes: each frame image in the first candidate image sequence is in accordance with the In the case that the matching degree of the first image is arranged in descending order, adjusting the image located in the target window in the first candidate image sequence to the last position of the first candidate image sequence; In the case that the frames of images in the first candidate image sequence are arranged in the order of matching degree with the first image from high to low, the first candidate image sequence located in the target window The image is adjusted to the foremost position of the first candidate image sequence.
  • the determining the first candidate image sequence from the image library includes:
  • any image in the image library corresponds to a visual word vector, and the image library
  • the image in is used to construct an electronic map of the scene to be located when the target device collects the first image
  • the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector in the image library and the visual word vector corresponding to the first image includes: determining that the corresponding visual word vector in the image library is similar to the one in the image library.
  • the first image corresponds to at least one image of the same visual word, and multiple primary images are obtained; any image in the image library corresponds to at least one visual word, and the first image corresponds to at least one visual word; the multiple primary images are determined
  • a plurality of candidate images with the highest similarity between the corresponding visual word vector and the visual word vector of the first image are selected.
  • the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector and the visual word vector of the first image in the plurality of primary selection images includes: determining the plurality of primary selections The first Q percent of the image with the highest similarity between the corresponding visual word vector and the visual word vector of the first image is obtained to obtain the multiple candidate images; Q is a real number greater than 0.
  • the determining the plurality of candidate images with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image includes:
  • the vocabulary tree is obtained by clustering the features extracted from the training images collected from the scene to be located;
  • the vocabulary tree is a visual word vector obtained from the features extracted from any of the primary selected images
  • a plurality of candidate images with the highest similarity between the corresponding visual word vector and the target word vector among the plurality of primary selected images are determined.
  • the feature extracted from the first image is converted into a target word vector using a vocabulary tree, and multiple candidate images are obtained by calculating the similarity between the target word vector and the visual word vector corresponding to each primary selected image. Filter out candidate images quickly and accurately.
  • each leaf node in the vocabulary tree corresponds to a visual word
  • the last node in the vocabulary tree is a leaf node
  • the feature extracted from the first image using the vocabulary tree includes:
  • the target word vector can be quickly calculated.
  • the vocabulary tree to classify the features extracted from the first image to obtain intermediate features that are classified into a target leaf node;
  • the target leaf node is any leaf node in the vocabulary tree, and the target leaf node Correspond to the target visual word;
  • the weight of the target visual word, and the cluster center corresponding to the target visual word, the target weight corresponding to the target visual word in the first image is calculated; the target weight and the target
  • the weights of visual words are positively related, and the weights of the target visual words are determined according to the number of features corresponding to the target visual words when the vocabulary tree is generated.
  • the intermediate feature includes at least one sub-feature; the target weight is the sum of the weight parameters corresponding to each sub-feature included in the intermediate feature; the weight parameter corresponding to the sub-feature is negatively related to the feature distance
  • the feature distance is the Hamming distance between the sub feature and the corresponding cluster center.
  • the feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image includes:
  • the third feature extracted from the first image is classified into leaf nodes according to a vocabulary tree; the vocabulary tree is obtained by clustering the features extracted from the image collected from the scene to be located; the vocabulary tree The nodes in the last layer are leaf nodes, and each leaf node contains multiple features;
  • the fourth feature is a slave target
  • the feature extracted from the candidate image, the target candidate image is included in any image in the first candidate image sequence;
  • the number of features matching the target candidate image with the first image is obtained.
  • the method further include:
  • the conversion matrix is obtained by aligning the contour of the point cloud map with the indoor floor plan by transforming the angle and position of the point cloud map .
  • the determining that the first pose successfully locates the position of the camera includes: determining that the positional relationship of the L pair of feature points conforms to the first pose, and one of each pair of feature points A feature point is extracted from the first image, another feature point is extracted from an image in the first image sequence, and L is an integer greater than 1.
  • the method before the determining the first pose of the camera according to the first image sequence and the first image, the method further includes:
  • the point cloud map is constructed according to the multiple image sequences; wherein any one of the multiple image sequences is used to construct a sub-point cloud map of one or more regions; the point cloud map includes the The first electronic map and the second electronic map.
  • the scene to be positioned is divided into multiple regions, and the sub-point cloud map is constructed for each region.
  • the sub-point cloud map is constructed for each region.
  • the method before using the vocabulary tree to convert the features extracted from the first image into a target word vector, the method further includes:
  • the visual positioning method is applied to a server; before the first candidate image sequence is determined from the image library, the method further includes: receiving the first image from a target device, and the target The device is equipped with the camera.
  • the server performs positioning based on the first image from the target device, which can take full advantage of the server's advantages in processing speed and storage space, with high positioning accuracy and fast positioning speed.
  • the method further includes: sending location information of the camera to the target device.
  • the server sends the location information of the target device to the target device, so that the target device can display the location information, so that the user can accurately know its location.
  • the visual positioning method is applied to an electronic device equipped with the camera.
  • the embodiments of the present disclosure provide another visual positioning method, which may include: collecting a target image through a camera;
  • Target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera
  • the location information is used to indicate the location and direction of the camera;
  • the location information is information about the location when the camera collects the target image determined by the server according to a second candidate image sequence;
  • the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to a target window, and the target window is determined from an image library and contains multiple consecutive frames of the target frame image
  • the image library is used to construct an electronic map, the target frame image is an image in the image library that matches a second image, and the second image is collected by the camera before the first image is collected
  • the images of each frame in the first candidate image sequence are arranged in the order of the degree of matching with the first image;
  • An electronic map is displayed, and the electronic map contains the location and direction of the camera.
  • a visual positioning device which includes:
  • the screening unit is configured to determine a first candidate image sequence from an image library; the image library is used to construct an electronic map, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image ,
  • the first image is an image collected by a camera;
  • the screening unit is further configured to adjust the order of each frame image in the first candidate image sequence according to a target window to obtain a second candidate image sequence;
  • the target window is determined from an image library containing the target frame image
  • the target frame image is an image that matches a second image in the image library, and the second image is an image collected by the camera before the first image is collected;
  • the determining unit is configured to determine the target pose of the camera when acquiring the first image according to the second candidate image sequence.
  • a terminal device which includes:
  • Camera configured to collect target images
  • a sending unit configured to send target information to a server, the target information including the target image or the feature sequence extracted from the target image, and the internal parameters of the camera;
  • the receiving unit is configured to receive position information, where the position information is used to indicate the position and direction of the camera; the position information is determined by the server according to a second candidate image sequence when the camera acquires the target image
  • the position information of the second candidate image sequence is determined by the server according to a second candidate image sequence when the camera acquires the target image
  • the position information of the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to the target window, and the target window contains the target frame determined from the image library
  • the image library is used to construct an electronic map, and the target frame image is an image that matches a second image in the image library.
  • the second image is the first image captured by the camera. For an image collected before an image, each frame of the image in the first candidate image sequence is arranged in the order of the degree of matching with the first image;
  • the display unit is configured to display an electronic map, and the electronic map contains the position and direction of the camera.
  • an embodiment of the present disclosure provides an electronic device, the electronic device includes: a memory, configured to store a program; a processor, configured to execute the program stored in the memory, and when the program is executed, The processor is configured to execute the method of any one of the foregoing first aspect to the foregoing second aspect and any implementation manner.
  • an embodiment of the present disclosure provides a visual positioning system, including: a server and a terminal device, the server executes the method of the first aspect and any one of the implementation manners described above, and the terminal device is configured to execute the first aspect described above. Two-sided approach.
  • an embodiment of the present disclosure provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the above-mentioned first From the first aspect to the second aspect and any implementation method.
  • embodiments of the present disclosure provide a computer program product, wherein the computer program product includes program instructions; wherein, when the program instructions are executed by a processor, the processor executes any of the foregoing embodiments.
  • the visual positioning method when the program instructions are executed by a processor, the processor executes any of the foregoing embodiments.
  • FIG. 1 is a schematic diagram of a vocabulary tree provided by an embodiment of the disclosure
  • Figure 2 is a visual positioning method provided by an embodiment of the present disclosure
  • FIG. 3 is another visual positioning method provided by an embodiment of the disclosure.
  • FIG. 4 is another visual positioning method provided by an embodiment of the present disclosure.
  • FIG. 5 is a positioning and navigation method provided by an embodiment of the disclosure.
  • FIG. 6 is a method for constructing a point cloud map provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a visual positioning device provided by an embodiment of the disclosure.
  • FIG. 8 is a schematic structural diagram of a terminal provided by an embodiment of the disclosure.
  • FIG. 9 is a schematic structural diagram of another terminal provided by an embodiment of the disclosure.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the disclosure.
  • the positioning method based on non-visual information usually needs to arrange devices in the scene to be positioned in advance, and the positioning accuracy is not high.
  • the positioning method based on visual information is the main direction of current research.
  • the visual positioning method provided by the embodiments of the present disclosure can be applied to scenarios such as location recognition and positioning navigation.
  • the application of the visual positioning method provided by the embodiments of the present disclosure in the location recognition scene and the positioning navigation scene will be briefly introduced below.
  • Location recognition scene For example, in a large shopping mall, you can divide the area of the shopping mall (that is, the scene to be located), and use the structure from motion (SFM) technology to construct the point cloud map of the shopping mall for each area.
  • SFM structure from motion
  • the user can start the target application on the mobile phone.
  • the mobile phone uses the camera to collect surrounding images, displays an electronic map on the screen, and displays it on the electronic map. Mark the current location and direction of the user.
  • the target application is an application specially developed to achieve accurate indoor positioning.
  • Positioning and navigation scenarios For example, in a large shopping mall, you can divide the area of the shopping mall, and use SFM and other technologies to build a point cloud map of the shopping mall for each area.
  • the user starts the target application on the mobile phone and enters the destination address to be reached; the user raises the mobile phone to collect images in front of the mobile phone, and the mobile phone displays the collected images in real time, and Display a mark indicating that the user has reached the destination address, such as an arrow.
  • the target application is an application specially developed to achieve accurate indoor positioning. Since the computing performance of the mobile phone is very small, it needs to be placed in the cloud for calculation, that is, the cloud realizes the positioning operation. Since shopping malls often change, you can rebuild the point cloud map only for the changed area, instead of rebuilding the entire mall.
  • the feature points of the image can be simply understood as the more prominent points in the image, such as contour points, bright spots in darker areas, dark spots in brighter areas, etc. This definition is based on the gray value of the image around the feature point, and the pixel value of a circle around the candidate feature point is detected. If there are enough pixels in the area around the candidate point and the gray value of the candidate point is sufficiently different, it is considered The candidate point is a feature point. After the feature points are obtained, the attributes of these feature points need to be described in some way. The output of these attributes is called the feature point descriptor (Feature Descritors).
  • ORB algorithm is a fast feature point extraction and description algorithm. The ORB algorithm uses the FAST (Features from Accelerated Segment Test) algorithm to detect feature points.
  • the FAST algorithm is an algorithm for corner detection.
  • the principle of the algorithm is to take a detection point in an image, and use the point as the center of the circle to determine whether the detection point is a corner point.
  • the ORB algorithm uses the BRIEF algorithm to calculate the descriptor of a feature point.
  • the core idea of the BRIEF algorithm is to select N point pairs in a certain pattern around the key point P, and combine the comparison results of these N point pairs as a descriptor.
  • the biggest feature of ORB algorithm is fast calculation speed. This firstly benefits from the use of FAST to detect feature points. FAST's detection speed is as famous as its name.
  • the third is to use the BRIEF algorithm to calculate the descriptor.
  • the unique binary string representation of the descriptor not only saves storage space, but also greatly shortens the matching time.
  • the descriptors of feature points A and B are as follows: A: 10101011; B: 10101010.
  • We set a threshold such as 80%.
  • the similarity between the descriptors of A and B is greater than 90%, we judge that A and B are the same feature points, that is, the two points are matched successfully. In this example, only the last digit of A and B are different, and the similarity is 87.5%, which is greater than 80%; then A and B are matched.
  • the Structure From Motion (SFM) algorithm is an offline algorithm for 3D reconstruction based on various collected disordered pictures. Before proceeding to the core algorithm StructureFrom Motion, some preparations are needed to select suitable pictures. First extract the focal length information from the picture, and then use feature extraction algorithms such as SIFT to extract image features, and use the kd-tree model to calculate the Euclidean distance between the feature points of the two pictures for feature point matching, so as to find the number of feature points matching Image pairs that meet the requirements.
  • SIFT Scale-Invariant Feature Transform
  • kd-tree is developed from BST (Binary Search Tree) and is a high-dimensional index tree data structure.
  • mainly nearest neighbor search Nearest Neighbor
  • approximate nearest neighbor search Approximate Nearest Neighbor
  • For each image matching pair calculate the epipolar geometry, estimate the fundamental matrix (ie, F matrix), and optimize and improve the matching pair through the ransac algorithm. If there are feature points that can be chained in such a matching pair and are detected all the time, then a trajectory can be formed.
  • the key first step is to select a good image pair to initialize the entire Bundle Adjustment (BA) process.
  • BA Bundle Adjustment
  • Random sample consensus uses an iterative method to estimate the parameters of the mathematical model from a set of observed data containing outliers.
  • the basic assumption of the RANSAC algorithm is that the sample contains correct data (inliers, data that can be described by the model) and abnormal data (outliers, data that deviates far from the normal range and cannot adapt to the mathematical model), that is, the data set contains noise. These abnormal data may be caused by wrong measurements, wrong assumptions, wrong calculations, etc.
  • the input of the RANSAC algorithm is a set of observation data, a parameterized model that can be explained or adapted to the observation data, and some credible parameters. RANSAC achieves its goal by repeatedly selecting a set of random subsets of the data.
  • the selected subsets are assumed to be interior points, and the following methods are used to verify: 1. There is a model adapted to the assumed interior points, that is, all unknown parameters can be calculated from the assumed interior points. 2. Use the model obtained in 1 to test all other data. If a certain point is suitable for the estimated model, consider it as an inside point. 3. If enough points are classified as hypothetical interior points, then the estimated model is reasonable enough. 4. Then, use all the assumed interior points to re-estimate the model, because it has only been estimated by the initial assumed interior points. 5. Finally, evaluate the model by estimating the error rate of the interior points and the model. This process is repeated for a fixed number of times, and the model generated each time is either discarded because there are too few interior points, or selected because it is better than the existing model.
  • Vocabulary tree is an efficient data structure for retrieving images based on visual words (also called visual words).
  • a tree structure allows keyword queries in sub-linear time instead of scanning all keywords to find matching images, which can greatly increase the retrieval speed.
  • the following describes the steps of building a vocabulary tree: 1. Extract the ORB features of all training images. Each training image extracts about 3000 features. The training images are collected from the scene to be positioned. 2. Use K-mean to cluster all the extracted features into K categories, and then cluster each category into K categories in the same way until the L layer, retain each cluster center in each layer, and finally generate vocabulary tree. Both K and L are integers greater than 1, for example, K is 10 and L is 6.
  • FIG. 1 is a schematic diagram of a vocabulary tree provided by an embodiment of the disclosure. As shown in Figure 1, the vocabulary tree includes a total of (L+1) layers, the first layer includes a root node, and the last layer includes multiple leaf nodes.
  • FIG. 2 is a visual positioning method provided by an embodiment of the present disclosure. As shown in FIG. 2, the method may include:
  • the vision positioning device determines a first candidate image sequence from an image library.
  • the visual positioning device can be a server, or a mobile terminal that can collect images, such as a mobile phone or a tablet computer.
  • This image library is used to construct electronic maps.
  • the first candidate image sequence includes M images, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image.
  • the first image is an image collected by the camera of the target device, and M is an integer greater than 1. For example, M is 5, 6, or 8, etc.
  • the target device can be a device that can collect images and/or videos, such as a mobile phone or a tablet.
  • multiple candidate images are first selected by calculating the similarity of visual word vectors, and then from the multiple candidate images The M images with the largest number of feature matches with the first image are obtained in the, and the image retrieval efficiency is high.
  • the number of feature matches between the first frame of image and the first image in the first candidate image sequence is the largest, and the number of feature matches between the last frame of image in the first candidate image sequence and the first image is the largest least.
  • the number of feature matches between the first frame of the image in the first candidate image sequence and the first image is the smallest, and the number of feature matches between the last frame of image in the first candidate image sequence and the first image is the smallest most.
  • the visual positioning device is a server
  • the first image is an image received from a mobile terminal such as a mobile phone
  • the first image may be an image collected by the mobile terminal in a scene to be positioned.
  • the visual positioning device is a mobile terminal capable of collecting images, such as a mobile phone or a tablet computer, and the first image is an image extracted by the visual positioning device in a scene to be positioned.
  • the target window contains a continuous multi-frame image containing a target frame image determined from an image library, and the target frame image is an image in the image library that matches the second image, and the second image is the first image captured by the camera. The image captured before the image.
  • each frame image in the first candidate image sequence is When the matching degree with the first image is arranged from low to high, the image located in the target window in the first candidate image sequence is adjusted to the last position of the first candidate image sequence; In the case that each frame image in the candidate image sequence is arranged in the order of matching degree with the first image from high to low, the image located in the target window in the first candidate image sequence is adjusted to the first backup image. Select the top position of the image sequence.
  • the visual positioning device may store or be associated with an image library, and the images in the image library are used to construct a point cloud map of the scene to be positioned.
  • the image library includes one or more image sequences, and each image sequence includes consecutive multiple frames of images obtained by collecting an area of the scene to be located, and each image sequence can be used to construct a sub-point cloud map, That is, a point cloud map of an area. These sub-point cloud maps constitute the point cloud map.
  • the images in the image library may be continuous.
  • the scene to be positioned can be divided into regions, and a multi-angle image sequence is collected for each region, and each region requires at least two image sequences in the front and back directions.
  • the target window may be an image sequence including the target frame image, or may be a part of the image sequence including the target frame image.
  • the target window includes 61 frames of images, that is, the target frame image and 30 frames of images before and after the target frame image.
  • the size of the target window is not limited. Assuming that the images in the first candidate image sequence are image 1, image 2, image 3, image 4, and image 5 in sequence, where image 3 and image 5 are calibration images, then the images in the second candidate image sequence are sequentially These are image 3, image 5, image 1, image 2, and image 4. It can be understood that the method flow in FIG. 2 implements continuous frame positioning, and the visual positioning device performs step 201, step 203, step 204, and step 205 to achieve single frame positioning.
  • the target pose here may include at least the position of the camera when the first image is captured; in other embodiments, the target pose may include: the position and pose of the camera when the first image is captured.
  • the pose of the camera includes but is not limited to the orientation of the camera.
  • the implementation of determining the target pose of the camera when acquiring the first image according to the second candidate image sequence is as follows: according to the first image sequence and the first image, determine the first image of the camera Pose; the first image sequence includes consecutive multiple frames of images adjacent to the first reference frame image in the image library, and the first reference frame image is included in the second candidate image sequence.
  • the first pose is the target pose.
  • the second pose of the camera is determined according to the second image sequence and the first image.
  • the second image sequence includes consecutive multiple frames of images adjacent to the second reference frame image in the image library, and the second reference frame image is the next frame image of the first reference frame image in the second candidate image sequence Or the previous image.
  • the first image sequence includes the first K1 frame image of the first reference frame image, the first reference frame image, and the last K1 frame image of the first reference frame image; K1 is an integer greater than 1, For example, K1 is 10.
  • the determination of the first pose of the camera according to the first image sequence and the first image may be: from the features extracted from each image in the first image sequence, determining the first image F features that match the extracted features, F is an integer greater than 0; the first pose is determined according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera.
  • the point cloud map is an electronic map of a scene to be positioned, and the scene to be positioned is a scene where the camera is located when the first image is collected. The scene to be located is the scene where the target device is located when the first image is collected.
  • the visual positioning device may use the PnP algorithm to determine the first pose of the camera according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the camera's internal parameters.
  • Each of the F features corresponds to a feature point in the image. That is, each feature corresponds to a 2D reference point (that is, the two-dimensional coordinates of the feature point in the image).
  • the space coordinate point corresponding to each 2D reference point can be determined, so that the one-to-one correspondence between the 2D reference point and the space coordinate point can be known.
  • each 2D reference point matches a space coordinate point, so that the space coordinate point corresponding to each feature can be known.
  • the visual positioning device may also use other methods to determine the spatial coordinate points corresponding to each feature in the point cloud map, which is not limited in the present disclosure.
  • the spatial coordinate points corresponding to the F features in the point cloud map are 3D reference points (ie, spatial coordinate points) in the F world coordinate systems.
  • Multi-point perspective imaging (Perspective-n-Point, PnP) is a method to solve the movement of 3D to 2D point pairs: that is, how to solve the pose of the camera when F 3D space points are given.
  • 3D reference points (3D reference points) coordinates in F world coordinate systems, F is an integer greater than 0; 2D reference points (2D reference points) corresponding to these F 3D points and projected on the image reference points) coordinates; internal parameters of the camera.
  • Solving the PnP problem can get the pose of the camera (or camera).
  • typical PnP problems such as P3P, direct linear transformation (DLT), EPnP (Efficient PnP), UPnP, and nonlinear optimization methods. Therefore, the visual positioning device can adopt any method to solve the PnP problem, and determine the second pose of the camera according to the F features, the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera.
  • Ransac algorithm can be used to iterate here, and the number of interior points can be counted in each iteration.
  • R is the rotation matrix
  • t is the translation vector, that is, the two sets of parameters included in the pose of the camera.
  • the camera is equivalent to a camera and other image or video capture devices.
  • the embodiment of the present disclosure of the present application provides a continuous frame positioning method, which uses a frame before the first image to locate the image of the first pose of the camera to adjust the ordering of the images in the first candidate image sequence. Taking advantage of the continuity of images in time series, the image most likely to match the first image is ranked at the forefront of the first candidate image sequence, so that an image that matches the first image can be found more quickly .
  • the visual positioning device may also perform the following operations to determine the three-dimensional position of the camera: determine the three-dimensional position of the camera according to the conversion matrix and the target pose of the camera.
  • the conversion matrix is obtained by transforming the angle and position of the point cloud map, and aligning the outline of the point cloud map with the indoor floor plan.
  • the rotation matrix R and the translation vector t are combined into a 4*4 matrix
  • Use the transformation matrix T i to multiply the matrix T′ to get a new matrix Will represent T as t * is the last three-dimensional position of the camera.
  • the three-dimensional position of the camera can be accurately determined, which is simple to implement.
  • the embodiment of the present disclosure provides a continuous frame positioning method, which uses a frame before the first image to locate the image of the first pose of the camera to adjust the order of each image in the first candidate image sequence, which can make full use of the image
  • a continuous frame positioning method which uses a frame before the first image to locate the image of the first pose of the camera to adjust the order of each image in the first candidate image sequence, which can make full use of the image
  • the image most likely to match the first image is ranked at the top of the first candidate image sequence, so that an image that matches the first image can be found more quickly, and then Position faster.
  • the situation where the position of the camera is successfully located according to the first pose may be: it is determined that the positional relationship between the L pair of feature points is consistent with the first pose, and one feature point in each pair of feature points is derived from the first pose. One image is extracted, and the other feature point is extracted from an image in the first image sequence, and L is an integer greater than 1.
  • the Ransac algorithm is used to iteratively solve the PnP according to the first pose, and the number of interior points is counted in each iteration.
  • the visual positioning device fails to locate the position of the camera by using a certain frame of image in the second candidate image sequence, it uses the next frame of image in the second candidate image sequence for positioning.
  • the embodiment of the present disclosure provides a continuous frame positioning method. After the position of the camera is successfully located using the first image, the next frame image of the first image collected by the camera is used for positioning.
  • the visual positioning device can use each frame image in sequence to locate the position of the camera according to the sequence of each frame image in the second candidate sequence until the position of the camera is located. If the position of the camera cannot be successfully located using each frame of the second candidate image sequence, then the positioning failure is returned. For example, the visual positioning device first uses the first frame image in the second candidate image sequence for positioning, if the positioning is successful, it stops this positioning; if the positioning is unsuccessful, it uses the second candidate image sequence. Position the second frame of image; and so on. The method of using the image sequence and the first image sequence for different times to locate the target pose of the camera may be the same.
  • step 201 The following describes how to determine the first candidate image sequence from the image library, that is, the implementation of step 201.
  • the method of determining the first candidate image sequence from the image database may be as follows: using a vocabulary tree to convert the features extracted from the first image into a target word vector; calculating the target word vector and the image database The similarity score of the word vector corresponding to each image; obtain the first 10 frames of images with the highest similarity score to the first image in each image sequence included in the image library to obtain the primary image sequence; according to the similarity score from high After sorting the images in the primary image sequence in the lowest order, take out the top 20% of the images as the selected image sequence, if less than 10 frames, directly take the first 10 frames; each frame in the selected image sequence The image and the first image are feature-matched; after sorting according to the number of feature matches of each frame image in the selected image sequence with the first image, the first M images are selected to obtain the first candidate image sequence.
  • the method of determining the first candidate image sequence from the image library may be as follows: determining the similarity (ie similarity score) between the corresponding visual word vector in the image library and the visual word vector corresponding to the first image The highest multiple candidate images; feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image; obtaining the multiple candidate images The M images that match the features of the first image the most, obtain the first candidate image sequence.
  • M is 5.
  • Any image in the image library corresponds to a visual word vector, and the images in the image library are used to construct an electronic map of the scene to be located when the target device collects the first image.
  • the determining the plurality of candidate images with the highest similarity between the visual word vector corresponding to the image library and the visual word vector corresponding to the first image may be: determining that the image library corresponds to the first image At least one image of the same visual word is obtained, and multiple primary selected images are obtained; determine the top Q percent image with the highest similarity between the corresponding visual word vector in the multiple primary selected images and the visual word vector of the first image, Obtain the multiple candidate images; Q is a real number greater than 0. For example, Q is 10, 15, 20, 30, etc. Any image in the image library corresponds to at least one visual word, and the first image corresponds to at least one visual word.
  • the visual positioning device obtains multiple candidate images in the following manner: using a vocabulary tree to convert the features extracted from the first image into a target word vector; respectively calculating the target word vector and the plurality of primary selected images The similarity of the visual word vector corresponding to each primary selected image in the primaries; determine the top Q percent of the image with the highest similarity between the visual word vector in the multiple primary selected images and the target word vector to obtain the multiple candidates image.
  • the vocabulary tree is obtained by clustering the features extracted from the training images collected from the scene to be located.
  • the visual word vector corresponding to any one of the plurality of primary images is a visual word vector obtained from the feature extracted from any one of the primary images using the vocabulary tree.
  • the feature matching of the multiple candidate images with the first image to obtain the number of features matching each candidate image with the first image may be: according to the vocabulary tree, the first image will be selected from the first image.
  • the third feature extracted from an image is classified to the reference leaf node; the third feature and the fourth feature are matched with features to obtain a feature matching the third feature.
  • the vocabulary tree is obtained by clustering the features extracted from the images collected from the scene to be located; the nodes in the last layer of the vocabulary tree are leaf nodes, and each leaf node contains multiple features.
  • the fourth feature is included in the reference leaf node and is a feature extracted from a target candidate image, and the target candidate image is included in the first candidate image sequence.
  • the visual positioning device may pre-store the image index and feature index corresponding to each visual word (ie, leaf node).
  • a corresponding image index and feature index are added to each visual word, and these indexes are used to accelerate feature matching. For example, if 100 images in the image library correspond to a certain visual word, the index of these 100 images (ie, image index) and the leaf node of the 100 images that fall on the leaf node corresponding to the visual word are added to the visual word.
  • the index of the feature ie feature index
  • the reference feature extracted from the first image falls on the reference node.
  • the following describes how to use the vocabulary tree to convert the features extracted from the first image into the target word vector.
  • Using the vocabulary tree to convert the features extracted from the first image into the target word vector includes: calculating the target visual word based on the features extracted from the first image, the weight of the target visual word, and the cluster center corresponding to the target visual word The target weight corresponding to the first image; the target word vector includes the weight of each visual word corresponding to the vocabulary tree in the first image; the target weight is positively related to the weight of the target visual word.
  • the word vector is calculated by the residual weighting method. Taking into account the difference of the features in the same visual word, the distinction is increased, and it is easy to access TF-IDF (term frequency-inverse document frequency). In the framework of ), the speed of image retrieval and feature matching can be improved.
  • TF-IDF term frequency-inverse document frequency
  • the following formula is used to convert the features extracted from the first image into a target word vector using a vocabulary tree:
  • W iweight is the weight of the i-th visual word itself
  • Dis(f i , c i ) is the Hamming distance from the feature f i to the cluster center c i of the i-th visual word
  • n represents from the first image
  • the extracted features fall on the number of features on the node corresponding to the i-th visual word
  • W i represents the weight of the i-th visual word in the first image.
  • a leaf node in the vocabulary tree corresponds to a visual word
  • the target word vector includes the weight of each visual word corresponding to the vocabulary tree in the first image.
  • a node of the vocabulary tree corresponds to a cluster center.
  • the vocabulary tree includes 1000 leaf nodes, and each leaf node corresponds to a visual word.
  • the visual positioning device needs to calculate the weight of each visual word in the first image to obtain the target word vector of the first image.
  • the visual positioning device may calculate the weight of the visual word corresponding to each leaf node in the vocabulary tree in the first image; combine the weight of the visual word corresponding to each leaf node in the first image to synthesize A vector to get the target word vector.
  • the word vector corresponding to each image in the image library can be calculated in the same manner to obtain the visual word vector corresponding to each primary selected image. Both i and n are integers greater than 1.
  • the feature f i is any feature extracted from the first image, and any feature corresponds to a binary string, that is, f i is a binary string.
  • the center of each visual word corresponds to a binary string.
  • c i is a binary string.
  • the Hamming distance may be calculated feature F i to the i-th word in the visual center of C i.
  • the Hamming distance indicates the number of different bits corresponding to two (same length) words. In other words, it is the number of characters that need to be replaced to transform one string into another. For example: The Hamming distance between 1011101 and 1001001 is 2.
  • the weight of each visual word in the vocabulary tree is negatively related to the number of features included in its corresponding node.
  • an index of the corresponding image is added to the i-th visual word, and the index is used to speed up image retrieval.
  • the calculation of the target weight corresponding to the target visual word in the first image based on the features extracted from the first image, the weight of the target visual word, and the cluster center corresponding to the target visual word includes: using vocabulary The tree classifies the features extracted from the first image to obtain intermediate features classified into the target leaf node; according to the intermediate feature, the weight of the target visual word, and the cluster center corresponding to the target visual word, the target visual word is calculated The target weight corresponding to the first image.
  • the target leaf node corresponds to the target visual word. It can be seen from formula (1) that the target weight is the sum of the weight parameters corresponding to each feature included in the intermediate feature. For example, the weight parameter corresponding to the feature f i:
  • the intermediate feature may include a first feature and a second feature; the Hamming distance between the first feature and the cluster center is a first distance, and the Hamming distance between the second feature and the cluster center is a second distance; if If the first distance and the second distance are different, the first weight parameter corresponding to the first feature is different from the second weight parameter corresponding to the second feature.
  • the word vector is calculated by the residual weighting method. Taking into account the difference of the features in the same visual word, the distinction is increased, and it is easy to access TF-IDF (term frequency-inverse document frequency). In the framework of ), the speed of image retrieval and feature matching can be improved.
  • TF-IDF term frequency-inverse document frequency
  • FIG. 3 is another visual positioning method provided by an embodiment of the present disclosure, and the method may include:
  • the terminal shoots a target image.
  • the terminal can be a mobile phone and other devices with camera function and/or camera function.
  • the terminal uses the ORB algorithm to extract the ORB feature of the target image.
  • the terminal uses other feature extraction methods to extract features of the target image.
  • the terminal transmits the ORB features extracted from the target image and the internal parameters of the camera to the server.
  • Steps 302 to 303 can be replaced by: the terminal transmits the target image and the internal parameters of the camera to the server.
  • the ORB feature of the image can be extracted by the server, so as to reduce the amount of calculation of the terminal.
  • the user can start the target application on the terminal, and use the camera to collect the target image through the target application and transmit the target image to the server.
  • the internal reference of the camera may be the internal reference of the camera of the terminal.
  • the server converts the ORB feature into an intermediate word vector.
  • the manner in which the server converts the ORB feature into the intermediate word vector is the same as the manner in which the feature extracted from the first image is converted into the target word vector by using the vocabulary tree in the foregoing embodiment, and will not be detailed here.
  • the server determines the first H images most similar to the target image in each image sequence according to the intermediate word vector, and obtains the similarity score corresponding to the first H image with the highest similarity score of the target image in each image sequence .
  • Each image sequence is contained in the image library, and each image sequence is used to construct a sub-point cloud map, and these sub-point cloud maps form a point cloud map corresponding to the scene to be located.
  • Step 305 is to query the first H images most similar to the target image in each image sequence in the image library.
  • H is an integer greater than 1, for example, H is 10.
  • Each image sequence may be obtained by collecting one or more regions of the scene to be located.
  • the server calculates the similarity score between each image in each image sequence and the target image according to the intermediate word vector.
  • the similarity score formula can be as follows:
  • s(v1, v2) represents the similarity score of the visual word vector v1 and the visual word vector v2).
  • the visual word vector v1 can be a word vector calculated based on the ORB feature extracted from the target image using formula (1); the visual word vector v2) can be based on the ORB feature extracted from any image in the image library, using the formula (1 ) The calculated word vector.
  • the server may store visual word vectors (corresponding to the aforementioned reference word vectors) corresponding to each image in the image library.
  • the visual word vector corresponding to each image is the feature extracted from the image, which is calculated by formula (1). It can be understood that the server only needs to calculate the visual word vector corresponding to the target image, and does not need to calculate the visual word vector corresponding to the images included in each image sequence in the image library.
  • the server only queries images that have a common visual word with the intermediate word vector, that is, only compares the similarity based on the image index in the leaf node corresponding to the non-zero item in the intermediate word vector. In other words, determine the image corresponding to at least one visual word of the target image in the image library to obtain multiple primary images; query the first H frames of the multiple primary images that are most similar to the target image according to the intermediate word vector image. For example, if the weight corresponding to the i-th visual word in the target image and the weight corresponding to a certain primary image are not 0, both the target image and the primary image correspond to the i-th visual word.
  • the server sorts the similarity scores corresponding to the first H images with the highest similarity score of the target image in each image sequence from high to low, and takes out multiple images with higher similarity scores to the target image as Alternative image.
  • the image library includes F image sequences, and the top 20% of the images with the highest similarity score to the target image among (F ⁇ H) images are taken as candidate images.
  • the (F ⁇ H) images include the first H images with the highest similarity score to the target image in each image sequence. If the number of images with the top 20% is less than 10, then the first 10 images are taken directly.
  • Step 306 is an operation of screening candidate images.
  • the server performs feature matching on each of the candidate images with the target image, and determines the top G images with the largest number of feature matching.
  • the features of the target image are first classified to a node in the L layer one by one according to the vocabulary tree.
  • the classification method is to select the cluster center point with the shortest distance (Hamming distance) from the current feature layer by layer starting from the root node ( Nodes in the tree), each classified feature is matched only with features that have a feature index in the corresponding node and the image to which it belongs is a candidate image. This can speed up feature matching.
  • Step 307 is a process of performing feature matching between each image in the candidate image and the target image. Therefore, step 307 can be regarded as a process of feature matching between two images.
  • the server obtains (2K+1) consecutive images in the reference image sequence.
  • the images in the reference image sequence are sorted according to the sequence of acquisition.
  • the reference image sequence includes any one of the first G images, the (2K+1) images (corresponding to the local point cloud map) include any one of the images, the first K images of the any one of the images, and The last K images of any image.
  • Step 308 is an operation of determining a local point cloud map.
  • the server determines multiple features that match the features extracted from the target image among the features extracted from the (2K+1) images.
  • step 309 can be regarded as a matching operation between the target image and the local point cloud map, that is, the frame-local point cloud map matching in FIG. 3.
  • the vocabulary tree is first used to classify the features extracted from the corresponding similarity scores, and then the same processing is performed on the features extracted from the target image, and only the features of the two parts that fall in the same node are considered. Matching, which can speed up feature matching. Among them, one part of the two parts is the target image, and the other part is the (2K+1) image.
  • the server determines the pose of the camera according to the multiple features, the spatial coordinate points corresponding to the multiple features in the point cloud map, and the internal parameters of the camera.
  • Step 310 is similar to step 203 in FIG. 2 and will not be described in detail here.
  • the server executes step 310 and fails to determine the pose of the camera, it uses another image in the previous G images to perform steps 308 to 310 again until the pose of the camera is successfully determined. For example, first determine the (2K+1) image based on the first image in the previous G images, and then use the (2K+1) image to determine the pose of the camera; if the pose of the camera is not determined successfully Next, determine a new (2K+1) image based on the second image of the previous G images, and then use the new (2K+1) image to determine the pose of the camera; repeat the above operations until the camera is successfully determined The pose.
  • the server sends the location information of the camera to the terminal when it successfully determines the pose of the camera.
  • the position information may include the three-dimensional position of the camera and the direction of the camera.
  • the server can determine the three-dimensional position of the camera according to the conversion matrix and the pose of the camera, and generate the position information.
  • the server executes step 308 if it fails to determine the pose of the camera.
  • step 308 Each time the server executes step 308, it needs to determine consecutive (2K+1) images based on one of the previous G images. It should be understood that the consecutive (2K+1) images determined by the server each time step 308 is executed are different.
  • the terminal displays the location of the camera on the electronic map.
  • the terminal displays the location and direction of the camera on the electronic map. It can be understood that the camera (ie, camera) is installed on the terminal, and the position of the camera is the position of the terminal. Users can accurately and quickly determine their own location and direction according to the location and direction of the camera.
  • the camera ie, camera
  • the position of the camera is the position of the terminal. Users can accurately and quickly determine their own location and direction according to the location and direction of the camera.
  • the terminal and the server work together.
  • the terminal collects images and extracts features.
  • the server is responsible for positioning and sending the positioning results (ie location information) to the terminal; the user only needs to use the terminal to send an image to the server. Determine exactly where you are.
  • FIG. 4 is another visual positioning method provided by an embodiment of the present disclosure. As shown in FIG. 4, the method may include:
  • the server obtains continuous multiple frames of images or multiple sets of features collected by the terminal.
  • Each set of features may be features extracted from one frame of image, and the multiple sets of features are in turn features extracted from multiple consecutive frames of images.
  • the consecutive multiple frames of images are sorted according to the sequence of acquisition.
  • the server determines the pose of the camera according to the first frame of image or the feature extracted from the first frame of image.
  • the first frame of image is the first frame of images in the continuous multiple frames of images.
  • Step 402 corresponds to the method of positioning based on a single image in FIG. 3.
  • the server can use the method in FIG. 3 to determine the pose of the camera by using the first frame of image.
  • Using the first frame of continuous images to perform positioning is the same as positioning based on a single image.
  • the first frame positioning in the continuous multi-frame positioning is the same as the single-frame positioning. If the positioning is successful, it will switch to continuous frame positioning; if the positioning fails, it will continue to single-frame positioning.
  • the server successfully determines the pose of the camera according to the previous frame of image, determine N frames of continuous images in the target image sequence.
  • the situation in which the pose of the camera is successfully determined in the previous frame of image means that the server executes step 402 to successfully determine the pose of the camera.
  • the target image sequence is an image sequence to which the features used to successfully locate the pose of the camera belong to the previous frame of image.
  • the server uses the first K images of an image in the target image sequence, the image, and the last K images of the image to perform feature matching with the previous image, and uses the matching feature points to successfully locate the camera The pose; the server obtains the first thirty images of the image in the target image sequence, the image, and the last thirty images of the image, that is, consecutive N frames of images.
  • the server determines the pose of the camera according to N consecutive images in the target image sequence.
  • Step 404 corresponds to step 308 to step 310 in FIG. 3.
  • the server determines multiple candidate images in the case that the pose of the camera is not successfully determined according to the previous frame of image.
  • the multiple candidate images are candidate images determined by the server according to the previous frame of image. That is to say, in the case that the pose of the camera is not successfully determined according to the previous frame of image, the server may use the candidate image of the previous frame as the candidate image of the current frame of image. This can reduce the steps of image retrieval and save time.
  • the server determines the pose of the camera according to the candidate image of the previous frame of image.
  • Step 406 corresponds to step 307 to step 310 in FIG. 3.
  • the server After the server enters the continuous frame positioning, it mainly uses the prior knowledge of the successful positioning of the previous frame to deduce that the image matching the current frame has a high probability of being near the image that was successfully positioned last time. In this way, a window can be opened near the image that was successfully positioned last time, and priority is given to those frames of images that fall in the window.
  • the window size can be up to 61 frames, with 30 frames before and after each, and truncated if it is less than 30 frames. If the positioning is successful, the window is passed down; if the positioning is unsuccessful, the positioning is performed according to the candidate image of a single frame.
  • a continuous frame sliding window mechanism is adopted, and sequential information is used to effectively reduce the amount of calculation, and the positioning success rate can be improved.
  • the prior knowledge of the successful positioning of the previous frame may be used to accelerate subsequent positioning operations.
  • FIG. 5 is a positioning and navigation method provided by an embodiment of the present disclosure. As shown in FIG. 5, the method may include:
  • the terminal starts the target application.
  • the target application is an application specially developed to achieve accurate indoor positioning. In actual applications, after the user clicks the icon corresponding to the target application on the screen of the terminal, the target application is started.
  • the terminal receives the destination address input by the user through the target interface.
  • the target interface is the interface displayed on the screen of the terminal after the terminal starts the target application, that is, the interface of the target application.
  • the destination address can be a restaurant, coffee shop, movie theater, etc.
  • the terminal displays the currently collected image, and transmits the collected image or the features extracted from the collected image to the server.
  • the terminal After the terminal receives the destination address input by the user, it can collect images of the surrounding environment through the camera (ie, the camera on the terminal) in real time or near real time, and transmit the collected images to the server at a fixed interval. In some embodiments, the terminal extracts the features of the collected image, and transmits the extracted features to the server at fixed intervals.
  • the server determines the pose of the camera according to the received image or feature.
  • Step 504 corresponds to step 401 to step 406 in FIG. 4.
  • the server uses the positioning method in Figure 4 to determine the camera's position and posture according to each frame of image received or the characteristics of each frame of image. It can be understood that the server can sequentially determine the pose of the camera according to the image sequence or feature sequence sent by the terminal, and then determine the position of the camera. In other words, the server can determine the pose of the camera in real time or near real time.
  • the server determines the three-dimensional position of the camera according to the conversion matrix and the pose of the camera.
  • the conversion matrix is obtained by transforming the angle and position of the point cloud map, and aligning the outline of the point cloud map with the indoor floor plan. Specifically, the rotation matrix R and the translation vector t are combined into a 4*4 matrix Use the transformation matrix T i to multiply the matrix T′ to get a new matrix Will represent T as t * is the last three-dimensional position of the camera.
  • the server sends location information to the terminal.
  • the position information may include the three-dimensional position of the camera, the direction of the camera, and mark information.
  • the mark information indicates the route the user needs to walk from the current location to the target address.
  • the marking information only indicates the route within the target distance, and the target distance is the farthest distance from the road in the currently displayed image.
  • the target distance may be 10 meters, 20 meters, 50 meters, and so on.
  • the server successfully determines the pose of the camera, it can determine the three-dimensional position of the camera according to the conversion matrix and the pose of the camera. Before performing step 506, the server may generate the mark information according to the location, destination address, and electronic map of the camera.
  • the terminal displays the collected images in real time and displays a mark indicating that the user has reached the destination address.
  • the user starts the target application on the mobile phone and enters the destination address that needs to be reached; the user raises the mobile phone to the front to collect images, and the mobile phone displays the collection in real time , And display a mark indicating that the user has reached the destination address, such as an arrow.
  • the server can accurately locate the location of the camera and provide navigation information to the user, and the user can quickly reach the target address according to the guidance.
  • Fig. 6 is a method for constructing a point cloud map provided by an embodiment of the disclosure. As shown in Figure 6, the method may include:
  • the server obtains multiple video sequences.
  • the user can divide the area of the scene to be positioned, and collect a multi-angle video sequence for each area, and each area needs at least two front and back video sequences.
  • the multiple video sequences are video sequences obtained by shooting each area in the scene to be positioned from multiple angles.
  • the server extracts images for each of the multiple video sequences according to the target frame rate to obtain multiple image sequences.
  • the server extracts a video sequence according to the target frame rate to obtain an image sequence.
  • the target frame rate may be 30 frames/sec.
  • Each image sequence is used to construct a sub-point cloud map.
  • the server uses each image sequence to construct a point cloud map.
  • the server may use the SFM algorithm to construct a sub-point cloud map using each image sequence, and all the sub-point cloud maps form the point cloud map.
  • the scene to be positioned is divided into multiple regions, and the sub-point cloud map is constructed in each region.
  • the sub-point cloud map is constructed in each region.
  • the multiple image sequences can be stored in the image database, and the vocabulary tree is used to determine the visual word vector corresponding to each image in the multiple image sequences .
  • the server may store the visual word vector corresponding to each image in the multiple image sequences.
  • the index of the corresponding image is added to each visual word included in the vocabulary tree. For example, if the weight of a certain visual word in the vocabulary tree corresponding to a certain image in the image library is not 0, then the index of the image is added to the visual word.
  • the server adds an index and a feature index of the corresponding image to each visual word included in the vocabulary tree.
  • the server can use the vocabulary tree to classify each feature of each image into leaf nodes, and each leaf node corresponds to a visual word. For example, among the features extracted from the images in each image sequence, 100 features fall on a certain leaf node, then the feature index of the 100 features is adjusted on the visual word corresponding to the leaf node. The feature index indicates the 100 features.
  • the following provides a specific example of locating the target pose of the camera based on the image sequence and the first image, which may include: determining, based on the image database, a sub-point cloud map established based on the first image sequence, wherein the sub-point cloud
  • the map includes: 3D coordinates and 3D descriptors corresponding to the 3D coordinates; determining the 2D coordinates of the first image and the 2D descriptors corresponding to the 2D coordinates; combining the 2D coordinates and the 2D descriptors with The 3D coordinates and 3D descriptors are matched; according to the successfully matched 2D coordinates and the conversion relationship between the 2D descriptors and the 3D coordinates and 3D descriptors, the first pose or the second pose, etc., can be determined.
  • the 3D descriptor may be description information of 3D coordinates, including: adjacent coordinates of the 3D coordinates and/or attribute information of bell coordinates.
  • the 2D descriptor may be description information of 2D coordinates. For example, using the pnp algorithm to use the above conversion relationship to determine the first pose or the second pose of the camera.
  • Figure 7 is a schematic structural diagram of a visual positioning device provided by an embodiment of the disclosure. If shown in Figure 7, the visual positioning device may include:
  • the screening unit 701 is configured to determine a first candidate image sequence from an image library; the image library is used to construct an electronic map, and each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image,
  • the first image is an image collected by a camera;
  • the screening unit 701 is further configured to adjust the order of each frame image in the first candidate image sequence according to the target window to obtain a second candidate image sequence;
  • the target window is a continuous multiple that contains the target frame image determined from the image library.
  • the determining unit 702 is configured to determine the target pose of the camera when acquiring the first image according to the second candidate image sequence.
  • the determining unit 702 is configured to determine the first pose of the camera according to the first image sequence and the first image; the first image sequence includes the first image sequence and the first image sequence in the image library. Consecutive multiple frames of images adjacent to a reference frame image, where the first reference frame image is included in the second candidate sequence;
  • the first pose is the target pose.
  • the determining unit 702 is configured to determine the position of the camera according to the second image sequence and the first image in the case of determining that the position of the camera is not successfully located according to the first pose
  • the second pose includes consecutive multiple frames of images adjacent to the second reference frame image in the image library, and the second reference frame image is the first reference frame image in the second candidate image sequence
  • the next frame of image or the previous frame of image in the case where it is determined that the position of the camera is successfully located according to the second pose, the second pose is determined to be the target pose.
  • the determining unit 702 is configured to determine F features that match the features extracted from the first image among the features extracted from each image in the first image sequence, where F is An integer greater than 0;
  • the corresponding spatial coordinate points of the F features in the point cloud map, and the internal parameters of the camera determine the first pose;
  • the point cloud map is an electronic map of the scene to be positioned, and the scene to be positioned is The scene where the camera was in when the first image was collected.
  • the screening unit 701 is configured to, in the case that the frames of images in the first candidate image sequence are arranged in a descending order of matching degree with the first image, Adjusting the image located in the target window in the first candidate image sequence to the last position of the first candidate image sequence;
  • the image located in the target window in the first candidate image sequence is adjusted to The foremost position of the first candidate image sequence.
  • the screening unit 701 is configured to, in the case that the frames of images in the first candidate image sequence are arranged in a descending order of matching degree with the first image, The image located in the target window in the first candidate image sequence is adjusted to the last position of the first candidate image sequence; each frame of the image in the first candidate image sequence is adjusted according to the degree of matching with the first image from In the case of high-to-low order arrangement, the image located in the target window in the first candidate image sequence is adjusted to the foremost position of the first candidate image sequence.
  • the screening unit 701 is configured to determine an image in the image library corresponding to at least one same visual word as the first image to obtain a plurality of primary selected images; any image in the image library Corresponding to at least one visual word, the first image corresponds to at least one visual word; determining a plurality of candidate images with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image.
  • the screening unit 701 is configured to determine the top Q percent image with the highest similarity between the corresponding visual word vector in the plurality of primary selected images and the visual word vector of the first image , Get the multiple candidate images; Q is a real number greater than 0.
  • the filtering unit 701 is configured to use a vocabulary tree to convert the features extracted from the first image into a target word vector; the vocabulary tree is a training image collected from the scene to be located The extracted features are clustered;
  • the visual word vector corresponding to any primary image in the multiple primary images is calculated by using the vocabulary tree The visual word vector obtained from the features extracted from any primary selected image;
  • a plurality of candidate images with the highest similarity between the corresponding visual word vector and the target word vector in the plurality of primary selected images are determined.
  • a leaf node in the vocabulary tree corresponds to a visual word, and the last node in the vocabulary tree is a leaf node;
  • the screening unit 701 is configured to calculate the weight of the visual word corresponding to each leaf node in the vocabulary tree in the first image; combine the weight of the visual word corresponding to each leaf node in the first image into a vector to obtain The target word vector.
  • a node of the vocabulary tree corresponds to a cluster center
  • the screening unit 701 is configured to use the vocabulary tree to classify the features extracted from the first image to obtain intermediate features classified into a target leaf node;
  • the target leaf node is any leaf node in the vocabulary tree, and the target leaf The node corresponds to the target visual word;
  • the weight of the target visual word, and the cluster center corresponding to the target visual word calculate the target weight of the target visual word in the first image; the target weight is positively related to the weight of the target visual word, The weight of the target visual word is determined according to the number of features corresponding to the target visual word when the vocabulary tree is generated.
  • the filtering unit 701 is configured to classify the third feature extracted from the first image into leaf nodes according to a vocabulary tree;
  • the vocabulary tree is an image collected from the scene to be located The features extracted in the vocabulary tree are clustered;
  • the nodes in the last layer of the vocabulary tree are leaf nodes, and each leaf node contains multiple features;
  • the fourth feature is extracted from the target candidate image Feature, the target candidate image is included in any image in the first candidate image sequence;
  • the number of features matching the target candidate image with the first image is obtained.
  • the determining unit 702 is further configured to determine the three-dimensional position of the camera according to the conversion matrix and the first pose; the conversion matrix is to transform the angle and position of the point cloud map, It is obtained by aligning the outline of the point cloud map with the indoor floor plan.
  • the determining unit 702 is configured to determine that the positional relationship between the L pair of feature points conforms to the first pose, and one feature point in each pair of feature points is extracted from the first image , The other feature point is extracted from the image in the first image sequence, and L is an integer greater than 1.
  • the device further includes:
  • the first obtaining unit 703 is configured to obtain a plurality of image sequences, each image sequence being obtained by collecting one area or multiple areas in the scene to be positioned;
  • the map construction unit 704 is configured to construct the point cloud map according to the multiple image sequences; wherein any one of the multiple image sequences is used to construct a sub-point cloud map of one or more regions; the point cloud map Including the first electronic map and the second electronic map.
  • the device further includes:
  • the second acquiring unit 705 is configured to acquire multiple training images obtained by shooting the scene to be positioned;
  • the feature extraction unit 706 is configured to perform feature extraction on the multiple training images to obtain a training feature set
  • the clustering unit 707 is configured to perform multiple clustering of the features in the training feature set to obtain the vocabulary tree.
  • the second acquiring unit 705 and the first acquiring unit 703 may be the same unit or different units.
  • the visual positioning device is a server, and the device further includes:
  • the receiving unit 708 is configured to receive the first image from a target device that has the camera installed.
  • the device further includes:
  • the sending unit 709 is configured to send the location information of the camera to the target device.
  • Figure 8 is a schematic structural diagram of a terminal provided by an embodiment of the present disclosure. If shown in Figure 8, the terminal may include:
  • the camera 801 is configured to collect a target image
  • the sending unit 802 is configured to send target information to the server, where the target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera;
  • the receiving unit 803 is configured to receive position information; the position information is used to indicate the position and direction of the camera; the position information is information about the position of the camera when the target image is collected by the server determined by the server according to the second candidate image sequence;
  • the second candidate image sequence is obtained by the server adjusting the sequence of each frame image in the first candidate image sequence according to a target window, the target window being a continuous multiple frame image containing the target frame image determined from the image library, the The image library is used to construct an electronic map.
  • the target frame image is an image that matches the second image in the image library.
  • the second image is an image collected by the camera before the first image is collected.
  • the first candidate The frames of images in the image sequence are arranged in the order of matching degree with the first image;
  • the display unit 804 is configured to display an electronic map including the position and direction of the camera.
  • the terminal further includes: a feature extraction unit 805, configured to extract features in the target image.
  • the position information may include the three-dimensional position of the camera and the direction of the camera.
  • the camera 801 can be specifically used to execute the method mentioned in step 301 and the method that can be equivalently replaced;
  • the feature extraction unit 805 can be specifically configured to execute the method mentioned in step 302 and the method that can be equivalently replaced;
  • the sending unit 802 can It is specifically used to execute the method mentioned in step 303 and the method that can be equivalently replaced;
  • the display unit 804 is specifically configured to execute the method mentioned in step 313 and step 507 and the method that can be equivalently replaced. It can be understood that the terminal in FIG. 8 can implement the operations performed by the terminal in FIG. 3 and FIG. 5.
  • each unit in the visual positioning device and the terminal is only a division of logical functions, and may be fully or partially integrated into a physical entity in actual implementation, or may be physically separated.
  • the above units can be separately established processing elements, or they can be integrated into the same chip for implementation.
  • they can also be stored in the storage element of the controller in the form of program code, which is called and combined by a certain processing element of the processor.
  • each unit can be integrated together or implemented independently.
  • the processing element here can be an integrated circuit chip with signal processing capabilities.
  • each step of the above method or each of the above units may be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
  • the processing element may be a general-purpose processor, such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, such as one or more specific integrated circuits.
  • Circuit English: application-specific integrated circuit, abbreviation: ASIC
  • microprocessors English: digital signal processor, abbreviation: DSP
  • FPGA field-programmable gate array
  • FIG. 9 is a schematic diagram of another terminal structure provided by an embodiment of the present disclosure.
  • the terminal in this embodiment as shown in FIG. 9 may include: one or more processors 901, a memory 902, a transceiver 903, a camera 904, and an input and output device 905.
  • the aforementioned processor 901, transceiver 903, memory 902, camera 904, and input/output device 905 are connected via a bus 906.
  • the memory 902 is used to store instructions
  • the processor 901 is used to execute instructions stored in the memory 902.
  • the transceiver 903 is used to receive and send data.
  • the camera 904 is used to collect images.
  • the processor 901 is used to control the transceiver 903, the camera 904, and the input/output device 905 to implement the operations performed by the terminal in FIG. 3 and FIG. 5.
  • the processor 901 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors or digital signal processors (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 902 may include a read-only memory and a random access memory, and provides instructions and data to the processor 901. A part of the memory 902 may also include a non-volatile random access memory. For example, the memory 902 may also store device type information.
  • the processor 901, the memory 902, the transceiver 903, the camera 904, and the input/output device 905 described in the embodiments of the present disclosure can implement the implementation of the terminal described in any of the foregoing embodiments, which will not be repeated here.
  • the transceiver 903 can implement the functions of the sending unit 802 and the receiving unit 803.
  • the processor 901 may implement the function of the feature extraction unit 805.
  • the input and output device 905 is used to implement the function of the display unit 804, and the input and output device 905 may be a display screen.
  • FIG. 10 is a schematic diagram of a server structure provided by an embodiment of the present disclosure.
  • the server 1100 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1022 (for example, , One or more processors) and memory 1032, and one or more storage media 1030 (for example, one or more storage devices with a large amount of storage) for storing application programs 1042 or data 1044.
  • the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage.
  • the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 1022 may be configured to communicate with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the server 1100.
  • the server 1100 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the input/output interface 1058 can implement the functions of the receiving unit 708 and the sending unit 709.
  • the central processing unit 1022 can implement the functions of the screening unit 701, the determining unit 702, the first obtaining unit 703, the map constructing unit 704, the second obtaining unit 705, the feature extraction unit 706, and the clustering unit 707.
  • a computer-readable storage medium stores a computer program, and the above-mentioned computer program is implemented when executed by a processor: a first candidate image sequence is determined from an image library; The image library is used to construct an electronic map, each frame image in the first candidate image sequence is arranged in the order of matching degree with the first image, and the first image is an image collected by a camera; and the image is adjusted according to the target window.
  • the sequence of each frame image in the first candidate image sequence is used to obtain the second candidate image sequence;
  • the target window is a continuous multi-frame image containing the target frame image determined from the image library, and the target frame image is the The image matching the second image in the image library, where the second image is the image collected by the camera before the first image is collected; according to the second candidate image sequence, it is determined that the camera is collecting the The pose of the target in the first image.
  • the computer-readable storage medium stores a computer program.
  • the computer program realizes: collect a target image through a camera; send target information to a server ,
  • the target information includes the target image or the feature sequence extracted from the target image, and the internal parameters of the camera; receiving position information, where the position information is used to indicate the position and direction of the camera;
  • the location information is the information of the location when the camera collects the target image determined by the server according to the second candidate image sequence;
  • the second candidate image sequence is the server adjusting the first candidate image according to the target window
  • the target window is a continuous multi-frame image containing a target frame image determined from an image library, the image library is used to construct an electronic map, and the target frame image is the image
  • the image in the library that matches the second image, where the second image is the image collected by the camera before the first image is collected, and each frame of image in the first candidate image sequence is in accordance with the The matching degree of an image is

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Remote Sensing (AREA)
  • Mathematical Physics (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Nonlinear Science (AREA)
  • Image Analysis (AREA)

Abstract

一种视觉定位方法和相关装置,涉及计算机视觉领域,该方法包括:视觉定位装置从图像库中确定第一备选图像序列(201);所述图像库用于构建电子地图,所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列(202);所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿(203)。

Description

视觉定位方法及相关装置
相关申请的交叉引用
本公开基于申请号为201910821911.3、申请日为2019年08月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及计算机视觉领域但不限于计算机视觉领域,尤其涉及一种视觉定位方法及相关装置。
背景技术
定位技术在人们的日常生活中非常重要。由于全球定位系统(Global Positioning System,GPS)进行定位,但是GPS定位多用于室外定位。目前,室内定位系统的实现主要是基于Wi-Fi信号、蓝牙信号和超宽带技术(Ultra Wide Band,UWB)等。基于Wi-Fi信号的定位,需要事先布置好很多无线接入点(Access Point,AP)。
视觉信息获取简单方便,不需要对场景进行改造,用手机等设备拍摄图像就能获取周围丰富的视觉信息。基于视觉的定位技术是利用手机等图像或视频采集设备采集的视觉信息(图像或视频)进行定位。
发明内容
本公开实施例提供视觉定位方法及相关装置。
第一方面,本公开实施例提供了一种视觉定位方法,该方法包括:从图像库中确定第一备选图像序列;所述图像库用于构建电子地图,所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
本申请本公开实施例,利用图像帧在时序上的连贯性,有效提升连续帧的定位速度。
在一些实施例中,所述根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿包括:根据第一图像序列和所述第一图像,确定所述相机的第一位姿;所述第一图像序列包括所述图像库中与第一参考帧图像相邻的连续多帧图像,所述第一参考帧图像包含于所述第二备选序列所述第一参考帧图像包含于所述第二备选图像序列;在确定根据所述第一位姿成功定位所述相机的位置的情况下,确定所述第一位姿为所述目标位姿。
在一些实施例中,所述根据第一图像序列和所述第一图像,确定所述相机的第一位姿之后,所述方法还包括:在确定根据所述第一位姿未成功定位所述相机的位置的情况,根据第二图像序列和所述第一图像,确定所述相机的第二位姿;所述第二图像序列包括所述图像库中与第二参考帧图像相邻的连续多帧图像,所述第二参考帧图像为所述第二备选图像序列中所述第一参考帧图像的后一帧图像或前一帧图像;在确定根据所述第二位姿成功定位所述相机的位置的情况下,确定所述第二位姿为所述目标位姿。
在一些实施例中,所述根据第一图像序列和所述第一图像,确定所述相机的第一位姿包括:从所述第一图像序列中各图像提取的特征中,确定与从所述第一图像提取的特征相匹配的F个特征,F为大 于0的整数;根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿;所述点云地图为待定位场景的电子地图,所述待定位场景为所述相机采集所述第一图像时所处的场景。
在一些实施例中,所述根据目标窗口调整第一备选图像序列中各帧图像的顺序,得到第二备选图像序列包括:在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
在一些实施例中,所述从图像库中确定第一备选图像序列包括:
确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选图像;所述图像库中任一图像对应一个视觉词向量,所述图像库中的图像用于构建所述目标设备采集所述第一图像时所处的待定位场景的电子地图;
将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量;
获取所述多个备选图像中与所述第一图像的特征匹配数量最多的M个图像,得到所述第一备选图像序列。
在一些实施例中,所述确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选图像包括:确定所述图像库中与所述第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;所述图像库中任一图像对应至少一个视觉单词,所述第一图像对应至少一个视觉单词;确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像。
在一些实施例中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的前百分之Q的图像,得到所述多个备选图像;Q为大于0的实数。
在一些实施例中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:
利用词汇树将从所述第一图像提取的特征转换为目标词向量;所述词汇树为将从所述待定位场景采集的训练图像中提取的特征进行聚类得到的;
分别计算所述目标词向量与所述多个初选图像中各初选图像对应的视觉词向量的相似度;所述多个初选图像中任一初选图像对应的视觉词向量为利用所述词汇树由从所述任一初选图像提取的特征得到的视觉词向量;
确定所述多个初选图像中对应的视觉词向量与所述目标词向量相似度最高的多个备选图像。
在该实现方式中,利用词汇树将从第一图像提取的特征转换为目标词向量,通过计算目标词向量与各初选图像对应的视觉词向量的相似度来得到多个备选图像,可以快速、准确地筛选出备选图像。
在一些实施例中,所述词汇树中的每一个叶子节点对应一个视觉单词,所述词汇树中最后一层的节点为叶子节点;所述利用词汇树将从所述第一图像提取的特征转换为目标词向量包括:
计算所述词汇树中各叶子节点对应的视觉单词在所述第一图像对应的权重;
将由所述各叶子节点对应的视觉单词在所述第一图像对应的权重组合成一个向量,得到所述目标 词向量。
在该实现方式中,可以快速地计算得到目标词向量。
在一些实施例中,所述词汇树的每一个节点对应一个聚类中心;所述计算所述词汇树对应的各视觉单词在所述第一图像对应的权重包括:
利用所述词汇树对从所述第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;所述目标叶子节点为所述词汇树中的任意一个叶子节点,所述目标叶子节点与目标视觉单词相对应;
根据所述中间特征、所述目标视觉单词的权重以及所述目标视觉单词对应的聚类中心,计算所述目标视觉单词在所述第一图像对应的目标权重;所述目标权重与所述目标视觉单词的权重正相关,所述目标视觉单词的权重为根据生成所述词汇树时所述目标视觉单词对应的特征数量确定的。
在一些实施例中,所述中间特征包括至少一个子特征;所述目标权重为所述中间特征包括的各子特征对应的权重参数之和;所述子特征对应的权重参数与特征距离负相关,所述特征距离为所述子特征与对应的聚类中心的汉明距离。
在该实现方式中,考虑了落在同一视觉单词当中的特征的差异性。
在一些实施例中,所述将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量包括:
根据词汇树将从所述第一图像提取的第三特征分类至叶子节点;所述词汇树为将从所述待定位场景采集的图像中提取的特征进行聚类得到的;所述词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;
对各所述叶子节点中的所述第三特征和第四特征做特征匹配,以得到各所述叶子节点中与所述第三特征相匹配的第四特征;所述第四特征为从目标备选图像提取的特征,所述目标备选图像包含于所述第一备选图像序列中的任一图像;
根据各所述叶子节点中与所述第三特征相匹配的第四特征,得到所述目标备选图像与所述第一图像相匹配的特征的数量。
采用这种方式,可以减少特征匹配的运算量,大幅度提高特征匹配的速度。
在一些实施例中,所述根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿之后,所述方法还包括:
根据转换矩阵和所述第一位姿,确定所述相机的三维位置;所述转换矩阵为通过变换所述点云地图的角度和位置,将所述点云地图的轮廓和室内平面图对齐得到的。
在一些实施例中,所述确定所述第一位姿成功定位所述相机的位置的情况包括:确定L对特征点的位置关系均符合所述第一位姿,每对特征点中的一个特征点是从所述第一图像提取的,另一个特征点是从所述第一图像序列中的图像提取的,L为大于1的整数。
在该实现方式中,可以准确、快速的确定该第二位姿是否能成功定位目标设备的位置。
在一些实施例中,所述根据第一图像序列和所述第一图像,确定所述相机的第一位姿之前,所述方法还包括:
获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;
根据所述多个图像序列,构建所述点云地图;其中,所述多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;所述点云地图包括所述第一电子地图和所述第二电子地图。
在该实现方式中,将待定位场景划分为多个区域,分区域构建子点云地图。这样当待定位场景中 某个区域变换后,仅需采集该区域的视频序列来构建该区域的子点云地图,而不同重新构建整个待定位场景的点云地图;可以有效减少工作量。
在一些实施例中,所述利用词汇树将从所述第一图像提取的特征转换为目标词向量之前,所述方法还包括:
获得拍摄所述待定位场景得到的多张训练图像;
对所述多张训练图像进行特征提取,以得到训练特征集;
对所述训练特征集中的特征进行多次聚类,得到所述词汇树。
在一些实施例中,所述视觉定位方法应用于服务器;所述从图像库中确定第一备选图像序列之前,所述方法还包括:接收来自目标设备的所述第一图像,所述目标设备安装有所述相机。
在该实现方式中,服务器根据来自目标设备的第一图像进行定位,可以充分服务器在处理速度以及存储空间方面的优势,定位精度高、定位速度快。
在一些实施例中,所述确定所述第二位姿成功定位所述目标设备的位置的情况之后,所述方法还包括:将所述相机的位置信息发送至所述目标设备。
在该实现中,服务器将目标设备的位置信息发送给该目标设备,以便于该目标设备显示该位置信息,可以使得用户准确地知道其所处的位置。
在一些实施例中,所述视觉定位方法应用于安装有所述相机的电子设备。
第二方面,本公开实施例提供了另一种视觉定位方法,该方法可包括:通过相机采集目标图像;
向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;
接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;
显示电子地图,所述电子地图中包含所述相机的位置和方向。
第三方面,本公开实施例提供了一种视觉定位装置,该装置包括:
筛选单元,配置为从图像库中确定第一备选图像序列;所述图像库用于构建电子地图,所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;
所述筛选单元,还配置为根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;
确定单元,配置为根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
第四方面,本公开实施例提供了一种终端设备,该终端设备包括:
相机,配置为采集目标图像;
发送单元,配置为向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;
接收单元,配置为接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;
显示单元,配置为显示电子地图,所述电子地图中包含所述相机的位置和方向。
第五方面,本公开实施例提供了一种电子设备,该电子设备包括:存储器,用于存储程序;处理器,配置为执行所述存储器存储的所述程序,当所述程序被执行时,所述处理器用于执行如上述第一方面至上述第二方面以及任一种实现方式的方法。
第六方面,本公开实施例提供了一种视觉定位系统,包括:服务器和终端设备,所述服务器执行如上述第一方面以及任一种实现方式的方法,所述终端设备用于执行上述第二方面的方法。
第七方面,本公开实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行上述第一方面至第二方面以及任一种实现方式的方法。
第八方面,本公开实施例提供一种计算机程序产品,其中,所述计算机程序产品包含有程序指令;其中,所述程序指令当被处理器执行时使所述处理器执行前述任意实施例提供的视觉定位方法。
附图说明
为了更清楚地说明本公开实施例中的技术方案,下面将对本公开实施例或背景技术中所需要使用的附图进行说明。
图1为本公开实施例提供的一种词汇树的示意图;
图2为本公开实施例提供的一种视觉定位方法;
图3为本公开实施例提供的另一种视觉定位方法;
图4为本公开实施例提供的又一种视觉定位方法;
图5为本公开实施例提供的一种定位导航方法;
图6为本公开实施例提供的一种构建点云地图的方法;
图7为本公开实施例提供的一种视觉定位装置的结构示意图;
图8为本公开实施例提供的一种终端的结构示意图;
图9为本公开实施例提供的另一种终端的结构示意图;
图10为本公开实施例提供的一种服务器的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本公开实施例方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。
本公开的说明书实施例和权利要求书及上述附图中的术语“第一”、“第二”、和“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必 限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
由于基于非视觉信息的定位方法通常需要事先在待定位场景布置设备,并且定位精度不高。如今,基于视觉信息的定位方法是目前研究的主要方向。本公开实施例提供的视觉定位方法能够应用在位置识别、定位导航等场景。下面分别对本公开实施例提供的视觉定位方法在位置识别场景以及定位导航场景中的应用进行简单的介绍。
位置识别场景:譬如在大的商场中,可以对商场(即待定位场景)划分区域,对各个区域采用运动恢复结构(Structure from Motion,SFM)等技术构建商场的点云地图。用户在该商场中想要确定自身所在的位置和/或方向时,该用户可以启动手机上的目标应用,该手机利用摄像头采集周围的图像,在屏幕上显示电子地图,并在该电子地图上标出该用户当前所在的位置和方向。该目标应用为实现室内的精确定位专门开发的应用。
定位导航场景:譬如在大的商场中,可以对商场划分区域,对各个区域采用SFM等技术构建商场的点云地图。用户在商场中迷路或者想要去某个店,该用户启动手机上的目标应用,并输入需要到达的目的地址;该用户举起手机对着前方采集图像,该手机实时显示采集的图像,并显示指示该用户达到目的地址的标记,例如箭头。该目标应用为实现室内的精确定位专门开发的应用。由于手机的计算性能很小,所以需要放到云端进行计算,即由云端实现定位操作。由于商场经常变动,可以只对变动的区域重新构建点云地图即可,不用整个商场全部重新构建。
由于本公开实施例涉及图像特征提取、SFM算法以及位姿估计等,为了便于理解,下面先对本公开实施例涉及的相关术语及相关概念进行介绍。
(1)特征点、描述子以及二进制定向简单描述符(Oriented Fast and Rotated Brief,ORB)算法
图像的特征点可以简单的理解为图像中比较显著的点,如轮廓点,较暗区域中的亮点,较亮区域中的暗点等。这个定义基于特征点周围的图像灰度值,检测候选特征点周围一圈的像素值,如果候选点周围领域内有足够多的像素点与该候选点的灰度值差别够大,则认为该候选点为一个特征点。得到特征点后需要以某种方式描述这些特征点的属性。这些属性的输出称之为该特征点的描述子(Feature Descritors)。ORB算法是一种快速特征点提取和描述的算法。ORB算法是采用FAST(Features from Accelerated Segment Test)算法来检测特征点。FAST算法是一种用于角点检测的算法,该算法的原理是取图像中检测点,以该点为圆心的周围的16个像素点判断检测点是否为角点。ORB算法是采用BRIEF算法来计算一个特征点的描述子。BRIEF算法的核心思想是在关键点P的周围以一定模式选取N个点对,把这N个点对的比较结果组合起来作为描述子。
ORB算法最大的特点就是计算速度快。这首先得益于使用FAST检测特征点,FAST的检测速度正如它的名字一样是出了名的快。再次是使用BRIEF算法计算描述子,该描述子特有的二进制串的表现形式不仅节约了存储空间,而且大大缩短了匹配的时间。例如特征点A、B的描述子如下:A:10101011;B:10101010。我们设定一个阈值,比如80%。当A和B的描述子的相似度大于90%时,我们判断A和B是相同的特征点,即这2个点匹配成功。在这个例子中A和B只有最后一位不同,相似度为87.5%,大于80%;则A和B是匹配的。
(2)SFM算法
运动恢复结构(Structure From Motion,SFM)算法是一种基于各种收集到的无序图片进行三维重建的离线算法。在进行核心的算法Structure From Motion之前需要一些准备工作,挑选出合适的图片。 首先从图片中提取焦距信息,然后利用SIFT等特征提取算法去提取图像特征,采用kd-tree模型去计算两张图片特征点之间的欧式距离进行特征点的匹配,从而找到特征点匹配个数达到要求的图像对。SIFT(Scale-Invariant Feature Transform)是一种检测局部特征的算法。kd-tree是从BST(Binary Search Tree)发展而来,是一种高维索引树形数据结构。常用于大规模高维数据密集的查找比对的使用场景中,主要是最近邻查找(Nearest Neighbor)以及近似最近邻查找(Approximate Nearest Neighbor)。在计算机视觉中主要是图像检索和识别中的高维特征向量的查找和比对。对于每一个图像匹配对,计算对极几何,估计基础矩阵(即F阵)并通过ransac算法优化改善匹配对。如果有特征点可以在这样的匹配对中链式地传递下去,一直被检测到,那么就可以形成轨迹。之后进入Structure From Motion部分,关键的第一步就是选择好的图像对去初始化整个集束调整(Bundle Adjustment,BA)过程。首先对初始化选择的两幅图片进行第一次BA,然后循环添加新的图片进行新的BA,最后直到没有可以继续添加的合适的图片,BA结束。得到相机估计参数和场景几何信息,即稀疏的3D点云(点云地图)。
(3)RANSAC算法
随机抽样一致算法(random sample consensus,RANSAC)采用迭代的方式从一组包含离群的被观测数据中估算出数学模型的参数。RANSAC算法的基本假设是样本中包含正确数据(inliers,可以被模型描述的数据),也包含异常数据(outliers,偏离正常范围很远、无法适应数学模型的数据),即数据集中含有噪声。这些异常数据可能是由于错误的测量、错误的假设、错误的计算等产生的。RANSAC算法的输入是一组观测数据,一个可以解释或者适应于观测数据的参数化模型,一些可信的参数。RANSAC通过反复选择数据中的一组随机子集来达成目标。被选取的子集被假设为局内点,并用下述方法进行验证:1、有一个模型适应于假设的局内点,即所有的未知参数都能从假设的局内点计算得出。2、用1中得到的模型去测试所有的其它数据,如果某个点适用于估计的模型,认为它也是局内点。3、如果有足够多的点被归类为假设的局内点,那么估计的模型就足够合理。4、然后,用所有假设的局内点去重新估计模型,因为它仅仅被初始的假设局内点估计过。5、最后,通过估计局内点与模型的错误率来评估模型。这个过程被重复执行固定的次数,每次产生的模型要么因为局内点太少而被舍弃,要么因为比现有的模型更好而被选用。
(4)词汇树
词汇树是一种高效的基于视觉词汇(也称视觉单词)检索图像的数据结构。面对海量的图像库,一个树状结构允许在次线性时间内进行的关键词查询,而不是扫描全体关键词去寻找匹配的图像,这样就可以大幅度的提高检索速度。下面介绍一下构建词汇树的步骤:1、提取所有训练图像的ORB特征。每幅训练图像提取约3000个特征。训练图像从待定位场景中采集。2、把所有提取的特征用K均值(k-mean)聚成K类,对每一类再用同样的方式再聚成K类直到L层,保留每层中各个聚类中心,最终生成词汇树。K和L均为大于1的整数,例如K为10,L为6。叶子结点即第L层的结点为最终的视觉单词。词汇树中的一个节点为一个聚类中心。图1为本公开实施例提供的一种词汇树的示意图。如图1所示,词汇树一共包括(L+1)层,第一层包括一个根节点,最后一层包括多个叶子节点。
图2为本公开实施例提供的一种视觉定位方法,如图2所示,该方法可包括:
201、视觉定位装置从图像库中确定第一备选图像序列。
该视觉定位装置可以是服务器,也可以是手机、平板电脑等可采集图像的移动终端。该图像库用于构建电子地图。该第一备选图像序列包括M个图像,该第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列。该第一图像为目标设备的相机采集的图像,M为大于1的整数。例如,M为5、 6或8等。该目标设备可以是手机、平板电脑等可采集图像和/或视频的设备在该实现中,先通过计算视觉词向量的相似度来选出多个备选图像,再从该多个备选图像中获取与第一图像的特征匹配数量最多的M个图像,图像检索效率高。
在一些实施例中,该第一备选图像序列中第一帧图像与该第一图像的特征匹配数量最多,该第一备选图像序列中最后一帧图像与该第一图像的特征匹配数量最少。
在一些实施例中,该第一备选图像序列中第一帧图像与该第一图像的特征匹配数量最少,该第一备选图像序列中最后一帧图像与该第一图像的特征匹配数量最多。
在一些实施例中,视觉定位装置为服务器,第一图像为接收的来自手机等移动终端的图像,该第一图像可以是移动终端在待定位场景采集的图像。
在一些实施例中,视觉定位装置为手机、平板电脑等可采集图像的移动终端,第一图像为该视觉定位装置在待定位场景提取的图像。
采用这种方式可以从图像库中初步筛选出一些图像,再从这些图像出选择出对应的视觉词向量与该第一图像的视觉词向量相似度最高的多个备选图像;可以大幅度提高图像检索的效率。
202、根据目标窗口调整第一备选图像序列中各帧图像的顺序,得到第二备选图像序列。该目标窗口包含从图像库中确定的包含目标帧图像的连续多帧图像,该目标帧图像为该图像库中与第二图像相匹配的图像,该第二图像为该相机在采集到第一图像之前所采集的图像。
在一些实施例中,根据目标窗口调整该第一备选图像序列中各帧图像的顺序,得到第二备选图像序列的实现方式如下:在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从低到高的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最后位置;在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从高到低的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最前位置。视觉定位装置可存储或关联有图像库,该图像库中的图像用于构建待定位场景的点云地图。
在一些实施例中,该图像库包括一个或多个图像序列,每个图像序列包括采集该待定位场景的一个区域得到的连续多帧图像,每个图像序列可用于构建一个子点云地图,即一个区域的点云地图。这些子点云地图构成该点云地图。可以理解,该图像库中的图像可以是连续的。在实际应用中,可对待定位场景划分区域,对每个区域采集多角度的图像序列,每个区域至少需要正反两个方向的图像序列。
该目标窗口可以是包括该目标帧图像的一个图像序列,也可以是包括该目标帧图像的图像序列的一部分。举例来说,该目标窗口包括61帧图像,即目标帧图像以及该目标帧图像的前后各三十帧图像。本公开实施例中,目标窗口的大小不作限定。假定第一备选图像序列中的图像依次为图像1、图像2、图像3、图像4以及图像5,其中,图像3和图像5为标定图像,则该第二备选图像序列中的图像依次为图像3、图像5、图像1、图像2以及图像4。可以理解,图2中的方法流程实现的是连续帧定位,视觉定位装置执行步骤201、步骤203、步骤204以及步骤205可实现单帧定位。
203、根据该第二备选图像序列确定该相机在采集该第一图像时的目标位姿。
此处的目标位姿为至少可包括相机在采集第一图像时的位置;在另一些实施例中,该目标位字可包括:相机在采集第一图像时的位置和姿态。该相机的姿态包括但不限于相机的朝向。
在一些实施例中,根据该第二备选图像序列确定该相机在采集该第一图像时的目标位姿的实现方式如下:根据第一图像序列和该第一图像,确定该相机的第一位姿;该第一图像序列包括该图像库中与第一参考帧图像相邻的连续多帧图像,该第一参考帧图像包含于该第二备选图像序列。在确定根据 该第一位姿成功定位该相机的位置的情况下,确定该第一位姿为该目标位姿。在确定根据该第一位姿未成功定位该相机的位置的情况,根据第二图像序列和该第一图像,确定该相机的第二位姿。该第二图像序列包括该图像库中与第二参考帧图像相邻的连续多帧图像,该第二参考帧图像为该第二备选图像序列中该第一参考帧图像的后一帧图像或前一帧图像。
在一些实施例中,该第一图像序列包括该第一参考帧图像的前K1帧图像、该第一参考帧图像以及该第一参考帧图像的后K1帧图像;K1为大于1的整数,例如K1为10。
在一些实施例中,该根据第一图像序列和该第一图像,确定该相机的第一位姿可以是:从该第一图像序列中各图像提取的特征中,确定与从该第一图像提取的特征相匹配的F个特征,F为大于0的整数;根据该F个特征、该F个特征在点云地图中对应的空间坐标点以及该相机的内参,确定该第一位姿。该点云地图为待定位场景的电子地图,该待定位场景为该相机采集该第一图像时所处的场景。该待定位场景为该目标设备采集该第一图像时所处的场景。
例如,视觉定位装置可采用PnP算法根据该F个特征、该F个特征在该点云地图中对应的空间坐标点以及相机的内参,确定该相机的第一位姿。该F个特征中每个特征对应图像中的一个特征点。也就是说,每个特征对应的一个2D参考点(即特征点在图像中的二维坐标)。通过匹配2D参考点和空间坐标点(即3D参考点)可以确定每个2D参考点对应的空间坐标点,这样就可以知道2D参考点与空间坐标点的一一对应关系。由于每个特征对应一个2D参考点,每个2D参考点匹配一个空间坐标点,这样就可以知道每个特征对应的空间坐标点。视觉定位装置也可以采用其他方式确定各特征在点云地图中对应的空间坐标点,本公开不作限定。该F个特征在点云地图中对应的空间坐标点即为F个世界坐标系中的3D参考点(即空间坐标点)。多点透视成像(Perspective-n-Point,PnP)是求解3D到2D点对的运动的方法:即给出F个3D空间点时,如何求解相机的位姿。PnP问题的已知条件:F个世界坐标系中的3D参考点(3D reference points)坐标,F为大于0的整数;与这F个3D点对应的、投影在图像上的2D参考点(2D reference points)坐标;摄像头的内参。求解PnP问题可以得到相机(也可以是摄像头)的位姿。典型的PnP问题求解方式有很多种,例如P3P,直接线性变换(DLT),EPnP(Efficient PnP),UPnP,还有非线性优化方法等。因此,视觉定位装置可以采用任一种求解PnP问题的方式,根据F个特征、该F个特征在点云地图中对应的空间坐标点以及相机的内参,确定相机的第二位姿。另外,考虑到存在特征误匹配的情况,这里可以使用Ransac算法进行迭代,每轮迭代统计出内点个数。当内点数满足某个比例或是迭代固定轮数后,停止迭代,把内点数最大的解(R和t)返回。其中,R为旋转矩阵,t为平移向量,即该相机的位姿包括的两组参数。本公开实施例中,相机等同于摄像头以及其他图像或视频采集装置。
本申请本公开实施例提供的是一种连续帧定位方法,利用第一图像之前的一帧定位出相机的第一位姿的图像来调整第一备选图像序列中各图像的排序,能够充分利用图像在时序上的连贯性,将最可能与该第一图像相匹配的图像排在该第一备选图像序列的最前面,这样就可以更快地找到与该第一图像相匹配的图像。
在一些实施例中,视觉定位装置在执行步骤203之后,还可以执行如下操作确定相机的三维位置:根据转换矩阵和该相机的目标位姿,确定该相机的三维位置。其中,该转换矩阵为通过变换点云地图 的角度和位置,将该点云地图的轮廓和室内平面图对齐得到的。具体的,将旋转矩阵R和平移向量t拼成4*4的矩阵
Figure PCTCN2019117224-appb-000001
采用转换矩阵T i左乘该矩阵T′得到新矩阵
Figure PCTCN2019117224-appb-000002
将把T表示为
Figure PCTCN2019117224-appb-000003
t *即为相机最后的三维位置。在该实现方式中,可以准确地确定相机的三维位置,实现简单。
本公开实施例提供的是一种连续帧定位方法,利用第一图像之前的一帧定位出相机的第一位姿的图像来调整第一备选图像序列中各图像的排序,能够充分利用图像在时序上的连贯性,将最可能与该第一图像相匹配的图像排在该第一备选图像序列的最前面,这样就可以更快地找到与该第一图像相匹配的图像,进而更快地定位。
在一个实现方式中,根据第一位姿成功定位相机的位置的情况可以是:确定L对特征点的位置关系均符合该第一位姿,每对特征点中的一个特征点是从该第一图像提取的,另一个特征点是从该第一图像序列中的图像提取的,L为大于1的整数。示例性的,根据该第一位姿采用Ransac算法迭代求解PnP,每轮迭代统计出内点个数。当内点个数大于目标阈值(例如12)时,确定根据第一位姿成功定位该相机的位置;当内点个数不大于该目标阈值(例如12)时,根据该第一位姿未成功定位该相机的位置。在实际应用中,视觉定位装置如果利用第二备选图像序列中的某帧图像未成功定位该相机的位置,则利用该第二备选图像序列中该帧图像的下一帧图像进行定位。
如果使用该第二备选图像序列中的每帧图像都不能成功定位该相机的位置,则返回定位失败。本公开实施例提供的是连续帧定位方法,当利用第一图像成功定位相机的位置后,继续采用相机采集的该第一图像的下一帧图像进行定位。
在实际应用中,视觉定位装置可按照第二备选序列中各帧图像的先后顺序依次使用各帧图像来定位相机的位置,直到定位出该相机的位置。如果使用该第二备选图像序列中的每帧图像都不能成功定位该相机的位置,则返回定位失败。举例来说,视觉定位装置先使用第二备选图像序列中的第一帧图像进行定位,如果定位成功,则停止本次定位;如果定位未成功,则使用该第二备选图像序列中的第二帧图像进行定位;依次类推。不同次使用图像序列和第一图像序列进行相机的目标位姿定位的方法可相同。
下面介绍如何从图像库中确定第一备选图像序列的方式,即步骤201的实现方式。
在一个实现方式中,该从图像库中确定第一备选图像序列的方式可以如下:利用词汇树将从该第一图像提取的特征转换为目标词向量;计算该目标词向量与图像库中各图像对应的词向量的相似性评分;获取该图像库包括的每个图像序列中与该第一图像的相似性评分最高的前10帧图像,得到初选图像序列;按照相似性评分由高到低的顺序对该初选图像序列中的各图像进行排序之后,取出前20%的图像作为中选图像序列,如果小于10帧则直接取前10帧;对该中选图像序列中的每一帧图像与该第一图像进行特征匹配;按照该中选图像序列中各帧图像与该第一图像的特征匹配的数量由多到少排序 之后,选取前M个图像,得到第一备选图像序列。
在一个实现方式中,该从图像库中确定第一备选图像序列的方式可以如下:确定图像库中对应的视觉词向量与该第一图像对应的视觉词向量相似度(即相似性评分)最高的多个备选图像;将该多个备选图像分别与该第一图像做特征匹配,得到各备选图像与该第一图像相匹配的特征的数量;获取该多个备选图像中与该第一图像的特征匹配数量最多的该M个图像,得到该第一备选图像序列。
在一些实施例中,M为5。该图像库中任一图像对应一个视觉词向量,该图像库中的图像用于构建该目标设备采集该第一图像时所处的待定位场景的电子地图。
在一些实施例中,该确定该图像库中对应的视觉词向量与该第一图像对应的视觉词向量相似度最高的多个备选图像可以是:确定该图像库中与该第一图像对应至少一个相同的视觉单词的图像,得到多个初选图像;确定该多个初选图像中对应的视觉词向量与该第一图像的视觉词向量相似度最高的前百分之Q的图像,得到该多个备选图像;Q为大于0的实数。例如Q为10、15、20、30等。该图像库中任一图像对应至少一个视觉单词,该第一图像对应至少一个视觉单词。
在一些实施例中,视觉定位装置采用如下方式得到多个备选图像:利用词汇树将从该第一图像提取的特征转换为目标词向量;分别计算该目标词向量与该多个初选图像中各初选图像对应的视觉词向量的相似度;确定该多个初选图像中对应的视觉词向量与该目标词向量相似度最高的前百分之Q的图像,得到该多个备选图像。该词汇树为将从该待定位场景采集的训练图像中提取的特征进行聚类得到的。该多个初选图像中任一初选图像对应的视觉词向量为利用该词汇树由从该任一初选图像提取的特征得到的视觉词向量。
在一些实施例中,该将该多个备选图像分别与该第一图像做特征匹配,得到各备选图像与该第一图像相匹配的特征的数量可以是:根据词汇树将从该第一图像提取的第三特征分类至参考叶子节点;对该第三特征和第四特征做特征匹配,以得到与该第三特征相匹配的特征。该词汇树为将从该待定位场景采集的图像中提取的特征进行聚类得到的;该词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征。该第四特征包含于该参考叶子节点且为从目标备选图像提取的特征,该目标备选图像包含于该第一备选图像序列。可以理解,若从第一图像提取的某个特征对应参考叶子节点(词汇树中任一叶子节点),视觉定位装置对该特征和从某个备选图像提取出的特征做特征匹配时,仅需对该特征和从该备选图像提取出的特征中对应该参考叶子节点的特征做特征匹配,而不需要对该特征与其他特征做特征匹配。
视觉定位装置可以预先存储有各视觉单词(即叶子节点)对应的图像索引以及特征索引。在一些实施例中,在每个视觉单词中添加相应的图像索引以及特征索引,这些索引用来加速特征匹配。举例来说,图像库中100个图像均对应某个视觉单词,则在该视觉单词中添加这100个图像的索引(即图像索引)以及这100图像中落在该视觉单词对应的叶子节点的特征的索引(即特征索引)。又举例来说,从第一图像提取出的参考特征落在参考节点,在对该参考特征和从多个备选图像提取出的特征做特征匹配时,先确定该多个备选图像中该参考节点的图像索引所指示的目标备选图像,根据特征索引确定该目标备选图像落在该参考节点的特征,对该参考特征与该目标备选图像中落在该参考节点的特征做匹配。采用这种方式减少特征匹配的运算量,大幅度提高特征匹配的速度。
下面介绍如何利用词汇树将从第一图像提取的特征转换为目标词向量的方式。
该利用词汇树将从第一图像提取的特征转换为目标词向量包括:根据从第一图像提取的特征、目标视觉单词的权重以及该目标视觉单词对应的聚类中心,计算该目标视觉单词在该第一图像对应的目 标权重;该目标词向量包括词汇树对应的各视觉单词在该第一图像对应的权重;该目标权重与该目标视觉单词的权重正相关。在该实现方式中,采用残差加权的方式计算词向量,考虑到落在同一视觉单词当中的特征的差异性,增加了区分性,很容易的接入TF-IDF(term frequency–inverse document frequency)框架中,能够提高图像检索以及特征匹配的速度。
在一些实施例中,采用如下公式利用词汇树将从该第一图像提取的特征转换为目标词向量:
Figure PCTCN2019117224-appb-000004
其中,W iweight为第i个视觉单词本身的权重,Dis(f i,c i)为特征f i到第i个视觉单词的聚类中心c i的汉明距离,n表示从该第一图像提取出的特征落在第i个视觉单词对应的节点上的特征的数量,W i表示第i个视觉单词在该第一图像对应的权重。词汇树中的一个叶子节点对应一个视觉单词,该目标词向量包括该词汇树对应的各视觉单词在该第一图像对应的权重。该词汇树的一个节点对应一个聚类中心。举例来说,词汇树包括1000个叶子节点,每个叶子节点对应一个视觉单词,视觉定位装置需要计算每个视觉单词在该第一图像对应的权重,以得到该第一图像的目标词向量。在一些实施例中,视觉定位装置可计算该词汇树中各叶子节点对应的视觉单词在该第一图像对应的权重;将由该各叶子节点对应的视觉单词在该第一图像对应的权重组合成一个向量,得到该目标词向量。可以理解,可以采用相同的方式计算图像库中各图像对应的词向量,以得到上述各初选图像对应的视觉词向量。i和n均为大于1的整数。特征f i为从该第一图像提取的任一特征,任一特征对应一个二进制串,即f i为一个二进制字符串。每个视觉单词中心对应一个二进制串。也就是说,c i为一个二进制串。因此,可以计算特征f i到第i个视觉单词中心c i的汉明距离。汉明距离表示两个(相同长度)字对应位不同的数量。换句话说,它就是将一个字符串变换成另外一个字符串所需要替换的字符个数。例如:1011101与1001001之间的汉明距离是2。在一些实施例中,词汇树中各视觉单词本身的权重与其对应的节点包括的特征的数量负相关。在一些实施例中,若W i不为0,则在第i个视觉单词中添加对应图像的索引,该索引用来加速图像的检索。
在一些实施例中,该根据从第一图像提取的特征、目标视觉单词的权重以及该目标视觉单词对应的聚类中心,计算该目标视觉单词在该第一图像对应的目标权重包括:利用词汇树对从该第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;根据该中间特征、该目标视觉单词的权重以及该目标视觉单词对应的聚类中心,计算该目标视觉单词在该第一图像对应的该目标权重。其中,该目标叶子节点与该目标视觉单词相对应。从公式(1)可以看出,该目标权重为该中间特征包括的各特征对应的权重参数之和。举例来说,特征f i对应的权重参数:
Figure PCTCN2019117224-appb-000005
该中间特征可以包括第一特征和第二特征;该第一特征与该聚类中 心的汉明距离为第一距离,该第二特征与该聚类中心的汉明距离为第二距离;若该第一距离和该第二距离不同,则该第一特征对应的第一权重参数与该第二特征对应的第二权重参数不同。
在该实现方式中,采用残差加权的方式计算词向量,考虑到落在同一视觉单词当中的特征的差异性,增加了区分性,很容易的接入TF-IDF(term frequency–inverse document frequency)框架中,能够提高图像检索以及特征匹配的速度。
下面介绍基于单张图像进行定位的具体示例。图3为本公开实施例提供的另一种视觉定位方法,该方法可包括:
301、终端拍摄一幅目标图像。
该终端可以是手机以及其他具有摄像功能和/或拍照功能的设备。
302、终端采用ORB算法提取目标图像的ORB特征。
在一些实施例中,终端采用其他特征提取方式提取该目标图像的特征。
303、终端将从目标图像提取的ORB特征以及相机的内参传输至服务器。
步骤302至步骤303可以替代为:终端将目标图像以及相机的内参传输至服务器。这样可以由服务器提取该图像的ORB特征,以便于减少终端的计算量。在实际应用中,用户可以启动终端上的目标应用,通过该目标应用利用相机采集目标图像并将该目标图像传输至服务器。相机的内参可以是该终端的摄像头的内参。
304、服务器将ORB特征转换为中间词向量。
服务器将ORB特征转换为中间词向量的方式与前述实施例中利用词汇树将从第一图像提取的特征转换为目标词向量的方式相同,这里不再详述。
305、服务器根据中间词向量确定每个图像序列中与目标图像最相似的前H张图像,并得到每个图像序列中与该目标图像的相似性评分最高的前H张图像对应的相似性评分。
每个图像序列均包含于图像库,每个图像序列用于构建一个子点云地图,这些子点云地图组成待定位场景对应的点云地图。步骤305为查询图像库的每个图像序列中与目标图像最相似的前H张图像。H为大于1的整数,例如H为10。每个图像序列可以是采集该待定位场景的一个或多个区域得到的。服务器根据中间词向量计算每个图像序列中各图像与目标图像的相似性评分。相似性评分公式可以如下:
Figure PCTCN2019117224-appb-000006
其中,s(v1,v2)表示视觉词向量v1和视觉词向量v2)的相似性评分。视觉词向量v1可以是根据从目标图像提取的ORB特征,采用公式(1)计算得到的词向量;视觉词向量v2)可以是根据从图像库中任一图像提取的ORB特征,采用公式(1)计算得到的词向量。假定词汇树包括L个叶子节点,每个叶子节点对应一个视觉单词,v1=[W 1W 2...W L],其中,W L表示第L个视觉单词在该目标图像对应的权重,L为大于1的整数。可以理解,视觉词向量v1和视觉词向量v2)的维度相同。服务器可以存储有图像库中各图像对应的视觉词向量(对应于上述参考词向量)。每个图像对应的视觉词向量为根据该图像提取出的特征,采用公式(1)计算得到。可以理解,服务器仅需计算目标图像对应的视觉词向量,而不需要计算图像库中各图像序列包括的图像对应的视觉词向量。
在一些实施例中,服务器只查询和中间词向量有共同视觉单词的图像,即只根据中间词向量中非零项对应的叶子节点中的图像索引来比较相似度。也就是说,确定图像库中与目标图像对应至少一个相同的视觉单词的图像,得到多个初选图像;根据中间词向量查询该多个初选图像中与该目标图像最相似的前H帧图像。举例来说,若第i个视觉单词在目标图像对应的权重以及在某个初选图像对应的权重均不为0,则该目标图像与该初选图像均对应该第i个视觉单词。
306、服务器按照每个图像序列中与目标图像的相似性评分最高的前H张图像对应的相似性评分由高到低的排序,取出与该目标图像的相似性评分较高的多张图像作为备选图像。
在一些实施例中,图像库包括F个图像序列,取出(F×H)张图像中与该目标图像的相似性评分最高的前20%的图像作为备选图像。该(F×H)张图像包括每个图像序列中与该目标图像的相似性评分最高的前H张图像。如果与该前20%的图像的个数小于10张,则直接取前10张图像。步骤306为筛选备选图像的操作。
307、服务器对备选图像中每一张图像与目标图像做特征匹配,并确定特征匹配的数量最多的前G张图像。
G为大于1的整数,例如G为5。在一些实施例中,先把目标图像的特征根据词汇树逐一分类到L层某个节点,分类方式为从根节点开始逐层选择与当前特征距离(汉明距离)最短的聚类中心点(树中的节点),对每个分类后的特征只与对应节点中存在着特征索引且其所属的图像为备选图像的特征进行匹配。这样可以加速特征匹配。步骤307是备选图像中每一张图像与目标图像做特征匹配的过程。因此,步骤307可以看作是两张图像做特征匹配的过程。
308、服务器获取参考图像序列中连续(2K+1)张图像。
该参考图像序列中的图像按照采集得到的先后顺序排序。该参考图像序列包括该前G张图像中的任一张图像,该(2K+1)张图像(对应局部点云地图)包括该任一张图像、该任一张图像的前K张图像以及该任一张图像的后K张图像。步骤308为确定局部点云地图的操作。
309、服务器确定从(2K+1)张图像提取的特征中与从目标图像提取的特征相匹配的多个特征。
该参考图像序列中连续(2K+1)张图像对应一个局部点云地图。因此,步骤309可以看作是目标图像与该局部点云地图的匹配操作,即图3中帧-局部点云地图匹配。在一些实施例中,先利用词汇树对从对应的相似性评分提取的特征进行分类,然后对从目标图像提取的特征进行相同的处理,只考虑落在同一个节点中的两部分的特征的匹配,这样可以加速特征匹配。其中,该两部分中一部分为该目标图像,另一部分为该(2K+1)张图像。
310、服务器根据多个特征、该多个特征在点云地图中对应的空间坐标点以及相机的内参,确定相机的位姿。
步骤310与图2中的步骤203相似,这里不再详述。在服务器执行步骤310,未成功确定相机的位姿的情况下,利用前G张图像中另一张图像重新执行步骤308至步骤310,直至成功确定该相机的位姿。举例来说,先根据前G张图像中第一张图像确定(2K+1)张图像,再利用该(2K+1)张图像确定相机的位姿;若未成功确定相机的位姿的情况下,根据前G张图像中第二张图像确定新的(2K+1)张图像,再利用新的(2K+1)张图像确定相机的位姿;重复执行上述操作,直至成功确定该相机的位姿。
311、服务器在成功确定相机的位姿的情况下,向终端发送相机的位置信息。
该位置信息可以包括该相机的三维位置以及该相机的方向。服务器在成功确定相机的位姿的情况 下,可以根据转换矩阵与该相机的位姿,确定该相机的三维位置,并生成该位置信息。
312、服务器在未成功确定相机的位姿的情况下,执行步骤308。
服务器每次执行步骤308都需要根据前G张图像中的一张图像,确定连续(2K+1)张图像。应理解,服务器每次执行步骤308确定的连续(2K+1)张图像不同。
313、终端在电子地图中显示相机的位置。
在一些实施例中,终端在电子地图中显示相机的位置和方向。可以理解,相机(即摄像头)安装在终端上,该相机的位置即为该终端的位置。用户可以根据该相机的位置和方向,可以准确地、快速地确定自身所在的位置和方向。
本公开实施例中,终端和服务器协同工作,该终端采集图像以及提取特征,该服务器负责定位并向该终端发送定位结果(即位置信息);用户仅需利用终端向服务器发送一张图像就可以准确地确定自身所在的位置。
图3介绍了基于单张图像进行定位的具体示例。在实际应用中,服务器也可以根据终端发送的连续多帧图像或者连续多帧图像的特征进行定位。下面介绍基于连续多帧图像进行定位的具体示例。图4为本公开实施例提供的另一种视觉定位方法,如图4所示,该方法可包括:
401、服务器获得终端采集的连续多帧图像或者多组特征。
每组特征可以为从一帧图像提取出的特征,该多组特征依次为从连续多帧图像提取出的特征。该连续多帧图像按照采集得到的先后顺序排序。
402、服务器根据第一帧图像或者从该第一帧图像提取的特征,确定相机的位姿。
该第一帧图像为该连续多帧图像中的第一帧图像。步骤402对应于图3中基于单张图像进行定位的方法。也就是说,服务器可以采用图3中的方法,利用该第一帧图像确定相机的位姿。利用连续多帧图像中的第一帧图像进行定位和基于单张图像进行定位是一样的。也就是说,连续多帧定位中的第一帧定位和单张定位是一样的。若定位成功,则转入连续帧定位;若定位失败,则继续单张定位。
403、服务器在根据前一帧图像成功确定相机的位姿的情况下,确定目标图像序列中N帧连续的图像。
前一帧图像成功确定相机的位姿的情况指的是服务器执行步骤402成功确定该相机的位姿。该目标图像序列为前一帧图像成功定位出相机的位姿所使用的特征属于的图像序列。举例来说,服务器利用目标图像序列中某张图像的前K张图像、该张图像以及该张图像的后K张图像与前一帧图像做特征匹配,并利用相匹配的特征点成功定位相机的位姿;则服务器获取该目标图像序列中该张图像的前三十张图像、该张图像以及该张图像的后三十张图像,即连续的N帧图像。
404、服务器根据目标图像序列中N帧连续的图像,确定相机的位姿。
步骤404对应于图3中的步骤308至步骤310。
405、服务器在根据前一帧图像未成功确定相机的位姿的情况下,确定多张备选图像。
该多张备选图像为服务器根据前一帧图像确定的备选图像。也就是说,在根据前一帧图像未成功确定相机的位姿的情况下,服务器可以将前一帧的备选图像作为当前帧图像的备选图像。这样可以减少图像检索的步骤,节省时间。
406、服务器根据前一帧图像的备选图像,确定相机的位姿。
步骤406对应于图3中的步骤307至步骤310。
服务器进入连续帧定位后,主要是利用前一帧定位成功的先验知识,推导出与当前帧相匹配的图 像有大概率是在上一次定位成功的图像附近。这样就可以在上一次定位成功的图像附近开启一个窗口,优先考虑落在该窗口中的那些帧图像。窗口大小可以至多为61帧,前后各三十帧,不足三十帧的则截断。若定位成功,则将窗口传递下去;若定位不成功,则按照单帧的备选图像进行定位。本公开实施例中,采用连续帧滑动窗口机制,利用时序上连贯信息,有效的减少计算量,可以提升定位成功率。
本公开实施例中,服务器进行连续帧定位时,可以利用前一帧定位成功的先验知识,来加速后续的定位操作。
图4介绍了连续帧定位,下面介绍连续帧定位的一种应用实施例。图5为本公开实施例提供的一种定位导航方法,如图5所示,该方法可包括:
501、终端启动目标应用。
该目标应用为实现室内的精确定位专门开发的应用。在实际应用中,用户点击目标应用在终端的屏幕上对应的图标后,启动该目标应用。
502、终端通过目标界面接收用户输入的目的地址。
该目标界面为终端启动该目标应用后,该终端的屏幕显示的界面,即该目标应用的界面。该目的地址可以是餐馆、咖啡厅、电影院等。
503、终端显示当前采集到的图像,并将采集到的图像或从采集到的图像提取的特征传输至服务器。
终端接收用户输入的目的地址后,可以实时或接近实时的通过相机(即该终端上的摄像头)采集周围环境的图像,并按照固定间隔将采集的图像传输至服务器。在一些实施例中,终端提取采集的图像的特征,并按照固定间隔将提取的特征传输至服务器。
504、服务器根据接收到的图像或特征,确定相机的位姿。
步骤504对应于图4中的步骤401至步骤406。也就是说,服务器采用图4中的定位方法,根据接收到每一帧图像或每一帧图像的特征,确定相机的位姿。可以理解,服务器可以根据终端发送的图像序列或特征序列,依次确定相机的位姿,进而确定该相机的位置。也就是说,服务器可以实时或接近实时的确定相机的位姿。
505、服务器根据转换矩阵和相机的位姿,确定该相机的三维位置。
其中,该转换矩阵为通过变换点云地图的角度和位置,将该点云地图的轮廓和室内平面图对齐得到的。具体的,将旋转矩阵R和平移向量t拼成4*4的矩阵
Figure PCTCN2019117224-appb-000007
采用转换矩阵T i左乘该矩阵T′得到新矩阵
Figure PCTCN2019117224-appb-000008
将把T表示为
Figure PCTCN2019117224-appb-000009
t *即为相机最后的三维位置。
506、服务器向终端发送位置信息。
该位置信息可以包括该相机的三维位置、该相机的方向以及标记信息。该标记信息指示用户从当前位置达到目标地址所需行走的路线。在一些实施例中,标记信息仅指示目标距离内的路线,该目标距离为与当前显示图像中道路的最远距离,该目标距离可以是10米、20米、50米等。服务器在成功确定相机的位姿的情况下,可以根据转换矩阵与该相机的位姿,确定该相机的三维位置。服务器在执行步骤506之前,可以根据该相机的位置、目的地址以及电子地图,生成该标记信息。
507、终端实时显示采集的图像,并显示指示用户达到目的地址的标记。
举例来说,用户在商场中迷路或者想要去某个店,该用户启动手机上的目标应用,并输入需要到达的目的地址;该用户举起手机对着前方采集图像,该手机实时显示采集的图像,并显示指示该用户达到目的地址的标记,例如箭头。
本公开实施例中,服务器可以准确地定位相机的位置,并向用户提供导航信息,该用户可以根据指引,快速地达到目标地址。
前述实施例中,服务器确定相机的位姿需要用到点云地图。下面介绍一种构建点云地图的具体举例。图6为本公开实施例提供的一种构建点云地图的方法。如图6所示,该方法可包括:
601、服务器获取多个视频序列。
用户可以对待定位场景划分区域,对每个区域采集多角度的视频序列,每个区域至少需要正反两个方向的视频序列。该多个视频序列为对待定位场景中每个区域从多角度进行拍摄得到的视频序列。
602、服务器对多个视频序列中每个视频序列按照目标帧率提取图像,以得到多个图像序列。
服务器按照目标帧率提取一个视频序列可以得到一个图像序列。该目标帧率可以是30帧/秒。每个图像序列用于构建一个子点云地图。
603、服务器利用各图像序列构建出点云地图。
服务器可以采用SFM算法利用每个图像序列构建一个子点云地图,所有的子点云地图组成该点云地图。
本公开实施例中,将待定位场景划分为多个区域,分区域构建子点云地图。这样当待定位场景中某个区域变换后,仅需采集该区域的视频序列来构建该区域的子点云地图,而不同重新构建整个待定位场景的点云地图;可以有效减少工作量。
服务器在获得用于构建待定位场景的点云地图的多个图像序列后,可以将该多个图像序列存储至图像库,并利用词汇树确定该多个图像序列中各图像对应的视觉词向量。服务器可以存储该多个图像序列中各图像对应的视觉词向量。在一些实施例中,在词汇树包括的各视觉单词中添加对应图像的索引。举例来说,词汇树中某个视觉单词在图像库中的某个图像对应的权重不为0,则在该视觉单词中添加该图像的索引。在一些实施例中,服务器在词汇树包括的各视觉单词中添加对应图像的索引以及特征索引。服务器可以利用词汇树将每个图像的每个特征分类至叶子节点,每个叶子节点对应一个视觉单词。举例来说,从各图像序列中的图像提取的特征中100个特征落在某个叶子节点,则在该叶子节点对应的视觉单词调整该100个特征的特征索引。该特征索引指示该100个特征。
以下提供一种基于图像序列和第一图像定位相机的目标位姿的具体示例,可包括:基于所述图像库,确定基于所述第一图像序列建立的子点云地图,其中,子点云地图包括:3D坐标及与所述3D坐标对应的3D描述子;确定所述第一图像的2D坐标及所述2D坐标对应的2D描述子;将所述2D坐标和所述2D描述子,与所述3D坐标和3D描述子进行匹配;根据匹配成功的所述2D坐标和2D描述子与3D坐标和3D描述子之间的转换关系,确定出第一位姿或第二位姿等,可用于定位相机的位姿。该3D描述子可为3D坐标的描述信息,包括:该3D坐标相邻的坐标和/或响铃坐标的属性信息。2D描述子可为2D坐标的描述信息。例如,使用pnp算法利用上述转换关系,确定出相机的第一位姿或第二位姿。
图7为本公开实施例提供的一种视觉定位装置的结构示意图,如果7所示,该视觉定位装置可包括:
筛选单元701,配置为从图像库中确定第一备选图像序列;该图像库用于构建电子地图,该第一 备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,该第一图像为相机采集的图像;
筛选单元701,还配置为根据目标窗口调整该第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;该目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,该目标帧图像为该图像库中与第二图像相匹配的图像,该第二图像为该相机在采集到第一图像之前所采集的图像;
确定单元702,配置为根据该第二备选图像序列确定该相机在采集该第一图像时的目标位姿。
在一个在一些实施例中实现方式中,确定单元702,配置为根据第一图像序列和该第一图像,确定该相机的第一位姿;该第一图像序列包括该图像库中与第一参考帧图像相邻的连续多帧图像,该第一参考帧图像包含于该第二备选序列;
在确定根据该第一位姿成功定位该相机的位置的情况下,确定该第一位姿为该目标位姿。
在一个在一些实施例中实现方式中,确定单元702,配置为在确定根据该第一位姿未成功定位该相机的位置的情况,根据第二图像序列和该第一图像,确定该相机的第二位姿;该第二图像序列包括该图像库中与第二参考帧图像相邻的连续多帧图像,该第二参考帧图像为该第二备选图像序列中该第一参考帧图像的后一帧图像或前一帧图像;在确定根据该第二位姿成功定位该相机的位置的情况下,确定该第二位姿为该目标位姿。
在一个在一些实施例中实现方式中,确定单元702,配置为从该第一图像序列中各图像提取的特征中,确定与从该第一图像提取的特征相匹配的F个特征,F为大于0的整数;
根据该F个特征、该F个特征在点云地图中对应的空间坐标点以及该相机的内参,确定该第一位姿;该点云地图为待定位场景的电子地图,该待定位场景为该相机采集该第一图像时所处的场景。
在一个在一些实施例中实现方式中,筛选单元701,配置为在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从低到高的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最后位置;
在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从高到低的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最前位置。
在一个在一些实施例中实现方式中,筛选单元701,配置为在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从低到高的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最后位置;在该第一备选图像序列中的各帧图像按照与该第一图像的匹配度从高到低的顺序排列的情况下,将该第一备选图像序列中位于该目标窗口的图像调整至该第一备选图像序列最前位置。
在一个在一些实施例中实现方式中,筛选单元701,配置为确定该图像库中与该第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;该图像库中任一图像对应至少一个视觉单词,该第一图像对应至少一个视觉单词;确定该多个初选图像中对应的视觉词向量与该第一图像的视觉词向量相似度最高的多个备选图像。
在一个在一些实施例中实现方式中,筛选单元701,配置为确定该多个初选图像中对应的视觉词向量与该第一图像的视觉词向量相似度最高的前百分之Q的图像,得到该多个备选图像;Q为大于0的实数。
在一个在一些实施例中实现方式中,筛选单元701,配置为利用词汇树将从该第一图像提取的特征转换为目标词向量;该词汇树为将从该待定位场景采集的训练图像中提取的特征进行聚类得到的;
分别计算该目标词向量与该多个初选图像中各初选图像对应的视觉词向量的相似度;该多个初选 图像中任一初选图像对应的视觉词向量为利用该词汇树由从该任一初选图像提取的特征得到的视觉词向量;
确定该多个初选图像中对应的视觉词向量与该目标词向量相似度最高的多个备选图像。
在一个在一些实施例中实现方式中,该词汇树中的一个叶子节点对应一个视觉单词,该词汇树中最后一层的节点为叶子节点;
筛选单元701,配置为计算该词汇树中各叶子节点对应的视觉单词在该第一图像对应的权重;将由该各叶子节点对应的视觉单词在该第一图像对应的权重组合成一个向量,得到该目标词向量。
在一个在一些实施例中实现方式中,该词汇树的一个节点对应一个聚类中心;
筛选单元701,配置为利用该词汇树对从该第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;该目标叶子节点为该词汇树中的任意一个叶子节点,该目标叶子节点与目标视觉单词相对应;
根据该中间特征、该目标视觉单词的权重以及该目标视觉单词对应的聚类中心,计算该目标视觉单词在该第一图像对应的目标权重;该目标权重与该目标视觉单词的权重正相关,该目标视觉单词的权重为根据生成该词汇树时该目标视觉单词对应的特征数量确定的。
在一个在一些实施例中实现方式中,筛选单元701,配置为于根据词汇树将从该第一图像提取的第三特征分类至叶子节点;该词汇树为将从该待定位场景采集的图像中提取的特征进行聚类得到的;该词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;
对各该叶子节点中的该第三特征和第四特征做特征匹配,以得到各该叶子节点中与该第三特征相匹配的第四特征;该第四特征为从目标备选图像提取的特征,该目标备选图像包含于该第一备选图像序列中的任一图像;
根据各该叶子节点中与该第三特征相匹配的第四特征,得到该目标备选图像与该第一图像相匹配的特征的数量。
在一个在一些实施例中实现方式中,确定单元702,还配置为根据转换矩阵和该第一位姿,确定该相机的三维位置;该转换矩阵为通过变换该点云地图的角度和位置,将该点云地图的轮廓和室内平面图对齐得到的。
在一个在一些实施例中实现方式中,确定单元702,配置为确定L对特征点的位置关系均符合该第一位姿,每对特征点中的一个特征点是从该第一图像提取的,另一个特征点是从该第一图像序列中的图像提取的,L为大于1的整数。
在一个在一些实施例中实现方式中,该装置还包括:
第一获取单元703,配置为获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;
地图构建单元704,配置为根据该多个图像序列,构建该点云地图;其中,该多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;该点云地图包括该第一电子地图和该第二电子地图。
在一个在一些实施例中实现方式中,该装置还包括:
第二获取单元705,配置为获得拍摄该待定位场景得到的多张训练图像;
特征提取单元706,配置为对该多张训练图像进行特征提取,以得到训练特征集;
聚类单元707,配置为对该训练特征集中的特征进行多次聚类,得到该词汇树。第二获取单元705 和第一获取单元703可以是同一单元,也可以是不同的单元。
在一个在一些实施例中实现方式中,该视觉定位装置为服务器,该装置还包括:
接收单元708,配置为接收来自目标设备的该第一图像,该目标设备安装有该相机。
在一个在一些实施例中实现方式中,该装置还包括:
发送单元709,配置为将该相机的位置信息发送至该目标设备。
图8为本公开实施例提供的一种终端的结构示意图,如果8所示,该终端可包括:
摄像头801,配置为采集目标图像;
发送单元802,配置为向服务器发送目标信息,该目标信息包括该目标图像或从该目标图像提取出的特征序列,以及该摄像头的内参;
接收单元803,配置为接收位置信息;该位置信息用于指示该相机的位置和方向;该位置信息为该服务器根据第二备选图像序列确定的该相机采集该目标图像时的位置的信息;该第二备选图像序列为该服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,该目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,该图像库用于构建电子地图,该目标帧图像为该图像库中与第二图像相匹配的图像,该第二图像为该相机在采集到第一图像之前所采集的图像,该第一备选图像序列中的各帧图像按照与该第一图像的匹配度顺序排列;
显示单元804,配置为显示电子地图,该电子地图中包含摄像头的位置和方向。
在一些实施例中,该终端还包括:特征提取单元805,用于提取该目标图像中的特征。
该位置信息可以包括该摄像头的三维位置以及该摄像头的方向。摄像头801可具体用于执行步骤301中所提到的方法以及可以等同替换的方法;特征提取单元805可具体用于执行步骤302中所提到的方法以及可以等同替换的方法;发送单元802可具体用于执行步骤303中所提到的方法以及可以等同替换的方法;显示单元804具体用于执行步骤313和步骤507中所提到的方法以及可以等同替换的方法。可以理解,图8中的终端可以实现图3以及图5中的终端所执行的操作。
应理解以上视觉定位装置和终端中的各个单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。例如,以上各个单元可以为单独设立的处理元件,也可以集成同一个芯片中实现,此外,也可以以程序代码的形式存储于控制器的存储元件中,由处理器的某一个处理元件调用并执行以上各个单元的功能。此外各个单元可以集成在一起,也可以独立实现。这里的处理元件可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个单元可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。该处理元件可以是通用处理器,例如中央处理器(英文:central processing unit,简称:CPU),还可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(英文:application-specific integrated circuit,简称:ASIC),或,一个或多个微处理器(英文:digital signal processor,简称:DSP),或,一个或者多个现场可编程门阵列(英文:field-programmable gate array,简称:FPGA)等。
参见图9,是本公开实施例提供的另一种终端结构示意图。如图9所示的本实施例中的终端可以包括:一个或多个处理器901、存储器902、收发器903、摄像头904以及输入输出设备905。上述处理器901、收发器903、存储器902、摄像头904以及输入输出设备905通过总线906连接。存储器902用于存储指令,处理器901用于执行存储器902存储的指令。收发器903用于接收和发送数据。摄像头904用于采集图像。其中,处理器901用于控制收发器903、摄像头904以及输入输出设备905,来 实现图3以及图5中的终端所执行的操作。
应当理解,在本公开实施例中,所称处理器901可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器902可以包括只读存储器和随机存取存储器,并向处理器901提供指令和数据。存储器902的一部分还可以包括非易失性随机存取存储器。例如,存储器902还可以存储设备类型的信息。
具体实现中,本公开实施例中所描述的处理器901、存储器902、收发器903、摄像头904以及输入输出设备905可执行前述任一实施例所描述的终端的实现方式,在此不再赘述。具体的,收发器903可实现发送单元802和接收单元803的功能。处理器901可实现特征提取单元805的功能。输入输出设备905用于实现显示单元804的功能,输入输出设备905可以是显示屏。
图10是本公开实施例提供的一种服务器结构示意图,该服务器1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1022(例如,一个或一个以上处理器)和存储器1032,一个或一个以上存储应用程序1042或数据1044的存储介质1030(例如一个或一个以上海量存储设备)。其中,存储器1032和存储介质1030可以是短暂存储或持久存储。存储在存储介质1030的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1022可以设置为与存储介质1030通信,在服务器1100上执行存储介质1030中的一系列指令操作。
服务器1100还可以包括一个或一个以上电源1026,一个或一个以上有线或无线网络接口1050,一个或一个以上输入输出接口1058,和/或,一个或一个以上操作系统1041,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图10所示的服务器结构。具体的,输入输出接口1058可实现接收单元708以及发送单元709的功能。中央处理器1022可实现筛选单元701、确定单元702、第一获取单元703、地图构建单元704、第二获取单元705、特征提取单元706、聚类单元707的功能。
在本公开的实施例中提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现:从图像库中确定第一备选图像序列;所述图像库用于构建电子地图,所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
在本公开的实施例中提供了另一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现:通过相机采集目标图像;向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标 窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;显示电子地图,所述电子地图中包含所述相机的位置和方向。以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。

Claims (46)

  1. 一种视觉定位方法,包括:
    从图像库中确定第一备选图像序列;所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;
    根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;
    根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
  2. 根据权利要求1所述的方法,其中,所述根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿包括:
    根据第一图像序列和所述第一图像,确定第一位姿;所述第一图像序列包括所述图像库中与第一参考帧图像相邻的连续多帧图像,所述第一参考帧图像包含于所述第二备选序列;
    在根据所述第一位姿成功定位所述相机的位置的情况下,确定所述第一位姿为所述目标位姿。
  3. 根据权利要求2所述的方法,其中,所述根据第一图像序列和所述第一图像,确定第一位姿之后,所述方法还包括:
    在根据所述第一位姿未成功定位所述相机的位置的情况下,根据第二图像序列和所述第一图像,确定第二位姿;所述第二图像序列包括所述图像库中与第二参考帧图像相邻的连续多帧图像,所述第二参考帧图像为所述第二备选图像序列中所述第一参考帧图像的后一帧图像或前一帧图像;
    在根据所述第二位姿成功定位所述相机的位置的情况下,确定所述第二位姿为所述目标位姿。
  4. 根据权利要求2或3所述的方法,其中,所述根据第一图像序列和所述第一图像,确定第一位姿包括:
    从所述第一图像序列中各图像提取的特征中,确定与从所述第一图像提取的特征相匹配的F个特征,F为大于0的整数;
    根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿;所述点云地图为待定位场景的电子地图,所述待定位场景为所述相机采集所述第一图像时所处的场景。
  5. 根据权利要求1至4任一项所述的方法,其中,所述根据目标窗口调整第一备选图像序列中各帧图像的顺序,得到第二备选图像序列包括:
    在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;
    在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
  6. 根据权利要求5所述的方法,其中,所述从图像库中确定第一备选图像序列包括:
    确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选 图像;所述图像库中任一图像对应一个视觉词向量,所述图像库中的图像用于构建所述目标设备采集所述第一图像时所处的待定位场景的电子地图;
    将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量;
    获取所述多个备选图像中与所述第一图像的特征匹配数量最多的M个图像,得到所述第一备选图像序列。
  7. 根据权利要求6所述的方法,其中,所述确定所述图像库中对应的视觉词向量与所述第一图像对应的视觉词向量相似度最高的多个备选图像包括:
    确定所述图像库中与所述第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;所述图像库中任一图像对应至少一个视觉单词,所述第一图像对应至少一个视觉单词;
    确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像。
  8. 根据权利要求7所述的方法,其中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:
    确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的前百分之Q的图像,得到所述多个备选图像;Q为大于0的实数。
  9. 根据权利要求7或8所述的方法,其中,所述确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像包括:
    利用词汇树将从所述第一图像提取的特征转换为目标词向量;所述词汇树为将从所述待定位场景采集的训练图像中提取的特征进行聚类得到的;
    分别计算所述目标词向量与所述多个初选图像中各初选图像对应的视觉词向量的相似度;所述多个初选图像中任一初选图像对应的视觉词向量为利用所述词汇树由从所述任一初选图像提取的特征得到的视觉词向量;
    确定所述多个初选图像中对应的视觉词向量与所述目标词向量相似度最高的多个备选图像。
  10. 根据权利要求9所述的方法,其中,所述词汇树中的每一个叶子节点对应一个视觉单词,所述词汇树中最后一层的节点为叶子节点;所述利用词汇树将从所述第一图像提取的特征转换为目标词向量包括:
    计算所述词汇树中各叶子节点对应的视觉单词在所述第一图像对应的权重;
    将由所述各叶子节点对应的视觉单词在所述第一图像对应的权重组合成一个向量,得到所述目标词向量。
  11. 根据权利要求10所述的方法,其中,所述词汇树的每一个节点对应一个聚类中心;所述计算所述词汇树对应的各视觉单词在所述第一图像对应的权重包括:
    利用所述词汇树对从所述第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;所述目标叶子节点为所述词汇树中的任意一个叶子节点,所述目标叶子节点与目标视觉单词相对应;
    根据所述中间特征、所述目标视觉单词的权重以及所述目标视觉单词对应的聚类中心,计算所 述目标视觉单词在所述第一图像对应的目标权重;所述目标权重与所述目标视觉单词的权重正相关,所述目标视觉单词的权重为根据生成所述词汇树时所述目标视觉单词对应的特征数量确定的。
  12. 根据权利要求11所述的方法,其中,所述中间特征包括至少一个子特征;所述目标权重为所述中间特征包括的各子特征对应的权重参数之和;所述子特征对应的权重参数与特征距离负相关,所述特征距离为所述子特征与对应的聚类中心的汉明距离。
  13. 根据权利要求6至12任一项所述的方法,其中,所述将所述多个备选图像分别与所述第一图像做特征匹配,得到各备选图像与所述第一图像相匹配的特征的数量包括:
    根据词汇树将从所述第一图像提取的第三特征分类至叶子节点;所述词汇树为将从所述待定位场景采集的图像中提取的特征进行聚类得到的;所述词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;
    对各所述叶子节点中的所述第三特征和第四特征做特征匹配,以得到各所述叶子节点中与所述第三特征相匹配的第四特征;所述第四特征为从目标备选图像提取的特征,所述目标备选图像包含于所述第一备选图像序列中的任一图像;
    根据各所述叶子节点中与所述第三特征相匹配的第四特征,得到所述目标备选图像与所述第一图像相匹配的特征的数量。
  14. 根据权利要求4至13任一项所述的方法,其中,所述根据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿之后,所述方法还包括:
    根据转换矩阵和所述第一位姿,确定所述相机的三维位置;所述转换矩阵为通过变换所述点云地图的角度和位置,将所述点云地图的轮廓和室内平面图对齐得到的。
  15. 根据权利要求1至14所述的方法,其中,所述确定所述第一位姿成功定位所述相机的位置的情况包括:确定L对特征点的位置关系均符合所述第一位姿,每对特征点中的一个特征点是从所述第一图像提取的,另一个特征点是从所述第一图像序列中的图像提取的,L为大于1的整数。
  16. 根据权利要求2至15所述的方法,其中,根据第一图像序列和所述第一图像,确定第一位姿所述根据第一图像序列和所述第一图像,确定第一位姿之前,所述方法还包括:
    获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;
    根据所述多个图像序列,构建所述点云地图;其中,所述多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;所述点云地图包括所述第一电子地图和所述第二电子地图。
  17. 根据权利要求9至16任一项所述的方法,其中,所述利用词汇树将从所述第一图像提取的特征转换为目标词向量之前,所述方法还包括:
    获得拍摄所述待定位场景得到的多张训练图像;
    对所述多张训练图像进行特征提取,以得到训练特征集;
    对所述训练特征集中的特征进行多次聚类,得到所述词汇树。
  18. 根据权利要求1至17所述的方法,其中,所述视觉定位方法应用于服务器;所述从图像库中确定第一备选图像序列之前,所述方法还包括:
    接收来自目标设备的所述第一图像,所述目标设备安装有所述相机。
  19. 根据权利要求18所述的方法,其中,所述确定所述第一位姿成功定位所述相机的位置的情况之后,所述方法还包括:
    将所述相机的位置信息发送至目标设备。
  20. 根据权利要求1至17所述的方法,其中,所述视觉定位方法应用于安装有所述相机的电子设备。
  21. 一种视觉定位方法,包括:
    通过相机采集目标图像;
    向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;
    接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;
    显示电子地图,所述电子地图中包含所述相机的位置和方向。
  22. 一种视觉定位装置,其中,包括:
    筛选单元,配置为从图像库中确定第一备选图像序列;所述第一备选图像序列中的各帧图像按照与第一图像的匹配度顺序排列,所述第一图像为相机采集的图像;
    所述筛选单元,还配置为根据目标窗口调整所述第一备选图像序列中各帧图像的顺序,得到第二备选图像序列;所述目标窗口包括从图像库中确定的包含目标帧图像的连续多帧图像,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像;
    确定单元,用于根据所述第二备选图像序列确定所述相机在采集所述第一图像时的目标位姿。
  23. 根据权利要求22所述的装置,其中,所述确定单元,具体用于根据第一图像序列和所述第一图像,确定第一位姿;所述第一图像序列包括所述图像库中与第一参考帧图像相邻的连续多帧图像,所述第一参考帧图像包含于所述第二备选图像序列;
    在根据所述第一位姿成功定位所述相机的位置的情况下,确定所述第一位姿为所述目标位姿。
  24. 根据权利要求23所述的装置,其中,所述确定单元,还配置为在根据所述第一位姿未成功定位所述相机的位置的情况下,根据第二图像序列和所述第一图像,确定所述相机的第二位姿;所述第二图像序列包括所述图像库中与第二参考帧图像相邻的连续多帧图像,所述第二参考帧图像为所述第二备选图像序列中所述第一参考帧图像的后一帧图像或前一帧图像;在根据所述第二位姿成功定位所述相机的位置的情况下,确定所述第二位姿为所述目标位姿。
  25. 根据权利要求23或24所述的装置,其中,所述确定单元,配置为从所述第一图像序列中各图像提取的特征中,确定与从所述第一图像提取的特征相匹配的F个特征,F为大于0的整数;根 据所述F个特征、所述F个特征在点云地图中对应的空间坐标点以及所述相机的内参,确定所述第一位姿;所述点云地图为待定位场景的电子地图,所述待定位场景为所述相机采集所述第一图像时所处的场景。
  26. 根据权利要求22至25任一项所述的装置,其中,所述筛选单元,配置为在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
  27. 根据权利要求26所述的装置,其中,
    所述筛选单元,配置为在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从低到高的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最后位置;
    在所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度从高到低的顺序排列的情况下,将所述第一备选图像序列中位于所述目标窗口的图像调整至所述第一备选图像序列最前位置。
  28. 根据权利要求27所述的装置,其中,所述筛选单元,配置为确定所述图像库中与所述第一图像对应至少一个相同视觉单词的图像,得到多个初选图像;所述图像库中任一图像对应至少一个视觉单词,所述第一图像对应至少一个视觉单词;
    确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的多个备选图像。
  29. 根据权利要求28所述的装置,其中,所述筛选单元,配置为确定所述多个初选图像中对应的视觉词向量与所述第一图像的视觉词向量相似度最高的前百分之Q的图像,得到所述多个备选图像;Q为大于0的实数。
  30. 根据权利要求28或29所述的装置,其中,所述筛选单元,配置为利用词汇树将从所述第一图像提取的特征转换为目标词向量;所述词汇树为将从所述待定位场景采集的训练图像中提取的特征进行聚类得到的;
    分别计算所述目标词向量与所述多个初选图像中各初选图像对应的视觉词向量的相似度;所述多个初选图像中任一初选图像对应的视觉词向量为利用所述词汇树由从所述任一初选图像提取的特征得到的视觉词向量;
    确定所述多个初选图像中对应的视觉词向量与所述目标词向量相似度最高的多个备选图像。
  31. 根据权利要求30所述的装置,其中,所述词汇树中的每一个叶子节点对应一个视觉单词,所述词汇树中最后一层的节点为叶子节点;
    所述筛选单元,配置为计算所述词汇树中各叶子节点对应的视觉单词在所述第一图像对应的权重;
    将由所述各叶子节点对应的视觉单词在所述第一图像对应的权重组合成一个向量,得到所述目标词向量。
  32. 根据权利要求31所述的装置,其中,所述词汇树的一个节点对应一个聚类中心;
    所述筛选单元,配置为利用所述词汇树对从所述第一图像提取的特征进行分类,得到分类到目标叶子节点的中间特征;所述目标叶子节点为所述词汇树中的任意一个叶子节点,所述目标叶子节点与目标视觉单词相对应;
    根据所述中间特征、所述目标视觉单词的权重以及所述目标视觉单词对应的聚类中心,计算所述目标视觉单词在所述第一图像对应的目标权重;所述目标权重与所述目标视觉单词的权重正相关,所述目标视觉单词的权重为根据生成所述词汇树时所述目标视觉单词对应的特征数量确定的。
  33. 根据权利要求32所述的装置,其中,所述中间特征包括至少一个子特征;所述目标权重为所述中间特征包括的各子特征对应的权重参数之和;所述子特征对应的权重参数与特征距离负相关,所述特征距离为所述子特征与对应的聚类中心的汉明距离。
  34. 根据权利要求27至33任一项所述的装置,其中,所述筛选单元,配置为根据词汇树将从所述第一图像提取的第三特征分类至叶子节点;所述词汇树为将从所述待定位场景采集的图像中提取的特征进行聚类得到的;所述词汇树的最后一层的节点为叶子节点,每个叶子节点包含多个特征;
    对各所述叶子节点中的所述第三特征和第四特征做特征匹配,以得到各所述叶子节点中与所述第三特征相匹配的第四特征;所述第四特征为从目标备选图像提取的特征,所述目标备选图像包含于所述第一备选图像序列中的任一图像;
    根据各所述叶子节点中与所述第三特征相匹配的第四特征,得到所述目标备选图像与所述第一图像相匹配的特征的数量。
  35. 根据权利要求25至34任一项所述的装置,其中,所述确定单元,还配置为根据转换矩阵和所述第一位姿,确定所述相机的三维位置;所述转换矩阵为通过变换所述点云地图的角度和位置,将所述点云地图的轮廓和室内平面图对齐得到的。
  36. 根据权利要求22至35任一项所述的装置,其中,
    所述确定单元,配置为确定L对特征点的位置关系均符合所述第一位姿,每对特征点中的一个特征点是从所述第一图像提取的,另一个特征点是从所述第一图像序列中的图像提取的,L为大于1的整数。
  37. 根据权利要求23至36任一项所述的装置,其中,所述装置还包括:
    第一获取单元,配置为获得多个图像序列,每个图像序列为采集待定位场景中的一个区域或多个区域得到的;
    地图构建单元,配置为根据所述多个图像序列,构建所述点云地图;其中,所述多个图像序列中任一图像序列用于构建一个或多个区域的子点云地图;所述点云地图包括所述第一电子地图和所述第二电子地图。
  38. 根据权利要求30至37任一项所述的装置,其中,所述装置还包括:
    第二获取单元,配置为获得拍摄所述待定位场景得到的多张训练图像;
    特征提取单元,用于对所述多张训练图像进行特征提取,以得到训练特征集;
    聚类单元,配置为对所述训练特征集中的特征进行多次聚类,得到所述词汇树。
  39. 根据权利要求22至37任一项所述的装置,其中,所述视觉定位装置为服务器,所述装置还包括:
    接收单元,配置为接收来自目标设备的所述第一图像,所述目标设备安装有所述相机。
  40. 根据权利要求39所述的装置,其中,所述装置还包括:
    发送单元,配置为将所述相机的位置信息发送至所述目标设备。
  41. 根据权利要求22至38任一项所述的装置,其中,所述视觉定位装置为安装有所述相机的电子设备。
  42. 一种终端设备,其中,包括:
    相机,配置为采集目标图像;
    发送单元,配置为向服务器发送目标信息,所述目标信息包括所述目标图像或从所述目标图像提取出的特征序列,以及所述相机的内参;
    接收单元,配置为接收位置信息,所述位置信息用于指示所述相机的位置和方向;所述位置信息为所述服务器根据第二备选图像序列确定的所述相机采集所述目标图像时的位置的信息;所述第二备选图像序列为所述服务器根据目标窗口调整第一备选图像序列中各帧图像的顺序得到的,所述目标窗口为从图像库中确定的包含目标帧图像的连续多帧图像,所述图像库用于构建电子地图,所述目标帧图像为所述图像库中与第二图像相匹配的图像,所述第二图像为所述相机在采集到第一图像之前所采集的图像,所述第一备选图像序列中的各帧图像按照与所述第一图像的匹配度顺序排列;;
    显示单元,配置为显示电子地图,所述电子地图中包含所述相机的位置和方向。
  43. 一种视觉定位系统,其中,包括服务器和终端设备,所述服务器执行如权利要求1-19中任一所述的方法,所述终端设备用于执行权利要求21中的方法。
  44. 一种电子设备,其中,包括:
    存储器,配置为存储程序;
    处理器,配置为执行所述存储器存储的所述程序,当所述程序被执行时,所述处理器用于执行如权利要求1-20中任一所述的方法。
  45. 一种计算机可读存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-20任一项所述的方法。
  46. 一种计算机程序产品,其中,所述计算机程序产品包含有程序指令;其中,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-20任一项所述的方法。
PCT/CN2019/117224 2019-08-30 2019-11-11 视觉定位方法及相关装置 WO2021035966A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020227001898A KR20220024736A (ko) 2019-08-30 2019-11-11 시각적 포지셔닝 방법 및 관련 장치
JP2022503488A JP7430243B2 (ja) 2019-08-30 2019-11-11 視覚的測位方法及び関連装置
US17/585,114 US20220148302A1 (en) 2019-08-30 2022-01-26 Method for visual localization and related apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910821911.3A CN112445929B (zh) 2019-08-30 2019-08-30 视觉定位方法及相关装置
CN201910821911.3 2019-08-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/585,114 Continuation US20220148302A1 (en) 2019-08-30 2022-01-26 Method for visual localization and related apparatus

Publications (1)

Publication Number Publication Date
WO2021035966A1 true WO2021035966A1 (zh) 2021-03-04

Family

ID=74684964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117224 WO2021035966A1 (zh) 2019-08-30 2019-11-11 视觉定位方法及相关装置

Country Status (6)

Country Link
US (1) US20220148302A1 (zh)
JP (1) JP7430243B2 (zh)
KR (1) KR20220024736A (zh)
CN (1) CN112445929B (zh)
TW (1) TWI745818B (zh)
WO (1) WO2021035966A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463429A (zh) * 2022-04-12 2022-05-10 深圳市普渡科技有限公司 机器人、地图创建方法、定位方法及介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11620829B2 (en) * 2020-09-30 2023-04-04 Snap Inc. Visual matching with a messaging application
CN113177971A (zh) * 2021-05-07 2021-07-27 中德(珠海)人工智能研究院有限公司 一种视觉跟踪方法、装置、计算机设备及存储介质
KR102366364B1 (ko) * 2021-08-25 2022-02-23 주식회사 포스로직 기하학적 패턴 매칭 방법 및 이러한 방법을 수행하는 장치
CN118052867A (zh) * 2022-11-15 2024-05-17 中兴通讯股份有限公司 定位方法、终端设备、服务器及存储介质
CN116659523B (zh) * 2023-05-17 2024-07-23 深圳市保臻社区服务科技有限公司 一种基于社区进入车辆的位置自动定位方法及装置
CN117708357B (zh) * 2023-06-16 2024-08-23 荣耀终端有限公司 一种图像检索方法和电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107796397A (zh) * 2017-09-14 2018-03-13 杭州迦智科技有限公司 一种机器人双目视觉定位方法、装置和存储介质
CN108596976A (zh) * 2018-04-27 2018-09-28 腾讯科技(深圳)有限公司 相机姿态追踪过程的重定位方法、装置、设备及存储介质
US20180297207A1 (en) * 2017-04-14 2018-10-18 TwoAntz, Inc. Visual positioning and navigation device and method thereof
CN109710724A (zh) * 2019-03-27 2019-05-03 深兰人工智能芯片研究院(江苏)有限公司 一种构建点云地图的方法和设备
CN109816769A (zh) * 2017-11-21 2019-05-28 深圳市优必选科技有限公司 基于深度相机的场景地图生成方法、装置及设备
CN110057352A (zh) * 2018-01-19 2019-07-26 北京图森未来科技有限公司 一种相机姿态角确定方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2418588A1 (en) * 2010-08-10 2012-02-15 Technische Universität München Visual localization method
EP2423873B1 (en) * 2010-08-25 2013-12-11 Lakeside Labs GmbH Apparatus and Method for Generating an Overview Image of a Plurality of Images Using a Reference Plane
US9324151B2 (en) * 2011-12-08 2016-04-26 Cornell University System and methods for world-scale camera pose estimation
JP5387723B2 (ja) * 2012-04-26 2014-01-15 カシオ計算機株式会社 画像表示装置、及び画像表示方法、画像表示プログラム
US10121266B2 (en) * 2014-11-25 2018-11-06 Affine Technologies LLC Mitigation of disocclusion artifacts
CN104700402B (zh) * 2015-02-06 2018-09-14 北京大学 基于场景三维点云的视觉定位方法及装置
CN106446815B (zh) * 2016-09-14 2019-08-09 浙江大学 一种同时定位与地图构建方法
CN107368614B (zh) * 2017-09-12 2020-07-07 猪八戒股份有限公司 基于深度学习的图像检索方法及装置
CN108198145B (zh) * 2017-12-29 2020-08-28 百度在线网络技术(北京)有限公司 用于点云数据修复的方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180297207A1 (en) * 2017-04-14 2018-10-18 TwoAntz, Inc. Visual positioning and navigation device and method thereof
CN107796397A (zh) * 2017-09-14 2018-03-13 杭州迦智科技有限公司 一种机器人双目视觉定位方法、装置和存储介质
CN109816769A (zh) * 2017-11-21 2019-05-28 深圳市优必选科技有限公司 基于深度相机的场景地图生成方法、装置及设备
CN110057352A (zh) * 2018-01-19 2019-07-26 北京图森未来科技有限公司 一种相机姿态角确定方法及装置
CN108596976A (zh) * 2018-04-27 2018-09-28 腾讯科技(深圳)有限公司 相机姿态追踪过程的重定位方法、装置、设备及存储介质
CN109710724A (zh) * 2019-03-27 2019-05-03 深兰人工智能芯片研究院(江苏)有限公司 一种构建点云地图的方法和设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463429A (zh) * 2022-04-12 2022-05-10 深圳市普渡科技有限公司 机器人、地图创建方法、定位方法及介质
CN114463429B (zh) * 2022-04-12 2022-08-16 深圳市普渡科技有限公司 机器人、地图创建方法、定位方法及介质

Also Published As

Publication number Publication date
JP7430243B2 (ja) 2024-02-09
KR20220024736A (ko) 2022-03-03
CN112445929A (zh) 2021-03-05
US20220148302A1 (en) 2022-05-12
JP2022541559A (ja) 2022-09-26
TW202109357A (zh) 2021-03-01
CN112445929B (zh) 2022-05-17
TWI745818B (zh) 2021-11-11

Similar Documents

Publication Publication Date Title
TWI745818B (zh) 視覺定位方法、電子設備及電腦可讀儲存介質
WO2021057744A1 (zh) 定位方法及装置、设备、存储介质
WO2021057742A1 (zh) 定位方法及装置、设备、存储介质
EP4056952A1 (en) Map fusion method, apparatus, device, and storage medium
US9626585B2 (en) Composition modeling for photo retrieval through geometric image segmentation
CN111323024B (zh) 定位方法及装置、设备、存储介质
CN111652934A (zh) 定位方法及地图构建方法、装置、设备、存储介质
KR20140043393A (ko) 위치 기반 인식 기법
US20230351794A1 (en) Pedestrian tracking method and device, and computer-readable storage medium
WO2017114237A1 (zh) 一种图像查询方法和装置
WO2022142049A1 (zh) 地图构建方法及装置、设备、存储介质、计算机程序产品
WO2023221790A1 (zh) 图像编码器的训练方法、装置、设备及介质
CN111709317A (zh) 一种基于显著性模型下多尺度特征的行人重识别方法
Xue et al. A fast visual map building method using video stream for visual-based indoor localization
JP7430254B2 (ja) 場所認識のための視覚的オブジェクトインスタンス記述子
Jiang et al. Indoor localization with a signal tree
CN114743139A (zh) 视频场景检索方法、装置、电子设备及可读存储介质
Orhan et al. Semantic pose verification for outdoor visual localization with self-supervised contrastive learning
US11127199B2 (en) Scene model construction system and scene model constructing method
Sui et al. An accurate indoor localization approach using cellphone camera
US20230281867A1 (en) Methods performed by electronic devices, electronic devices, and storage media
Wu et al. A vision-based indoor positioning method with high accuracy and efficiency based on self-optimized-ordered visual vocabulary
Zhang et al. Hierarchical Image Retrieval Method Based on Bag-of-Visual-Word and Eight-point Algorithm with Feature Clouds for Visual Indoor Positioning
CN116843754A (zh) 一种基于多特征融合的视觉定位方法及系统
WO2022268094A1 (en) Methods, systems, and media for image searching

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943486

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022503488

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20227001898

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19943486

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.07.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19943486

Country of ref document: EP

Kind code of ref document: A1