CN112132900A - Visual repositioning method and system - Google Patents

Visual repositioning method and system Download PDF

Info

Publication number
CN112132900A
CN112132900A CN202011052456.4A CN202011052456A CN112132900A CN 112132900 A CN112132900 A CN 112132900A CN 202011052456 A CN202011052456 A CN 202011052456A CN 112132900 A CN112132900 A CN 112132900A
Authority
CN
China
Prior art keywords
image
data
point cloud
dimensional
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011052456.4A
Other languages
Chinese (zh)
Inventor
邓秋平
王锐
刘剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingmeixin Beijing Technology Co ltd
Original Assignee
Lingmeixin Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingmeixin Beijing Technology Co ltd filed Critical Lingmeixin Beijing Technology Co ltd
Priority to CN202011052456.4A priority Critical patent/CN112132900A/en
Publication of CN112132900A publication Critical patent/CN112132900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual repositioning method, which comprises the following steps: matching an image to be positioned, and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image; determining reference position and posture information of the image to be positioned based on the position and posture information of the candidate image group, and determining the same characteristic points between the candidate image and the image to be positioned; and determining the target position and posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image. The invention also discloses a vision repositioning system. The invention can solve the problems of insufficient positioning precision and robustness of the existing visual positioning mode.

Description

Visual repositioning method and system
Technical Field
The invention relates to the technical field of computer vision, in particular to a vision repositioning method and a vision repositioning system.
Background
Visual positioning is one of indoor positioning, and currently, the visual positioning methods mainly include the following:
(1) the scheme is similar to laser SLAM, and the distance between a user and an obstacle can be directly calculated through collected spatial point cloud data, so that real-time spatial map information is constructed, or positioning is performed in a known map.
(2) The VSLAM scheme based on a monocular or fisheye camera estimates the self pose change by using a multi-frame image, calculates the distance between a user and an object by accumulating the pose change, and performs positioning and map construction.
(3) The optical flow positioning scheme measures a relative position change through the light ray change in each frame of picture entering the camera, thereby obtaining the accurate information of the relative motion of the object.
These visual positioning methods usually adopt a relative positioning mode, that is, after an initial pose is given, the distance and direction relative to the initial pose of the user are measured through an algorithm, so as to determine the current pose information. However, in a digital twin environment, especially when a camera with low cost and low pixel is used, the relative positioning modes have insufficient positioning precision and low robustness.
Disclosure of Invention
In order to solve the problems of insufficient positioning precision and insufficient robustness, the invention aims to provide a visual repositioning method and a visual repositioning system.
The invention provides a visual repositioning method, which comprises the following steps:
matching an image to be positioned, and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image;
determining reference position and posture information of the image to be positioned based on the position and posture information of the candidate image group, and determining the same characteristic points between the candidate image and the image to be positioned;
and determining the target position and posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
As a further improvement of the present invention, the matching is performed on the image to be positioned, and the candidate image group matched with the image to be positioned is obtained and implemented by a neural network, and the method further includes: training the neural network according to a data set.
As a further improvement of the present invention, the neural network comprises:
the first network is used for carrying out feature extraction on the input image to obtain a first feature map;
the second network is used for extracting the features of the first feature map to obtain a second feature map;
the third network is used for carrying out local feature extraction on the first feature map to obtain key points and local feature descriptors;
and the fourth network is used for carrying out global feature extraction on the second feature map to obtain a global feature descriptor.
As a further refinement of the invention, the data set comprises a plurality of sets of sample images, each of the sets of sample images comprising position and orientation information, the method further comprising: acquiring the plurality of sample image groups,
wherein the obtaining the plurality of sample image groups comprises:
acquiring three-dimensional point cloud data of a laser radar and RGB image data of a camera in a three-dimensional scene to be reconstructed;
determining the position and attitude information of each frame of image in a point cloud space;
acquiring complete point cloud of the three-dimensional scene to be reconstructed;
mapping the complete point cloud into an image space based on the position and posture information of each frame of image in the point cloud space, and constructing a dense depth map of each frame of image;
and storing each frame of image and the dense depth map pair corresponding to each frame of image as a sample image group, and using the position and posture information of each frame of image in a point cloud space as the position and posture information of the sample image group.
As a further improvement of the present invention, the acquiring a complete point cloud of the three-dimensional scene to be reconstructed includes:
registering two adjacent frames of point clouds based on normal distribution transformation to obtain a relative position and attitude matrix M between the two adjacent frames of point clouds;
and based on the relative position attitude matrix M, converting each frame of point cloud in the three-dimensional point cloud data of the laser radar to obtain complete point cloud data of the three-dimensional scene to be reconstructed.
As a further improvement of the invention, the three-dimensional point cloud data of the laser radar comprises scanning data of a plurality of points,
wherein the mapping the complete point cloud into an image space to construct a dense depth map for each frame of image comprises:
respectively converting the scanning data of the plurality of points into a Cartesian coordinate system to obtain first data, wherein the first data comprises a plurality of three-dimensional coordinate points P;
mapping the first data into the two-dimensional image space according to the calibration data to obtain second data, wherein the second data comprises a plurality of mapping points P/
And constructing a sparse depth map according to the second data.
As a further improvement of the present invention, the determining, in the dense depth map corresponding to the candidate image, the target position and posture information of the image to be positioned according to the three-dimensional coordinate point corresponding to each feature point includes:
matching and searching three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image;
and determining the target position posture information of the image to be positioned by a PnP method according to each three-dimensional coordinate point and the projection position of each three-dimensional point coordinate in an image space.
The present invention also provides a visual repositioning system, characterized in that the system comprises:
the matching module is used for matching an image to be positioned and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image;
the first positioning module is used for determining reference position posture information of the image to be positioned based on the position posture information of the candidate image group and determining the same characteristic points between the candidate image and the image to be positioned;
and the second positioning module is used for determining the target position posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
As a further improvement of the present invention, the matching module is implemented by a neural network, and the system further includes: training the neural network according to a data set.
As a further improvement of the present invention, the neural network comprises:
the first network is used for carrying out feature extraction on the input image to obtain a first feature map;
the second network is used for extracting the features of the first feature map to obtain a second feature map;
the third network is used for carrying out local feature extraction on the first feature map to obtain key points and local feature descriptors;
and the fourth network is used for carrying out global feature extraction on the second feature map to obtain a global feature descriptor.
As a further refinement of the invention, the data set comprises a plurality of sets of sample images, each of the sets of sample images comprising position and orientation information, the method further comprising: acquiring the plurality of sample image groups,
wherein the obtaining the plurality of sample image groups comprises:
acquiring three-dimensional point cloud data of a laser radar and RGB image data of a camera in a three-dimensional scene to be reconstructed;
determining the position and attitude information of each frame of image in a point cloud space;
acquiring complete point cloud of the three-dimensional scene to be reconstructed;
mapping the complete point cloud into an image space based on the position and posture information of each frame of image in the point cloud space, and constructing a dense depth map of each frame of image;
and storing each frame of image and the dense depth map pair corresponding to each frame of image as a sample image group, and using the position and posture information of each frame of image in a point cloud space as the position and posture information of the sample image group.
As a further improvement of the present invention, the acquiring a complete point cloud of the three-dimensional scene to be reconstructed includes:
registering two adjacent frames of point clouds based on normal distribution transformation to obtain a relative position and attitude matrix M between the two adjacent frames of point clouds;
and based on the relative position attitude matrix M, converting each frame of point cloud in the three-dimensional point cloud data of the laser radar to obtain complete point cloud data of the three-dimensional scene to be reconstructed.
As a further improvement of the invention, the three-dimensional point cloud data of the laser radar comprises scanning data of a plurality of points,
wherein the mapping the complete point cloud into an image space to construct a dense depth map for each frame of image comprises:
respectively converting the scanning data of the plurality of points into a Cartesian coordinate system to obtain first data, wherein the first data comprises a plurality of three-dimensional coordinate points P;
mapping the first data into the two-dimensional image space according to the calibration data to obtain second data, wherein the second data comprises a plurality of mapping points P/
And constructing a sparse depth map according to the second data.
As a further refinement of the invention, the second positioning module is configured to: matching and searching three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image;
and determining the target position posture information of the image to be positioned by a PnP method according to each three-dimensional coordinate point and the projection position of each three-dimensional point coordinate in an image space.
The invention also provides an electronic device comprising a memory and a processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method.
The invention also provides a computer-readable storage medium having stored thereon a computer program for execution by a processor to perform the method.
The invention has the beneficial effects that: the two progressive methods of image retrieval coarse matching positioning and point cloud precise matching positioning are adopted, so that the running time can be greatly saved, and the method is suitable for real-time operation. The high-precision digital twin environment is used as a source database, a terminal user only needs to use cheap camera equipment, and the relative positioning result is corrected by using an absolute space coordinate as a reference value, so that the problems of insufficient positioning precision and insufficient robustness of the existing visual positioning mode are solved, and the method has high practical value.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a flow chart illustrating a method of visual repositioning according to an exemplary embodiment of the present invention;
fig. 2 is a schematic diagram of a neural network according to an exemplary embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
In addition, in the description of the present invention, the terms used are for illustrative purposes only and are not intended to limit the scope of the present invention. The terms "comprises" and/or "comprising" are used to specify the presence of stated elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the present invention will be more readily understood by those of ordinary skill in the art. The drawings are only for purposes of illustrating the described embodiments of the invention. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated in the present application may be employed without departing from the principles described in the present application.
A visual repositioning method according to an embodiment of the present invention, as shown in fig. 1, includes:
s1, matching images to be positioned, and acquiring a candidate image group matched with the images to be positioned, wherein the candidate image group comprises candidate images and dense depth maps corresponding to the candidate images;
s2, determining reference position and posture information of the image to be positioned based on the position and posture information of the candidate image group, and determining the same characteristic points between the candidate image and the image to be positioned;
and S3, determining the target position and posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
The method can be applied to a digital twin environment synthesized by laser and image equipment, and adopts equipment combination of a laser radar and a visible light camera, for example, a Velodyne-16 multi-line laser radar and a Flir industrial camera adopting a 190-degree lens, and the two cameras are arranged on the same rigid structure, so that three-dimensional point cloud data of the laser radar and RGB image data of the camera can be simultaneously obtained. It can be understood that the lidar and the camera are mounted on the same rigid structure, and are initially calibrated and the position and angle relationship of the lidar and the camera is registered, so that the three-dimensional point cloud data of the lidar can be correctly mapped to the RGB image shot by the camera and aligned with the pixel of the corresponding position in the RGB image of the camera. After the two are calibrated, calibration data can be obtained, and the calibration data is used for mapping the three-dimensional point cloud data obtained in real time and the RGB image data of the camera in the three-dimensional scene to be reconstructed.
Wherein, the calibration data can be obtained after calibrating the laser radar and the camera, and the calibration data at least needs to comprise an offset matrix MoAnd the reference matrix MiWherein the offset matrix MoAfter the three-dimensional coordinate point P in the three-dimensional point cloud data is aligned with the pixel of the corresponding position in the RGB image, the mapping point P in the alignment is calculated/Obtaining relative position deviation information between coordinates and pixel coordinates, and obtaining an internal reference matrix MiAnd calibrating and obtaining the internal distortion parameters of the camera.
The image to be positioned may be an RGB image captured by a user using any camera device (including a low-pixel camera), the image to be positioned is positioned twice, the reference position and posture information obtained for the first time is used as a coarse positioning result, and the target position and posture information obtained for the second time is transmitted to a terminal user as an accurate positioning result (i.e., a final visual positioning result).
In an optional embodiment, the matching is performed on the image to be positioned, a candidate image group matched with the image to be positioned is obtained, and the matching is implemented by a neural network, and the method further includes: training the neural network according to a data set.
The method comprises the steps of constructing a data set with position and posture information based on three-dimensional point cloud data of a laser radar and RGB image data of a camera which are well registered in a deep learning mode and used for training of a neural network and matching and recognizing of an image to be positioned, bringing characteristic information and position and posture information of the image in the data set into the neural network for end-to-end training, and obtaining the trained neural network. After the neural network is trained, when a new image to be positioned is input, an image group matched with the image to be positioned can be quickly searched, and accurate position and posture information is calculated and obtained in a characteristic point matching mode.
In an alternative embodiment, the neural network comprises:
the first network is used for carrying out feature extraction on the input image to obtain a first feature map;
the second network is used for extracting the features of the first feature map to obtain a second feature map;
the third network is used for carrying out local feature extraction on the first feature map to obtain key points and local feature descriptors;
and the fourth network is used for carrying out global feature extraction on the second feature map to obtain a global feature descriptor.
As shown in fig. 2, the first network employs MobileNet (labeled MobileNet (1)), the second network employs MobileNet (labeled MobileNet (2)), the third network employs SuperPoint Decoder, and the fourth network employs NetVLAD. The local feature header is led out from the branch of the MobileNet (1) encoder earlier than the global feature header, so that the spatial discrimination feature needs to be reserved with higher spatial resolution, the SuperPoint encoder scheme is adopted for the key points and the local feature descriptors, the semantic level of the local feature descriptors can be lower than that of the global feature descriptors, and the coarse positioning has higher robustness, accuracy and real-time performance.
In an alternative embodiment, the data set includes a plurality of sets of sample images, each of the sets of sample images including position and orientation information, and the method further includes: acquiring the plurality of sample image groups,
wherein the obtaining the plurality of sample image groups comprises:
acquiring three-dimensional point cloud data of a laser radar and RGB image data of a camera in a three-dimensional scene to be reconstructed;
determining the position and attitude information of each frame of image in a point cloud space;
acquiring complete point cloud of the three-dimensional scene to be reconstructed;
mapping the complete point cloud into an image space based on the position and posture information of each frame of image in the point cloud space, and constructing a dense depth map of each frame of image;
and storing each frame of image and the dense depth map pair corresponding to each frame of image as a sample image group, and using the position and posture information of each frame of image in a point cloud space as the position and posture information of the sample image group.
In an alternative embodiment, the acquiring a complete point cloud of the three-dimensional scene to be reconstructed includes:
registering two adjacent frames of point clouds based on normal distribution transformation to obtain a relative position and attitude matrix M between the two adjacent frames of point clouds;
and based on the relative position attitude matrix M, converting each frame of point cloud in the three-dimensional point cloud data of the laser radar to obtain complete point cloud data of the three-dimensional scene to be reconstructed.
The method of the invention uses a Normal Distribution Transformation (NDT) method to divide the space occupied by the point cloud of a reference frame into grids with specified sizes, calculates the multidimensional normal distribution parameter of each grid, transforms each point of the point cloud of the current frame according to a transfer matrix T, calculates which grid each point of the point cloud of the current frame falls on, and calculates a probability density function p (x) for each network. Because each point may occupy a plurality of networks, it is necessary to calculate probability densities for each grid and add the probability densities to obtain an objective function s (p) corresponding to the total registration score, and determine optimal values of all points based on the objective function s (p), so as to obtain an optimal match between two adjacent frames of point clouds. In this registration process, feature calculation and matching of the corresponding points are not utilized, and thus a high real-time speed can be ensured.
Because the point cloud in each grid satisfies the gaussian distribution, the mean value and the covariance matrix Σ of the point set q in each grid can be calculated.
Figure BDA0002709968240000091
Figure BDA0002709968240000092
In the formula, xiPoints in each grid are represented, i represents the serial number of the points, and n represents the number of points in each grid.
Figure BDA0002709968240000093
Figure BDA0002709968240000094
Where x denotes a point in a grid, xi' denotes a point in the ith mesh among all the target meshes (each mesh can be understood as a point), qiRepresenting the set of points in the ith mesh of all the target meshes.
In an alternative embodiment, the three-dimensional point cloud data of the lidar includes scanning data of a plurality of points, wherein the mapping the complete point cloud into an image space to construct a dense depth map of each frame of the image includes:
respectively converting the scanning data of the plurality of points into a Cartesian coordinate system to obtain first data, wherein the first data comprises a plurality of three-dimensional coordinate points P;
the first calibration data is used for calibrating the first calibration dataMapping the data into the two-dimensional image space to obtain second data, wherein the second data comprises a plurality of mapping points P/
And constructing a sparse depth map according to the second data.
It is understood that the three-dimensional point cloud data is recorded in the form of points when scanned, including scanned data of a plurality of points, each corresponding to one three-dimensional coordinate in a cartesian coordinate system, and thus, the three-dimensional point cloud data may be converted into first data including a plurality of three-dimensional coordinate points P. In the process of mapping a plurality of three-dimensional coordinate points P, each three-dimensional coordinate point P is mapped into a mapping point P in a two-dimensional image space/The two-dimensional image space may be understood as a two-dimensional space defined by the width and height of the image. During the mapping process, when the point P is mapped/When the distance between the corresponding three-dimensional coordinate point P and the original point in the camera coordinate system is not more than the maximum available distance (the maximum available distance can be preset according to the use requirement), the mapping point P is positioned in the two-dimensional image space/Saving to dense depth map, otherwise discarding the shot P/. In the process of saving the dense depth map, each frame of image and the dense depth map corresponding to each frame of image need to be stored in a pair.
The coordinates of each three-dimensional coordinate point P may be obtained from P ═ rcos θ, rsin α cos θ, rsin θ. α represents a horizontal angle of the point P with respect to the origin in the cartesian coordinate system, θ represents a vertical angle of the point P with respect to the origin in the cartesian coordinate system, and r represents a distance of the point P with respect to the origin in the cartesian coordinate system. As mentioned above, the calibration data includes the offset matrix M of the three-dimensional point cloud data and the RGB image dataoAnd the internal reference matrix M of the cameraiWhen mapping, the following mapping can be adopted
Figure BDA0002709968240000101
An optional implementation manner, where, in the dense depth map corresponding to the candidate image, the determining, according to the three-dimensional coordinate point corresponding to each feature point, the target position and posture information of the image to be positioned includes:
matching and searching three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image;
and determining the target position posture information of the image to be positioned by a PnP method according to each three-dimensional coordinate point and the projection position of each three-dimensional point coordinate in an image space.
It can be understood that when a candidate image group can be screened out through the neural network, matching is further performed in the dense depth map, matching and searching are performed on known 2D feature points (each feature point) of the image and three-dimensional coordinate points in the dense depth map, and then accurate position and posture information (target position and posture information) of an image to be positioned is solved based on 3D space points (each three-dimensional coordinate point) and a projection position thereof according to a PnP method, so that the positioning method can achieve positioning accuracy of a centimeter level. And when the matched three-dimensional coordinate point cannot be found in the dense depth map by each feature point, performing coarse positioning on the image to be positioned again, wherein the positioning process is as described above and is not repeated here.
When solving by the PnP method, the three-dimensional coordinate point P is projected to a projection point (u, v, 1) in the image space, and an augmentation matrix [ R | t ] is defined, which yields the following equation:
Figure BDA0002709968240000111
wherein each feature point pair (three-dimensional coordinate point and projection point) provides two linear constraints with respect to t
Figure BDA0002709968240000112
T is a total of 12 dimensions, and at least 6 characteristic point pairs are needed to realize the linear solution of T.
And solving by an SVD least square method to obtain the target position attitude information of the image to be positioned, including R (attitude) and t (position).
The method disclosed by the invention has the advantages that the flows of preprocessing (calibrating three-dimensional point cloud data of a laser radar and RGB image data of a camera) and deep learning training of the digital twin data and the flows of retrieving and accurately matching a new image (an image to be positioned) are all executed on a cloud server, and the new image can be obtained from any camera equipment, such as a mobile phone, a network camera and the like. The method comprises the steps that three-dimensional point cloud data obtained in the digital twin data acquisition process and a large number of pre-shot RGB images are used as input sources, and an image retrieval database (data set) is formed by utilizing the input sources; after uploading new images shot by any camera equipment, a terminal user can quickly retrieve the closest preset image (candidate image group) in the image retrieval database, and then the accurate position and posture information is calculated in a characteristic point matching mode. The two progressive methods of image retrieval coarse matching positioning and point cloud precise matching positioning are adopted, so that the running time can be greatly saved, and the system is suitable for real-time operation. The method adopted by the invention takes the high-precision digital twin environment as the source database, the terminal user only needs to use cheap camera equipment, and the relative positioning result is corrected by using an absolute space coordinate as a reference value, so that the problems of insufficient positioning precision and insufficient robustness of the existing visual positioning mode are solved, and the method has high practical value.
The vision repositioning system of the embodiment of the invention comprises:
the matching module is used for matching an image to be positioned and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image;
the first positioning module is used for determining reference position posture information of the image to be positioned based on the position posture information of the candidate image group and determining the same characteristic points between the candidate image and the image to be positioned;
and the second positioning module is used for determining the target position posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
The system can be applied to a digital twin environment synthesized by laser and image equipment, and adopts equipment combination of a laser radar and a visible light camera, for example, a Velodyne-16 multi-line laser radar and a Flir industrial camera adopting a 190-degree lens, and the two cameras are arranged on the same rigid structure, so that three-dimensional point cloud data of the laser radar and RGB image data of the camera can be simultaneously obtained. It can be understood that the lidar and the camera are mounted on the same rigid structure, and are initially calibrated and the position and angle relationship of the lidar and the camera is registered, so that the three-dimensional point cloud data of the lidar can be correctly mapped to the RGB image shot by the camera and aligned with the pixel of the corresponding position in the RGB image of the camera. After the two are calibrated, calibration data can be obtained, and the calibration data is used for mapping the three-dimensional point cloud data obtained in real time and the RGB image data of the camera in the three-dimensional scene to be reconstructed.
Wherein, the calibration data can be obtained after calibrating the laser radar and the camera, and the calibration data at least needs to comprise an offset matrix MoAnd the reference matrix MiWherein the offset matrix MoAfter the three-dimensional coordinate point P in the three-dimensional point cloud data is aligned with the pixel of the corresponding position in the RGB image, the mapping point P in the alignment is calculated/Obtaining relative position deviation information between coordinates and pixel coordinates, and obtaining an internal reference matrix MiAnd calibrating and obtaining the internal distortion parameters of the camera.
The image to be positioned may be an RGB image captured by a user using any camera device (including a low-pixel camera), the image to be positioned is positioned twice, the reference position and posture information obtained for the first time is used as a coarse positioning result, and the target position and posture information obtained for the second time is transmitted to a terminal user as an accurate positioning result (i.e., a final visual positioning result).
In an alternative embodiment, the matching module is implemented by a neural network, and the system further includes: training the neural network according to a data set.
The system provided by the invention constructs a data set with position and posture information based on the three-dimensional point cloud data of the well-registered laser radar and the RGB image data of the camera in a deep learning mode, is used for training the neural network and matching and identifying the image to be positioned, and brings the characteristic information and the position and posture information of the image in the data set into the neural network for end-to-end training to obtain the trained neural network. After the neural network is trained, when a new image to be positioned is input, an image group matched with the image to be positioned can be quickly searched, and accurate position and posture information is calculated and obtained in a characteristic point matching mode.
In an alternative embodiment, the neural network comprises:
the first network is used for carrying out feature extraction on the input image to obtain a first feature map;
the second network is used for extracting the features of the first feature map to obtain a second feature map;
the third network is used for carrying out local feature extraction on the first feature map to obtain key points and local feature descriptors;
and the fourth network is used for carrying out global feature extraction on the second feature map to obtain a global feature descriptor.
As shown in fig. 2, the first network employs MobileNet (labeled MobileNet (1)), the second network employs MobileNet (labeled MobileNet (2)), the third network employs SuperPoint Decoder, and the fourth network employs NetVLAD. The local feature header is led out from the branch of the MobileNet (1) encoder earlier than the global feature header, so that the spatial discrimination feature needs to be reserved with higher spatial resolution, the SuperPoint encoder scheme is adopted for the key points and the local feature descriptors, the semantic level of the local feature descriptors can be lower than that of the global feature descriptors, and the coarse positioning has higher robustness, accuracy and real-time performance.
In an alternative embodiment, the data set includes a plurality of sets of sample images, each of the sets of sample images including position and orientation information, and the method further includes: acquiring the plurality of sample image groups,
wherein the obtaining the plurality of sample image groups comprises:
acquiring three-dimensional point cloud data of a laser radar and RGB image data of a camera in a three-dimensional scene to be reconstructed;
determining the position and attitude information of each frame of image in a point cloud space;
acquiring complete point cloud of the three-dimensional scene to be reconstructed;
mapping the complete point cloud into an image space based on the position and posture information of each frame of image in the point cloud space, and constructing a dense depth map of each frame of image;
and storing each frame of image and the dense depth map pair corresponding to each frame of image as a sample image group, and using the position and posture information of each frame of image in a point cloud space as the position and posture information of the sample image group.
In an alternative embodiment, the acquiring a complete point cloud of the three-dimensional scene to be reconstructed includes:
registering two adjacent frames of point clouds based on normal distribution transformation to obtain a relative position and attitude matrix M between the two adjacent frames of point clouds;
and based on the relative position attitude matrix M, converting each frame of point cloud in the three-dimensional point cloud data of the laser radar to obtain complete point cloud data of the three-dimensional scene to be reconstructed.
The system of the invention uses a Normal Distribution Transformation (NDT) method to divide the space occupied by the point cloud of a reference frame into grids with specified sizes, calculates the multidimensional normal distribution parameter of each grid, transforms each point of the point cloud of the current frame according to a transfer matrix T, calculates which grid each point of the point cloud of the current frame falls on, and calculates a probability density function p (x) for each network. Because each point may occupy a plurality of networks, it is necessary to calculate probability densities for each grid and add the probability densities to obtain an objective function s (p) corresponding to the total registration score, and determine optimal values of all points based on the objective function s (p), so as to obtain an optimal match between two adjacent frames of point clouds. In this registration process, feature calculation and matching of the corresponding points are not utilized, and thus a high real-time speed can be ensured.
Because the point cloud in each grid satisfies the gaussian distribution, the mean value and the covariance matrix Σ of the point set q in each grid can be calculated.
Figure BDA0002709968240000141
Figure BDA0002709968240000142
In the formula, xiPoints in each grid are represented, i represents the serial number of the points, and n represents the number of points in each grid.
Figure BDA0002709968240000143
Figure BDA0002709968240000144
Where x denotes a point in a grid, xi' denotes a point in the ith mesh among all the target meshes (each mesh can be understood as a point), qiRepresenting the set of points in the ith mesh of all the target meshes.
Wherein the mapping the complete point cloud into an image space to construct a dense depth map for each frame of image comprises:
respectively converting the scanning data of the plurality of points into a Cartesian coordinate system to obtain first data, wherein the first data comprises a plurality of three-dimensional coordinate points P;
mapping the first data into the two-dimensional image space according to the calibration data to obtain second data, wherein the second data comprises a plurality of mapping points P/
And constructing a sparse depth map according to the second data.
It will be appreciated that the three-dimensional point cloud data is recorded as points during scanning, including a plurality of pointsEach point corresponds to a three-dimensional coordinate in a cartesian coordinate system, and thus, the three-dimensional point cloud data can be converted into first data including a plurality of three-dimensional coordinate points P. In the process of mapping a plurality of three-dimensional coordinate points P, each three-dimensional coordinate point P is mapped into a mapping point P in a two-dimensional image space/The two-dimensional image space may be understood as a two-dimensional space defined by the width and height of the image. During the mapping process, when the point P is mapped/When the distance between the corresponding three-dimensional coordinate point P and the original point in the camera coordinate system is not more than the maximum available distance (the maximum available distance can be preset according to the use requirement), the mapping point P is positioned in the two-dimensional image space/Saving to dense depth map, otherwise discarding the shot P/. In the process of saving the dense depth map, each frame of image and the dense depth map corresponding to each frame of image need to be stored in a pair.
The coordinates of each three-dimensional coordinate point P may be obtained from P ═ rcos θ, rsin α cos θ, rsin θ. α represents a horizontal angle of the point P with respect to the origin in the cartesian coordinate system, θ represents a vertical angle of the point P with respect to the origin in the cartesian coordinate system, and r represents a distance of the point P with respect to the origin in the cartesian coordinate system. As mentioned above, the calibration data includes the offset matrix M of the three-dimensional point cloud data and the RGB image dataoAnd the internal reference matrix M of the cameraiWhen mapping, the following mapping can be adopted
Figure BDA0002709968240000151
In an alternative embodiment, the second positioning module is configured to: matching and searching three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image;
and determining the target position posture information of the image to be positioned by a PnP method according to each three-dimensional coordinate point and the projection position of each three-dimensional point coordinate in an image space.
It can be understood that when a candidate image group can be screened out through the neural network, matching is further performed in the dense depth map, matching and searching are performed on known 2D feature points (each feature point) of the image and three-dimensional coordinate points in the dense depth map, and then accurate position and posture information (target position and posture information) of an image to be positioned is solved based on 3D space points (each three-dimensional coordinate point) and a projection position thereof according to a PnP method, so that the positioning method can achieve positioning accuracy of a centimeter level. And when the matched three-dimensional coordinate point cannot be found in the dense depth map by each feature point, performing coarse positioning on the image to be positioned again, wherein the positioning process is as described above and is not repeated here.
When solving by the PnP method, the three-dimensional coordinate point P is projected to a projection point (u, v, 1) in the image space, and an augmentation matrix [ R | t ] is defined, which yields the following equation:
Figure BDA0002709968240000161
wherein each feature point pair (three-dimensional coordinate point and projection point) provides two linear constraints with respect to t
Figure BDA0002709968240000162
T is a total of 12 dimensions, and at least 6 characteristic point pairs are needed to realize the linear solution of T.
And solving by an SVD least square method to obtain the target position attitude information of the image to be positioned, including R (attitude) and t (position).
The system disclosed by the invention performs the processes of preprocessing (calibrating three-dimensional point cloud data of a laser radar and RGB image data of a camera) and deep learning training on the digital twin data and performing retrieval and accurate matching on a new image (an image to be positioned) on a cloud server, and the new image can be obtained from any camera equipment, such as a mobile phone, a network camera and the like. The system takes three-dimensional point cloud data obtained in the digital twin data acquisition process and a large number of pre-shot RGB images as input sources, and an image retrieval database (data set) is formed by utilizing the input sources; after uploading new images shot by any camera equipment, a terminal user can quickly retrieve the closest preset image (candidate image group) in the image retrieval database, and then the accurate position and posture information is calculated in a characteristic point matching mode.
The two progressive methods of image retrieval coarse matching positioning and point cloud precise matching positioning are adopted, so that the running time can be greatly saved, and the system is suitable for real-time operation. The method adopted by the invention takes the high-precision digital twin environment as the source database, the terminal user only needs to use cheap camera equipment, and the relative positioning result is corrected by using an absolute space coordinate as a reference value, so that the problems of insufficient positioning precision and insufficient robustness of the existing visual positioning mode are solved, and the method has high practical value.
The disclosure also relates to an electronic device comprising a server, a terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; wherein the memory stores instructions executable by the at least one processor to implement the method of the above embodiments.
In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications of the device and data processing, i.e., implements the method, by executing nonvolatile software programs, instructions, and modules stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory and, when executed by the one or more processors, perform the methods of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.
The present disclosure also relates to a computer-readable storage medium for storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It will be understood by those skilled in the art that while the present invention has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. A method of visual repositioning, the method comprising:
matching an image to be positioned, and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image;
determining reference position and posture information of the image to be positioned based on the position and posture information of the candidate image group, and determining the same characteristic points between the candidate image and the image to be positioned;
and determining the target position and posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
2. The method of claim 1, wherein the matching of the image to be located, the acquisition of the candidate image group matched with the image to be located, is realized by a neural network, and the method further comprises: training the neural network according to a data set.
3. The method of claim 2, wherein the neural network comprises:
the first network is used for carrying out feature extraction on the input image to obtain a first feature map;
the second network is used for extracting the features of the first feature map to obtain a second feature map;
the third network is used for carrying out local feature extraction on the first feature map to obtain key points and local feature descriptors;
and the fourth network is used for carrying out global feature extraction on the second feature map to obtain a global feature descriptor.
4. The method of claim 2, wherein the data set comprises a plurality of sets of sample images, each of the sets of sample images comprising position pose information, the method further comprising: acquiring the plurality of sample image groups,
wherein the obtaining the plurality of sample image groups comprises:
acquiring three-dimensional point cloud data of a laser radar and RGB image data of a camera in a three-dimensional scene to be reconstructed;
determining the position and attitude information of each frame of image in a point cloud space;
acquiring complete point cloud of the three-dimensional scene to be reconstructed;
mapping the complete point cloud into an image space based on the position and posture information of each frame of image in the point cloud space, and constructing a dense depth map of each frame of image;
and storing each frame of image and the dense depth map pair corresponding to each frame of image as a sample image group, and using the position and posture information of each frame of image in a point cloud space as the position and posture information of the sample image group.
5. The method of claim 4, wherein the acquiring of the complete point cloud of the three-dimensional scene to be reconstructed comprises:
registering two adjacent frames of point clouds based on normal distribution transformation to obtain a relative position and attitude matrix M between the two adjacent frames of point clouds;
and based on the relative position attitude matrix M, converting each frame of point cloud in the three-dimensional point cloud data of the laser radar to obtain complete point cloud data of the three-dimensional scene to be reconstructed.
6. The method of claim 4, wherein the lidar three-dimensional point cloud data comprises scan data for a plurality of points,
wherein the mapping the complete point cloud into an image space to construct a dense depth map for each frame of image comprises:
respectively converting the scanning data of the plurality of points into a Cartesian coordinate system to obtain first data, wherein the first data comprises a plurality of three-dimensional coordinate points P;
mapping the first data into the two-dimensional image space according to the calibration data to obtain second data, wherein the second data comprises a plurality of mapping points P/
And constructing a sparse depth map according to the second data.
7. The method of claim 1, wherein the determining, in the dense depth map corresponding to the candidate image, the target position and posture information of the image to be positioned according to the three-dimensional coordinate point corresponding to each feature point comprises:
matching and searching three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image;
and determining the target position posture information of the image to be positioned by a PnP method according to each three-dimensional coordinate point and the projection position of each three-dimensional point coordinate in an image space.
8. A visual repositioning system, wherein the system comprises:
the matching module is used for matching an image to be positioned and acquiring a candidate image group matched with the image to be positioned, wherein the candidate image group comprises a candidate image and a dense depth map corresponding to the candidate image;
the first positioning module is used for determining reference position posture information of the image to be positioned based on the position posture information of the candidate image group and determining the same characteristic points between the candidate image and the image to be positioned;
and the second positioning module is used for determining the target position posture information of the image to be positioned according to the three-dimensional coordinate points corresponding to the feature points in the dense depth map corresponding to the candidate image.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, the computer program being executable by a processor for implementing the method according to any one of claims 1-7.
CN202011052456.4A 2020-09-29 2020-09-29 Visual repositioning method and system Pending CN112132900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011052456.4A CN112132900A (en) 2020-09-29 2020-09-29 Visual repositioning method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011052456.4A CN112132900A (en) 2020-09-29 2020-09-29 Visual repositioning method and system

Publications (1)

Publication Number Publication Date
CN112132900A true CN112132900A (en) 2020-12-25

Family

ID=73844895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011052456.4A Pending CN112132900A (en) 2020-09-29 2020-09-29 Visual repositioning method and system

Country Status (1)

Country Link
CN (1) CN112132900A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686962A (en) * 2021-01-21 2021-04-20 中国科学院空天信息创新研究院 Indoor visual positioning method and device and electronic equipment
WO2022177505A1 (en) * 2021-02-17 2022-08-25 National University Of Singapore Methods relating to virtual reality systems and interactive objects
CN117011685A (en) * 2023-09-27 2023-11-07 之江实验室 Scene recognition method and device and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989604A (en) * 2016-02-18 2016-10-05 合肥工业大学 Target object three-dimensional color point cloud generation method based on KINECT
CN107680102A (en) * 2017-08-28 2018-02-09 国网甘肃省电力公司电力科学研究院 A kind of airborne cloud data electric force pole tower extraction method based on space constraint
CN109300190A (en) * 2018-09-06 2019-02-01 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the storage medium of three-dimensional data
CN110853075A (en) * 2019-11-05 2020-02-28 北京理工大学 Visual tracking positioning method based on dense point cloud and synthetic view
JP2020035216A (en) * 2018-08-30 2020-03-05 Kddi株式会社 Image processing device, method, and program
CN111161398A (en) * 2019-12-06 2020-05-15 苏州智加科技有限公司 Image generation method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989604A (en) * 2016-02-18 2016-10-05 合肥工业大学 Target object three-dimensional color point cloud generation method based on KINECT
CN107680102A (en) * 2017-08-28 2018-02-09 国网甘肃省电力公司电力科学研究院 A kind of airborne cloud data electric force pole tower extraction method based on space constraint
JP2020035216A (en) * 2018-08-30 2020-03-05 Kddi株式会社 Image processing device, method, and program
CN109300190A (en) * 2018-09-06 2019-02-01 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the storage medium of three-dimensional data
CN110853075A (en) * 2019-11-05 2020-02-28 北京理工大学 Visual tracking positioning method based on dense point cloud and synthetic view
CN111161398A (en) * 2019-12-06 2020-05-15 苏州智加科技有限公司 Image generation method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686962A (en) * 2021-01-21 2021-04-20 中国科学院空天信息创新研究院 Indoor visual positioning method and device and electronic equipment
WO2022177505A1 (en) * 2021-02-17 2022-08-25 National University Of Singapore Methods relating to virtual reality systems and interactive objects
CN117011685A (en) * 2023-09-27 2023-11-07 之江实验室 Scene recognition method and device and electronic device
CN117011685B (en) * 2023-09-27 2024-01-09 之江实验室 Scene recognition method and device and electronic device

Similar Documents

Publication Publication Date Title
CN112132972B (en) Three-dimensional reconstruction method and system for fusing laser and image data
CN110568447B (en) Visual positioning method, device and computer readable medium
CN109615652B (en) Depth information acquisition method and device
CN110097553B (en) Semantic mapping system based on instant positioning mapping and three-dimensional semantic segmentation
CN105758426B (en) The combined calibrating method of the multisensor of mobile robot
CN110176032B (en) Three-dimensional reconstruction method and device
CN109947097B (en) Robot positioning method based on vision and laser fusion and navigation application
CN112132900A (en) Visual repositioning method and system
CN108537844B (en) Visual SLAM loop detection method fusing geometric information
CN109118544B (en) Synthetic aperture imaging method based on perspective transformation
DE112011102132T5 (en) Method and device for image-based positioning
CN111260539B (en) Fish eye pattern target identification method and system thereof
CN109709977B (en) Method and device for planning movement track and moving object
CN112207821B (en) Target searching method of visual robot and robot
Cvišić et al. Recalibrating the KITTI dataset camera setup for improved odometry accuracy
CN110634138A (en) Bridge deformation monitoring method, device and equipment based on visual perception
CN115201883B (en) Moving target video positioning and speed measuring system and method
CN113066112B (en) Indoor and outdoor fusion method and device based on three-dimensional model data
CN117036300A (en) Road surface crack identification method based on point cloud-RGB heterogeneous image multistage registration mapping
CN110991306B (en) Self-adaptive wide-field high-resolution intelligent sensing method and system
CN112017238A (en) Method and device for determining spatial position information of linear object
CN111198563B (en) Terrain identification method and system for dynamic motion of foot type robot
CN109883400B (en) Automatic target detection and space positioning method for fixed station based on YOLO-SITCOL
CN111724432A (en) Object three-dimensional detection method and device
CN114092388A (en) Obstacle detection method based on monocular camera and odometer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination