CN112132888B - Monocular camera positioning in large-scale indoor sparse laser radar point clouds - Google Patents

Monocular camera positioning in large-scale indoor sparse laser radar point clouds Download PDF

Info

Publication number
CN112132888B
CN112132888B CN202010597347.4A CN202010597347A CN112132888B CN 112132888 B CN112132888 B CN 112132888B CN 202010597347 A CN202010597347 A CN 202010597347A CN 112132888 B CN112132888 B CN 112132888B
Authority
CN
China
Prior art keywords
camera
similarity
pose
initial
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010597347.4A
Other languages
Chinese (zh)
Other versions
CN112132888A (en
Inventor
陈钰
王冠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Black Sesame Intelligent Technology Chongqing Co Ltd
Original Assignee
Black Sesame Intelligent Technology Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/533,389 external-priority patent/US11380003B2/en
Application filed by Black Sesame Intelligent Technology Chongqing Co Ltd filed Critical Black Sesame Intelligent Technology Chongqing Co Ltd
Publication of CN112132888A publication Critical patent/CN112132888A/en
Application granted granted Critical
Publication of CN112132888B publication Critical patent/CN112132888B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • G01C21/206Instruments for performing navigational calculations specially adapted for indoor navigation
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/02Systems using the reflection of electromagnetic waves other than radio waves
    • G01S17/06Systems determining position data of a target
    • G01S17/42Simultaneous measurement of distance and other co-ordinates
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/88Lidar systems specially adapted for specific applications
    • G01S17/89Lidar systems specially adapted for specific applications for mapping or imaging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Abstract

A method of camera positioning includes receiving a camera image, receiving a lidar point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the lidar point cloud, measuring a similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the measured similarity.

Description

Monocular camera positioning in large-scale indoor sparse laser radar point clouds
Cross reference to related applications
The present application claims the benefit of U.S. provisional application No. 62/866,509, filed on 25 th 6 of 2019, and U.S. patent application No. 16/533,389, filed on 6 th 8 of 2019, the entire contents of both of which are incorporated herein by reference.
Technical Field
The present invention relates to autopilot systems, and in particular, to camera positioning using lidar point clouds.
Background
For a robot to navigate autonomously in space, it must position itself precisely in the environment map. Therefore, six-degree-of-freedom pose estimation of the positioning sensor is one of techniques for realizing a robot such as an autonomous car. Currently, the most commonly used positioning sensors are cameras and LiDAR (light detection AND RANGING). The lidar generates an ambient point cloud as a map, and the positioning system finds an optimal registration between the running point clouds with respect to a sub-region of the map to infer the pose of the lidar.
Currently, a guess for the initial position can be found from the high definition Global Positioning System (GPS). Possible problems with this mass-production approach include the cost of the high definition global positioning system, and that GPS signals may not be available where the sky is obscured (e.g., in a parking lot class of rooms, etc.). If a guess for the initial position cannot be provided, the LiDAR location system would need to significantly increase computing resources or may fail altogether.
Cameras are inexpensive and very popular, and have been provided with Visual Odometers (VO) and visual instant localization and mapping (visual Simultaneous Localization AND MAPPING, vSLAM). The pose of the camera can be determined by visual input only, without the simultaneous use of GPS. However, one disadvantage in the case of camera-based positioning is its stability under given lighting conditions, the structure of the visual scene, and its inaccurate perception of scene depth. In the event of an uneven illumination sequence, such as a sudden exposure to bright light from a dark shadow, the direct vision odometer or vSLAM may fail. Visual methods typically rely on the existence of structures so that these algorithms can find many features to track across frames. For example, in a typical parking garage, a large number of white walls and repeated columns are often found, which makes VO/vSLAM inefficient. Furthermore, when an accurate depth of field is not available, the vision method may quickly produce a dimensional drift that results in a large cumulative positioning error.
Therefore, in order to better realize positioning, a method for effectively positioning a camera image by utilizing a sparse LiDAR point cloud and the camera image simultaneously is proposed.
Disclosure of Invention
A first example of camera positioning, comprising at least one of the following steps: the method includes receiving a camera image, receiving a LiDAR point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the LiDAR point cloud, measuring a similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the measured similarity.
A second example of camera positioning, comprising at least one of the following steps: receiving a camera image, receiving a LiDAR point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the LiDAR point cloud, measuring a similarity of the initial camera pose to the initial depth projection set as a state value, measuring a regression of the similarity of the initial camera pose to the initial depth projection set, measuring a gradient of the similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the gradient of the similarity.
Drawings
In the drawings:
FIG. 1 is a system diagram of a first example according to one embodiment of the present disclosure;
FIG. 2 is a system diagram of a second example according to one embodiment of the present disclosure;
FIG. 3 is an example depiction of a data collection vehicle according to one embodiment of the present disclosure;
FIG. 4 is an example camera image map according to one embodiment of this disclosure;
FIG. 5 is an example of a depth map according to one embodiment of the present disclosure;
FIG. 6 is an example logic flow of EnforceNet network according to one embodiment of the present disclosure;
FIG. 7 is an example logic flow of a gestural regression network according to an embodiment of the disclosure;
FIG. 8 is a convolutional neural network layout of example EnforceNet, according to one embodiment of the present disclosure;
FIG. 9 is an example data output for different training settings of a system according to one embodiment of the present disclosure;
FIG. 10 is a first example method of camera positioning according to one embodiment of the disclosure; and
Fig. 11 is a second example method of camera positioning according to one embodiment of the disclosure.
Detailed Description
The following examples are presented to illustrate the application of the apparatus and method and are not intended to limit the scope. Modifications to the equivalent of the apparatus and method are intended to be within the scope of the claims.
Certain terms are used throughout the following description and claims to refer to particular system components. Those skilled in the art will appreciate that different companies may refer to a component and/or a method by different names. It is not intended to distinguish between components and/or methods that differ in name but not function.
In the following discussion and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus may be interpreted to mean "including, but not limited to,". Likewise, the terms "coupled" or "coupled" are intended to mean either a direct or an indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.
Estimating camera pose using camera image input within a previous 3D LiDAR (LiDAR) point cloud map may provide a possible solution. The capture of environmental semantics by cameras is many times that of LiDAR and allows edge-based localization.
The main positioning method is operated in a sensor mode at present, and the LiDAR and the camera belong to different sensor modes. LIDAR SLAM compare the point cloud structures, while vSLAM matches the camera image features. Because LiDAR scanning is typically sparse and dense, it is time consuming and expensive, making the camera more suitable for positioning within a sparse point cloud.
In the present disclosure, the end-to-end novel neural network architecture may provide a solution for camera pose estimation within the LiDAR point cloud. The method uses a monocular camera. If a single camera image and an initial rough pose estimate are given, the method makes a set of depth projections using the point cloud and gives a pair of camera images and depth projections that are used to infer the pose of the camera. This solution utilizes state value constraints in the network (called resistance modules) to quantify the pose estimation fit and back-propagate it to the network. The resistor module may produce faster network convergence (convergence).
Current camera positioning with LiDAR point clouds renders composite views from 3D maps and compares them to camera images. This approach is most applicable when LiDAR maps are dense and the pose transformation between the current camera view and the rendered image is small. However, as the lighting conditions and scale drift change, the method may suffer from pose jump and cumulative positioning errors. The disclosed method utilizes state value prediction and attitude regression communications, which yields improved inference speeds.
Fig. 1 depicts an example of an automated parking assistance system 100 that may be used to implement a deep neural network associated with the operation of one or more portions or steps in processes 700 and 800. In this example, the processors associated with the hybrid system include a Field Programmable Gate Array (FPGA) 122, a Graphics Processing Unit (GPU) 120, and a Central Processing Unit (CPU) 118.
The processing units 118, 120, and 122 have the capability of providing deep neural networks. A CPU is a general-purpose processor that can perform a number of different functions, the versatility of which enables it to perform a number of different tasks, but its processing of a number of data streams is limited and its functionality with respect to neural networks is very limited. GPUs are image processors that have many small processing cores capable of processing parallel tasks in sequence. An FPGA is a field programmable device that has the ability to be reconfigured and performs any function in a hardwired circuit that can be programmed into a CPU or GPU. Because the programming of the FPGA is in circuit form, it is many times faster than the CPU and significantly faster than the GPU.
There are other types of processors that the system may include, for example, an Accelerated Processing Unit (APU) that includes a CPU with on-chip GPU elements and a Digital Signal Processor (DSP) that is dedicated to performing high-speed digital data processing. An Application Specific Integrated Circuit (ASIC) may also perform the hardwired functions of the FPGA; however, the lead time for designing and producing an ASIC is on the order of a few quarters of a year, rather than the fast turnaround implementation available in programming FPGAs.
The image processing unit 120, the central processing unit 118, and the field programmable gate array 122 are connected to each other and to the memory interface controller 112. The FPGA is connected to the memory interface through programmable logic circuits to the memory interconnect 130. This additional equipment is employed due to the fact that FPGAs run at very large bandwidths and is used to minimize the circuitry used from FPGAs to performing memory tasks. Memory interface controller 112 is additionally coupled to persistent storage disk 110, system memory 114, and Read Only Memory (ROM) 116.
The system of fig. 2 can be used to program and train FPGAs. The GPU works well with unstructured data and can be used for training, once the data has been trained, a deterministic inference model can be found, and the CPU can program the FPGA with model data determined by the GPU.
The memory interface controller is connected to a central interconnect 124 that is additionally connected to the GPU 120, CPU 118, and FPGA122. The central interconnect 124 is additionally connected to an input and output interface 128 and a network interface 126, the input and output interface 128 being connected to a forward facing camera 132, a left side camera 134, a LiDAR 136.
Fig. 2 depicts a second example hybrid computing system 200 that may be used to implement a neural network associated with the operation of one or more portions or steps of flowchart 500. In this example, the processor associated with the system includes a Field Programmable Gate Array (FPGA) 210 and a Central Processing Unit (CPU) 220.
The FPGA is electrically connected to the FPGA controller 212, which FPGA controller 212 is connected to a Direct Memory Access (DMA) 218. The DMA is connected to an input buffer 214 and an output buffer 216, which are coupled to the FPGA to buffer data into and out of the FPGA, respectively. The DMA 218 has two first-in-first-out (FIFO) buffers, one for the main CPU and the other for the FPGA, which allows data to be written to and read from the appropriate buffers.
The main switch 228 is on the CPU side of the DMA, which transfers data and commands to Direct Memory Access (DMA). The DMA is also connected to a Synchronous Dynamic Random Access Memory (SDRAM) controller 224, which allows data to be transferred to and from the FPGA and CPU 220, and to an external SDRAM 226 and CPU 220. The main switch 228 is connected to a peripheral interface 230, which peripheral interface 230 is connected to a forward facing camera 232, a left side camera 234, and a LiDAR 236. Flash controller 222 controls persistent memory and is connected to CPU 220.
Fig. 3 discloses a bird's eye view 300 of a stitched view of the system of the present disclosure. The forward camera 132 and the left camera 134 are red, blue, green (RGB) monocular cameras and provide camera images to the system. Cameras 132 and 134 may also be stereo, infrared, black and white cameras, etc. LiDAR in this example is overhead and provides a sparse LiDAR point cloud map of the vehicle surroundings.
Method of
The disclosed method compares the camera image to a depth projection set that projects from multiple pose guesses and quantifies the proximity of each guess to the actual camera pose to determine the sample position for the next iteration of the pose guess. The disclosed method operates in camera image feature space rather than infer a depth map from the camera image and operate in point cloud space.
Description of the problem
Formally, a six degree of freedom camera pose is defined as p= [ R, t ] ∈se (3), where r∈so (3) is rotation and t∈r 3 is translation. LiDAR maps are a collection of points in space M= { M i|mi∈R3, i=1, 2,3, … |M|}. The projection P of the LiDAR map from the perspective pose Vp is defined as:
Wherein the method comprises the steps of
Is the projection parameter of the clipping plane, and
Is a set of camera internal parameters.
The pair of camera image I (taken at pose P i) and depth map D (projected at pose P d) is defined as
Wherein the operator calculates a representation vector of the pair. The attitude difference of I Pi and D Pd is defined as
△Pi,d=Pi-Pd,△Pi,d∈SE (3)
Thus, the attitude estimation problem can be expressed as
Where E is a function of quantifying the similarity between I Pi and D Pd. For convenience of presentation, the pose of the depth projection is defined as P d, and the pose of the photographed camera image is defined as P i.
Because of the sparsity of the LiDAR point cloud, the information between l P and D Pd is significantly insufficient to be quantified even if photographed under the same pose. Therefore, heuristic post-processing and similarity measurement on sparse depth maps is inefficient, which points to a EnforceNet type of learning method for network design.
Fig. 4 is an example camera image map, in this example 410 is a complete camera image, 412 is a filtered view of the camera image and 414 is an edge of the camera image.
Fig. 5 is an example depth map, in which 510 is a full depth map, 512 is a filtered view of the depth map and 514 is an edge of the depth map.
EnforceNet designs
Fig. 6 is an example logic flow of EnforceNet network. Since camera pose estimation is typically a real-time task requiring reliable results given various lighting conditions, our network design goal is fast, accurate and generalizable. Ideally, the system should operate at high frequencies on embedded hardware with limited computing resources. The system should operate with limited retraining/fine tuning applied to different scenes and remain stable in the face of changing lighting conditions.
The system utilizes leverage the previous 3D point cloud M and based on having a gesture sample D d D e 1, many depth projections { D j |j ε [1 ] generated from those depth projections of infinity ] }, infinity ] }. The system has the ability to explore training pair H i,d with large or small posing differences and gradually derive a reasonable pose approaching P i.
To infer Pi for one camera image, the P d set within M can be randomly sampled and E (H Pi,Pd) measured to derive a subsequent set of P d until convergence is complete. The process may be described as a Markov Decision Process (MDP) in which future pose estimates are independent of a given current estimate that has been passed based on a reinforcement learning framework.
The virtual agent explores space for the best pose and H Pi,Pd is considered the state of the virtual agent. The state value function F (Δp i,d) is defined as a monotonically decreasing function. Thus, if Δp i,d is low, then the state has a high value; and when Δp i,d is high, the state value is low.
In one example, the first network quantifies the state value and the second network regresses Δp i,d. The input to the second network is also the camera image depth projection pair H Pi,Pd and its labels are their respective ground truth (ground truth) pose differences.
Fig. 6 depicts two sets of camera images 610 and 616 and two sets of depth maps 614 and 620 being compared 612 and 618, respectively. One set of camera images and depth maps is routed to the state value network 622 and the other set is routed to the pose regression network 624, where the state value network 622 and the pose regression network 624 share weights 626. The state value network 622 and the pose regression network 624 are connected by a resistance module 628 that constrains the state values. The output of the state value network is 630 Δpi, d and the output of the pose regression network is 632 Δpi, d. Fed back into the state value network and into the pose regression network are ground truth values 634 and gradients 636.
Gesture regression network
FIG. 7 is an example logic flow of a gesture regression network. The pose regression network and the state value network are connected by a resistive module, where Δpi, d prediction is used as a state value label. In this example, since the depth projection is far from the image, if Δpi, d is large, the state value is small. When Δ pose is small, the state value is large. The resistance module forces the state value network to learn the ground truth value Δpi, d. Weights are shared between the pose regression network and the state value network. The shared weights increase the adjustment of the Δpi, d regression network due to the state value gradients and ground truth values Δpi, d counter-propagating through the regression network. As both training complexity and inference speed are increased, the state value network more effectively learns ground truth information additionally from shared weights.
FIG. 7 depicts an example of a pose regression network in which a current pose Pi 712 from a camera image 720 and a guessed pose Pd 710 from LiDAR point cloud 718 are fed into a pose convolutional neural network 722, the pose convolutional neural network 722 outputs Δ pose 724, which are fed into a pose loss module 716, and the current and guessed poses are fed into ΔPi, d, which are also fed into the pose loss module 716.
EnforceNet layout
Fig. 8 is an example convolutional neural network layout of EnforceNet. In one example network architecture for implementing the backbone network of systems, methods, and computer-readable media, a 7-layer convolutional neural network (ConvNet) is used, as shown in FIG. 8. The camera image and depth projection pair H Pi,Pd is a network input. In the present example of the present invention,The operator is a stack operator. This example architecture is lightweight and accurate for the performance of camera positioning.
For training the network, a simple Euclidean error combining translation and rotation is used. The ground truth transformation Δp i,d (translation t t and rotation R t) at time stamp t between depth projection and camera image capture can be described as:
△Pt,i,d=[△Rt,i,d,△tt,i,d]
Predictive conversion Can be combined into
The attitude loss can be defined as a weighted sum of errors from both components of rotation and translation (α 1 is the rotation loss in the attitude loss, α 2 is the translation loss in the attitude loss), as follows:
In addition to the attitude penalty, the state value penalty is increased. The state value has the ability to evaluate the current pose prediction. To train the state value function, the negative pose L oss is considered the ground truth S t, and the state value prediction can be expressed as F (ΔP i,d). Thus, the state value pose L F(ΔPt,i,d) can be expressed as follows (α3 is a weight of the state value loss):
In summary, the total loss L oss in our network is:
To effectively render the depth map, the proxy performs depth projection and depth map enhancement. To generate additional pairs of camera images and depth projections, the pose of the camera is disturbed with random noise to render additional depth projections. The scrambling produces pairs of images with known ground truth Δp. The displacement rotation of the pose is + -5 deg. and the translational displacement is + -1 m. In one example, key frames of the camera image are obtained using LIDAR SLAM and sensor synchronization. For parking garage { G ijk |i ε| [1,2], j ε [1,2,3], k ε [1,2] }, many image pairs with key frames with ground truth pose differences are generated. For training the model, the dataset was divided into 60% for training, 30% for validation and 10% for testing.
Fig. 8 depicts an example layout of EnforceNet Convolutional Neural Network (CNN). The network captures a camera image 814 and a depth projection 812 to create Δp i,d 810. The camera images and depth projections are fed into various modules of networks 816-820, with networks 816-820 outputting state values 822 and delta pose 826.Δp i,d 810 is fed into the Loss Δt module 828 along with Δ pose 826. The Loss Δt module 828 outputs a signal 830 to the-Loss ΔΔt module 824. The Loss ΔΔt module 824 receives the signal 830 and the state value 822.
Positioning accuracy
Positioning accuracy is confirmed using data from different parking garages at different times to show positioning accuracy and general applicability. Table 910 in fig. 9 summarizes some combinations of garage, time, and camera orientation.
The visual appearance of the same garage may vary significantly due to parking conditions and time of day. To verify that the model remains accurate in the face of visual appearance changes, a runnability test is set for several exercises with different camera image trajectories. Table 912 in fig. 9 contains details of the positioning accuracy for a garage. The data indicate that the translational error is less than 10 cm for the general case test dataset, with the rotational error wandering at 0 degrees.
The model is efficient once validated is trained by data from a single garage. Table 914 in fig. 9 shows that the mixed version of training data increases the convergence speed. Thus, in the same garage scenario and across garage scenarios, the data indicates similar results.
A method of camera positioning is depicted in fig. 10, including receiving (1010) a camera image, receiving (1012) a LIDAR point cloud, estimating (1014) an initial pose of the camera image, sampling (1016) an initial depth projection set within the LIDAR point cloud, measuring (1018) a similarity of the initial camera pose to the initial depth projection set, and deriving (1020) a subsequent depth projection set based on the measured similarity.
The method of fig. 10 may additionally provide for deriving a subsequent depth pose projection set based on a minimization of measured similarity, and providing similarity is a function based on state values.
The method of fig. 10 may further comprise the steps of: back-propagating the derivation of the subsequent depth projection set, measuring the attitude loss, then deriving the subsequent depth projection set based on the measured similarity until convergence, returning to the ground truth attitude difference based on the measured similarity, and back-propagating the function based on the state values as the gradient and the ground truth attitude difference.
A method of camera positioning is depicted in fig. 11, including receiving (1110) a camera image, receiving (1112) a LIDAR point cloud, estimating (1114) an initial pose of the camera image, sampling (1116) an initial depth projection set within the LIDAR point cloud, measuring (1118) a similarity of the initial camera pose to the initial depth projection set as a state value, measuring (1120) a regression of the similarity of the initial camera pose to the initial depth projection set, measuring (1122) a similarity gradient of the initial camera pose to the initial depth projection set, and deriving (1124) a subsequent depth projection set based on the similarity gradient.
The method of fig. 11 may additionally provide for performing a measure of similarity by the first network, performing a measure of regression of similarity by the second network, and performing a constraint on the status value of the measured similarity fed back to the first and second networks by the resistance module.
The method of fig. 11 may further include sharing weights between the first network and the second network, constraining the status values of the measured similarities fed back to the first network and the second network, and then deriving a subsequent set of depth projections based on the measured similarities until convergence.
The present disclosure proposes a network EnforceNet that provides an end-to-end solution for camera pose location within a large-scale sparse 3D LiDAR point cloud. The EnforceNe network has a resistance module and a weight distribution scheme. Experiments on the actual dataset of a large indoor parking garage show that EnforceNet has reached the highest level of positioning accuracy with excellent generalization performance.
Those of skill in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations thereof. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged in different ways (e.g., in different orders, or partitioned in different ways) without departing from the scope of the subject technology.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an illustration of example approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged. Some steps may be performed simultaneously. The accompanying method claims present elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The foregoing description provides various examples of the subject technology, but the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the claim language, wherein the use of elements in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more". The term "some" means one or more unless stated otherwise. Pronouns in males (e.g., his) include females and neutrality (e.g., she and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the invention. Predicates "configured to", "operable to", and "programmed to" do not imply any particular tangible or intangible modification to the subject, but are intended to be used interchangeably. For example, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations, or that the processor is operable to monitor and control operations. Likewise, a processor configured to execute code may be interpreted as a processor programmed to execute code or a processor operable to execute code.
Phrases such as "an aspect" do not imply that such aspect is essential to the present technology or that such aspect applies to all configurations of the subject technology. The disclosure relating to an aspect may apply to all configurations, or one or more configurations. One aspect may provide one or more examples. A phrase such as an "aspect" may refer to one or more aspects and vice versa. Phrases such as "an embodiment" do not imply that such an embodiment is essential to the subject technology or that such an embodiment applies to all configurations of the subject technology. The disclosure directed to one embodiment may be applicable to all embodiments, or one or more embodiments. One embodiment may provide one or more examples. A phrase such as an "embodiment" may refer to one or more embodiments and vice versa. Phrases such as "configuration" do not imply that such configuration is essential to the subject technology, or that such configuration applies to all configurations of the subject technology. The disclosure relating to one configuration may apply to all configurations, or one or more configurations. One or more examples may be provided for one configuration. A phrase such as "configured" may refer to one or more configurations and vice versa.
The word "example" is used herein to mean "serving as an example or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs.
All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, the disclosure herein is not intended to be donated to the public, whether or not such disclosure is explicitly recited in the claims. Furthermore, to the extent that the term "includes," "including," "has," or similar terms is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.
Reference to "one embodiment," "an embodiment," "some embodiments," "various embodiments," or similar language means that a particular element or feature is included in at least one embodiment of the present invention. Although a phrase may appear in multiple places, the phrase does not necessarily refer to the same embodiment. Those of skill in the art will be able to devise and incorporate any of a variety of mechanisms suitable for achieving the above-described functionality in conjunction with the present disclosure.
It should be understood that the present invention teaches only one example of the illustrative embodiment and that many variations of the invention can be readily devised by those skilled in the art after reading the present invention and that the scope of the invention is determined by the claims that follow.

Claims (16)

1. A method of camera positioning, comprising:
receiving a camera image;
Receiving a laser radar point cloud;
Estimating an initial camera pose of the camera image;
sampling an initial depth projection set within the lidar point cloud;
Measuring a similarity of the initial camera pose to the initial depth projection set; and
A subsequent depth projection set is derived based on the measured similarity in the camera image feature space.
2. The method of camera positioning of claim 1, wherein the deriving a subsequent depth projection set is based on a minimization of measured similarity.
3. The method of camera positioning of claim 1, further comprising back-propagating a derivation of the subsequent depth projection set.
4. The method of camera positioning according to claim 1, wherein the measured similarity is a function based on a state value.
5. The method of camera positioning of claim 1, further comprising measuring attitude loss.
6. The method of camera positioning according to claim 1, wherein the lidar point cloud is a sparse dataset.
7. The method of camera positioning of claim 1, further comprising subsequently deriving the subsequent set of depth projections based on the measured similarity until convergence.
8. The method of camera positioning according to claim 4, further comprising regressing the ground truth attitude difference based on the measured similarity.
9. The method of camera positioning according to claim 8, further comprising back-propagating the state value based function as a gradient and the ground truth pose difference.
10. A method of camera positioning, comprising:
receiving a camera image;
Receiving a laser radar point cloud;
Estimating an initial camera pose of the camera image;
sampling an initial depth projection set within the lidar point cloud;
Measuring similarity of the initial camera pose and the initial depth projection set as a state value;
measuring a regression of the initial camera pose and the similarity of the initial depth projection set;
measuring a gradient of the similarity of the initial camera pose to the initial depth projection set; and
A subsequent depth projection set is derived based on the gradient of the similarity in camera image feature space.
11. The method of camera positioning according to claim 10, wherein the measuring of the similarity is performed through a first network.
12. The method of camera positioning according to claim 11, characterized in that the measurement of the regression of the similarity is performed through a second network.
13. The method of camera positioning of claim 12, further comprising weight sharing between the first network and the second network.
14. The method of camera positioning according to claim 13, further comprising constraining status values of measured similarities fed back to the first network and the second network.
15. The method of camera positioning according to claim 14, wherein the constraining feedback to the state value of the measured similarity of the first network and the second network is performed by a resistive module.
16. The method of camera positioning of claim 15, further comprising subsequently deriving the subsequent set of depth projections based on the measured similarity until convergence.
CN202010597347.4A 2019-06-25 2020-06-28 Monocular camera positioning in large-scale indoor sparse laser radar point clouds Active CN112132888B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962866509P 2019-06-25 2019-06-25
US62/866,509 2019-06-25
US16/533,389 2019-08-06
US16/533,389 US11380003B2 (en) 2019-06-25 2019-08-06 Monocular camera localization in large scale indoor sparse LiDAR point cloud

Publications (2)

Publication Number Publication Date
CN112132888A CN112132888A (en) 2020-12-25
CN112132888B true CN112132888B (en) 2024-04-26

Family

ID=73851141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010597347.4A Active CN112132888B (en) 2019-06-25 2020-06-28 Monocular camera positioning in large-scale indoor sparse laser radar point clouds

Country Status (1)

Country Link
CN (1) CN112132888B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934827A (en) * 2015-12-31 2017-07-07 杭州华为数字技术有限公司 The method for reconstructing and device of three-dimensional scenic
CN108509820A (en) * 2017-02-23 2018-09-07 百度在线网络技术(北京)有限公司 Method for obstacle segmentation and device, computer equipment and readable medium
CN108648240A (en) * 2018-05-11 2018-10-12 东南大学 Based on a non-overlapping visual field camera posture scaling method for cloud characteristics map registration
CN108717712A (en) * 2018-05-29 2018-10-30 东北大学 A kind of vision inertial navigation SLAM methods assumed based on ground level
WO2019060125A1 (en) * 2017-09-22 2019-03-28 Zoox, Inc. Three-dimensional bounding box from two-dimensional image and point cloud data
CN109887057A (en) * 2019-01-30 2019-06-14 杭州飞步科技有限公司 The method and apparatus for generating high-precision map

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10848731B2 (en) * 2012-02-24 2020-11-24 Matterport, Inc. Capturing and aligning panoramic image and depth data
US10866101B2 (en) * 2017-06-13 2020-12-15 Tusimple, Inc. Sensor calibration and time system for ground truth static scene sparse flow generation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934827A (en) * 2015-12-31 2017-07-07 杭州华为数字技术有限公司 The method for reconstructing and device of three-dimensional scenic
CN108509820A (en) * 2017-02-23 2018-09-07 百度在线网络技术(北京)有限公司 Method for obstacle segmentation and device, computer equipment and readable medium
WO2019060125A1 (en) * 2017-09-22 2019-03-28 Zoox, Inc. Three-dimensional bounding box from two-dimensional image and point cloud data
CN108648240A (en) * 2018-05-11 2018-10-12 东南大学 Based on a non-overlapping visual field camera posture scaling method for cloud characteristics map registration
CN108717712A (en) * 2018-05-29 2018-10-30 东北大学 A kind of vision inertial navigation SLAM methods assumed based on ground level
CN109887057A (en) * 2019-01-30 2019-06-14 杭州飞步科技有限公司 The method and apparatus for generating high-precision map

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Automotive radar and camera fusion using Generative Adversarial Networks;Vladimir Lekic 等;《omputer Vision and Image Understanding》;1-8 *
基于相机与摇摆激光雷达融合的 非结构化环境定位;俞毓锋 等;《自动化学报》;1791-1798 *
室内移动机器人RGB-D_SLAM算法研究;张米令;《中国优秀硕士学位论文全文数据库 信息科技辑》;I140-1534 *

Also Published As

Publication number Publication date
CN112132888A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
JP6862409B2 (en) Map generation and moving subject positioning methods and devices
US11064178B2 (en) Deep virtual stereo odometry
WO2018177159A1 (en) Method and system for determining position of moving object
CN107808407A (en) Unmanned plane vision SLAM methods, unmanned plane and storage medium based on binocular camera
WO2019196476A1 (en) Laser sensor-based map generation
WO2019104571A1 (en) Image processing method and device
Eynard et al. Real time UAV altitude, attitude and motion estimation from hybrid stereovision
Dusha et al. Fixed-wing attitude estimation using computer vision based horizon detection
CN110853085B (en) Semantic SLAM-based mapping method and device and electronic equipment
Wen et al. Hybrid semi-dense 3D semantic-topological mapping from stereo visual-inertial odometry SLAM with loop closure detection
CN113903011A (en) Semantic map construction and positioning method suitable for indoor parking lot
WO2021081774A1 (en) Parameter optimization method and apparatus, control device, and aircraft
WO2022062480A1 (en) Positioning method and positioning apparatus of mobile device
KR20200056905A (en) Method and apparatus for aligning 3d model
CN112907557A (en) Road detection method, road detection device, computing equipment and storage medium
CN114943757A (en) Unmanned aerial vehicle forest exploration system based on monocular depth of field prediction and depth reinforcement learning
John et al. Automatic calibration and registration of lidar and stereo camera without calibration objects
CN112132888B (en) Monocular camera positioning in large-scale indoor sparse laser radar point clouds
CN107610224A (en) It is a kind of that algorithm is represented based on the Weakly supervised 3D automotive subjects class with clear and definite occlusion modeling
US11380003B2 (en) Monocular camera localization in large scale indoor sparse LiDAR point cloud
CN112069997B (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net
CN115655291A (en) Laser SLAM closed-loop mapping method and device, mobile robot, equipment and medium
CN115661341A (en) Real-time dynamic semantic mapping method and system based on multi-sensor fusion
CN114648639A (en) Target vehicle detection method, system and device
Mei et al. A Novel scene matching navigation system for UAVs based on vision/inertial fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant