CN112132888B

CN112132888B - Monocular camera positioning in large-scale indoor sparse laser radar point clouds

Info

Publication number: CN112132888B
Application number: CN202010597347.4A
Authority: CN
Inventors: 陈钰; 王冠
Original assignee: Black Sesame Intelligent Technology Chongqing Co Ltd
Current assignee: Black Sesame Intelligent Technology Chongqing Co Ltd
Priority date: 2019-06-25
Filing date: 2020-06-28
Publication date: 2024-04-26
Anticipated expiration: 2040-06-28
Also published as: CN112132888A

Abstract

A method of camera positioning includes receiving a camera image, receiving a lidar point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the lidar point cloud, measuring a similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the measured similarity.

Description

Monocular camera positioning in large-scale indoor sparse laser radar point clouds

Cross reference to related applications

The present application claims the benefit of U.S. provisional application No. 62/866,509, filed on 25 th 6 of 2019, and U.S. patent application No. 16/533,389, filed on 6 th 8 of 2019, the entire contents of both of which are incorporated herein by reference.

Technical Field

The present invention relates to autopilot systems, and in particular, to camera positioning using lidar point clouds.

Background

For a robot to navigate autonomously in space, it must position itself precisely in the environment map. Therefore, six-degree-of-freedom pose estimation of the positioning sensor is one of techniques for realizing a robot such as an autonomous car. Currently, the most commonly used positioning sensors are cameras and LiDAR (light detection AND RANGING). The lidar generates an ambient point cloud as a map, and the positioning system finds an optimal registration between the running point clouds with respect to a sub-region of the map to infer the pose of the lidar.

Currently, a guess for the initial position can be found from the high definition Global Positioning System (GPS). Possible problems with this mass-production approach include the cost of the high definition global positioning system, and that GPS signals may not be available where the sky is obscured (e.g., in a parking lot class of rooms, etc.). If a guess for the initial position cannot be provided, the LiDAR location system would need to significantly increase computing resources or may fail altogether.

Cameras are inexpensive and very popular, and have been provided with Visual Odometers (VO) and visual instant localization and mapping (visual Simultaneous Localization AND MAPPING, vSLAM). The pose of the camera can be determined by visual input only, without the simultaneous use of GPS. However, one disadvantage in the case of camera-based positioning is its stability under given lighting conditions, the structure of the visual scene, and its inaccurate perception of scene depth. In the event of an uneven illumination sequence, such as a sudden exposure to bright light from a dark shadow, the direct vision odometer or vSLAM may fail. Visual methods typically rely on the existence of structures so that these algorithms can find many features to track across frames. For example, in a typical parking garage, a large number of white walls and repeated columns are often found, which makes VO/vSLAM inefficient. Furthermore, when an accurate depth of field is not available, the vision method may quickly produce a dimensional drift that results in a large cumulative positioning error.

Therefore, in order to better realize positioning, a method for effectively positioning a camera image by utilizing a sparse LiDAR point cloud and the camera image simultaneously is proposed.

Disclosure of Invention

A first example of camera positioning, comprising at least one of the following steps: the method includes receiving a camera image, receiving a LiDAR point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the LiDAR point cloud, measuring a similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the measured similarity.

A second example of camera positioning, comprising at least one of the following steps: receiving a camera image, receiving a LiDAR point cloud, estimating an initial camera pose of the camera image, sampling an initial depth projection set within the LiDAR point cloud, measuring a similarity of the initial camera pose to the initial depth projection set as a state value, measuring a regression of the similarity of the initial camera pose to the initial depth projection set, measuring a gradient of the similarity of the initial camera pose to the initial depth projection set, and deriving a subsequent depth projection set based on the gradient of the similarity.

Drawings

In the drawings:

FIG. 1 is a system diagram of a first example according to one embodiment of the present disclosure;

FIG. 2 is a system diagram of a second example according to one embodiment of the present disclosure;

FIG. 3 is an example depiction of a data collection vehicle according to one embodiment of the present disclosure;

FIG. 4 is an example camera image map according to one embodiment of this disclosure;

FIG. 5 is an example of a depth map according to one embodiment of the present disclosure;

FIG. 6 is an example logic flow of EnforceNet network according to one embodiment of the present disclosure;

FIG. 7 is an example logic flow of a gestural regression network according to an embodiment of the disclosure;

FIG. 8 is a convolutional neural network layout of example EnforceNet, according to one embodiment of the present disclosure;

FIG. 9 is an example data output for different training settings of a system according to one embodiment of the present disclosure;

FIG. 10 is a first example method of camera positioning according to one embodiment of the disclosure; and

Fig. 11 is a second example method of camera positioning according to one embodiment of the disclosure.

Detailed Description

The following examples are presented to illustrate the application of the apparatus and method and are not intended to limit the scope. Modifications to the equivalent of the apparatus and method are intended to be within the scope of the claims.

Certain terms are used throughout the following description and claims to refer to particular system components. Those skilled in the art will appreciate that different companies may refer to a component and/or a method by different names. It is not intended to distinguish between components and/or methods that differ in name but not function.

In the following discussion and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus may be interpreted to mean "including, but not limited to,". Likewise, the terms "coupled" or "coupled" are intended to mean either a direct or an indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.

Estimating camera pose using camera image input within a previous 3D LiDAR (LiDAR) point cloud map may provide a possible solution. The capture of environmental semantics by cameras is many times that of LiDAR and allows edge-based localization.

The main positioning method is operated in a sensor mode at present, and the LiDAR and the camera belong to different sensor modes. LIDAR SLAM compare the point cloud structures, while vSLAM matches the camera image features. Because LiDAR scanning is typically sparse and dense, it is time consuming and expensive, making the camera more suitable for positioning within a sparse point cloud.

In the present disclosure, the end-to-end novel neural network architecture may provide a solution for camera pose estimation within the LiDAR point cloud. The method uses a monocular camera. If a single camera image and an initial rough pose estimate are given, the method makes a set of depth projections using the point cloud and gives a pair of camera images and depth projections that are used to infer the pose of the camera. This solution utilizes state value constraints in the network (called resistance modules) to quantify the pose estimation fit and back-propagate it to the network. The resistor module may produce faster network convergence (convergence).

Current camera positioning with LiDAR point clouds renders composite views from 3D maps and compares them to camera images. This approach is most applicable when LiDAR maps are dense and the pose transformation between the current camera view and the rendered image is small. However, as the lighting conditions and scale drift change, the method may suffer from pose jump and cumulative positioning errors. The disclosed method utilizes state value prediction and attitude regression communications, which yields improved inference speeds.

Fig. 1 depicts an example of an automated parking assistance system 100 that may be used to implement a deep neural network associated with the operation of one or more portions or steps in processes 700 and 800. In this example, the processors associated with the hybrid system include a Field Programmable Gate Array (FPGA) 122, a Graphics Processing Unit (GPU) 120, and a Central Processing Unit (CPU) 118.

The processing units 118, 120, and 122 have the capability of providing deep neural networks. A CPU is a general-purpose processor that can perform a number of different functions, the versatility of which enables it to perform a number of different tasks, but its processing of a number of data streams is limited and its functionality with respect to neural networks is very limited. GPUs are image processors that have many small processing cores capable of processing parallel tasks in sequence. An FPGA is a field programmable device that has the ability to be reconfigured and performs any function in a hardwired circuit that can be programmed into a CPU or GPU. Because the programming of the FPGA is in circuit form, it is many times faster than the CPU and significantly faster than the GPU.

There are other types of processors that the system may include, for example, an Accelerated Processing Unit (APU) that includes a CPU with on-chip GPU elements and a Digital Signal Processor (DSP) that is dedicated to performing high-speed digital data processing. An Application Specific Integrated Circuit (ASIC) may also perform the hardwired functions of the FPGA; however, the lead time for designing and producing an ASIC is on the order of a few quarters of a year, rather than the fast turnaround implementation available in programming FPGAs.

The image processing unit 120, the central processing unit 118, and the field programmable gate array 122 are connected to each other and to the memory interface controller 112. The FPGA is connected to the memory interface through programmable logic circuits to the memory interconnect 130. This additional equipment is employed due to the fact that FPGAs run at very large bandwidths and is used to minimize the circuitry used from FPGAs to performing memory tasks. Memory interface controller 112 is additionally coupled to persistent storage disk 110, system memory 114, and Read Only Memory (ROM) 116.

The system of fig. 2 can be used to program and train FPGAs. The GPU works well with unstructured data and can be used for training, once the data has been trained, a deterministic inference model can be found, and the CPU can program the FPGA with model data determined by the GPU.

The memory interface controller is connected to a central interconnect 124 that is additionally connected to the GPU 120, CPU 118, and FPGA122. The central interconnect 124 is additionally connected to an input and output interface 128 and a network interface 126, the input and output interface 128 being connected to a forward facing camera 132, a left side camera 134, a LiDAR 136.

Fig. 2 depicts a second example hybrid computing system 200 that may be used to implement a neural network associated with the operation of one or more portions or steps of flowchart 500. In this example, the processor associated with the system includes a Field Programmable Gate Array (FPGA) 210 and a Central Processing Unit (CPU) 220.

The FPGA is electrically connected to the FPGA controller 212, which FPGA controller 212 is connected to a Direct Memory Access (DMA) 218. The DMA is connected to an input buffer 214 and an output buffer 216, which are coupled to the FPGA to buffer data into and out of the FPGA, respectively. The DMA 218 has two first-in-first-out (FIFO) buffers, one for the main CPU and the other for the FPGA, which allows data to be written to and read from the appropriate buffers.

The main switch 228 is on the CPU side of the DMA, which transfers data and commands to Direct Memory Access (DMA). The DMA is also connected to a Synchronous Dynamic Random Access Memory (SDRAM) controller 224, which allows data to be transferred to and from the FPGA and CPU 220, and to an external SDRAM 226 and CPU 220. The main switch 228 is connected to a peripheral interface 230, which peripheral interface 230 is connected to a forward facing camera 232, a left side camera 234, and a LiDAR 236. Flash controller 222 controls persistent memory and is connected to CPU 220.

Fig. 3 discloses a bird's eye view 300 of a stitched view of the system of the present disclosure. The forward camera 132 and the left camera 134 are red, blue, green (RGB) monocular cameras and provide camera images to the system. Cameras 132 and 134 may also be stereo, infrared, black and white cameras, etc. LiDAR in this example is overhead and provides a sparse LiDAR point cloud map of the vehicle surroundings.

Method of

The disclosed method compares the camera image to a depth projection set that projects from multiple pose guesses and quantifies the proximity of each guess to the actual camera pose to determine the sample position for the next iteration of the pose guess. The disclosed method operates in camera image feature space rather than infer a depth map from the camera image and operate in point cloud space.

Description of the problem

Formally, a six degree of freedom camera pose is defined as p= [ R, t ] ∈se (3), where r∈so (3) is rotation and t∈r ³ is translation. LiDAR maps are a collection of points in space M= { M _i|m_i∈R³, i=1, 2,3, … |M|}. The projection P of the LiDAR map from the perspective pose Vp is defined as:

Wherein the method comprises the steps of

Is the projection parameter of the clipping plane, and

Is a set of camera internal parameters.

The pair of camera image I (taken at pose P _i) and depth map D (projected at pose P _d) is defined as

Wherein the operator calculates a representation vector of the pair. The attitude difference of I _Pi and D _Pd is defined as

△P_i,d＝P_i-P_d,△P_i,d∈SE (3)

Thus, the attitude estimation problem can be expressed as

Where E is a function of quantifying the similarity between I _Pi and D _Pd. For convenience of presentation, the pose of the depth projection is defined as P _d, and the pose of the photographed camera image is defined as P _i.

Because of the sparsity of the LiDAR point cloud, the information between l _P and D _Pd is significantly insufficient to be quantified even if photographed under the same pose. Therefore, heuristic post-processing and similarity measurement on sparse depth maps is inefficient, which points to a EnforceNet type of learning method for network design.

Fig. 4 is an example camera image map, in this example 410 is a complete camera image, 412 is a filtered view of the camera image and 414 is an edge of the camera image.

Fig. 5 is an example depth map, in which 510 is a full depth map, 512 is a filtered view of the depth map and 514 is an edge of the depth map.

EnforceNet designs

Fig. 6 is an example logic flow of EnforceNet network. Since camera pose estimation is typically a real-time task requiring reliable results given various lighting conditions, our network design goal is fast, accurate and generalizable. Ideally, the system should operate at high frequencies on embedded hardware with limited computing resources. The system should operate with limited retraining/fine tuning applied to different scenes and remain stable in the face of changing lighting conditions.

The system utilizes leverage the previous 3D point cloud M and based on having a gesture sample D _d D e 1, many depth projections { D _j |j ε [1 ] generated from those depth projections of infinity ] }, infinity ] }. The system has the ability to explore training pair H _i,d with large or small posing differences and gradually derive a reasonable pose approaching P _i.

To infer Pi for one camera image, the P _d set within M can be randomly sampled and E (H _Pi,Pd) measured to derive a subsequent set of P _d until convergence is complete. The process may be described as a Markov Decision Process (MDP) in which future pose estimates are independent of a given current estimate that has been passed based on a reinforcement learning framework.

The virtual agent explores space for the best pose and H _Pi,Pd is considered the state of the virtual agent. The state value function F (Δp _i,d) is defined as a monotonically decreasing function. Thus, if Δp _i,d is low, then the state has a high value; and when Δp _i,d is high, the state value is low.

In one example, the first network quantifies the state value and the second network regresses Δp _i,d. The input to the second network is also the camera image depth projection pair H _Pi,Pd and its labels are their respective ground truth (ground truth) pose differences.

Fig. 6 depicts two sets of camera images 610 and 616 and two sets of depth maps 614 and 620 being compared 612 and 618, respectively. One set of camera images and depth maps is routed to the state value network 622 and the other set is routed to the pose regression network 624, where the state value network 622 and the pose regression network 624 share weights 626. The state value network 622 and the pose regression network 624 are connected by a resistance module 628 that constrains the state values. The output of the state value network is 630 Δpi, d and the output of the pose regression network is 632 Δpi, d. Fed back into the state value network and into the pose regression network are ground truth values 634 and gradients 636.

Gesture regression network

FIG. 7 is an example logic flow of a gesture regression network. The pose regression network and the state value network are connected by a resistive module, where Δpi, d prediction is used as a state value label. In this example, since the depth projection is far from the image, if Δpi, d is large, the state value is small. When Δ pose is small, the state value is large. The resistance module forces the state value network to learn the ground truth value Δpi, d. Weights are shared between the pose regression network and the state value network. The shared weights increase the adjustment of the Δpi, d regression network due to the state value gradients and ground truth values Δpi, d counter-propagating through the regression network. As both training complexity and inference speed are increased, the state value network more effectively learns ground truth information additionally from shared weights.

FIG. 7 depicts an example of a pose regression network in which a current pose Pi 712 from a camera image 720 and a guessed pose Pd 710 from LiDAR point cloud 718 are fed into a pose convolutional neural network 722, the pose convolutional neural network 722 outputs Δ pose 724, which are fed into a pose loss module 716, and the current and guessed poses are fed into ΔPi, d, which are also fed into the pose loss module 716.

EnforceNet layout

Fig. 8 is an example convolutional neural network layout of EnforceNet. In one example network architecture for implementing the backbone network of systems, methods, and computer-readable media, a 7-layer convolutional neural network (ConvNet) is used, as shown in FIG. 8. The camera image and depth projection pair H _Pi,Pd is a network input. In the present example of the present invention,The operator is a stack operator. This example architecture is lightweight and accurate for the performance of camera positioning.

For training the network, a simple Euclidean error combining translation and rotation is used. The ground truth transformation Δp _i,d (translation t _t and rotation R _t) at time stamp t between depth projection and camera image capture can be described as:

△P_t,i,d＝[△R_t,i,d,△t_t,i,d]

Predictive conversion Can be combined into

The attitude loss can be defined as a weighted sum of errors from both components of rotation and translation (α ₁ is the rotation loss in the attitude loss, α ₂ is the translation loss in the attitude loss), as follows:

In addition to the attitude penalty, the state value penalty is increased. The state value has the ability to evaluate the current pose prediction. To train the state value function, the negative pose L _oss is considered the ground truth S _t, and the state value prediction can be expressed as F (ΔP _i,d). Thus, the state value pose L _F(ΔPt,i,d) can be expressed as follows (α3 is a weight of the state value loss):

In summary, the total loss L _oss in our network is:

To effectively render the depth map, the proxy performs depth projection and depth map enhancement. To generate additional pairs of camera images and depth projections, the pose of the camera is disturbed with random noise to render additional depth projections. The scrambling produces pairs of images with known ground truth Δp. The displacement rotation of the pose is + -5 deg. and the translational displacement is + -1 m. In one example, key frames of the camera image are obtained using LIDAR SLAM and sensor synchronization. For parking garage { G _ijk |i ε| [1,2], j ε [1,2,3], k ε [1,2] }, many image pairs with key frames with ground truth pose differences are generated. For training the model, the dataset was divided into 60% for training, 30% for validation and 10% for testing.

Fig. 8 depicts an example layout of EnforceNet Convolutional Neural Network (CNN). The network captures a camera image 814 and a depth projection 812 to create Δp _i,d 810. The camera images and depth projections are fed into various modules of networks 816-820, with networks 816-820 outputting state values 822 and delta pose 826.Δp _i,d 810 is fed into the Loss _Δt module 828 along with Δ pose 826. The Loss _Δt module 828 outputs a signal 830 to the-Loss _ΔΔt module 824. The Loss _ΔΔt module 824 receives the signal 830 and the state value 822.

Positioning accuracy

Positioning accuracy is confirmed using data from different parking garages at different times to show positioning accuracy and general applicability. Table 910 in fig. 9 summarizes some combinations of garage, time, and camera orientation.

The visual appearance of the same garage may vary significantly due to parking conditions and time of day. To verify that the model remains accurate in the face of visual appearance changes, a runnability test is set for several exercises with different camera image trajectories. Table 912 in fig. 9 contains details of the positioning accuracy for a garage. The data indicate that the translational error is less than 10 cm for the general case test dataset, with the rotational error wandering at 0 degrees.

The model is efficient once validated is trained by data from a single garage. Table 914 in fig. 9 shows that the mixed version of training data increases the convergence speed. Thus, in the same garage scenario and across garage scenarios, the data indicates similar results.

A method of camera positioning is depicted in fig. 10, including receiving (1010) a camera image, receiving (1012) a LIDAR point cloud, estimating (1014) an initial pose of the camera image, sampling (1016) an initial depth projection set within the LIDAR point cloud, measuring (1018) a similarity of the initial camera pose to the initial depth projection set, and deriving (1020) a subsequent depth projection set based on the measured similarity.

The method of fig. 10 may additionally provide for deriving a subsequent depth pose projection set based on a minimization of measured similarity, and providing similarity is a function based on state values.

The method of fig. 10 may further comprise the steps of: back-propagating the derivation of the subsequent depth projection set, measuring the attitude loss, then deriving the subsequent depth projection set based on the measured similarity until convergence, returning to the ground truth attitude difference based on the measured similarity, and back-propagating the function based on the state values as the gradient and the ground truth attitude difference.

A method of camera positioning is depicted in fig. 11, including receiving (1110) a camera image, receiving (1112) a LIDAR point cloud, estimating (1114) an initial pose of the camera image, sampling (1116) an initial depth projection set within the LIDAR point cloud, measuring (1118) a similarity of the initial camera pose to the initial depth projection set as a state value, measuring (1120) a regression of the similarity of the initial camera pose to the initial depth projection set, measuring (1122) a similarity gradient of the initial camera pose to the initial depth projection set, and deriving (1124) a subsequent depth projection set based on the similarity gradient.

The method of fig. 11 may additionally provide for performing a measure of similarity by the first network, performing a measure of regression of similarity by the second network, and performing a constraint on the status value of the measured similarity fed back to the first and second networks by the resistance module.

The method of fig. 11 may further include sharing weights between the first network and the second network, constraining the status values of the measured similarities fed back to the first network and the second network, and then deriving a subsequent set of depth projections based on the measured similarities until convergence.

The present disclosure proposes a network EnforceNet that provides an end-to-end solution for camera pose location within a large-scale sparse 3D LiDAR point cloud. The EnforceNe network has a resistance module and a weight distribution scheme. Experiments on the actual dataset of a large indoor parking garage show that EnforceNet has reached the highest level of positioning accuracy with excellent generalization performance.

Those of skill in the art will appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations thereof. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. The various components and blocks may be arranged in different ways (e.g., in different orders, or partitioned in different ways) without departing from the scope of the subject technology.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an illustration of example approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged. Some steps may be performed simultaneously. The accompanying method claims present elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The foregoing description provides various examples of the subject technology, but the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the claim language, wherein the use of elements in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more". The term "some" means one or more unless stated otherwise. Pronouns in males (e.g., his) include females and neutrality (e.g., she and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the invention. Predicates "configured to", "operable to", and "programmed to" do not imply any particular tangible or intangible modification to the subject, but are intended to be used interchangeably. For example, a processor configured to monitor and control operations or components may also mean that the processor is programmed to monitor and control operations, or that the processor is operable to monitor and control operations. Likewise, a processor configured to execute code may be interpreted as a processor programmed to execute code or a processor operable to execute code.

Phrases such as "an aspect" do not imply that such aspect is essential to the present technology or that such aspect applies to all configurations of the subject technology. The disclosure relating to an aspect may apply to all configurations, or one or more configurations. One aspect may provide one or more examples. A phrase such as an "aspect" may refer to one or more aspects and vice versa. Phrases such as "an embodiment" do not imply that such an embodiment is essential to the subject technology or that such an embodiment applies to all configurations of the subject technology. The disclosure directed to one embodiment may be applicable to all embodiments, or one or more embodiments. One embodiment may provide one or more examples. A phrase such as an "embodiment" may refer to one or more embodiments and vice versa. Phrases such as "configuration" do not imply that such configuration is essential to the subject technology, or that such configuration applies to all configurations of the subject technology. The disclosure relating to one configuration may apply to all configurations, or one or more configurations. One or more examples may be provided for one configuration. A phrase such as "configured" may refer to one or more configurations and vice versa.

The word "example" is used herein to mean "serving as an example or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, the disclosure herein is not intended to be donated to the public, whether or not such disclosure is explicitly recited in the claims. Furthermore, to the extent that the term "includes," "including," "has," or similar terms is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim.

Reference to "one embodiment," "an embodiment," "some embodiments," "various embodiments," or similar language means that a particular element or feature is included in at least one embodiment of the present invention. Although a phrase may appear in multiple places, the phrase does not necessarily refer to the same embodiment. Those of skill in the art will be able to devise and incorporate any of a variety of mechanisms suitable for achieving the above-described functionality in conjunction with the present disclosure.

It should be understood that the present invention teaches only one example of the illustrative embodiment and that many variations of the invention can be readily devised by those skilled in the art after reading the present invention and that the scope of the invention is determined by the claims that follow.

Claims

1. A method of camera positioning, comprising:

receiving a camera image;

Receiving a laser radar point cloud;

Estimating an initial camera pose of the camera image;

sampling an initial depth projection set within the lidar point cloud;

Measuring a similarity of the initial camera pose to the initial depth projection set; and

A subsequent depth projection set is derived based on the measured similarity in the camera image feature space.

2. The method of camera positioning of claim 1, wherein the deriving a subsequent depth projection set is based on a minimization of measured similarity.

3. The method of camera positioning of claim 1, further comprising back-propagating a derivation of the subsequent depth projection set.

4. The method of camera positioning according to claim 1, wherein the measured similarity is a function based on a state value.

5. The method of camera positioning of claim 1, further comprising measuring attitude loss.

6. The method of camera positioning according to claim 1, wherein the lidar point cloud is a sparse dataset.

7. The method of camera positioning of claim 1, further comprising subsequently deriving the subsequent set of depth projections based on the measured similarity until convergence.

8. The method of camera positioning according to claim 4, further comprising regressing the ground truth attitude difference based on the measured similarity.

9. The method of camera positioning according to claim 8, further comprising back-propagating the state value based function as a gradient and the ground truth pose difference.

10. A method of camera positioning, comprising:

receiving a camera image;

Receiving a laser radar point cloud;

Estimating an initial camera pose of the camera image;

sampling an initial depth projection set within the lidar point cloud;

Measuring similarity of the initial camera pose and the initial depth projection set as a state value;

measuring a regression of the initial camera pose and the similarity of the initial depth projection set;

measuring a gradient of the similarity of the initial camera pose to the initial depth projection set; and

A subsequent depth projection set is derived based on the gradient of the similarity in camera image feature space.

11. The method of camera positioning according to claim 10, wherein the measuring of the similarity is performed through a first network.

12. The method of camera positioning according to claim 11, characterized in that the measurement of the regression of the similarity is performed through a second network.

13. The method of camera positioning of claim 12, further comprising weight sharing between the first network and the second network.

14. The method of camera positioning according to claim 13, further comprising constraining status values of measured similarities fed back to the first network and the second network.

15. The method of camera positioning according to claim 14, wherein the constraining feedback to the state value of the measured similarity of the first network and the second network is performed by a resistive module.

16. The method of camera positioning of claim 15, further comprising subsequently deriving the subsequent set of depth projections based on the measured similarity until convergence.