CN117953157A

CN117953157A - Pure vision self-supervision three-dimensional prediction model based on two-dimensional video in automatic driving field

Info

Publication number: CN117953157A
Application number: CN202410121222.2A
Authority: CN
Inventors: 堵炜炜; 孙盛婷; 郭佳珺; 石翼华; 郑学锋; 于帅; 桂敏
Original assignee: China Unicom Shanghai Industrial Internet Co Ltd
Current assignee: China Unicom Shanghai Industrial Internet Co Ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-04-30

Abstract

The invention relates to the technical field of automatic driving, in particular to a pure vision self-supervision three-dimensional prediction model based on two-dimensional video in the automatic driving field, which comprises the following steps: firstly converting an image into a 3D space to obtain a 3D representation of a scene, secondly directly applying constraint to the 3D representation by regarding the converted 3D scene as a signed distance nerve radiation field, and then rendering a previously selected image and a 2D image of a future frame as self-supervision signals for learning the 3D representation, thereby obtaining a better result on monocular depth estimation and looking around depth estimation through a proposed algorithm model, solving the problem that 3D data needs a large amount of manpower and material resources to be manually marked, predicting the 3D scene based on a 2D video sequence, and effectively expanding the data acquisition and use range; the method does not need data marking, releases a large amount of manpower and material resources, is more suitable for quick updating and scene deployment, and further plays a role in promoting the direction of pure vision.

Description

Pure vision self-supervision three-dimensional prediction model based on two-dimensional video in automatic driving field

Technical Field

The invention relates to the technical field of automatic driving, in particular to a pure vision self-supervision three-dimensional prediction model based on a two-dimensional video in the automatic driving field.

Background

In automatic driving, in order to prevent an automatic driving vehicle from striking surrounding vehicles and pedestrians, it is necessary to predict future movement trajectories of the surrounding vehicles and pedestrians, thereby planning a safe route. Currently, environmental awareness is the first ring of autopilot, the tie of vehicle and environment interactions. The overall performance of an autopilot system depends largely on the perceived system. Currently, there are two main technical routes for environmental awareness:

① A vision-dominated multi-sensor fusion scheme, typically represented by tesla;

② With the laser radar as the leading part and other sensors as the auxiliary technical proposal, typical examples are google, hundred degrees and the like.

The driver who drives the car subjectively makes the information acquisition that relies on high-precision map and people's mesh, and the most competitive of the current mainstream car enterprise Tesla FSD is pure vision, does not rely on the scheme of high-precision map.

There are three challenges with purely visual solutions:

1. Purely visual object recognition presents a great challenge in the depth direction: the human body can use binocular, the depth is estimated by knowing the size of the object and the degree of scaling of the object in the eye, and the depth can also be estimated by the shadow cast by the object, plane assumption, focusing and the like. However, the environment in objective reality is very complex, and the deep learning model has difficulty in solving the problem of difficult modeling.

Deviation of 2.2D coordinate system from 3D coordinate system: visual perception belongs to planar imaging. Taking the human eye as an example, the retinal plane is analogous to the camera optical center coordinate system of the camera, and the transformation relationship between the retinal plane and the coordinate system used in the conventional decision planning is very complex, so that the visual perception result is difficult to intuitively use for downstream planning control. Humans have the ability to directly translate field of view acquisition information into information in the 3D coordinate system of the surrounding environment, where AI needs to be mined.

3. The complex environmental information has a greater impact on the final result: the real-time perception is influenced by various multidimensional factors such as shielding of real-time information, strong light, interference and the like, so that the real-time decision is greatly influenced. When the model is built, the input factors are not uniformly weighted, and a large number of original data of complex scenes are needed to enrich a basic database.

For the three challenges, the algorithm model corresponding to the artificial intelligence needs the ability of AI to have depth estimation, unknown obstacle sensing ability, and ability based on the precision map prior. Tesla currently publishes solutions to these three challenges: 1. grid-aware network Occupancy Network under the vehicle coordinate system; 2. a priori Lane line sensing Network Lane Network of the fusion map; 3. the visual crowdsourcing map generating method is used for enriching standard precise map information and generating visual crowdsourcing map capability of training true values.

Aiming at 3D occupation prediction research, the patent explores 3D occupation prediction by using a self-supervision method based on a video sequence of a real-time camera. Firstly converting the image into 3D space to obtain a 3D representation of the scene, and secondly directly imposing constraints on the 3D representation by treating the converted 3D scene as a signed distance neural radiation field. The previously selected image and the 2D image of the future frame are then rendered as self-supervising signals for learning the 3D representation. The algorithm model provided by the patent obtains better results on monocular depth estimation and looking around depth estimation, and solves the problem that 3D data needs a large amount of manpower and material resources to be marked manually. Through testing, the model training speed is high, the accuracy is high, the robustness is good, and the method can be widely applied to self-driving tasks with vision as a center.

In summary, the present invention solves the existing problems by designing a two-dimensional video-based pure-vision self-supervision three-dimensional prediction model in the automatic driving field.

Disclosure of Invention

The invention aims to provide a pure visual self-supervision three-dimensional prediction model based on two-dimensional video in the field of automatic driving so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the utility model relates to a pure vision self-supervision three-dimensional prediction model based on two-dimensional video in the automatic driving field, which is characterized in that a self-supervision method is used for exploring 3D occupation prediction based on a video sequence shot by a real-time camera, and the steps are as follows:

Step 1, firstly converting an image into a 3D space to acquire a 3D representation of a scene;

step 2, then directly imposing constraints on the 3D representation by treating the converted 3D scene as a signed distance neural radiation field;

and 3, finally, rendering the previous selected image and the 2D image of the future frame as self-supervision signals for learning the 3D representation.

As a preferred embodiment of the present invention, the step 1 of converting the image into the 3D space to obtain the 3D representation of the scene includes a supervision method optimization of the conversion from the image into the occupancy, the conversion from the occupancy into the image, and the occupancy-oriented.

As a preferred aspect of the present invention, the conversion from the image to the occupancy includes:

Converting the image into a BEV aerial view or TPV top view space, so as to obtain 3D representation of the scene, so that 3D feature interaction can be realized in a single camera setting in time, and ambiguity brought by multiple cameras is avoided to the greatest extent;

Adaptively aggregating information from image features F using a deformable cross-attention (CA) mechanism, wherein a set of learnable 3D labels Q are used as queries, and corresponding local image features are used as keys and values; each of the learnable markers represents a cylinder area in the BEV or TPV characterization;

Furthermore, the correspondence between the 3D mark and the image feature is determined by a projection matrix t= { T _n |n=1,..n', whereby the projection matrix is transformed from the vehicle to the pixel coordinate system, and staggering the use of variable self-attention (SA), variable cross-attention (CA) and Feed Forward Network (FFN) to construct a 3D encoder F, the formula of which is specifically as follows:

F_l＝FFN_l(CA_l(SA_l(Q_l)；F,T))

Wherein the subscript l represents the first block.

As a preferred aspect of the present invention, the conversion from occupancy to image includes:

Optimizing a self-supervising based on perceived occupancy prediction system using multi-view consistency inherent in video sequences, employing differentiable volume rendering to synthesize color and depth views, enabling seamless integration of fade information into the rendering pipeline, and efficient use of supervision from multiple viewpoints;

Firstly converting the 3D representation R into a Signed Distance Function (SDF) field S e R ^H×W×D representing the distance of each voxel center to the nearest object surface, using different MLPs for different heights of the BEV representation by implementing a decoder network D with MLPs, while constructing a 3D feature volume from the TPV representation by broadcasting and accumulating, and then feeding the feature volume by a separate MLP;

For successive 3D coordinates p, bicubic Interpolation (BI) is used to predict SDF value s _p and its occupancy state O _p is determined from the sign of s _p, with the following formula:

o_p＝sgn(s_p),s_p＝BI(S,p)＝BI(D(R),p)

The preference for the Sign Distance Function (SDF) field over the density field has two advantages, firstly, the DF field has a more definite physical meaning than the density field and inherently contains assumptions about gradient magnitude, thus allowing easier regularization and optimization;

The signed nature of the second SDF facilitates a direct determination of whether a point is located inside or outside the surface, thereby enabling accurate differentiation of complex geometries; corresponding to the SDF value, a separate MLP D _c is used as a decoder to obtain the color c from R, and the formula is specifically as follows:

c_p＝BI(C,p)＝BI(D_c(R),p)。

as a preferred scheme of the invention, the specific steps of the optimization of the occupancy-oriented supervision method are as follows:

Firstly, training speed is increased through a distributed training depth prediction module, secondly, the possibility of a sample and data enhancement are increased through adding LiDAR data to acquire additional depth supervision, and finally, an MVS embedding method is newly added in a training nerve radiation field (NeRFs) to solve the problem that the optimization of a depth model can be hindered by the additional degree of freedom of volume rendering and the local acceptance field of bilinear interpolation.

Compared with the prior art, the invention has the beneficial effects that:

According to the invention, 3D occupation prediction is explored by designing a video sequence aiming at 3D occupation prediction research and based on a real-time camera, a self-supervision method is used for exploring 3D occupation prediction, firstly, an image is converted into a 3D space to obtain 3D representation of a scene, secondly, the converted 3D scene is regarded as a signed distance nerve radiation field to directly apply constraint to the 3D representation, and then, a previously selected image and a 2D image of a future frame are rendered to serve as self-supervision signals to learn the 3D representation, so that a better result is obtained on monocular depth estimation and looking around depth estimation through a proposed algorithm model, the problem that a large amount of manpower, material resources and manual labeling are needed for 3D data is solved, and in the 2D-based video sequence, the 3D scene is predicted, and the data acquisition and application range are effectively expanded; the method has the advantages that data labeling is not needed, a large amount of manpower and material resources are released, the method is more suitable for rapid updating and scene deployment, the proposed algorithm is used for an automatic driving self-supervision visual base 3D occupation prediction method, the effect of promoting the direction of pure vision is achieved, the optimized model is further achieved, sparse view reconstruction and object level data generalization are effectively promoted, the training and rendering speed is higher, and the method is more suitable for specific scene use.

Drawings

Fig. 1 is a unitary frame of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

In order that the invention may be readily understood, several embodiments of the invention will be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, but in which the invention may be embodied in many different forms and is not limited to the embodiments described herein, but instead is provided to provide a more thorough and complete disclosure of the invention.

It will be understood that when an element is referred to as being "mounted" on another element, it can be directly on the other element or intervening elements may also be present, and when an element is referred to as being "connected" to another element, it may be directly connected to the other element or intervening elements may also be present, the terms "vertical", "horizontal", "left", "right" and the like are used herein for the purpose of illustration only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, and the terms used herein in this description of the invention are for the purpose of describing particular embodiments only and are not intended to be limiting of the invention, with the term "and/or" as used herein including any and all combinations of one or more of the associated listed items.

Referring to fig. 1, the present invention provides a technical solution:

The technical scheme adopted by the invention for solving the technical problems comprises the following steps: 3D occupation prediction, nerve radiation field, self-supervision depth prediction.

1.3D occupancy prediction: detection, tracking, prediction and planning are all the core tasks of automatic driving and form part of an automatic driving standard flow together. 3D occupancy prediction, i.e. scene complement, is one of the basic tasks. This task involves the finest granularity of predictions of occupancy and semantics of the environment. Pioneering works rely on 3D inputs such as depth, occupancy grid, point cloud or Truncated Symbol Distance Function (TSDF). Methods of predicting semantic occupancy from image input only include modality fusion, multitasking learning and end-to-end autopilot. While promising, these methods all require 3D truth from heavy labeling work for supervision. Currently there are indeed some automation tools, such as BTSs and ScenceRF, that learn 3D occupancy in a self-supervised manner in a monocular scene, but adaptation in a surround view scene remains challenging.

2. Neural radiation field: in the self-supervised 3D reconstruction process, the neural radiation field resembles a learnable multi-layer perceptron (MLP), one for each scene, to achieve mapping of spatial locations to radiation and density values. Volume rendering is often used in order to synthesize the acquired information into a new view angle. The basic principle is to inquire a plurality of positions of the MLP along the way of the light and integrate the radiation density weight. NeRFs depend on a large number of images and truth poses of the same scene. In order to promote sparse view reconstruction and object level data generalization, the patent conditions image features on the basis of the sparse view reconstruction and object level data generalization, and an optimized model is more beneficial to constructing real-time reconstruction of scene level under real-time conditions.

The self-supervised depth prediction is that training the neural radiation field (NeRFs) takes longer and is easily over-fitted due to geometry-appearance coupling, in order to solve this problem, the present patent increases training speed by a distributed training depth prediction module, increases sample likelihood and data enhancement by adding LiDAR data to obtain additional depth supervision. Finally, this patent adds a MVS embedding method in NeRFs to solve the problem that the additional degrees of freedom of volume rendering and the locally accepted fields of bilinear interpolation may prevent optimization of the depth model.

The following describes in detail the specific implementation method of the invention in the order of the optimization of the occupancy-oriented supervision method according to the image-to-occupancy conversion from occupancy to image conversion.

1. Conversion from image to occupancy

In real life, the interaction of the vehicle driver with the surroundings in which the vehicle is traveling is particularly complex. Therefore, in an autopilot system, three-dimensional structural accuracy of reconstructing a scene is of great importance. A universal proxy method for reconstructing a three-dimensional occupied pre-test scene in a social surface. In view of its fine granularity and comprehensiveness, its goal is to generate a voxelized prediction to encode occupancy information and semantic information for each voxel.

O∈C^H×W×D

Where H, W, D and C represent the set of predefined categories and the resolution of the occupied grid. Vision-based methods have become a promising approach to three-dimensional occupancy prediction because of their lower cost and significant effect compared to three-dimensional-based methods. Vision-based methods typically produce a generic learning map from RGB images,For an image, there is i= { I _n |n=1,..a., N }, where N refers to the number of cameras, and pictures are acquired from multiple cameras, so as to calculate the occupancy O of 3D:

O＝M(I)＝D(F(ε(I₁),...,ε(I_N)))

In the formula, a two-dimensional backbone network firstly encodes N input images into multi-scale image features F=epsilon (I), then a three-dimensional encoder further promotes the two-dimensional features into three-dimensional representations R=F (F) through an attention mechanism, for a monocular scene, lifting-segmentation-hit (LSS) same transformation is adopted, and then a decoder D generates a final occupation prediction result O=D (R);

Although the strongly supervised based visual approach takes the exciting representation of 3D occupancy prediction given only visual input, the model still relies on 3D semantic supervision at the time of model reasoning. Such as a LiDAR point cloud or dense occupancy annotation. The corresponding method of self-supervision, in contrast, only uses the temporal correspondence inherent in the video sequence as supervision to predict meaningful occupancy, so that a large number of unlabeled driving images can be easily exploited. However, existing self-supervising methods only consider a single monocular camera and decode occupancy O _front directly from 2D image features by instantiating a 3D encoder F with the same transform;

O_front＝D(F(ε(I_mono)))＝D((ε(I_mono)))

Based on the logic, the three-dimensional image processing method converts the image into the BEV aerial view or TPV top view space, so that the 3D representation of the scene is obtained, the interaction of the 3D features can be realized in the single-camera setting in time, and the ambiguity brought by multiple cameras is avoided to the greatest extent.

Information is adaptively aggregated from image features F using a deformable cross-attention (CA) mechanism, in which a set of learnable 3D labels Q are used as queries, and corresponding local image features are used as keys and values. Each of the learnable markers represents a cylinder area in the BEV or TPV characterization. Furthermore, the correspondence between the 3D mark and the image feature is determined by a projection matrix t= { T _n |n=1,..n', these projection matrices are transformed from the host vehicle to the pixel coordinate system. The patent interleaves the variable self-attention (SA), the variable cross-attention (CA) and the Feed Forward Network (FFN) to construct the 3D encoder F.

F_l＝FFN_l(CA_l(SA_l(Q_l)；F,T))

Wherein the subscript l represents the first block.

1. Conversion from occupancy to image

The present patent exploits the inherent multi-view consistency in video sequences to optimize a self-supervising based on the perception of occupancy prediction system, employing differentiable volume rendering to synthesize color and depth views, enabling seamless integration of fade information into the rendering pipeline, and efficient utilization of supervision from multiple viewpoints. We first convert the 3D representation R into a Signed Distance Function (SDF) field S e R ^H×W×D, which represents the distance of each voxel center to the nearest object surface, by implementing a decoder network D with MLPs, we use different MLPs for different heights of the BEV representation, while we construct a 3D feature volume from the TPV representation by broadcasting and accumulating [31], and then feed the feature volume by a separate MLP. For successive 3D coordinates p, we use Bicubic Interpolation (BI) to predict the SDF value s _p and determine its occupancy state O _p from the sign of s _p.

o_p＝sgn(s_p),s_p＝BI(S,p)＝BI(D(R),p)

The preference for the Sign Distance Function (SDF) field over the density field has two advantages, firstly, the DF field has a more definite physical meaning than the density field and inherently contains assumptions about gradient magnitude, thus allowing easier regularization and optimization.

Second, the signed nature of SDF facilitates a direct determination of whether a point is inside or outside a surface, thereby enabling accurate differentiation of complex geometries. Corresponding to the SDF value, a separate MLP D _c is used as a decoder to obtain the color c from R.

c_p＝BI(C,p)＝BI(D_c(R),p)

2. Occupancy-oriented supervision method optimization

The supervision formula of self-supervising vision based 3D occupancy prediction that facilitates autopilot will be elaborated in this section. The core is that an MVS embedded deep learning module is added. Depth supervision enhances the convergence of the neural radiation field, making learning of the exact geometry faster and more efficient, especially in the case of sparse views. The self-supervised depth estimation in NeRFs typically relies on photometric re-projection loss L _rpj, which aims to maximize the similarity between the target image I _t and the sampled pixels, refining the original image I _s by optimizing the depth prediction D (I _t; θ).

In the formula, θ, x,Pi represents a learnable parameter, a random pixel on the target image, a deformed pixel on the source image, and a projection matrix from the target to the coordinates of the source image. l (x) and d|x represent bilinear interpolation and pi (pi, dx, pi) of the corresponding 2D tensor, mapping one pixel x to another image according to the depth dx and the projection matrix pi. Bilinear interpolationHas an adverse effect on the optimized properties of L _rpj due to/>The four neighboring grid pixel differences and color are the only reference information for θ optimization, which may be easily misled due to improper initial value settings or low texture regions.

Summarizing: the framework used in this patent is for 3D occupancy prediction on a self-supervising visual basis. The framework uses an image backbone and a 3D encoder to generate a 3D representation, equivalent to BEVFormer or TPVFormer. To render a new view, the framework applies a lightweight multi-layer perceptron (MLP) to predict SDF (symbolic distance function) values, colors and semantic vectors. The framework then performs volume rendering to synthesize color, depth, and semantic views. Finally, a simple threshold is used to predict the occupied volume.

The training speed is increased through the distributed training depth prediction module, the possibility of a sample is increased and data enhancement is achieved through adding LiDAR data to acquire additional depth supervision, and finally, an MVS embedding method is newly added in a training nerve radiation field (NeRFs) to solve the problem that the optimization of a depth model can be hindered by the additional degree of freedom of volume rendering and the local receiving field of bilinear interpolation.

The overall frame diagram is shown in fig. 1 below.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The utility model relates to a pure vision self-supervision three-dimensional prediction model based on two-dimensional video in the automatic driving field, which is characterized in that a self-supervision method is used for exploring 3D occupation prediction based on a video sequence shot by a real-time camera, and the steps are as follows:

2. A two-dimensional video-based purely visual self-supervising three-dimensional prediction model for the autopilot field according to claim 1, wherein: converting the image into 3D space to obtain a 3D representation of the scene in said step 1 comprises a supervised method optimization of the conversion from image to occupancy, the conversion from occupancy to image and occupancy-oriented.

3. A two-dimensional video-based purely visual self-supervising three-dimensional prediction model of the autopilot domain according to claim 2, characterized in that: the conversion from image to occupancy comprises:

Further, the correspondence between the 3D mark and the image feature is determined by a projection matrix t= { T _n |n=1, …, N } so that the projection matrix is transformed from the own vehicle to the pixel coordinate system, and the 3D encoder F is constructed using a variable self-attention (SA), a variable cross-attention (CA) and a Feed Forward Network (FFN) in a staggered manner, the formula of which is specifically as follows:

F_l＝FFN_l(CA_l(SA_l(Q_l)；F,T))

Wherein the subscript l represents the first block.

4. A two-dimensional video-based purely visual self-supervising three-dimensional prediction model of the autopilot domain according to claim 3, wherein: the converting from occupancy to image comprises:

o_p＝sgn(s_p),s_p＝BI(S,p)＝BI(D(R),p)

c_p＝BI(C,p)＝BI(D_c(R),p)。

5. A two-dimensional video-based purely visual self-supervising three-dimensional prediction model for the autopilot domain as claimed in claim 4, wherein: the specific steps of the occupation-oriented supervision method optimization are as follows: