CN112800822A - 3D automatic tagging with structural and physical constraints - Google Patents

3D automatic tagging with structural and physical constraints Download PDF

Info

Publication number
CN112800822A
CN112800822A CN202011267490.3A CN202011267490A CN112800822A CN 112800822 A CN112800822 A CN 112800822A CN 202011267490 A CN202011267490 A CN 202011267490A CN 112800822 A CN112800822 A CN 112800822A
Authority
CN
China
Prior art keywords
vehicle
scene
module
trajectory
program code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011267490.3A
Other languages
Chinese (zh)
Inventor
W·基尔
S·扎卡洛夫
A·D·盖登
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toyota Research Institute Inc
Original Assignee
Toyota Research Institute Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/026,057 external-priority patent/US11482014B2/en
Application filed by Toyota Research Institute Inc filed Critical Toyota Research Institute Inc
Publication of CN112800822A publication Critical patent/CN112800822A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • G06V20/584Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of vehicle lights or traffic lights
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Abstract

The application relates to 3D automatic labeling with structural and physical constraints. A method for 3D automatic labeling of objects with predetermined structural and physical constraints, comprising: initial object seeds are identified for all frames from a given sequence of frames of a scene. The method also includes refining each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene. The method also includes linking the automatically tagged 3D object vehicle into the trajectory over time while complying with the predetermined structural and physical constraints.

Description

3D automatic tagging with structural and physical constraints
Cross Reference to Related Applications
This application claims the benefit OF U.S. provisional patent application 62/935,246 entitled "AUTOLABELING 3D OBJECTS WITH DIFFERENTIABLE REDDERING OF SDF SHAPE PRIORS" filed on 11, 14.2019, the disclosure OF which is expressly incorporated by reference in its entirety.
Technical Field
Certain aspects of the present disclosure relate generally to machine learning and, more particularly, to 3D automatic labeling of objects constrained by structural and physical constraints.
Background
Autonomous agents (e.g., vehicles, robots, etc.) rely on machine vision for sensing the surrounding environment by analyzing regions of interest in images of the surrounding environment. Although scientists have spent decades studying the human visual system, solutions for achieving equivalent machine vision remain elusive. Achieving equivalent machine vision is a goal to enable truly autonomous agents. Machine vision differs from the field of digital image processing because it is desirable to recover the three-dimensional (3D) structure of the world from an image and use the 3D structure to fully understand a scene. That is, machine vision efforts provide a high-level understanding of the surrounding environment, as performed by the human vision system.
In operation, the autonomous agent may rely on a trained Deep Neural Network (DNN) to identify objects within a region of interest in an image of a surrounding scene of the autonomous agent. For example, the DNN may be trained to recognize and track objects captured by one or more sensors, such as light detection and ranging (LIDAR) sensors, sonar sensors, red-green-blue (RGB) cameras, RGB-depth (RGB-D) cameras, and so forth. In particular, the DNN may be trained to understand a scene from video input based on annotations (annotations) of cars within the scene. Unfortunately, annotating videos is a challenging task that involves deep understanding of the visual scene.
Disclosure of Invention
A method for 3D automatic labeling of objects with predetermined structural and physical constraints, comprising: initial object seeds are identified for all frames from a given sequence of frames of a scene. The method further includes refining each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene. The method also includes linking the automatically tagged 3D object vehicle into the trajectory over time while complying with the predetermined structural and physical constraints.
A non-transitory computer readable medium includes program code recorded thereon for 3D automatic marking of objects with predetermined structural and physical constraints, wherein the program code is executed by a processor. The non-transitory computer-readable medium includes program code to identify an initial object seed from all frames of a given sequence of frames of a scene. The non-transitory computer-readable medium further includes program code to refine each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagging 3D object vehicles within the scene. The non-transitory computer readable medium further includes program code for linking the automatically tagged 3D object vehicle into a trajectory over time while complying with the predetermined structural and physical constraints.
A system for 3D automatic labeling of objects with predetermined structural and physical constraints includes an object seed detection module. The object seed detection module is trained to identify initial object seeds for all frames from a given sequence of frames of a scene. The system also includes an object seed refinement module. The object seed refinement module is trained to refine each of the initial object seeds over the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene. The system also includes a 3D auto-tagging module. The 3D autotagging module is trained to link the autotagged 3D object vehicle into a trajectory over time while complying with the predetermined structural and physical constraints.
This has outlined rather broadly the features and technical advantages of the present disclosure in order that the detailed description that follows may be better understood. Additional features and advantages of the disclosure will be described below. It should be appreciated by those skilled in the art that this disclosure may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the teachings of the disclosure as set forth in the appended claims. The novel features which are believed to be characteristic of the disclosure, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present disclosure.
Drawings
The features, nature, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout.
FIG. 1 illustrates an example implementation of a system for 3D automatic labeling with structural and physical constraints designed using a system on a chip (SOC), according to aspects of the present disclosure.
FIG. 2 is a block diagram illustrating a software architecture that may modularize functionality for 3D automatic tagging with structural and physical constraints in accordance with aspects of the present disclosure.
Fig. 3 is a diagram illustrating an example of a hardware implementation for a 3D automatic tagging system that utilizes structural constraints and physical constraints, according to aspects of the present disclosure.
Fig. 4 is a block diagram of a 3D auto-mark pipeline for the 3D auto-mark system of fig. 3, according to aspects of the present disclosure.
Figures 5A-5C illustrate curved projection of an object using a directed distance field (SDF) according to aspects of the disclosure.
Fig. 6 is a diagram of an initialization portion of a 3D auto-mark pipeline for the 3D auto-mark system of fig. 3, according to aspects of the present disclosure.
Fig. 7 shows an example of a 3D marker output by a 3D automatic marker pipeline for the 3D automatic marker system of fig. 3, according to an illustrative configuration of the present disclosure.
FIG. 8 illustrates a system architecture for a 3D auto-mark pipeline for the 3D auto-mark system of FIG. 3, according to aspects of the present disclosure.
FIG. 9 is a flow diagram illustrating a method for 3D automatic tagging of objects with predetermined structural and physical constraints, according to aspects of the present disclosure.
Detailed Description
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. It will be apparent, however, to one skilled in the art that these concepts may be practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form in order to avoid obscuring such concepts.
Based on the teachings, one skilled in the art should appreciate that the scope of the present disclosure is intended to encompass any aspect of the present disclosure, whether implemented independently of or in combination with any other aspect of the present disclosure. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth. In addition, the scope of the present disclosure is intended to cover such an apparatus or method practiced using other structure, functionality, or structure and functionality in addition to or other than the illustrated aspects of the present disclosure. It should be understood that any aspect of the disclosed disclosure may be embodied by one or more elements of a claim.
Although specific aspects are described herein, many variations and permutations of these aspects fall within the scope of the disclosure. Although some benefits and advantages of the preferred aspects are mentioned, the scope of the present disclosure is not intended to be limited to a particular benefit, use, or purpose. Rather, aspects of the present disclosure are intended to be broadly applicable to different technologies, system configurations, networks, and protocols, some of which are illustrated by way of example in the figures and the following description of the preferred aspects. The detailed description and drawings are merely illustrative of the disclosure rather than limiting, the scope of the disclosure being defined by the appended claims and equivalents thereof.
Deep learning typically involves large marker data sets to achieve prior art performance. In the context of three-dimensional (3D) object detection for own-vehicle and other robotic applications, 3D cuboids (cuboids) are a type of annotation because they allow for proper reasoning about all nine degrees of freedom (three for each instance of position, orientation, and range of metrics). Unfortunately, acquiring enough markers to train a 3D object detector can be laborious and expensive, as it relies primarily on a large number of manual annotators. Conventional methods of scaling up annotation pipelines (annotation pipelines) include better tools, active learning, or a combination thereof. These methods generally rely on heuristics and involve the person to cycle to correct for semi-automatic labeling, especially for difficult edge cases.
In particular, conventional approaches in the field of deep learning strongly rely on supervised training systems. While they may provide for immediate learning of mappings from input to output, oversight involves a large number of annotated datasets to complete the task. Unfortunately, acquiring these annotated datasets is laborious and expensive. In addition, the cost of annotation varies greatly with annotation type, as 2D bounding boxes are cheaper and faster to annotate than, for example, real-world partitions or cuboids.
Aspects of the present disclosure provide improvements over conventional annotation methods by using several different priors to automatically label objects (e.g., vehicles and non-vehicles). These priors include things such as the vehicle being on the ground, the vehicle not being able to penetrate other vehicles, the vehicle having four wheels, etc. This aspect of the disclosure effectively uses shape priors to automatically label objects. The shape prior includes certain constraints, including that the vehicle is rigid, that the vehicle should be located on the ground, and that the vehicle cannot penetrate another vehicle.
One aspect of the present disclosure provides an improved three-dimensional (3D) annotation and object detection system by shape-prior dependent differentiable rendering. In this aspect of the disclosure, the shape-a-priori differentiable rendering enables recovery of the scale (scale), pose, and shape of the object (e.g., vehicle in the case of an autonomous driving system) in the field. In one configuration, the 3D automatic labeling pipeline specifies 2D detection (e.g., bounding box or instance mask) and sparse LIDAR point cloud (point cloud) as inputs. LIDAR point clouds are now ubiquitous in 3D robotic applications. In practice, the object detection itself may even be generated by an off-the-shelf 2D object detector. These configurations prove that differentiable visual alignment (e.g., also referred to as "analysis-by-synthesis" or "render-compare") is an efficient method of automatic labeling. That is, differentiable visual alignment provides an effective method for applications such as autonomous driving and other 3D robotic applications including humanoid robots.
One configuration of a 3D annotation and object detection system includes a continuous traversable Coordinate Shape Space (CSS), which combines a directed-distance-field (SDF) shape space (e.g., a "DeepSDF" shape space frame) with Normalized Object Coordinates (NOCS). This combination makes it possible to set the object shapes to correspond (correspondance), which facilitates deformable shape matching. The 3D annotation and object detection system employs a differentiable SDF renderer for comparative scene analysis over a defined shape space. Further, the 3D annotation and object detection system includes a learning course for an automatic marking pipeline that begins with synthetic data (e.g., in an autonomous driving environment, a Computer Aided Design (CAD) model of the vehicle and driving scene). In one configuration, the auto-tagging pipeline mixes the synthetic data and the actual data in subsequent training cycles and gradually increases the difficulty level of the input data throughout the training cycles.
In some configurations, the auto-tagging pipeline begins with a CSS neural network trained to predict a 2D NOCS map and shape vectors from image patches (image patch). To guide the initial version, the CSS network is trained using synthetic data for which ground-live (ground-route) NOCS and shape vector targets are readily available, and enhancements are also applied to minimize domain gaps (e.g., sim2 real). In these configurations, the auto-tagging loop includes (1) locating instances with 2D annotations, (2) running a CSS network on the extracted input image blocks, (3) re-projecting the NOCS into the scene via LIDAR, (4) recovering object models from the CSS, (5) computing approximate poses via 3D-3D correspondence, and (6) running projection and geometric alignment for refinement of the initial estimates.
After processing the images in the training set, the recovered auto-labels are collected and the CSS prediction network is retrained to gradually expand to a new domain. The process is then repeated to obtain better and better CSS predictions, and in turn better automatic labeling of the object (e.g., 3D cuboid bounding box). To avoid drift due to noisy automatic labeling, a training session is employed that first focuses on simple samples and increases the difficulty level each time a training cycle is passed. In aspects of the present disclosure, annotation (e.g., automatic tagging) of vehicles is performed without cost by utilizing a strong prior such as car shape, metric size, road topology, maps, and other similar shape prior information.
The present disclosure extends to using shape priors to perform automatic labeling. As described above, "shape prior" is information on the shape of an object that is known in advance. For example, the shape prior information may identify that the vehicle should have a rigid shape. The shape prior information can be extended to improve the accuracy of the automatic labeling. For example, the shape prior information may include information such as that the vehicle should have four or more wheels, that the vehicle should be located on the ground, and that the vehicles should not penetrate each other.
In one aspect of the disclosure, a method for automatically labeling 3D objects includes identifying, by an object detector, initial object seeds for all frames from a given sequence of frames of a scene using 2D/3D data. For example, the object seed is an object that may be a vehicle, but may also be a non-vehicle object. Once identified, the optimization process refines each initial seed over the 2D/3D information while complying with map and road constraints. This part of the method involves a "shape prior". In this part of the process, additional shape prior information is used, including that the vehicle should have wheels, must be on the ground, and not penetrate each other. Another optimization links 3D objects over time while following roads and physical boundaries, creating a smooth trajectory.
FIG. 1 illustrates an example implementation of the aforementioned systems and methods for 3D automatic labeling with structural and physical constraints using a system on a chip (SOC)100 of an own vehicle 150. According to certain aspects of the present disclosure, SOC 100 may include a single processor or a multi-core processor (e.g., a central processing unit). Variables (e.g., neural signals and synaptic weights), system parameters associated with a computing device (e.g., a neural network with weights), delays, frequency bin (bin) information, and task information may be stored in a memory block. The memory blocks may be associated with a Neural Processing Unit (NPU)108, a CPU 102, a Graphics Processing Unit (GPU)104, a Digital Signal Processor (DSP)106, a dedicated memory block 118, or may be distributed across multiple blocks. Instructions executed at a processor (e.g., CPU 102) may be loaded from a program memory associated with CPU 102, or may be loaded from a dedicated memory block 118.
The SOC 100 may also include additional processing blocks configured to perform specific functions, such as the GPU 104, the DSP 106, and the connection block 110, which may include fourth generation long term evolution (4G LTE) connections, unlicensed Wi-Fi connections, USB connections, a USB connection, a USB,
Figure BDA0002776525960000071
Connection, and the like. Further, the multimedia processor 112 in combination with the display 130 may classify and categorize the pose of the object in the region of interest, for example, according to the display 130 illustrating the view of the vehicle. In some aspects, the NPU 108 may be implemented in the CPU 102, the DSP 106, and/or the GPU 104. The SOC 100 may also include a sensor processor 114, an Image Signal Processor (ISP)116, and/or a navigation 120, which may include, for example, a global positioning system.
The SOC 100 may be based on an Advanced Risk Machine (ARM) instruction set, or the like. In another aspect of the present disclosure, the SOC 100 may be a server computer in communication with the own vehicle 150. In such an arrangement, the own vehicle 150 may include the processor and other features of the SOC 100. In this aspect of the disclosure, the instructions loaded into the processor (e.g., CPU 102) or NPU 108 of the subject vehicle 150 may include code for 3D automatic labeling with structural and physical constraints of objects (e.g., vehicle and non-vehicle objects) within the image captured by the sensor processor 114. The instructions loaded into the processor (e.g., CPU 102) may also include: code for planning and controlling (e.g., of the own vehicle) in response to linking 3D objects over time, creating a smooth trajectory while conforming to roads and physical boundaries from images captured by the sensor processor 114.
The instructions loaded into the processor (e.g., CPU 102) may also include code for identifying an initial object seed from all frames of a given sequence of frames of a scene. The instructions loaded into the processor (e.g., CPU 102) may also include code for refining each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene. The instructions loaded into the processor (e.g., CPU 102) may also include code for linking the automatically tagged 3D object vehicle into the trajectory over time while complying with the predetermined structural and physical constraints.
Fig. 2 is a block diagram illustrating a software architecture 200 to be used for planning and controlling the functional modularity of an own vehicle using a 3D automatic tagging system with structural and physical constraints, according to aspects of the present disclosure. Using the architecture, the controller application 202 may be designed such that it may cause various processing blocks of the SOC 220 (e.g., the CPU 222, the DSP 224, the GPU 226, and/or the NPU 228) to perform support calculations during runtime operation of the controller application 202.
The controller application 202 may be configured to invoke a function defined in the user space 204 that may analyze a scene in video captured by a monocular camera of the own vehicle, for example, based on 3D automatic labeling of objects in the scene. In aspects of the present disclosure, 3D automatic labeling of objects (e.g., vehicular and non-vehicular objects) of a video is improved by using structural and physical constraints as shape priors. The controller application 202 may make requests to compile program code associated with libraries defined in a 3D automatic labeling Application Programming Interface (API)206 to label vehicles within a scene of video captured by the monocular camera of the own vehicle using structural constraints and physical constraints as shape priors.
The runtime engine 208, which may be compiled code of a runtime framework, may be further accessed by the controller application 202. The controller application 202 may cause the runtime engine 208 to perform monocular (single camera) 3D detection and automatic tagging, for example. When an object is detected within a predetermined distance of the own vehicle, the runtime engine 208 may in turn send a signal to an operating system 210, such as a Linux kernel 212, running on the SOC 220. The operating system 210, in turn, may cause computations to be performed on the CPU 222, the DSP 224, the GPU 226, the NPU 228, or some combination thereof. The CPU 222 is directly accessible by the operating system 210 and may access other processing blocks through drivers, such as the driver 214 for the DSP 224, for the GPU 226, or for the NPU 228. In the illustrated example, the deep neural network may be configured to run on a combination of processing blocks, such as the CPU 222 and GPU 226, or if present, the NPU 228.
Fig. 3 is a diagram illustrating an example of a hardware implementation of a 3D automatic labeling system 300 using structural and physical constraints as shape priors, according to aspects of the present disclosure. The 3D automatic tagging system 300 may be configured to understand a scene so as to be able to plan and control the own vehicle in response to images from video captured by a camera during operation of the automobile 350. The 3D automatic marking system 300 may be a component of a vehicle, robotic device, or other device. For example, as shown in FIG. 3, the 3D automated marking system 300 is a component of an automobile 350. Aspects of the present disclosure are not limited to the 3D automated marking system 300 as a component of the automobile 350, as other devices, such as buses, motorcycles, or other similar vehicles, are also contemplated to use the 3D automated marking system 300. The automobile 350 may be autonomous or semi-autonomous.
The 3D automatic tagging system 300 may be implemented with an interconnect architecture, represented generally by interconnect 308. Interconnect 308 may include any number of point-to-point interconnects, buses, and/or bridges depending on the specific application of 3D automated marking system 300 and the overall design constraints of automobile 350. The interconnect 308 links together various circuits, including one or more processors and/or hardware modules represented by the sensor module 302, the vehicle awareness module 310, the processor 320, the computer-readable medium 322, the communication module 324, the motion module 326, the positioning module 328, the planner module 330, and the controller module 340. Interconnect 308 may also link various other circuits such as timing sources, peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further.
The 3D automated tagging system 300 includes a transceiver 332 coupled to the sensor module 302, the vehicle perception module 310, the processor 320, the computer readable medium 322, the communication module 324, the motion module 326, the positioning module 328, the planner module 330, and the controller module 340. The transceiver 332 is coupled to an antenna 334. The transceiver 332 communicates with various other devices over a transmission medium. For example, the transceiver 332 may receive commands via transmissions from a user or a remote device. As discussed herein, the user may be at a location remote from the location of the automobile 350. As another example, transceiver 332 may transmit automatically tagged 3D objects within the video and/or the planned action from vehicle awareness module 310 to a server (not shown).
The 3D automatic tagging system 300 includes a processor 320 coupled to a computer-readable medium 322. In accordance with the present disclosure, the processor 320 performs processing, including executing software stored on the computer-readable medium 322 to provide functionality. When executed by processor 320, the software causes 3D automated tagging system 300 to perform various functions described for the own vehicle to perceive an automated tagging scene within video captured by a single camera or any module (e.g., 302, 310, 324, 326, 328, 330, and/or 340) of the own vehicle (e.g., automobile 350). The computer-readable medium 322 may also be used for storing data that is manipulated by the processor 320 when executing software.
The sensor module 302 may obtain images via different sensors, such as a first sensor 304 and a second sensor 306. The first sensor 304 may be a visual sensor (e.g., a stereo camera or a red-green-blue (RGB) camera) for capturing 2D RGB images. The second sensor 306 may be a ranging sensor, such as a light detection and ranging (LIDAR) sensor or a radio detection and ranging (RADAR) sensor. Of course, aspects of the present disclosure are not limited to the foregoing sensors, as other types of sensors (e.g., thermal, sonar, and/or laser) are also contemplated as either of the first sensor 304 or the second sensor 306.
The images of the first sensor 304 and/or the second sensor 306 may be processed by the processor 320, the sensor module 302, the vehicle perception module 310, the communication module 324, the motion module 326, the positioning module 328, and the controller module 340. In conjunction with the computer-readable medium 322, the images from the first sensor 304 and/or the second sensor 306 are processed to implement the functionality described herein. In one configuration, the detected 3D object information captured by first sensor 304 and/or second sensor 306 may be transmitted via transceiver 332. The first sensor 304 and the second sensor 306 may be coupled to the automobile 350 or may be in communication with the automobile 350.
Understanding a scene from video input based on automatic labeling of 3D objects within the scene is an important perceptual task in the field of autonomous driving, such as the automobile 350. The present disclosure extends the use of shape priors to perform automatic labeling. As described above, "shape prior" is information on the shape of an object that is known in advance. For example, the shape prior information may identify that the vehicle should have a rigid shape. The shape prior information can be extended to improve the accuracy of the automatic labeling. For example, the shape prior information may include information such as that the vehicle should have four or more wheels, that the vehicle should be located on the ground, and that the vehicles should not penetrate each other. In aspects of the present disclosure, annotation (e.g., automatic tagging) of vehicles is performed without cost by utilizing strong priors such as car shape, dimensions, road topology, maps, and other similar structural and physical shape prior constraints.
The location module 328 may determine the location of the automobile 350. For example, the location module 328 may use the Global Positioning System (GPS) to determine the location of the automobile 350. The positioning module 328 may implement a Dedicated Short Range Communication (DSRC) compatible GPS unit. The DSRC-compliant GPS unit includes hardware and software to enable the automobile 350 and/or the positioning module 328 to comply with one or more of the following DSRC standards, including any derivatives or branches thereof: EN 12253:2004 dedicated short-range communication-the physical layer using 5.9GHz microwave (reviewed); EN 12795:2002 Dedicated Short Range Communications (DSRC) -DSRC data link layer: media access and logical link control (overview); EN 12834:2002 proprietary short-range communication-application layer (overview); EN 13372:2004 Dedicated Short Range Communications (DSRC) -DSRC profiles for RTTT applications (reviewed); and EN ISO 14906:2004 electronic toll-applications interface.
The DSRC-compatible GPS unit within the positioning module 328 is operable to provide GPS data describing the location of the automobile 350 with spatial level accuracy in order to accurately guide the automobile 350 to a desired location. For example, the automobile 350 travels to a predetermined location and a portion of the sensor data is desired. Spatial level accuracy means that the position of the car 350 is described by GPS data sufficient to confirm the position of the car 350 parking space. That is, the location of the automobile 350 is accurately determined with spatial level accuracy based on the GPS data from the automobile 350.
The communication module 324 may facilitate communication via the transceiver 332. For example, the communication module 324 may be configured to provide communication capabilities via different wireless protocols, such as Wi-Fi, Long Term Evolution (LTE), 3G, and so forth. The communication module 324 may also communicate with other components of the automobile 350 that are not modules of the 3D automated marking system 300. Transceiver 332 may be a communication channel through network access point 360. The communication channel may include DSRC, LTE-D2D, mmWave, Wi-Fi (infrastructure mode), Wi-Fi (ad hoc mode), visible light communication, TV white space communication, satellite communication, full duplex wireless communication, or any other wireless communication protocol such as those mentioned herein.
In some configurations, the network access point 360 includes a bluetooth communication network or a cellular communication network for sending and receiving data, including via Short Message Service (SMS), Multimedia Message Service (MMS), hypertext transfer protocol (HTTP), direct data connection, Wireless Application Protocol (WAP), email, DSRC, full-duplex wireless communication, mmWave, WiFi (infrastructure mode), Wi-Fi (ad hoc mode), visible light communication, TV white space communication, and satellite communication. The network access point 360 may also include a mobile data network, which may include a 3G, 4G, 5G, LTE-V2X, LTE-D2D, VoLTE, or any other mobile data network or combination of mobile data networks. Further, network access point 360 may include one or more IEEE 802.11 wireless networks.
The 3D automatic marking system 300 also includes a planner module 330 for planning the selected trajectory to perform a route/action (e.g., avoid a collision) of the car 350 and a controller module 340 for controlling the motion of the car 350. The controller module 340 may perform the selected action via the motion module 326 for autonomously operating the automobile 350 along, for example, a selected route. In one configuration, the planner module 330 and the controller module 340 may collectively override the user input when the user input is expected (e.g., predicted) to cause a collision according to the level of autonomy of the automobile 350. These modules may be software modules running in the processor 320, resident/stored in the computer readable medium 322, and/or hardware modules coupled to the processor 320, or some combination thereof.
The National Highway Traffic Safety Administration (NHTSA) has defined different "levels" (e.g., level 0, level 1, level 2, level 3, level 4, and level 5) of autonomous vehicles. For example, if an autonomous vehicle has a higher number of levels than another autonomous vehicle (e.g., level 3 is a higher number of levels than level 2 or 1), then an autonomous vehicle with a higher number of levels provides a greater combination and number of autonomous features relative to a vehicle with a lower number of levels. These different levels of autonomous vehicles are briefly described below.
Level 0: in a class 0 vehicle, an Advanced Driver Assistance System (ADAS) feature set installed in the vehicle does not provide vehicle control, but may alert the driver of the vehicle. The level 0 vehicle is not an autonomous or semi-autonomous vehicle.
Level 1: in a class 1 vehicle, the driver is ready to take driving control of the autonomous vehicle at any time. The set of ADAS features installed in an autonomous vehicle may provide autonomous features, such as: adaptive Cruise Control (ACC); parking assist with automatic steering; and Lane Keeping Assist (LKA) type II.
Level 2: in a level 2 vehicle, the driver is obligated to detect objects and events in the road environment and respond if the ADAS feature set installed in the autonomous vehicle fails to respond correctly (based on the driver's subjective judgment). The ADAS feature set installed in an autonomous vehicle may include acceleration, braking, and steering. In a class 2 vehicle, the ADAS feature set installed in the autonomous vehicle may be deactivated immediately when the driver takes over.
Level 3: in class 3ADAS vehicles, within known limited environments (e.g., motorways), drivers may safely divert their attention from driving tasks, but still must prepare to control autonomous vehicles when needed.
Level 4: in a class 4 vehicle, the ADAS feature set installed in the autonomous vehicle may control the autonomous vehicle in all but a few environments (e.g., severe weather). The driver of a class 4 vehicle enables the automated system only when it is safe (which includes the ADAS feature set installed in the vehicle). When an automated level 4 vehicle is enabled, driver attention is not required to have the autonomous vehicle operate safely and consistently within accepted norms.
Level 5: in a class 5 vehicle, no human intervention is involved other than setting the destination and starting the system. The automated system can travel to any legal travel location and make its own decisions (which may vary based on the jurisdiction in which the vehicle is located).
The Highly Autonomous Vehicle (HAV) is a level 3 or higher autonomous vehicle. Thus, in some configurations, the automobile 350 is one of: a level 0 non-autonomous vehicle; a level 1 autonomous vehicle; a level 2 autonomous vehicle; a level 3 autonomous vehicle; a level 4 autonomous vehicle; a level 5 autonomous vehicle; and HAV.
The vehicle awareness module 310 may be in communication with the sensor module 302, the processor 320, the computer-readable medium 322, the communication module 324, the motion module 326, the positioning module 328, the planner module 330, the transceiver 332, and the controller module 340. In one configuration, the vehicle sensing module 310 receives sensor data from the sensor module 302. The sensor module 302 may receive sensor data from a first sensor 304 and a second sensor 306. According to aspects of the present disclosure, the vehicle perception module 310 may receive sensor data directly from the first sensor 304 or the second sensor 306 to perform 3D automatic labeling of vehicular and non-vehicular objects from images captured by the first sensor 304 or the second sensor 306 of the automobile 350.
As shown in fig. 3, the vehicle perception module 310 includes an object seed detection module 312, an object seed refinement module 314, a 3D automatic labeling module 316, and a vehicle trajectory module 318 (e.g., video-based automatic labeling). The object seed detection module 312, the object seed refinement module 314, the 3D automatic labeling module 316, and the vehicle trajectory module 318 may be components of the same or different artificial neural networks, such as a Deep Neural Network (DNN). The object seed model of the object seed detection module 312 and/or the object seed refinement module 314 is not limited to a deep neural network. In operation, the vehicle sensing module 310 receives a data stream from the first sensor 304 and/or the second sensor 306. The data stream may include a 2D RGB image from the first sensor 304 and LIDAR data points from the second sensor 306. The data stream may include a plurality of frames, such as image frames. In this configuration, the first sensor 304 captures monocular (single camera) 2D RGB images.
The vehicle perception module 310 is configured to interpret a scene from a video input (e.g., a sensor module) as a perception task during autonomous driving of the automobile 350 based on 3D automatic markers describing objects (e.g., vehicles) within the scene. Aspects of the present disclosure relate to a method for automatically labeling 3D objects, including identifying, by an object seed detection module 312, initial object seeds for all frames from a given sequence of frames of a scene. For example, the object seed is an object that may be a vehicle, but may also be a non-vehicle object. Once identified, the object seed refinement module 314 performs an optimization process to refine each initial seed over the 2D/3D information while respecting physical and structural constraints (e.g., map and road constraints).
In aspects of the present disclosure, this part of the 3D automatic labeling approach involves physical and structural shape prior constraints. In this part of the process, additional shape prior information is applied, including that the vehicle should have wheels, must be on the ground, and not penetrate the other. The 3D auto-labeling module completes labeling of 3D vehicle and non-vehicle objects. The vehicle trajectory module 318 provides another optimization that links 3D objects over time while following roads and physical boundaries, creating a smooth trajectory, as also shown in fig. 4.
Overview of 3D auto-labeling pipelines
Fig. 4 is a block diagram of a 3D automatic tagging pipeline 400 for the 3D automatic tagging system 300 according to an illustrative configuration of the present disclosure. In fig. 4, the data set 410 potentially includes real image and LIDAR data as well as synthetic input data. As described, the synthetic input data may include computer-rendered driving scenes and CAD models of different types of vehicles with ground truth annotations. In these configurations, the synthetic input data is used to train a Coordinate Shape Space (CSS) network 430. In the example of fig. 4, the 2D object detector (e.g., vehicle perception module 310 of fig. 3) has detected three vehicles in the input image 405 and has marked them with 2D markers 420a, 420b, and 420c, respectively. In this example, the 2D marker is a 2D border. The vehicle awareness module 310 inputs the 2D tag 420a, the 2D tag 420b, and the 2D tag 420c to the CSS network 430.
In this aspect of the disclosure, for each 2D tagged object, the vehicle awareness module 310 generates a 2D Normalized Object Coordinates (NOCS) image and a shape vector. The vehicle awareness module 310 decodes the 2D NOCS images and shape vectors into object models in a CSS network 430 (e.g., a continuously traversable CSS network). The vehicle perception module 310 then back projects the 2D NOCS image in the view frustum (frustum) to the corresponding LIDAR point cloud. The vehicle awareness module 310 also identifies one or more correspondences between the LIDAR point cloud and the object model to produce an initial estimate of an affine transformation between the LIDAR point cloud and the object model.
In this aspect of the disclosure, the object seed refinement module 314 performs an optimization process 440 that includes iteratively refining an estimate of an affine transform via differentiable geometric and visual alignment using a differentiable directed distance field (SDF) renderer. The 3D automark module 316 may then perform an automark verification process 450 to discard apparently incorrect automarks before collecting them into the CSS mark pool 460. Once all frames have been processed in a particular training cycle, the CSS network 430 may be retrained (retraining 470 in FIG. 4) and the next training cycle on the data set 410 may begin. Aspects of the 3D autotagging pipeline 400 are discussed in more detail below in conjunction with fig. 5A-8.
Coordinate Shape Space (CSS)
These configurations employ a coordinate space framework called "deep sdf" in the literature to embed a (watertight) vehicle model into a joint compact shape space representation using a single neural network, such as CSS network 430. The concept is to transform the input model into an SDF, where each value represents the distance to the closest surface, with positive and negative values representing the outer and inner regions, respectively. SDF representation is desirable because neural networks are generally easy to learn. Finally, the depsdf forms the shape space of the implicit surface with the decoder, and can use the provided hidden code z (the above-mentioned shape vector) to continue in space at spatially consecutive 3D positions x ═ x1,…,xNQuery the decoder to retrieve the SDF value s ═ s1,…,sN}:f(x;z)=s。
To facilitate approximate deformable shape matching, these configurations combine shape space with NOCS to form a continuously traversable CSS, as discussed above. To do so, these configurations scale the model to a unit diameter size and interpret the level 0 set of 3D coordinates as a dense surface description.
To train the function f, these configurations use composite input data, including multiple CAD models of the vehicle, and a rendered traffic scene with accompanying ground truth labels. These configurations were trained following the original deep sdf method, but after each iteration (e.g., after each pass through the training loop) the potential vectors (e.g., shape vectors) were projected onto the unit sphere. In CSS, each vehicle corresponds to a single shape vector. For example, (0,1,1) may be SUV,(0,1,0) may be dual-purpose, and (0,0,1) may be Porsche
Figure BDA0002776525960000161
The vectors are continuous, meaning that the CSS is continuously traversed from one vehicle to another (as if one vehicle "morphs" to another as the shape space is traversed). In these configurations, the CSS is three-dimensional, but in other configurations, it may have a shape space of different dimensions.
Differentiable SDF rendering
One component of the 3D auto-labeling pipeline 400 is the ability to optimize objects with respect to pose, scale, and shape. As described above, these functions are performed by the object seed refinement module 314. To this end, these configurations include a differentiable SDF renderer. Such differentiable SDF renderers avoid grid-related problems such as connectivity or intersections, and include different ways of sampling the representation. These configurations also employ an alternative formula for rendering implicit surfaces that directs their back propagation.
One aspect of a differentiable SDF renderer is the projection of the 0 iso-surface. Provides a query point XiAnd an associated directed distance value SiThese configurations include a differentiable way to access an implicit surface (implicit surface). Simply selecting query points based on their directional distance values does not form derivatives with respect to the potential vector. Furthermore, the positions of the regular samples are approximated on a curved surface. These configurations make use of the property of deriving the SDF relative to its position, at which point a normal is generated, in effect calculated using the inverse method (backward pass):
Figure BDA0002776525960000171
because the normal provides the direction to the nearest surface, there is a precise distance provided to the distance value. In this example, the query location may be projected to the 3D surface location piThe method comprises the following steps:
Figure BDA0002776525960000172
to obtain a clean surface projection, these configurations ignore the narrow band (| s) of the surfaceiAll points x except for > 0.03)i. A schematic explanation is provided in fig. 5A-5C.
Fig. 5A-5C show a curved projection 500 of an object using SDF according to an illustrative configuration of the disclosure. FIG. 5A illustrates an object surface 510 within a query mesh 520. As shown in FIG. 5A, locations inside the object surface 510 have negative SDF values, while those outside the object surface 510 have positive SDF values. Fig. 5B shows a normal to a point 530 outside the object surface 510. Fig. 5C shows projected object surface points 540 located on the object surface 510.
Another aspect of the differentiable SDF renderer is the use of surface tangent disks (surface tangent disks). In the field of computer graphics, the concept of surface elements (surfels) is a mature alternative to connected triangular primitives. The differentiable SDF representation in these configurations yields directional points and can be immediately used to render a curved disk. To obtain a watertight curve, a disc diameter is chosen that is large enough to close the hole. The curved plate may be constructed as follows:
1. given projection point
Figure BDA0002776525960000173
The 3D coordinates of the resulting tangent plane visible in the screen are estimated. By solving a system of linear equations for the plane and the camera projection, the distance D of the plane to each 2D pixel (u, v) can be calculated, resulting in the following solution:
Figure BDA0002776525960000181
wherein K-1Is the inverse camera matrix followed by back-projection (back-projection) to get the final 3D planar coordinates: p ═ K-1·(ud,v·d,d)T
2. Estimating the distance between the plane vertex and the curved point and the fixture, if it is greater than the disc diameter: m ═ max (diam-p)i-P||2,0). In order to ensure water tightnessAnd calculating the diameter according to the query position density:
Figure BDA0002776525960000182
the above calculation is performed for each pixel to obtain a depth map DiAnd at point piA tangential distance mask Mi
3D automatic marking pipeline
Another aspect of the differentiable SDF renderer is the rendering function. To generate the final rendering, these configurations employ functions that compose the layers of the disk for 2D projection onto the image plane. This may include combining the colors of different point primitives (point primitives) based on their depth values. The closer the primitive is to the camera, the stronger its contribution. These configurations use softmax to ensure that all original contributions amount to 1 at each pixel. More specifically, the rendering function is
Figure BDA0002776525960000183
Wherein
Figure BDA0002776525960000184
The NOCS returns coordinate shading for the resulting image, and Wi is a weighted mask defining the contribution of each disk:
Figure BDA0002776525960000185
wherein the content of the first and second substances,
Figure BDA0002776525960000186
is the normalized depth and σ is the transparency constant σ → ∞, which is completely opaque because only the nearest primitive is rendered. The foregoing formula enables gradient flow from pixel to surface point and allows image-based optimization. The aforementioned optimization functions after the initialization phase may be performed by the object seed refinement module 314.
3D object detection
One rationale for the automatic labeling approach in these configurations is to recover higher complexity labels using weak labels and strong differentiable priors. Although this idea has wide applicability, these configurations focus particularly on rectangular parallelepiped automatic labeling of the driving scene. As discussed above in connection with fig. 3 and 4, the 3D auto-labeling module 316 may run multiple cycles (iterations) of the 3D auto-labeling pipeline 400 during the training phase. In the first training cycle, CSS label pool 460 includes fully synthesized labels, and CSS network 430 (e.g., a trained CSS network) has not been well adapted to real images. The result can be a noisy NOCS prediction that is reliable only for object instances that perform well in the scene.
In one aspect of the present disclosure, the vehicle awareness module 310 directs a predetermined training session in which the CSS network 430 is first exposed to easy annotations and the vehicle awareness module 310 adds difficulty on subsequent training cycles. In these configurations, the difficulty of annotation can be defined by measuring the pixel size of the 2D marker, the amount of intersection with other 2D markers, and whether the 2D marker touches the boundary of the image (typically indicating object truncation). The vehicle awareness module 310 includes thresholds for these criteria to define increasingly difficult courses.
For example, CSS network 430 derives from the ResNet18 backbone and, following the encoder-decoder structure, processes 128 × 128 input blocks to output a NOCS map of the same size and 3D shape vector. Additional details regarding the structure of CSS network 430 are provided below in conjunction with the discussion of FIG. 8. Prior to the first annotation cycle, the vehicle awareness module 310 trains the CSS network 430 to infer the 2D NOCS diagram and shape vector from the blocks. As mentioned above, this mapping may be bootstrapped from the synthetic input data.
Fig. 6 is a diagram of an initialization portion of a 3D autotagging pipeline 600 for the 3D autotagging system 300 of fig. 3, according to an illustrative configuration of the present disclosure. For a given image block (see image block from input image 660 defined by 2D marker 420 in FIG. 6), the vehicle awareness module 310 infers a 2D NOCS graph using the CSS network 430
Figure BDA0002776525960000191
And a shape vector z (620). The vehicle perception module 310 decodes z into SDF and retrieves 3D surface points p ═ { p } of the object model 630 in its local frames1,…,pNAnd calculating the NOCS coordinates
Figure BDA0002776525960000192
Figure BDA0002776525960000193
The vehicle perception module 310 also includes a 3D LIDAR point l ═ 1 within the viewing cone1,…,1kIs projected 650 onto the block and the corresponding NOCS coordinate/is collectedc
To estimate the initial pose and scale, in this configuration, the vehicle perception module 310 establishes a 3D-3D correspondence between p and l to estimate an initial affine transformation between point l of the LIDAR point cloud and point p of the object model. To this end, the vehicle perception module 310 targets each piDistance from NOCS:
Figure BDA0002776525960000194
find its nearest neighbors and if
Figure BDA0002776525960000195
The result is retained. Finally, the vehicle perception module 310 may run a process known in the literature as Procrustes in conjunction with a random sample consensus (RANSAC) algorithm to estimate the pose (R, t) and the scale s. These operations are represented in fig. 6 by a 3D-3D RANSAC 640.
At this point, the vehicle awareness module 310 may begin a differential optimization (differential optimization) on the supplemental 2D and 3D evidence. While projecting 2D information provides a strong cord for orientation and shape, 3D points allow reasoning on scale and translation. In each iteration, the vehicle perception module 310 decodes the current shape vector estimate
Figure BDA0002776525960000201
Extracting a curved surface point PiAnd transforms them with the current estimate for pose and scale:
Figure BDA0002776525960000202
Figure BDA0002776525960000203
this process results in a refined or optimized affine transformation between point l of the LIDAR point cloud and point p of the object model.
Given surface model points in a scene frame, separate 2D and 3D penalties are calculated as follows. For 2D losses, the optimization process 440 employs the differentiable SDF renderer to generate the seek AND
Figure BDA0002776525960000204
Maximum aligned rendering of
Figure BDA0002776525960000205
Because of the prediction
Figure BDA0002776525960000206
May be noisy (especially during the first training cycle), so dissimilarity is minimized
Figure BDA0002776525960000207
An unsatisfactory solution may result. Instead, the optimization process 440 is
Figure BDA0002776525960000208
Of each rendered spatial pixel riDetermining
Figure BDA0002776525960000209
Nearest NOCS spatial neighbor in (1), called mjAnd set them to correspond if their NOCS distance is below the threshold. To allow for gradient flow, the object seed refinement module 314 uses their spatial index to locally resample the image. Thus, the penalty is all such corresponding Cs in NOCS space2DAverage distance of (d):
Figure BDA00027765259600002010
for 3D loss, for each
Figure BDA00027765259600002011
The vehicle awareness module 310 determines the nearest neighbor from l and holds it if it is closer than 0.25 m. Because the vehicle awareness module 310 typically produces good initialization, outliers in the optimization can be avoided by using strict thresholds. 3D losses are all corresponding C3DAverage distance of (d):
Figure BDA00027765259600002012
in summary, in these configurations, the final criterion is the sum of two losses: loss is loss2D+loss3D. In these configurations, the terms are unbalanced (e.g., weighted) because the two loss terms operate at similar magnitudes. Although described with reference to 2D and 3D penalties, the parasitic penalties include penalties from structural and physical constraints. For example, the object seed refinement module 314 is configured to access vehicle shape prior information about roads and physical boundaries. In this example, the object seed refinement module 314 is configured to adjust the linking of the 3D object vehicles over time by applying roads and physical boundaries to the trajectory, which may be optimized based on the additive losses from the applied structural and physical constraints.
Referring again to the automatic mark verification process 450 in FIG. 4, the optimization framework may sometimes lead to incorrect results, resulting in a reduction in the impact of incorrectly inferred automatic marks. To this end, in these configurations, the object seed refinement module 314 performs geometric and projection verification to remove the worst automatic markers (e.g., cuboids). The object seed refinement module 314 measures the number of LIDAR points in a narrow band (0.2m) around the automatically marked surface and rejects the number if less than 60% is outside the band. Further, the object seed refinement module 314 defines a projection constraint where automatic tagging is rejected if the Intersection-over-unity, IoU, of the rendered mask and the provided 2D tag is below 70%.
In these configurations, the autotags that survive the autotag validation process 450 are collected and added to the CSS tag pool 460. After the first training cycle, there is a mix of synthetic and real samples in the subsequent training cycle, which is used to retrain the CSS network 430. The CSS network 430 is retrained over multiple self-improving training cycles, resulting in better initialization and more accurate auto-labeling.
Fig. 7 shows an example of a 3D mark output by the 3D automatic marking pipeline 400 of fig. 4 for the 3D automatic marking system 300 of fig. 3, according to an illustrative configuration of the present disclosure. The scene depicted in the input image 710 includes an object 720 (a vehicle) and an object 730 (another vehicle). In this example, 3D auto-mark module 316 has extracted 3D mark 740a of object 720 and 3D mark 740b of object 730. Fig. 7 shows that the 3D marker output to the 3D object detector (e.g., object seed detection module 312 and object seed refinement module 314) in this configuration is a cuboid (e.g., a 3D bounding box).
3D object detection
In these configurations, the 3D cuboid auto-labeling is not the final target, but the means to end, i.e. 3D object detection. As known to those skilled in the art, once the 3D automatic labeling module 316 automatically extracts a 3D label (e.g., a cuboid) of an object, it is a relatively simple matter for the vehicle perception module 310 to perform 3D object detection of the object based at least in part on the extracted 3D label of the object. In aspects of the present disclosure, the vehicle trajectory module 318 is trained to plan the trajectory of the own vehicle from the automatically labeled link trajectories of the 3D object vehicles while conforming to the road and physical boundaries.
Additional implementation details regarding pipeline assemblies
Fig. 8 shows a system architecture of a 3D automatic marking pipeline 800 for the 3D automatic marking system 300 of fig. 3, according to an illustrative configuration of the present disclosure. As described above, in these configurations, CSS network 430 includes a ResNet18 backbone architecture. In these configurations, the decoder uses bilinear interpolation (bilinear interpolation) as an upsampling operation instead of deconvolution to reduce the number of parameters and the number of computations. Each upsampling is followed by a concatenation (concatenation) of the output feature map with the feature maps from the previous stage and one convolutional layer. Because the CSS network 430 is trained on synthetic input data, it can be initialized with ImageNet weights and the top five layers are frozen to prevent overfitting to the properties of the rendered data (peculiarities). In this configuration, the five headers 805 of the CSS network 430 are responsible for the U, V of the NOCS and the output of the W channels, as well as the mask (610) of the object and its potential vector (e.g., shape vector 620), encoding its DeepSDF shape.
The pose estimation block 810 is based on the 3D-3D correspondence estimate. In one aspect of the disclosure, the process is defined as follows: the CSS network 430 outputs the NOCS mapping each RGB pixel to a 3D location on the object surface. The NOC is back projected onto the LIDAR cone of view point (frustum point)650 using the provided camera parameters. In addition, the CSS network 430 outputs potential vectors (e.g., shape vectors 620), which are then fed to the DeepSDF network 820(DSDF) and transformed into a curved point cloud using 0-iso-surface projections, as described above. Because the DeepSDF network 820 is trained to output a normalized model placed at the origin, each point on the resulting model surface represents the NOCS. At this point, the system is ready to proceed with attitude estimation.
And the NOCS is used for establishing a corresponding relation between the view cone point and the model point. The back projected cones NOC are compared to the predicted model coordinates and the nearest neighbors of each cone point are estimated. RANSAC may be used for robust outlier rejection. In each iteration, four random points (n) are selected from the corresponding set and fed to the Procrustes algorithm, providing an initial estimate of the pose and scale of the model (i.e., an initial estimate of the affine transformation). In these configurations, the following RANSAC parameters may be used: the number of iterations k is based on a standard function using the expected success probability p of the theoretical result:
Figure BDA0002776525960000231
where w is the inlier probability and n represents an independently selected data point. In one configuration, p is 0.9 and w is 0.7.
In these configurations, a threshold of 0.2m is used to estimate the inliers and select the best fit. And calculating the final posture and proportion of the initial affine transformation based on the best-fit inliers.
Given the export and pose initialization of the CSS network 430, the optimization process 440 continues with the optimization phase (refer again to FIG. 8). The input to the deep sdf network 820 is formed by concatenating the potential vector z (620) with the 3D query lattice x (520). The DeepSDF network 820 outputs the SDF value for each query point on the query grid 520, which is used for 0 iso-surface projection, providing a dense surface-point cloud (dense surface-point cloud). The resulting point cloud is then transformed using the estimated pose and scale from the pose estimation block 810. Points that are not visible from a given camera view can be filtered using simple back-face culling, since the surface normal has been calculated for a 0 iso-surface projection. At this stage, the vehicle perception module 310 may apply a 3D loss between the resulting transformed point cloud and the input LIDAR view cone points. The curved point cloud is also used as an input to a differentiable renderer 860, which renders the NOCS to RGB and applies 2D losses between the NOCS predictions of the CSS network 430 and the output NOCS of the differentiable renderer 860. The potential vector (e.g., shape vector 620) and pose 830 are then updated, and the process is repeated until terminated.
The above-described 3D loss supports obtaining accurate pose/shape alignment with the view frustum points. However, in some cases, some points are available, resulting in poor alignment results. On the other hand, 2D penalties enable precise alignment in screen space on dense pixels, but are generally not suitable for 3D zoom and pan optimization and rely heavily on their initial estimates. The combination of the two losses (2D and 3D) provides the best of the two cases: dense 2D alignment and robust scale/translation estimation.
FIG. 9 is a flow diagram illustrating a method for 3D automatic tagging of objects with structural and physical constraints in accordance with aspects of the present disclosure. The method 900 begins at block 902, where the object detector identifies an initial object seed for all frames from a given sequence of frames of a scene using 2D/3D data. For example, as shown in fig. 3, the object seed detection module 312 is trained to identify an initial object seed for all frames of a given sequence of frames of the scene from the sensor module 302. For example, the object seed is an object that may be a vehicle but may also be a non-vehicle object, as shown in fig. 4.
At block 904, each of the initial object seeds is refined over the 2D/3D data while respecting predetermined physical and structural constraints on the automatically tagged 3D object vehicles within the scene. For example, as shown in fig. 3, the object seed refinement module 314 is trained to refine the initial object seed from the object seed detection module 312 by applying predetermined shape prior information (e.g., maps and road constraints). Once identified, the optimization process of the object seed refinement module refines each initial object seed on the 2D/3D information while complying with map and road constraints. This portion of the method 900 involves shape priors. In this portion of the method 900, additional shape prior information (e.g., the vehicle should have wheels, must be on the ground, and not penetrate another) is applied to constrain 3D object detection and enable the 3D auto-labeling module 316 to automatically label the 3D object vehicle in the scene, as shown in fig. 4.
At block 906, the automatically tagged 3D object vehicle is linked into the trajectory over time while complying with the predetermined structural and physical constraints. For example, as shown in FIG. 3, the vehicle trajectory module 318 is trained to link automatically tagged 3D object vehicles into the trajectory over time while complying with predetermined structural and physical constraints. The method 900 also includes accessing vehicle shape prior information about the roadway and the physical boundary. The method 900 further includes adjusting the linking of the 3D object vehicle over time by applying the road and the physical boundary to the trajectory. This enables method 900 to link the automatic tagging of the 3D object vehicle of block 906.
At block 908, the trajectory of the host vehicle is planned according to the linked trajectory of the automatically tagged 3D object vehicle while following the road and physical boundaries. For example, as shown in fig. 3, the vehicle trajectory module 318 is configured to plan the trajectory of the own vehicle (e.g., the automobile 350) from the linked trajectory of the automatically tagged 3D object vehicle while following the road and physical boundaries. Further, the controller module 340 is configured to select a vehicle control action (e.g., acceleration, braking, steering, etc.). The method 900 further includes performing three-dimensional object detection of automatically tagged 3D vehicle objects within the scene. The method 900 further includes performing three-dimensional pose detection of automatically tagged 3D vehicle objects within the scene.
In some aspects of the disclosure, method 900 may be performed by SOC 100 (fig. 1) or software architecture 200 (fig. 2) of host vehicle 150 (fig. 1). That is, each element of method 900 may be performed, for example, but not limited to, by SOC 100 of host vehicle 150, software architecture 200, or a processor (e.g., CPU 102), and/or other components included therein.
The various operations of the methods described above may be performed by any suitable means capable of performing the corresponding functions. The apparatus may include various hardware and/or software components and/or modules including, but not limited to, a circuit, an Application Specific Integrated Circuit (ASIC), or a processor. In general, where there are operations illustrated in the figures, those operations may have corresponding means plus function elements with similar numbering.
As used herein, the term "determining" includes a wide variety of actions. For example, "determining" can include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Additionally, "determining" may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and the like. Further, "determining" may include resolving, selecting, choosing, establishing, and the like.
As used herein, a phrase referring to "at least one of" a list of items refers to any combination of those items, including a single member. As an example, "at least one of: a, b, or c "is intended to encompass: a, b, c, a-b, a-c, b-c, and a-b-c.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein configured in accordance with the disclosure. The processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine that is specially configured for the purposes of this disclosure. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of storage medium known in the art. Some examples of a storage medium may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, and so forth. A software module may comprise a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
The functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in hardware, an example hardware configuration may include a processing system in a device. The processing system may be implemented with a bus architecture. The bus may include any number of interconnecting buses and bridges depending on the specific application of the processing system and the overall design constraints. The bus may link together various circuits including the processor, the machine-readable medium, and the bus interface. A bus interface may connect a network adapter or the like to the processing system via the bus. The network adapter may implement signal processing functions. For certain aspects, a user interface (e.g., keypad, display, mouse, joystick, etc.) may also be connected to the bus. The bus may also link various other circuits such as timing sources, peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further.
The processor may be responsible for managing the bus and processing, including the execution of software stored on the machine-readable medium. Examples of processors that may be specially configured according to the present disclosure include microprocessors, microcontrollers, DSP processors, and other circuits that may execute software. Software should be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), registers, a magnetic disk, an optical disk, a hard drive, or any other suitable storage medium, or any combination thereof. The machine-readable medium may be embodied in a computer program product. The computer program product may include packaging materials.
In a hardware implementation, the machine-readable medium may be part of a processing system that is separate from the processor. However, as those skilled in the art will readily appreciate, the machine-readable medium, or any portion thereof, may be external to the processing system. By way of example, a machine-readable medium may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the device, all of which may be accessed by a processor through a bus interface. Alternatively or in addition, the machine-readable medium or any portion thereof may be integrated into a processor, such as in the case of a cache memory and/or a special purpose register file. Although the various components discussed may be described as having particular locations, such as local components, they may also be configured in various ways, such as with certain components configured as part of a distributed computing system.
The processing system may be configured with one or more microprocessors providing processor functionality and an external memory providing at least a portion of the machine readable medium, all microprocessors being linked together with other supporting circuitry through an external bus architecture. Alternatively, the processing system may include one or more neuromorphic processors for implementing the neuron and nervous system models described herein. As another alternative, the processing system may be implemented with an ASIC having at least a portion of the processor, bus interface, user interface, support circuitry, and machine-readable medium integrated into a single chip, or having one or more PGAs, PLDs, controllers, state machines, gating logic, discrete hardware components, or any other suitable circuitry, or any combination of circuitry that can perform the various functions described throughout this disclosure. Those skilled in the art will recognize how best to implement the described functionality of a processing system, depending on the particular application and the overall design constraints imposed on the overall system.
The machine-readable medium may include a plurality of software modules. The software modules include instructions that, when executed by the processor, cause the processing system to perform various functions. The software modules may include a transmission module and a reception module. Each software module may reside on a single storage device or be distributed across multiple storage devices. By way of example, the software module may be loaded into RAM from a hard disk drive when a triggering event occurs. During execution of the software module, the processor may load some instructions into the cache to increase access speed. One or more cache lines may then be loaded into the special register file for execution by the processor. When referring to the functionality of a software module in the following, it will be understood that such functionality is carried out by a processor when executing instructions from the software module. Further, it should be appreciated that aspects of the present disclosure result in improvements to the functionality of processors, computers, machines, or other systems implementing these aspects.
If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a non-transitory computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as Infrared (IR), radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc; where a disk typically reproduces data magnetically, and a disk reproduces data optically with a laser. Thus, in some aspects, computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). Additionally, for other aspects, the computer-readable medium may comprise a transitory computer-readable medium (e.g., a signal). Combinations of the above should also be included within the scope of computer-readable media.
Accordingly, certain aspects may include a computer program product for performing the operations presented herein. For example, such a computer program product may include a computer-readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. For certain aspects, a computer program product may include packaging materials.
Further, it should be understood that modules and/or other suitable means for performing the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station, as applicable. For example, such a device may be coupled to a server to facilitate the transmission of means for performing the methods described herein. Alternatively, the various methods described herein may be provided via storage (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk, etc.), such that a user terminal and/or base station may obtain the various methods upon coupling or providing the storage to the apparatus. Further, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.
It is to be understood that the claims are not limited to the precise configuration and components described above. Various modifications, changes and variations may be made in the arrangement, operation and details of the methods and apparatus described above without departing from the scope of the claims.

Claims (15)

1. A method for 3D automatic labeling of objects with predetermined structural and physical constraints, comprising:
identifying initial object seeds for all frames from a given sequence of frames of a scene;
refining each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on the automatically tagged 3D object vehicles within the scene; and
linking the automatically tagged 3D object vehicle into a trajectory over time while complying with the predetermined structural and physical constraints.
2. The method of claim 1, further comprising: while following the road and the physical boundary, planning a trajectory of the own vehicle according to the link trajectory of the automatically labeled 3D object vehicle.
3. The method of claim 1, wherein identifying an initial object seed is performed by a vehicle perception module using 2D/3D data.
4. The method of claim 1, wherein refining the initial object seed comprises:
accessing vehicle shape prior information; and
discarding an incorrect automatic marking of the initial object seed when the initial object seed is identified as inconsistent with the vehicle shape prior information.
5. The method of claim 1, wherein linking the automatically tagged 3D object vehicle comprises:
accessing vehicle shape prior information about roads and physical boundaries; and
adjusting the linking of the 3D object vehicle over time by applying the road and physical boundaries to the trajectory.
6. The method of claim 1, further comprising:
planning a trajectory of the own vehicle according to perception of a scene from a video captured by the own vehicle;
performing three-dimensional object detection of automatically tagged 3D vehicle objects within the scene; or
Performing three-dimensional pose detection of automatically tagged 3D vehicle objects within the scene.
7. A non-transitory computer readable medium having recorded thereon program code for 3D automatic marking of objects with predetermined structural and physical constraints, the program code being executed by a processor and comprising:
program code to identify an initial object seed from all frames of a given sequence of frames of a scene;
program code for refining each of the initial object seeds on the 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene; and
program code for linking the automatically tagged 3D object vehicle into a trajectory over time while complying with the predetermined structural and physical constraints.
8. The non-transitory computer-readable medium of claim 7, further comprising: program code for planning a trajectory of the host vehicle based on the automatically marked linked trajectory of the 3D object vehicle while conforming to the road and the physical boundary.
9. The non-transitory computer-readable medium of claim 7, wherein the program code to identify the initial object seed is executed by a vehicle awareness module using 2D/3D data.
10. The non-transitory computer-readable medium of claim 7, wherein the program code for refining the initial object seed comprises:
program code for accessing vehicle shape prior information; and
program code for discarding an incorrect automatic marking of the initial object seed when the initial object seed is identified as inconsistent with the vehicle shape prior information.
11. The non-transitory computer-readable medium of claim 7, wherein the program code for linking the automatically tagged 3D object vehicle comprises:
program code for accessing vehicle shape prior information about the road and the physical boundary; and
program code for adjusting the linking of the 3D object vehicle over time by applying the road and physical boundaries to the trajectory.
12. The non-transitory computer-readable medium of claim 7, further comprising:
program code for planning a trajectory of the own vehicle in accordance with perception of a scene from video captured by the own vehicle;
program code for performing three-dimensional object detection of automatically tagged 3D vehicle objects within the scene; or
Program code for performing three-dimensional pose detection of automatically tagged 3D vehicle objects within the scene.
13. A system for 3D automatic labeling of objects with predetermined structural and physical constraints, the system comprising:
an object seed detection module trained to identify initial object seeds for all frames from a given sequence of frames of a scene;
an object seed refinement module trained to refine each of the initial object seeds over 2D/3D data while respecting predetermined structural and physical constraints on automatically tagged 3D object vehicles within the scene; and
a 3D auto-tagging module trained to link the auto-tagged 3D object vehicle into a trajectory over time while complying with the predetermined structural and physical constraints.
14. The system of claim 13, further comprising:
a vehicle trajectory module trained to plan a trajectory of a self vehicle from a linked trajectory of the automatically tagged 3D object vehicle while conforming to roads and physical boundaries; and
a vehicle perception module trained to perform identifying an initial object seed using 2D/3D data.
15. The system of claim 13, wherein the object seed refinement module is further trained to:
accessing vehicle shape prior information; and
discarding an incorrect automatic marking of the initial object seed when the initial object seed is identified as inconsistent with the vehicle shape prior information.
CN202011267490.3A 2019-11-14 2020-11-13 3D automatic tagging with structural and physical constraints Pending CN112800822A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962935246P 2019-11-14 2019-11-14
US62/935,246 2019-11-14
US17/026,057 US11482014B2 (en) 2019-11-14 2020-09-18 3D auto-labeling with structural and physical constraints
US17/026,057 2020-09-18

Publications (1)

Publication Number Publication Date
CN112800822A true CN112800822A (en) 2021-05-14

Family

ID=75807468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011267490.3A Pending CN112800822A (en) 2019-11-14 2020-11-13 3D automatic tagging with structural and physical constraints

Country Status (1)

Country Link
CN (1) CN112800822A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505929A (en) * 2021-07-16 2021-10-15 中国人民解放军军事科学院国防科技创新研究院 Topological optimal structure prediction method based on embedded physical constraint deep learning technology
CN115115700A (en) * 2022-05-17 2022-09-27 清华大学 Object attitude estimation method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505929A (en) * 2021-07-16 2021-10-15 中国人民解放军军事科学院国防科技创新研究院 Topological optimal structure prediction method based on embedded physical constraint deep learning technology
CN113505929B (en) * 2021-07-16 2024-04-16 中国人民解放军军事科学院国防科技创新研究院 Topological optimal structure prediction method based on embedded physical constraint deep learning technology
CN115115700A (en) * 2022-05-17 2022-09-27 清华大学 Object attitude estimation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11482014B2 (en) 3D auto-labeling with structural and physical constraints
US10991156B2 (en) Multi-modal data fusion for enhanced 3D perception for platforms
Menze et al. Object scene flow
Dong et al. Towards real-time monocular depth estimation for robotics: A survey
WO2019153245A1 (en) Systems and methods for deep localization and segmentation with 3d semantic map
JP7148718B2 (en) Parametric top-view representation of the scene
US11064178B2 (en) Deep virtual stereo odometry
CN114667437A (en) Map creation and localization for autonomous driving applications
CN112740268B (en) Target detection method and device
US11475628B2 (en) Monocular 3D vehicle modeling and auto-labeling using semantic keypoints
CN112991413A (en) Self-supervision depth estimation method and system
JP2018530825A (en) System and method for non-obstacle area detection
US20220148206A1 (en) Camera agnostic depth network
EP3980969A1 (en) Cross-modal sensor data alignment
CN111091038A (en) Training method, computer readable medium, and method and apparatus for detecting vanishing points
CN112800822A (en) 3D automatic tagging with structural and physical constraints
Zhao et al. Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes
CN116051779A (en) 3D surface reconstruction using point cloud densification for autonomous systems and applications using deep neural networks
CN116051780A (en) 3D surface reconstruction using artificial intelligence with point cloud densification for autonomous systems and applications
Wang et al. Lane detection algorithm based on temporal–spatial information matching and fusion
Holder et al. Learning to drive: End-to-end off-road path prediction
CN116189150A (en) Monocular 3D target detection method, device, equipment and medium based on fusion output
US20230135234A1 (en) Using neural networks for 3d surface structure estimation based on real-world data for autonomous systems and applications
CN116868239A (en) Static occupancy tracking
Zhang et al. 3D car-detection based on a Mobile Deep Sensor Fusion Model and real-scene applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination