CN109215067B

CN109215067B - High-resolution 3-D point cloud generation based on CNN and CRF models

Info

Publication number: CN109215067B
Application number: CN201810695220.9A
Authority: CN
Inventors: 黄玉; 郑先廷; 朱俊; 张伟德
Original assignee: Baidu USA LLC
Current assignee: Baidu USA LLC
Priority date: 2017-07-03
Filing date: 2018-06-29
Publication date: 2023-03-10
Anticipated expiration: 2038-06-29
Also published as: CN109215067A

Abstract

In one embodiment, a method or system generates a high resolution 3-D point cloud from a low resolution 3-D point cloud and a camera captured image to operate an autonomous vehicle (ADV). The system receives a first image captured by a camera for a driving environment. The system receives a second image representing a first depth map of a first point cloud corresponding to a driving environment. The system determines a second depth map by applying a convolutional neural network model to the first image. The system generates a third depth map by applying a conditional random domain model to the first image, the second image, and the second depth map, the third depth map having a higher resolution than the first depth map such that the third depth map represents a second point cloud perceiving the driving environment surrounding the ADV.

Description

High-resolution 3-D point cloud generation based on CNN and CRF models

Technical Field

Embodiments of the present disclosure generally relate to operating an autonomous vehicle. More particularly, embodiments of the present disclosure relate to generating high resolution three-dimensional (3-D) point clouds based on Convolutional Neural Network (CNN) and conditional random domain (CRF) models.

Background

Vehicles operating in an autonomous driving mode (e.g., unmanned) may relieve occupants, particularly the driver, from some driving-related duties. When operating in an autonomous driving mode, the vehicle may be navigated to various locations using onboard sensors, allowing the vehicle to travel with minimal human interaction or in some cases without any passengers.

High resolution LIDAR data is important for enabling real-time 3-D scene reconstruction for autonomous vehicle (ADV) applications such as object segmentation, detection, tracking, and classification. However, high resolution LIDAR devices are typically expensive and not necessarily available.

Disclosure of Invention

In one aspect of the disclosure, there is provided a method of generating a high resolution three dimensional point cloud, the method comprising:

receiving a first image captured by a first camera, the first image capturing a portion of a driving environment of the autonomous vehicle;

receiving a second image representing a first depth map of a first point cloud generated by a lidar device corresponding to a portion of the driving environment;

determining a second depth map by applying a convolutional neural network model to the first image; and

generating a third depth map by applying a conditional random domain model to the first image, the second image and the second depth map, the third depth map having a higher resolution than the first depth map, wherein the third depth map represents a second point cloud for perceiving the driving environment surrounding the autonomous vehicle.

In another aspect of the disclosure, a non-transitory machine-readable medium is provided having instructions stored thereon, which when executed by a processor, cause the processor to perform operations comprising:

receiving a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a lidar device;

In yet another aspect of the present disclosure, there is provided a data processing system including:

a processor; and

a memory coupled to the processor to store instructions that, when executed by the processor, cause the processor to perform operations comprising:

Drawings

Embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.

FIG. 1 is a block diagram illustrating a networked system according to one embodiment.

FIG. 2 is a block diagram illustrating an example of an autonomous vehicle according to one embodiment.

FIG. 3 is a block diagram illustrating an example of a perception and planning system for use with an autonomous vehicle, according to one embodiment.

FIG. 4 is a block diagram illustrating an example of a high resolution point cloud module for use with an autonomous vehicle, according to one embodiment.

Figure 5A is a diagram illustrating an exemplary ADV according to one embodiment.

Fig. 5B and 5C illustrate top and side views of a LIDAR/panoramic camera configuration for use with an autonomous vehicle, in accordance with some embodiments.

Fig. 5D-5F illustrate examples of monochrome/stereoscopic panoramic camera configurations according to some embodiments.

Fig. 6A and 6B show a flow diagram of an inference mode and a training mode, respectively, according to an embodiment.

Fig. 6C and 6D show a flow diagram of an inference mode and a training mode, respectively, according to an embodiment.

Fig. 7A and 7B are block diagrams illustrating examples of depth map generation according to some embodiments.

FIG. 8 is a diagram illustrating a contraction layer and an expansion layer of a convolutional neural network model, according to one embodiment.

Fig. 9A and 9B are block diagrams illustrating an example of high resolution depth map generation according to some embodiments.

FIG. 10 is a flow chart illustrating a method according to one embodiment.

Fig. 11A and 11B are block diagrams illustrating examples of depth map generation according to some embodiments.

Fig. 12 is a diagram illustrating a contraction (e.g., encoder/convolution) layer and an expansion (e.g., decoder/deconvolution) layer of a convolutional neural network model according to one embodiment.

FIG. 13 is a flow chart illustrating a method according to one embodiment.

Fig. 14A and 14B are block diagrams illustrating examples of depth map generation according to some embodiments.

FIG. 15 is a flow chart illustrating a method according to one embodiment.

FIG. 16 is a block diagram illustrating a data processing system according to one embodiment.

Detailed Description

Various embodiments and aspects of the disclosure will be described with reference to details discussed below, and the accompanying drawings will illustrate the various embodiments. The following description and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of embodiments of the present disclosure.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

According to some embodiments, a method or system generates a high resolution 3-D point cloud from a low resolution 3-D point cloud and a camera captured image to operate an autonomous vehicle (ADV). Machine learning (deep learning) techniques are used to combine low resolution LIDAR units with calibrated multi-camera systems to achieve functionally equivalent high resolution LIDAR units for generating 3-D point clouds. The multi-camera system is designed to output a wide-angle (e.g., 360 degrees) monochrome or stereoscopic (e.g., red, green, and blue, or RGB) panoramic image. An end-to-end depth neural network is then trained using reliable data and applied based on offline calibration parameters to implement a wide-angle panoramic depth map from input signals comprising wide-angle monochromatic or stereoscopic panoramic images, and a depth mesh of a 3-D point cloud from a low-cost LIDAR projected onto the monochromatic or stereoscopic panoramic images. Finally, a high resolution 3-D point cloud may be generated from the wide angle panoramic depth map. The same process applies to configurations with narrower view angle (e.g., limited range of angles) stereo cameras and narrower view angle low resolution LIDAR.

According to one aspect, the system receives a first image captured by a first camera, the first image capturing a portion of the driving environment of the ADV. The system receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a laser radar (LIDAR) device. The system downsamples the second image at a predetermined scale factor until the resolution of the second image reaches a predetermined threshold. The system generates a second depth map by applying a Convolutional Neural Network (CNN) model to the first image and the downsampled second image, the second depth map having a higher resolution than the first depth map such that the second depth map represents a second point cloud of the driving environment surrounding the perceived ADV.

According to another aspect, the system receives a first image captured by a first camera, the first image capturing a portion of the driving environment of the ADV. The system receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a laser radar (LIDAR) device. The system upsamples the second image by a predetermined scale factor to match the image scale of the first image. The system generates a second depth map by applying a Convolutional Neural Network (CNN) model to the first image and the upsampled second image, the second depth map having a higher resolution than the first depth map, such that the second depth map represents a second point cloud for perceiving the driving environment around the ADV.

According to another aspect, the system receives a first image captured by a first camera, the first image capturing a portion of the driving environment of the ADV. The system receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a laser radar (LIDAR) device. The system determines a second depth map by applying a Convolutional Neural Network (CNN) model to the first image. The system generates a third depth map by applying a conditional random domain function to the first image, the second image, and the second depth map, the third depth map having a higher resolution than the first depth map such that the third depth map represents a second point cloud perceiving the driving environment surrounding the ADV.

Fig. 1 is a block diagram illustrating an autonomous vehicle network configuration according to one embodiment of the present disclosure. Referring to fig. 1, a network configuration 100 includes an autonomous vehicle 101 that may be communicatively coupled to one or more servers 103-104 through a network 102. Although one autonomous vehicle is shown, multiple autonomous vehicles may be coupled to each other and/or to servers 103-104 through network 102. The network 102 may be any type of network, such as a wired or wireless Local Area Network (LAN), a Wide Area Network (WAN) such as the Internet, a cellular network, a satellite network, or a combination thereof. The servers 103-104 may be any type of server or cluster of servers, such as a network or cloud server, an application server, a backend server, or a combination thereof. The servers 103 to 104 may be data analysis servers, content servers, traffic information servers, map and point of interest (MPOI) servers or location servers, etc.

Autonomous vehicles refer to vehicles that may be configured to be in an autonomous driving mode in which the vehicle navigates through the environment with little or no input from the driver. Such autonomous vehicles may include a sensor system having one or more sensors configured to detect information related to the operating environment of the vehicle. The vehicle and its associated controller use the detected information to navigate through the environment. Autonomous vehicle 101 may operate in a manual mode, in a fully autonomous mode, or in a partially autonomous mode.

In one embodiment, autonomous vehicle 101 includes, but is not limited to, a perception and planning system 110, a vehicle control system 111, a wireless communication system 112, a user interface system 113, and a sensor system 115. Autonomous vehicle 101 may also include certain common components included in a common vehicle, such as: engines, wheels, steering wheels, transmissions, etc., which may be controlled by the vehicle control system 111 and/or the sensory and programming system 110 using a variety of communication signals and/or commands, such as, for example, acceleration signals or commands, deceleration signals or commands, steering signals or commands, braking signals or commands, etc.

The components 110-115 may be communicatively coupled to each other via an interconnect, bus, network, or combination thereof. For example, the components 110-115 may be communicatively coupled to one another via a Controller Area Network (CAN) bus. The CAN bus is a vehicle bus standard designed to allow microcontrollers and devices to communicate with each other in applications without a host. It is a message-based protocol originally designed for multiplexed electrical wiring within automobiles, but is also used in many other environments.

Referring now to FIG. 2, in one embodiment, the sensor system 115 includes, but is not limited to, one or more cameras 211, a Global Positioning System (GPS) unit 212, an Inertial Measurement Unit (IMU) 213, a radar unit 214, and a light detection and ranging (LIDAR) unit 215. The GPS system 212 may include a transceiver operable to provide information regarding the location of the autonomous vehicle. The IMU unit 213 may sense position and orientation changes of the autonomous vehicle based on inertial acceleration. Radar unit 214 may represent a system that utilizes radio signals to sense objects within the local environment of an autonomous vehicle. In some embodiments, in addition to sensing the object, radar unit 214 may additionally sense the speed and/or heading of the object. The LIDAR unit 215 may use a laser to sense objects in the environment in which the autonomous vehicle is located. The LIDAR unit 215 may include one or more laser sources, laser scanners, and one or more detectors, among other system components. The camera 211 may include one or more devices used to capture images of the environment surrounding the autonomous vehicle. The camera 211 may be a still camera and/or a video camera. The camera may be mechanically movable, for example, by mounting the camera on a rotating and/or tilting platform.

The sensor system 115 may also include other sensors, such as: sonar sensors, infrared sensors, steering sensors, throttle sensors, brake sensors, and audio sensors (e.g., microphones). The audio sensor may be configured to collect sound from an environment surrounding the autonomous vehicle. The steering sensor may be configured to sense a steering angle of a steering wheel, wheels of a vehicle, or a combination thereof. The throttle sensor and the brake sensor sense a throttle position and a brake position of the vehicle, respectively. In some cases, the throttle sensor and the brake sensor may be integrated into an integrated throttle/brake sensor.

In one embodiment, the vehicle control system 111 includes, but is not limited to, a steering unit 201, a throttle unit 202 (also referred to as an acceleration unit), and a brake unit 203. The steering unit 201 is used to adjust the direction or forward direction of the vehicle. The throttle unit 202 is used to control the speed of the motor or engine, which in turn controls the speed and acceleration of the vehicle. The brake unit 203 decelerates the vehicle by providing friction to decelerate the wheels or tires of the vehicle. It should be noted that the components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.

Returning to fig. 1, wireless communication system 112 allows communication between autonomous vehicle 101 and external systems such as devices, sensors, other vehicles, and the like. For example, the wireless communication system 112 may be in direct wireless communication with one or more devices, or in wireless communication via a communication network, such as with the servers 103-104 through the network 102. The wireless communication system 112 may use any cellular communication network or Wireless Local Area Network (WLAN), for example, using WiFi, to communicate with another component or system. The wireless communication system 112 may communicate directly with devices (e.g., passenger's mobile device, display device, speaker within the vehicle 101), for example, using infrared links, bluetooth, etc. The user interface system 113 may be part of a peripheral device implemented within the vehicle 101, including, for example, a keypad, a touch screen display device, a microphone, and speakers, among others.

Some or all of the functions of the autonomous vehicle 101 may be controlled or managed by the perception and planning system 110, particularly when operating in an autonomous mode. The awareness and planning system 110 includes the necessary hardware (e.g., processors, memory, storage devices) and software (e.g., operating systems, planning and routing programs) to receive information from the sensor system 115, the control system 111, the wireless communication system 112, and/or the user interface system 113, process the received information, plan a route or path from the origin to the destination, and then drive the vehicle 101 based on the planning and control information. Alternatively, the sensing and planning system 110 may be integrated with the vehicle control system 111.

For example, a user who is a passenger may specify a start location and a destination of a trip, e.g., via a user interface. The perception and planning system 110 obtains trip related data. For example, the sensing and planning system 110 may obtain location and route information from an MPOI server, which may be part of the servers 103-104. The location server provides location services and the MPOI server provides map services and POIs for certain locations. Alternatively, such location and MPOI information may be cached locally in persistent storage of the sensing and planning system 110.

The perception and planning system 110 may also obtain real-time traffic information from a traffic information system or server (TIS) as the autonomous vehicle 101 moves along the route. It should be noted that the servers 103 to 104 may be operated by third party entities. Alternatively, the functionality of the servers 103-104 may be integrated with the perception and planning system 110. Based on the real-time traffic information, MPOI information, and location information, as well as real-time local environmental data (e.g., obstacles, objects, nearby vehicles) detected or sensed by sensor system 115, perception and planning system 110 may plan an optimal route and drive vehicle 101, e.g., via control system 111, according to the planned route to safely and efficiently reach a designated destination.

Server 103 may be a data analysis system to perform data analysis services for various customers. In one embodiment, the data analysis system 103 includes a data collector 121, a machine learning engine 122, a neural network model generator 123, and a neural network/CRF model 124. The data collector 121 may collect different training data from various vehicles equipped with LIDAR sensors/cameras communicatively coupled to the server 103, the various vehicles being autonomous vehicles or ordinary vehicles driven by human drivers. Examples of training data may be depth/image data for image recognition functions such as object segmentation, detection, tracking and classification. The training data may be compiled into categories and associated with ground truth labels. In another embodiment, the data collector 121 may download the training data set from an online archive of the world wide web.

Based on the training data collected by the data collector, the machine learning engine 122 may generate or train a set of neural network/CRF models 124 for various purposes. For example, the machine learning engine 122 may perform end-to-end training on the CNN model as part of the neural network/CRF model 124 using training data, such as RGB images/3-D low resolution point clouds and 3-D high resolution point cloud input/output pairs.

CNN is a feedforward Artificial Neural Network (ANN) in which the pattern of connections between its neurons is inspired by the tissues of the animal's visual cortex. Individual cortical neurons respond to stimuli in a limited spatial region called the "receptive field". The receive fields of different neurons partially overlap so that their fields of view are tiled. The response of an individual neuron to a stimulus in its reception domain may be mathematically approximated by a convolution operation. The deep CNN is a CNN having a plurality of inner layers. An "inner layer" of a neural network refers to a layer between the input layer and the output layer of the neural network.

ANN is a computational method based on a large number of neural units or neurons, loosely modeling a biological brain with a large number of neurons connected by axons. Each neuron is connected to a number of other neurons, axons or connections can enhance or inhibit their effect on the activation state of the connected neurons through learning or training. Each individual neuron may have a function that combines the values of all its inputs. There may be a threshold function or limit function on each connection and on the unit itself: so that the signal must exceed the limit before propagating to other neurons. These systems are self-learning and training, rather than explicitly programmed.

"training" a CNN involves iteratively inputting the input layers of the CNN and comparing the expected output to the actual output at the output layers of the CNN to calculate an error term. These error terms are used to adjust the weights and biases in the hidden layer of the CNN so that the next output value will be closer to the "correct" value. The distribution of the input to each layer may slow down the training (i.e., convergence requires a lower training rate) and requires careful parameter initialization, i.e., setting the initial weights and bias of activation of the inner layers to a certain range for convergence. "convergence" means when the error term reaches a minimum.

Once the CNN model is trained, the model may be uploaded into an ADV, such as ADV 101, to generate a real-time high resolution 3-D point cloud. The high resolution 3-D point cloud may be generated in real time by inferring a depth map from an optical image captured by a camera and a low resolution 3-D point cloud captured by a low cost RADAR and/or LIDAR unit. It should be noted that the neural network/CRF model 124 is not limited to convolutional neural networks and conditional random domain (CRF) models, but may include radial basis function network models, recursive neural network models, kohonen self-organizing network models, and the like. The neural network/CRF model 124 may include different deep CNN models, such as LeNet ^TM 、AlexNet ^TM 、ZFNet ^TM 、GoogLeNet ^TM 、VGGNet ^TM Or a combination thereof. In addition, a normalization layer may be introduced at the active layer to reduce training time and increase convergence rate. In addition, a drop-out (dropout) layer may be introduced at random nodes to remove the contribution of nodes to the activation layer, thereby preventing overfitting of the training data.

FIG. 3 is a block diagram illustrating an example of a perception and planning system for use with an autonomous vehicle, according to one embodiment. The system 300 may be implemented as part of the autonomous vehicle 101 of fig. 1, including but not limited to the perception and planning system 110, the control system 111, and the sensor system 115. Referring to fig. 3, the perception and planning system 110 includes, but is not limited to, a localization module 301, a perception module 302, a prediction module 303, a decision module 304, a planning module 305, a control module 306, and a high resolution point cloud module 307.

Some or all of modules 301 through 307 may be implemented in software, hardware, or a combination thereof. For example, the modules may be installed in persistent storage 352, loaded into memory 351, and executed by one or more processors (not shown). It should be noted that some or all of these modules may be communicatively coupled to or integrated with some or all of the modules of the vehicle control system 111 of fig. 2. Some of modules 301 to 307 may be integrated together into an integrated module.

The location module 301 determines the current location of the autonomous vehicle 300 (e.g., using the GPS unit 212). The positioning module 301 (also referred to as a map and route module) manages any data related to the user's journey or route. The user may, for example, log in via a user interface and specify a starting location and a destination for the trip. The positioning module 301 communicates with other components of the autonomous vehicle 300, such as map and route information 311, to obtain trip related data. For example, the location module 301 may obtain location and route information from a location server and a Map and POI (MPOI) server. The location server provides location services and the MPOI server provides map services and POIs for certain locations and may thus be cached as part of the map and route information 311. The location module 301 may also obtain real-time traffic information from a traffic information system or server as the autonomous vehicle 300 moves along the route.

Based on the sensor data provided by sensor system 115 and the positioning information obtained by positioning module 301, perception module 302 determines a perception of the surrounding environment. The perception information may represent what an average driver would perceive around the vehicle the driver is driving. Perception may include, for example, a lane configuration in the form of an object (e.g., a straight lane or a curved lane), a traffic light signal, a relative position of another vehicle, a pedestrian, a building, a crosswalk, or other traffic-related indicia (e.g., stop indicia, give-way indicia), and so forth.

The perception module 302 may include a computer vision system or functionality of a computer vision system to process and analyze images captured by one or more cameras to identify objects and/or features in an autonomous vehicle environment. The objects may include traffic signals, road boundaries, other vehicles, pedestrians, and/or obstacles, etc. Computer vision systems may use object recognition algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system may map the environment, track objects, and estimate the speed of objects, among other things. The perception module 302 may also detect objects based on other sensor data provided by other sensors, such as radar and/or LIDAR.

For each object, the prediction module 303 predicts how the object will behave in this case. The prediction is performed based on perception data that perceives the driving environment at a point in time that takes into account a set of map/route information 311 and traffic rules 312. For example, if the object is a vehicle in the opposite direction and the current driving environment includes an intersection, the prediction module 303 will predict whether the vehicle is likely to move straight ahead or turn. If the perception data indicates that the intersection has no traffic lights, the prediction module 303 may predict that the vehicle may need to be completely parked before entering the intersection. If the perception data indicates that the vehicle is currently in a left-turn only lane or a right-turn only lane, the prediction module 303 may predict that the vehicle will be more likely to turn left or right, respectively.

For each subject, the decision module 304 makes a decision on how to treat the subject. For example, for a particular object (e.g., another vehicle in a crossing route) and metadata describing the object (e.g., speed, direction, turn angle), the decision module 304 decides how to encounter the object (e.g., cut, yield, stop, exceed). The decision module 304 may make such a decision based on a rule set, such as traffic rules or driving rules 312, which may be stored in persistent storage 352.

Based on the decisions for each of the perceived objects, the planning module 305 plans a path or route and driving parameters (e.g., distance, speed, and/or turn angle) for the autonomous vehicle. In other words, for a given object, the decision module 304 decides what to do with the object, and the planning module 305 determines how to do. For example, for a given subject, the decision module 304 may decide to exceed the subject, while the planning module 305 may determine whether to exceed on the left or right side of the subject. Planning and control data is generated by the planning module 305, including information describing how the vehicle 300 will move in the next movement cycle (e.g., the next route/path segment). For example, the planning and control data may instruct the vehicle 300 to move 10 meters at a speed of 30 miles per hour (mph), and then change to the right lane at a speed of 25 mph.

Based on the planning and control data, the control module 306 controls and drives the autonomous vehicle by sending appropriate commands or signals to the vehicle control system 111 according to the route or path defined by the planning and control data. The planning and control data includes sufficient information to drive the vehicle from a first point to a second point of the route or path at different points in time along the route or path using appropriate vehicle settings or driving parameters (e.g., throttle, brake, and turn commands).

In one embodiment, the planning phase is performed in multiple planning cycles (also referred to as instruction cycles), for example, in cycles of 100 milliseconds (ms) each. For each of the planning or instruction cycles, one or more control instructions will be issued based on the planning and control data. That is, for every 100ms, the planning module 305 plans the next route segment or path segment, e.g., including the target location and the time required for the ADV to reach the target location. Alternatively, the planning module 305 may also specify a particular speed, direction, and/or steering angle, etc. In one embodiment, the planning module 305 plans a route segment or a path segment for the next predetermined period of time (such as 5 seconds). For each planning cycle, the planning module 305 plans a target location for the current cycle (e.g., the next 5 seconds) based on the target locations planned in the previous cycle. The control module 306 then generates one or more control commands (e.g., throttle, brake, steering control commands) based on the current cycle of the schedule and the control data.

It should be noted that the decision module 304 and the planning module 305 may be integrated as an integrated module. The decision module 304/planning module 305 may include a navigation system or functionality of a navigation system to determine a driving path of an autonomous vehicle. For example, the navigation system may determine a range of speeds and heading directions for enabling the autonomous vehicle to move along the following path: the path substantially avoids perceived obstacles while advancing the autonomous vehicle along a roadway-based path to a final destination. The destination may be set based on user input via the user interface system 113. The navigation system may dynamically update the driving path while the autonomous vehicle is running. The navigation system may combine data from the GPS system and one or more maps to determine a driving path for the autonomous vehicle.

The decision module 304/planning module 305 may also include a collision avoidance system or the functionality of a collision avoidance system to identify, assess, and avoid or otherwise overcome potential obstacles in the environment of the autonomous vehicle. For example, a collision avoidance system may implement a change in navigation of an autonomous vehicle by: one or more subsystems in the control system 111 are operated to take a turning maneuver, a braking maneuver, etc. The collision avoidance system may automatically determine a feasible obstacle avoidance maneuver based on surrounding traffic patterns, road conditions, and the like. The collision avoidance system may be configured such that no turn-changing maneuvers are taken when other sensor systems detect vehicles, building obstacles, etc. located in adjacent areas into which the autonomous vehicle will change direction. The collision avoidance system may automatically select maneuvers that are both available and that maximize the safety of the occupants of the autonomous vehicle. The collision avoidance system may select an avoidance maneuver that is predicted to cause a minimum amount of acceleration to occur in a passenger compartment of the autonomous vehicle.

The high resolution point cloud module 307 generates a high resolution 3-D point cloud based on the image captured by the camera and the low resolution 3-D point cloud captured by the radar and/or LIDAR unit. The high resolution 3-D point cloud may be used by the perception module 302 to perceive the driving environment of the ADV. Such images/3-D point clouds may be gathered by the sensor system 115. The point cloud module 307 may apply one or more CNN models (as part of the neural network/CRF model 313) to the camera-captured images and low-resolution LIDAR data to generate a higher-resolution LIDAR point cloud. It should be noted that point cloud module 307 and perception module 302 may be integrated as an integrated module.

FIG. 4 is a block diagram illustrating an example of a high resolution point cloud generator for use with an autonomous vehicle, according to one embodiment. The high resolution point cloud module 307 includes an upsampling and/or repairing module 401, a downsampling module 402, a panorama module 403, a Conditional Random Field (CRF) module 404, and a high resolution depth map module 405. The upsampling and/or inpainting module 401 may upsample the input image, i.e., increase the image size by a factor. The repair module may apply a repair algorithm to recover or reconstruct missing or degraded portions of the image, such as dark spots in the depth map introduced by black objects. The downsampling module 402 may downsample the image, i.e., reduce the image size by a factor. The panorama module 403 may convert the narrower angle images into wider angle view (e.g., 360 degree view) panoramic images, or vice versa. For example, the panorama module 403 may first map the overlapping fields of view of the fluoroscopic images into cylindrical coordinates or spherical coordinates. The mapped images are then blended and/or stitched together. Here, the stitched image shows a wider horizontal field of view and a limited vertical field of view for cylindrical coordinates, or a 180 degree vertical field of view for spherical coordinates. The panorama in the projection is intended to be viewed as if the image were rolled into a cylinder/sphere for viewing and viewed from the inside. When viewed in the 2D plane, the horizontal line is curved, while the vertical line is still vertical. The CRF module 404 may apply a CRF (e.g., optimization model) model to the output of the CNN model and the low resolution depth map to further refine the estimation of the depth map. Finally, the high resolution depth map module 405 applies a CNN model to the RGB image/LIDAR depth image input to generate a high resolution LIDAR depth image.

Some or all of the modules 401 to 405 may be implemented in software, hardware, or a combination thereof. For example, the modules may be installed in persistent storage 352, loaded into memory 351, and executed by one or more processors (not shown). It should be noted that some or all of these modules may be communicatively coupled to or integrated with some or all of the modules of the vehicle control system 111 of fig. 2. Some of the modules 401-405 may be integrated together as an integrated module. For example, the upsampling module 401 and the downsampling module 402 may be integrated with the high resolution depth map module 405.

Figure 5A is a diagram illustrating an exemplary ADV according to one embodiment. Referring to fig. 5a, adv 101 includes a top-mounted LIDAR/panoramic camera configuration 501. In another embodiment, the LIDAR/panoramic camera configuration 501 may be mounted on the hood or cabin of the ADV 101, or anywhere on the ADV suitable for placement of such sensor units.

Fig. 5B and 5C illustrate top and side views of a LIDAR/panoramic camera configuration, according to some embodiments. Referring to fig. 5B, in one embodiment, a configuration 501 includes a low-definition or low-resolution LIDAR unit 502 and a stereoscopic panoramic camera 504 (e.g., multiple cameras). In one embodiment, the LIDAR unit 502 may be placed on top of the camera unit 504. The cell may be calibrated to have a similar reference point, such as a central vertical reference line (not shown), so that the LIDAR and panoramic camera rotate about the reference line. Referring to fig. 5C, in one embodiment, the configuration 501 includes a low resolution LIDAR unit 502 with a monochrome panoramic camera 506. Similarly, LIDAR unit 502 may be placed on top of camera unit 506, and these units may be calibrated to have a similar reference point, such as a central vertical reference line (not shown), so that the LIDAR and panoramic camera rotate about the reference line. It should be noted that a low resolution or low definition LIDAR unit refers to a LIDAR unit that captures sparse 3-D point clouds or point clouds with fewer points than a high resolution LIDAR unit. Sparse 3-D point clouds contain less depth data or information than dense 3-D point clouds. As an exemplary comparison, a LIDAR unit having 16 channels or fewer that capture a wider degree of view at 300,000 points per second may be a low resolution unit as compared to a LIDAR unit having a greater number of channels (e.g., 64 channels) that capture a wider angle view at two million points per second.

FIG. 5D shows top and side views of a monochrome panoramic camera configuration, according to one embodiment. In one embodiment, the monochrome panoramic camera configuration 506 includes six cameras placed in a hexagonal shape. The center of the hexagon may be the central reference point for determining camera focus, field of view and viewing angle. Each camera and its neighboring cameras may be placed about 60 degrees apart in horizontal viewing angle to obtain a completely wider horizontal viewing angle (e.g., 360 degree view). In one embodiment, each of the six cameras may capture images at a viewing angle of about 120 degrees horizontal, such that there is about 30 degrees of overlap between the images captured by the left camera and the right adjacent camera. The overlap may be used to blend and/or stitch the captured images together to generate a panoramic image.

Having generated a cylindrical or spherical panoramic image (e.g., a panoramic RGB image), the 3-D point cloud may be projected onto a (2-D) cylindrical or spherical image plane so as to be aligned with the cylindrical or spherical panoramic RGB image. For example, the 3-D points may be projected onto a 2-D cylindrical (or warped) image plane as shown below. Let (u, v) be the location of the pixel on the warped image plane. The pixel location on the 2-D cylinder will then be (r, h), where

Or

And f is the camera focal length. The same 3-D points may be projected onto a 2-D spherical image plane as shown below.

Let (u, v) be the location of the pixel on the warped image plane. The pixel location on the 2-D sphere would then be (r, h), where

Or

And f is the camera focal length. To reconstruct the point cloud from the depth map, a reverse transformation may be performed by back-projecting the 2-D panoramic depth map onto the 3-D space. Triangulation may be performed based on pixels of the panoramic surface. In one embodiment, triangulation may be performed directly from the location of those pixels on the camera image plane. In some embodiments, more cameras (such as three to eight cameras) may be used for the panoramic camera configuration 506. The cameras may be arranged in the shape of a triangle, rectangle, pentagon or octagon, respectively.

Fig. 5E and 5F illustrate examples of stereoscopic panoramic camera configurations according to some embodiments. Referring to fig. 5E, in one embodiment, a stereoscopic panoramic camera configuration 514 (e.g., camera configuration 504 of fig. 5B) includes twelve cameras positioned in a hexagonal shape. The center of the hexagonal shape may be the center reference point for determining the camera view angle and serves as a baseline for the stereo camera pair to create the stereo panoramic image. Each stereo pair of cameras and its adjacent stereo cameras (left and right) may be 60 degrees apart.

Referring to fig. 5F, in one embodiment, a stereoscopic panoramic camera configuration 524 (e.g., camera configuration 504 of fig. 5B) includes two monochromatic panoramic camera configurations, each having six cameras positioned in a hexagonal shape. Here, the stereoscopic panorama camera configuration is not a left-right stereoscopic pair, but a vertical top and bottom stereoscopic pair. The captured stereoscopic panoramic image may be projected onto a cylindrical or spherical surface as shown above. The images captured by the stereo pair of cameras are then used as input to a high resolution depth map module (along with the low resolution LIDAR image), such as the high resolution depth map module 405 of fig. 4, to generate a high resolution depth map or LIDAR image.

Fig. 6A and 6B show a flow diagram of an inference mode and a training mode, respectively, according to an embodiment. Fig. 6C and 6D show a flow diagram of an inference mode and a training mode, respectively, according to an embodiment. Fig. 6A and 6B involve constructing a monochrome or stereoscopic panoramic image from a camera image (via image blending and/or stitching techniques) and then fusing the panoramic image with a LIDAR image to generate a high resolution panoramic depth/disparity map. Fig. 6C and 6D involve fusing the camera image with the LIDAR image to generate a high resolution depth/disparity map, and then blending and/or stitching the depth maps together to generate a panoramic depth map.

Referring to FIG. 6A, an inference model is depicted in accordance with one embodiment. Process 600 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, process 600 may be performed by a point cloud module of an autonomous vehicle, such as point cloud module 307 of fig. 3. Referring to fig. 6A, at block 601, processing logic calibrates or configures the camera device (e.g., determines a center of reference for panoramic configuration, determines and/or adjusts a focal length of the camera). At block 603, processing logic generates a wide-angle (e.g., 360 degrees) stereoscopic or monochrome, cylindrical, or spherical panoramic image. At block 605, processing logic projects the LIDAR 3D point cloud onto the panoramic image to generate a depth grid or depth map. At block 607, based on the depth grid and the mono/stereo panoramic image, processing logic performs inference using an encoder-decoder network 611 (e.g., a trained CNN/CNN + CRF model) to generate a panoramic depth map. At block 609, processing logic backprojects the panoramic depth map back into 3-D space to generate a high resolution point cloud.

Referring to FIG. 6B, a training pattern in accordance with one embodiment is described by process 620. Process 620 may be performed by processing logic that may include software, hardware, or a combination thereof. For example, process 620 may be performed by a machine learning engine, such as machine learning engine 122 of server 103 of fig. 1. For this training mode, training data such as high resolution LIDAR point clouds and monochrome/stereo RGB images are collected. At block 621, processing logic calibrates or at least determines a camera focal length for the camera image according to the data image source. At block 623, processing logic generates a panoramic image based on the monochrome/stereo image. At block 625, processing logic projects the LIDAR point cloud onto the image plane and/or upsamples the LIDAR image to an RGB image scale. For monochrome panorama, the encoder-decoder network 627 learns to infer high resolution depth panorama from low resolution depth panorama. For stereo panoramas, the encoder-decoder network 627 learns to refine the stereo panoramas that match the low resolution depth panoramas projected from the low resolution LIDAR 3-D point cloud.

Referring to FIG. 6C, an inference model is depicted in accordance with one embodiment. Process 640 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, process 640 may be performed by a point cloud module of an autonomous vehicle, such as point cloud module 307 of fig. 3. Referring to fig. 6C, at block 641, processing logic calibrates or configures the camera device (e.g., determines a center of reference for panoramic configuration, determines and/or adjusts a focal length of the camera). At block 643, processing logic pre-processes the camera view, such as warping the camera view into a stereo view or a non-panoramic cylindrical/spherical view. At block 645, processing logic projects the low resolution LIDAR 3D point cloud onto the camera image to generate a low resolution depth grid or depth map. At block 647, based on the depth grid and the mono/stereo panoramic image, processing logic performs inference using an encoder 649 (e.g., a trained CNN/CNN + CRF model) to generate a high resolution depth map. At block 653, processing logic generates a wider angle panoramic depth map based on the calibration information 651 (such as calibration information 641). At block 655, processing logic backprojects the panoramic depth map back into 3-D space to generate a high resolution point cloud.

Referring to FIG. 6D, a training pattern in accordance with one embodiment is described by process 660. Process 660 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, process 660 may be performed by a machine learning engine, such as machine learning engine 122 of server 103 of fig. 1. For this training mode, a set of well-known training data is collected, such as a high resolution LIDAR point cloud and a monochrome/stereo RGB image. At block 661, processing logic calibrates or at least determines a camera focal length for the camera image according to the data image source. At block 663, processing logic prepares camera images for training based on the monochrome/stereo images. At block 665, processing logic projects the LIDAR point cloud onto an image plane of the RGB image and/or upsamples the LIDAR image to an RGB image scale. For monochrome camera images, the encoder-decoder network 667 learns to infer high resolution depth panoramas from low resolution depth panoramas. For stereo camera images, the encoder-decoder network 667 learns to improve the stereo panorama matched to the low resolution depth panorama projected from the low resolution LIDAR 3-D point cloud.

The output of the encoder/decoder network 627 (e.g., CNN model) is compared to the expected result to determine whether the difference between the output of the encoder/decoder network 627 and the expected result is below a predetermined threshold. If the difference exceeds a predetermined threshold, the above process may be performed iteratively by modifying certain parameters or coefficients of the model. An iterative process may be performed until the difference falls below a predetermined threshold, at which point the final product of the model is considered complete. The model is then used in real-time in an ADV to generate a high resolution point cloud based on the low resolution point cloud and the images captured by the one or more cameras.

Fig. 7A and 7B are block diagrams illustrating examples of depth map generation according to some embodiments. Referring to fig. 7A, in one embodiment, a depth map generator 700 may include a downsampling module 402 and a CNN model 701.CNN model 701 (as part of neural network/CRF model 313) may include a systolic layer (or encoder or convolutional layer) 713 and an expanded layer (or decoder or anti-convolutional layer) 715. Fig. 7B illustrates another exemplary embodiment of a depth map generator 720.

Depth map generators

700 and 720 may be executed by depth map module 405 of fig. 4.

Referring to fig. 4 and 7B, the generator 720 receives a first image captured by a first camera (e.g., the camera-captured image 703), which captures a portion of the driving environment of the ADV. The first image may be an RGB image captured by the camera device. Generator 720 receives a second image, such as a low resolution LIDAR image 707, that represents a first depth map of a first point cloud corresponding to a portion of the driving environment produced by a LIDAR (LIDAR) device. The downsampling module 402 downsamples a second image (e.g., the image 707) at a predetermined scale factor until the resolution of the second image reaches a predetermined threshold. In one embodiment, the second image is downsampled until it is dense, i.e., until the amount of overlap or "gap" in either of two adjacent cloud points of the second image falls below a predetermined threshold. The generator 720 generates a second depth map (e.g., a high resolution depth map 709) by applying the CNN model 701 to the first image (e.g., the image 703) and the downsampled second image, the second depth map (e.g., the image 709) having a higher resolution than the first depth map (e.g., the image 707) such that the second depth map (e.g., the image 709) represents a second point cloud of the driving environment around the perceived ADV. It should be noted that the term "image" generally refers to an RGB image or a LIDAR image. The term "depth map" or "LIDAR image" refers to a 2-D image of a 3-D point cloud mapped onto a perspective image plane or panoramic image plane. "image captured by a camera" refers to an optical image captured by a pinhole camera device.

In one embodiment, the camera-captured image 703 and the LIDAR image 707 are non-panoramic images that are warped or projected onto a cylindrical or spherical image plane. In another embodiment, the camera-captured image 703 and the LIDAR image 707 are panoramic images, such as cylindrical or spherical panoramic images. In another embodiment, the camera-captured image 703 and the LIDAR image 707 are perspective images. Here, for this camera configuration, the fluoroscopic images may be generated from a single camera set or any single camera from a monochrome/stereoscopic panoramic camera configuration. For a monochrome panoramic camera configuration, the configuration may include multiple cameras capturing multiple images at approximately the same time, such as configuration 506 of fig. 5C. The images will be warped and blended and/or stitched together by a panorama module (such as panorama module 403 of fig. 4) to generate a cylindrical or spherical panoramic image.

For the LIDAR configuration, the LIDAR image 707 is generated by mapping a 3-D point cloud captured by a LIDAR detector from a 3-D space/plane to a 2-D image plane. Here, the 2-D image plane of image 707 may be the same image plane as image 703. In another embodiment, the LIDAR image 707 may be a perspective LIDAR image corresponding to the camera-captured perspective image 703. Here, the CNN model 701 may be applied successively to pairs of images 703 and 707 of several perspectives to generate a perspective LIDAR image. The generated perspective LIDAR images may then be stitched or blended together by a panorama module (such as panorama module 403 of fig. 4) to generate a panoramic LIDAR image. In another embodiment, the generator 720 may include multiple CNN models, and these models may be applied simultaneously to the paired images 703 and 707 of multiple perspectives to generate multiple perspective LIDAR images for panoramic image generation.

Referring to fig. 4 and 7A, in another embodiment, the generator 700 receives a third image, such as a camera-captured image 705, the camera-captured image 705 being captured by a second camera. A high resolution depth map module of generator 700 (such as high resolution depth map module 405 of fig. 4) generates a second depth map by applying a CNN model to the first image, the third image, and the upsampled second image. Here, images 703 and 705 may be left and right stereo images (e.g., images captured by configuration 514 of fig. 5E), or vertical top and bottom stereo images (e.g., images captured by configuration 524 of fig. 5F). Although only two images captured by the cameras are shown, more images captured by more cameras may also be used as input to the CNN model.

Fig. 8 is a diagram illustrating a contracted (e.g., encoder/convolutional) layer and an expanded (e.g., decoder/deconvolution) layer of a CNN model, according to one embodiment. The CNN model 800 receives the camera image 801, the low resolution depth image 803, and outputs a high resolution depth image 825. For purposes of illustration, a single RGB image 801 is used herein. However, multiple images captured from multiple cameras may also be applied, for example, in a stereoscopic configuration. Note that in this application, an RGB image refers to a color image. Referring to fig. 8, a camera image 801 and a low resolution depth image 803 may represent the image 703 and the image 707 of fig. 7B, respectively. The high resolution image 825 may represent the image 709 of fig. 7B. The CNN model 800 may include different layers, such as downsampled layers 805, convolutional layers (807, 809), deconvolution layers (811, 817), prediction layers (813, 819, 823), and concatenation layers (815, 821).

The convolutional layer (as part of the systolic layer 713 of FIG. 7) and the deconvolution layer (as part of the dilated layer 715 of FIG. 7) may be connected in a single pipeline. Each of the convolutional layer or the systolic layer may down-sample the previous input layer, and each of the expansion layer or the anti-convolutional layer may up-sample the previous input layer. The last layer of constricting layers 713 (e.g., layer 809) is connected to the first layer of expanding layers 715 (e.g., layer 811) to form a single pipeline. The prediction layer (813, 819, 823) performs single channel depth map prediction and feeds the prediction forward to the next layer.

The prediction layer helps to minimize the estimation error of the final CNN output by reducing the error that propagates during the training process. The prediction layer can be implemented as a convolutional layer having the following characteristics: the output image has one output channel of the same image size as the input image. However, the prediction layer may include an upsampling function to upsample the output image size to match the image size of the next layer. The concatenated layers (808, 815, 821) perform a combining function that combines one or more images, such as the output images of the deconvolution, convolution and/or prediction layers. The convolutional/deconvolution layer enables the CNN to perform image classification by looking for low-level features, such as edges and curvatures, to build higher-level features. Downsampling refers to dividing the height and/or width of an image by a factor, such as a factor of 2 (i.e., the image size is reduced by four times). Upsampling refers to multiplying the height and/or width of an image by a factor, such as a factor of 2 (i.e., the image size is increased by four times).

Referring to fig. 8, for illustrative purposes, in one embodiment, the image 801 may comprise a monochrome RGB camera image (e.g., a combined 3-channel, 192 pixel by 96 pixel image) or multiple RGB images in a stereoscopic configuration. The low resolution depth image 803 may comprise a single channel (i.e., grayscale) 48 pixel by 24 pixel LIDAR image (i.e., the image 803 is one-fourth of the scale of the image 801). The convolutional layer 807 may receive the image 801 and downsample the image 801 by a factor of 2, thereby outputting a 64-channel, 96 pixel by 48 pixel image. Subsequent convolutional layers may downsample the image from the corresponding input by a factor, such as a factor of 2.

The input LIDAR image 803 may be downsampled by downsampling 805 until it is dense. For example, if there are no or fewer gaps in the pixels and output, the image is dense, e.g., a 512 channel, 24 pixel by 12 pixel image. The concatenation layer 808 may perform a combination of the corresponding output of the convolutional layer (e.g., 512 channel, 24 pixel by 12 pixel images) and the output of the downsampling layer 805 (e.g., 512 channel, 24 pixel by 12 pixel images) to produce a combined image with a higher resolution (e.g., 1024 channel, 24 pixel by 12 pixel images). It should be noted that in order to combine the down-sampled camera image with the down-sampled depth image or depth map, the size or dimensions of the two images must match. The two images are combined using corresponding convolution layers that match the size of the depth image, depending on the size or dimensions of the depth image layer that has been downsampled. Convolutional layer 809 may have, for example, a 1024 channel, 24 pixel by 12 pixel image as input and a 2048 channel, 12 pixel by 6 pixel image as output.

Deconvolution layer 811 may have as input a 2048 channel, 12 pixel by 6 pixel image and a 1024 channel, 24 pixel by 12 pixel image as output. The prediction layer 813 may upsample the input by a factor of 2 and may have as input a 2048 channel, 12 pixel by 6 pixel image and as output a 1 channel, 24 pixel by 12 pixel image. Concatenated layer 815 may have three inputs with matching image sizes, such as an input from convolutional layer 809 (e.g., a 1024 channel, 24 pixel by 12 pixel image), an output from prediction 813 (e.g., a 1 channel, 24 pixel by 12 pixel image), and an output from anti-convolutional layer 811 (e.g., a 1024 channel, 24 pixel by 12 pixel image). Thus, tandem layer 815 may output a 2049 channel, 24 pixel by 12 pixel image.

Deconvolution layer 817 may have as input a 1024 channel, 48 pixel by 24 pixel image and as output a 512 channel, 96 pixel by 48 pixel image. The prediction layer 819 may upsample the previous input by a factor of 2 and may have as input a 1024-channel, 48 pixel by 24 pixel image and as output a 1-channel, 96 pixel by 48 pixel image. The series 821 may have three inputs: feed forward from the convolutional layer (e.g., 64-channel, 96 pixel by 48 pixel image), output from the prediction layer 819 (e.g., 1-channel, 96 pixel by 48 pixel image), and output from the anti-convolutional layer 817 (e.g., 512-channel, 96 pixel by 48 pixel image). The concatenation 821 then combines these inputs and outputs a 577 channel, 96 pixel by 48 pixel image. The prediction layer 823 may upsample the input by a factor of 2 and may have as input a 577 channel, 96 pixel by 48 pixel image, and output as output 825 a 1 channel, 96 pixel by 48 pixel depth image. It should be noted that in some embodiments, the convolutional layers may be configured as feed forward at the random layers. In some embodiments, a pooling (pooling) layer is interposed between the convolutional layers, and an upper pooling (uncapping) layer is interposed between the deconvolution layers. It should be noted that fig. 8 illustrates one CNN model implementation, but should not be construed as limiting. For example, in some implementations, the CNN model may include different activation functions (e.g., reLU, sigmoid, step, hyperbolic tangent, etc.), exit layers, normalization layers, and so on.

Fig. 9A and 9B are block diagrams illustrating an example of high resolution depth map generation according to some embodiments. The panorama converter 903 and map generator 905 of fig. 9A may represent the encoder-decoder network 611 and panorama generation 603 of fig. 6A, respectively. The panorama converter 903 and map generator 905 of fig. 9B may collectively represent an encoder-decoder network 649 and panorama generation 653, respectively, of fig. 6C. The high resolution depth map generator 905 may be performed by the high resolution depth map module 405 and the panorama generator 903 may be performed by the panorama module 403 of fig. 4. Referring to fig. 9A, an input of the high resolution depth map generator 905 is coupled to an output of the panorama converter 903. Here, inputs 901, such as the images 703 and 705 captured by the camera of fig. 7A and the low resolution LIDAR image 707 of fig. 7A, may be converted to a panoramic image by the panoramic converter 903. A generator 905 receives the panoramic image and generates an output 905, e.g., a high resolution depth map, such as the LIDAR image 709 of fig. 7A. In this configuration, the input images are combined by blending together to generate a panoramic image before being fed to the CNN model, thereby generating a high resolution depth map.

Referring to fig. 9B, in one embodiment, the output of the high resolution depth map generator 905 is coupled to the input of the panorama converter 903. Here, inputs 901, such as the camera-captured images 703 and 705 of fig. 7A and the low resolution LIDAR image 707 of fig. 7A, may be applied by the generator 905 via the CNN model (as part of the high resolution depth map generator 905). The output depth map is received by a panorama converter 903. Converter 903 converts the output of generator 905 to a panoramic depth map, e.g., output 907. In this example, the raw images captured by the camera are fed into the CNN model to generate separate high resolution depth maps, respectively. The individual depth maps are then combined into a high resolution panoramic depth map by blending.

FIG. 10 is a flow chart illustrating a method according to one embodiment. Process 1000 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, process 1000 may be performed by a point cloud module of an autonomous vehicle, such as point cloud module 307 of fig. 3. Referring to fig. 10, at block 1001, processing logic receives a first image captured by a first camera, the first image capturing a portion of the driving environment of an ADV. At block 1002, processing logic receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a LIDAR device. At block 1003, processing logic downsamples the second image at the predetermined scale factor until the resolution of the second image reaches a predetermined threshold. At block 1004, the processing logic generates a second depth map by applying a Convolutional Neural Network (CNN) model to the first image and the downsampled second image, the second depth map having a higher resolution than the first depth map such that the second depth map represents a second point cloud of the driving environment around the perceived ADV.

In one embodiment, the processing logic receives a third image captured by the second camera and generates a second depth map by applying a CNN model to the first image, the third image, and the downsampled second image. In one embodiment, the first image comprises a cylindrical panoramic image or a spherical panoramic image. In another embodiment, a cylindrical panoramic image or a spherical panoramic image is generated based on several images captured by several camera devices. In another embodiment, the processing logic reconstructs the second point cloud by projecting the second depth map into a 3-D space based on a cylindrical panoramic image or a spherical panoramic image.

In one embodiment, the processing logic maps the downsampled second image onto an image plane of the first image. In one embodiment, the second depth map is generated by blending one or more generated depth maps such that the second depth map is a panoramic map.

In one embodiment, the CNN model includes a systolic layer and a dilated layer, wherein each systolic layer includes an encoder to down sample a respective input and the dilated layer is coupled to the systolic layer and each dilated layer includes a decoder to up sample a respective input. In one embodiment, information from the contracting layer is fed forward to the expanding layer, e.g., the output of the contracting layer is fed forward to the input of the expanding layer with a matching image size or dimension. In one embodiment, each of the dilated layers includes a prediction layer to predict a depth map for a subsequent layer.

Fig. 11A and 11B are block diagrams illustrating examples of depth map generation according to some embodiments. Referring to fig. 11A, in one embodiment, a depth map generator 1100 may include an upsampling/repair module 401 and a CNN model 701.CNN model 701 (as part of neural network/CRF model 313) may include a systolic layer (or encoder or convolutional layer) 713 and an expanded layer (or decoder or anti-convolutional layer) 715. Fig. 11B illustrates a depth map generator 1120 of another exemplary embodiment. The

depth map generators

1100 and 1120 may be performed by the depth map module 405 of fig. 4.

Referring to fig. 4 and 11B, the generator 1120 receives a first image captured by a first camera (e.g., camera-captured image 703) that captures a portion of the driving environment of the ADV. The generator 1120 receives a second image, such as a low resolution LIDAR image 707, that represents a first depth map of a first point cloud corresponding to a portion of the driving environment produced by a LIDAR (LIDAR) device. The upsampling/inpainting module 401 upsamples a second image (e.g., image 707) by a predetermined scaling factor to match the image 707 to the image scale of the image 703. In one embodiment, an algorithmic repair function is applied to the upsampled second image to recover any missing portions of the image, e.g., to repair background portions of the upsampled image. Repair is the process of recovering or reconstructing a lost or degraded portion of an image. In another embodiment, the remediation algorithm may include comparing the LIDAR captured image to a LIDAR image captured in a previous time frame. The generator 1120 generates a second depth map (e.g., high resolution depth map 709) by applying the CNN model 701 to the first image (e.g., image 703) and the upsampled and/or repaired second image, wherein the second depth map (e.g., image 709) has a higher resolution than the first depth map (e.g., image 707) such that the second depth map (e.g., image 709) represents a second point cloud perceiving the driving environment surrounding the ADV.

In one embodiment, the camera-captured image 703 and the LIDAR image 707 are panoramic images, such as cylindrical or spherical panoramic images. In another embodiment, the camera-captured image 703 and the LIDAR image 707 are perspective images. Here, for this camera configuration, the fluoroscopic images may be generated from a single camera set or a single camera from a monochrome/stereoscopic panoramic camera configuration. For a monochrome panoramic camera configuration, the configuration may include multiple see-through cameras that capture multiple images at approximately the same time, such as configuration 506 of fig. 5C. The images will be blended or stitched together by a panorama module, such as panorama module 403 of fig. 4, to generate a panoramic image.

For a LIDAR configuration, the LIDAR image 707 is generated by: the 3-D point cloud captured by the LIDAR detector is mapped from a 3-D space/plane, followed by a conversion of the 3-D point cloud to a 2-D image plane. Here, the 2-D image plane of image 707 may be the same image plane as image 703. In another implementation, the LIDAR image 707 may be a perspective LIDAR image corresponding to the perspective image 703 captured by the camera. Here, the CNN model 701 may be applied successively to pairs of images 703 and 707 of several perspectives to generate a perspective LIDAR image. The generated perspective LIDAR images may then be stitched or blended together by a panorama module (such as panorama module 403 of fig. 4) to generate a panoramic LIDAR image. In another embodiment, the generator 1120 may include multiple CNN models, and these models may be applied simultaneously to the multiple perspective paired images 703 and images 707 to generate multiple perspective LIDAR images for panoramic image generation.

Referring to fig. 4 and 11A, in another embodiment, the generator 1100 receives a third image, such as a camera-captured image 705, the camera-captured image 705 being captured by a second camera. The generator 1100 generates a second depth map by applying the CNN model to the first image, the third image and the upsampled and/or repaired second image. Here, images 703 and 705 may be left and right stereo images (e.g., images captured by configuration 514 of fig. 5E), or vertical top and bottom stereo images (e.g., images captured by configuration 524 of fig. 5F).

Fig. 12 is a diagram illustrating a contracted (e.g., encoder/convolutional) layer and an expanded (e.g., decoder/deconvolution) layer of a CNN model, according to one embodiment. The CNN model 1200 receives the camera image 801, the low resolution depth image 803, and outputs a high resolution depth image 825. The camera image 801 and the low resolution depth image 803 may be the image 703 and the image 707 of fig. 11B, respectively. The high resolution depth image 825 may be the image 709 of fig. 11B. CNN model 1200 may include different layers, such as upsampling layer 1203, convolutional layers (807, 809), deconvolution layers (811, 817), prediction layers (813, 819, 823), and concatenation layers (815, 821). Fig. 12 is similar to fig. 8 in most respects, except that a LIDAR image (e.g., low resolution depth image 803) is applied at the input layer of the CNN model and the concatenation layer (e.g., layer 808 of fig. 8) may be omitted.

Referring to fig. 12, for example, the camera image 801 may include a monochrome RGB camera image (e.g., 3-channel, 192 pixels by 96 pixels). The low resolution depth image 803 may comprise a single channel (i.e., grayscale) 48 pixel by 24 pixel LIDAR image (i.e., the image 803 is one-fourth of the scale of the image 801). The upsampling layer 1203 upsamples the image 803 by a scale factor (i.e., four) to match the image scale of the image 801 and outputs a one channel, 192 pixels by 96 pixels image. The upsampling layer 1203 may include a repair layer such that a repair algorithm may be applied to reconstruct missing pixels, where the missing pixels may be introduced by dark spots/artifacts perceived by the LIDAR detector, such as pits, shadows, and/or weather phenomena. The up-sampled/repaired image is combined (adding the image channels together) with the monochrome RGB camera image before it is received by the convolutional layer 807. For example, the input image of the layer 807 may be an image having a size of 192 pixels × 96 pixels with 4 channels.

FIG. 13 is a flow chart illustrating a method according to one embodiment. Process 1300 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, process 1300 may be performed by a point cloud module of an autonomous vehicle, such as point cloud module 307 of fig. 3. Referring to fig. 13, at block 1301, processing logic receives a first image captured by a first camera, the first image capturing a portion of the driving environment of the ADV. At block 1302, processing logic receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a laser radar (LIDAR) device. At block 1303, processing logic upsamples the second image by a predetermined scaling factor to match the image scale of the first image. At block 1304, processing logic generates a second depth map by applying a Convolutional Neural Network (CNN) model to the first image and the upsampled second image, the second depth map having a higher resolution than the first depth map such that the second depth map represents a second point cloud for perceiving the driving environment around the ADV.

In one embodiment, the processing logic receives a third image captured by the second camera and generates a second depth map by applying the CNN model to the first image, the third image, and the upsampled second image. In one embodiment, the first image comprises a cylindrical panoramic image or a spherical panoramic image. In another embodiment, a cylindrical panoramic image or a spherical panoramic image is generated based on several images captured by several camera devices. In another embodiment, the processing logic reconstructs the second point cloud by projecting the second depth map into a 3-D space based on a cylindrical panoramic image or a spherical panoramic image.

In one embodiment, the processing logic maps the upsampled second image onto an image plane of the first image. In one embodiment, the second depth map is generated by blending one or more generated depth maps such that the second depth map is a panoramic map.

In one embodiment, the CNN model includes a systolic layer and a diastolic layer, wherein each of the systolic layer includes an encoder to downsample a corresponding input, and the diastolic layer is coupled to the systolic layer, and each of the diastolic layer includes a decoder to upsample the corresponding input. In one embodiment, information for the contraction layer is fed forward to the expansion layer. In one embodiment, each of the dilated layers includes a prediction layer to predict a depth map for a subsequent layer. In one embodiment, upsampling the second image comprises inpainting the second image.

Fig. 14A and 14B are block diagrams illustrating examples of convolutional neural network models, according to some embodiments. Referring to fig. 14A, in one embodiment, a depth map generator 1400 may include an upsampling module 1401 and a CNN model 701. The CNN model 701 (as part of the neural network/CRF model 313) may include a systolic layer (or encoder or convolutional layer) 713 and a diastolic layer (or decoder or anti-convolutional layer) 715. Fig. 14B illustrates a depth map generator 1420 of another exemplary embodiment.

Depth map generators

1400 and 1420 may be performed by depth map module 405 of fig. 4.

Referring to fig. 4 and 14B, the generator 1420 receives a first image captured by a first camera (e.g., the camera-captured image 703), which captures a portion of the driving environment of the ADV. The generator 1420 receives a second image, such as a low resolution LIDAR image 707, that represents a first depth map of a first point cloud corresponding to a portion of the driving environment produced by a LIDAR (LIDAR) device. The upsampling module 1401 upsamples a second image (e.g., image 707) by a predetermined scale factor to match the image scale of the output image of the CNN model 701. The generator 1420 determines a second depth map (e.g., an output image of the CNN model 701) by applying the CNN model 701 to the first image (e.g., image 703). The generator 1420 generates a third depth map by applying a conditional random domain (CRF) model (performed by the CRF 1403, i.e. the CRF 404 of fig. 4) to the first image (e.g. the image 703), the second image (e.g. the image 707) and the second depth map (e.g. the output image of the CNN model 701), the third depth map having a higher resolution than the first depth map, such that the third depth map represents a second point cloud for perceiving the driving environment around the ADV.

An optimization model such as CRF may be used to refine the estimate of depth/disparity. According to one aspect, the end-to-end CNN model comprises a CRF model that includes three cost terms to optimize (or minimize) the total cost function. For example, the CRF cost function may be:

CRF(x)＝∑ _i∈V f _i (x _i )+∑ _ij∈U f _ij (x _ij )+∑ _k∈W g _k (x _k ),

wherein x is _i Is the disparity value of the ith pixel, V is the set of all pixels, U is a set of image edges, and W is the set of grid points of the LIDAR image. The first two terms (e.g., f) _i (x _i ) And f _ij (x _ij ) May be a unary term for stereo matching cost and a smooth pair of terms that estimate contrast-sensitive edge weights (i.e., image pixel smoothness/discontinuity), respectively.

For example, the CNN-CRF model may be configured such that a univariate term may be determined based on the correlation (e.g., based on the output of the CNN model 701 of fig. 14A) of the stereoscopic left and right RGB images (such as images 703 and 705 of fig. 14A), i.e., the stereoscopic matching cost. In the alternative, the CNN-CRF model may be configured such that a meta-term may be determined based on the "information gain" of the disparity value for the ith pixel (i.e., based on the output of the CNN model 701 of fig. 14B), with the "information gain" of the disparity value for the ith pixel having a contribution from all other disparity values applied to a monochrome (or monocular) RGB image, such as the image 703 of fig. 14B.

The smooth paired terms may be determined based on disparity values (e.g., based on the output of the CNN model 701) for any pair of pixels that represent smoothness/discontinuity of the estimated depth map. An example of such a cost term is defined in Knobelreiter et al, "End-to-End Training of Hybrid CNN-CRF Models for Stereo" (11 months 2016), the contents of which are incorporated herein by reference in their entirety. In an alternative embodiment, the cost term may be an information gain defined in Cao et al, "Estimating Depth of Monocular Images Using a Depth Fully convolved Residual network (Classification of 2016 (5) s.), the contents of which are incorporated herein by reference in their entirety. A third term (e.g., g (x)) may be a cost term representing an error term of the estimated LIDAR image relative to the low resolution LIDAR image (i.e., based on the output of CNN model 701 and the output of upsampling 1401 of fig. 14A-14B).

In one embodiment, g (x) may be defined as:

where the threshold is a predetermined threshold such as 1.0 or 2.0, xi is a parallax value of the ith pixel, and dk is a parallax value of the low resolution LIDAR image. It should be noted that the f (x) and g (x) terms may include weight terms based on the input image 703 that are applied to the input image on a pixel-by-pixel basis to highlight the contrast of the image. For example, the CRF 1403 may apply a weight term to f (x) and/or g (x) based on the input RGB image 703 of fig. 14A to 14B.

In one embodiment, the camera-captured image 703 and the LIDAR image 707 are panoramic images, such as cylindrical or spherical panoramic images. In another embodiment, the camera-captured image 703 and the LIDAR image 707 are perspective images. The camera configuration to capture the image 703 may include any of the camera configurations of fig. 5D-5F.

For a LIDAR configuration, a LIDAR image 707 is generated by: the 3-D point cloud captured by the LIDAR detector is mapped from a 3-D space/plane, followed by a conversion of the 3-D point cloud to a 2-D image plane. Here, the 2-D image plane of image 707 may be the same image plane as image 703. In another implementation, the LIDAR image 707 may be a perspective LIDAR image corresponding to the perspective image 703 captured by the camera. As previously described, the CNN model 701 may be applied successively to pairs of images 703 and 707 for several perspectives to generate a perspective LIDAR image. In another embodiment, several CNN models may be applied simultaneously to the multiple perspective paired images 703 and 707 to generate multiple perspective LIDAR images for panoramic image generation.

Referring to fig. 4 and 14A, in another embodiment, the generator 1400 receives a third image, such as the camera-captured image 705, which camera-captured image 705 was captured by a second camera. The generator 1400 determines a second depth map by applying the CNN model to the first image and the third image. A CRF model is applied to the second depth map over CRF 1403 to generate a third depth map. Here, images 703 and 705 may be left and right stereo images (e.g., images captured by configuration 514 of fig. 5E), or vertical top and bottom stereo images (e.g., images captured by configuration 524 of fig. 5F).

FIG. 15 is a flow chart illustrating a method according to one embodiment. Process 1550 may be performed by processing logic that may include software, hardware, or a combination thereof. For example, process 1550 may be performed by a point cloud module of an autonomous vehicle, such as point cloud module 307 of fig. 3. Referring to fig. 15, at block 1551, processing logic receives a first image captured by a first camera that captures a portion of the driving environment of the ADV. At block 1552, processing logic receives a second image representing a first depth map of a first point cloud corresponding to a portion of the driving environment generated by a laser radar (LIDAR) device. At block 1553, processing logic determines a second depth map by applying a Convolutional Neural Network (CNN) model to the first image. At block 1554, processing logic generates a third depth map by applying a conditional random domain function to the first image, the second image, and the second depth map, the third depth map having a higher resolution than the first depth map such that the third depth map represents a second point cloud of the driving environment around the perceived ADV.

In one embodiment, the processing logic receives a third image captured by the second camera and generates a third depth map by applying the CNN model to the first image and the third image. In one embodiment, the first image comprises a cylindrical panoramic image or a spherical panoramic image. In another embodiment, a cylindrical panoramic image or a spherical panoramic image is generated based on several images captured by several camera devices. In another embodiment, the processing logic reconstructs the second point cloud by projecting the third depth map into a 3-D space based on a cylindrical panoramic image or a spherical panoramic image.

In one embodiment, the processing logic maps the third image onto an image plane of the first image. In one embodiment, the third depth map is generated by blending one or more generated depth maps such that the third depth map is a panorama.

In one embodiment, the CNN model includes a systolic layer and a dilated layer, wherein each systolic layer includes an encoder to down sample a respective input and the dilated layer is coupled to the systolic layer and each dilated layer includes a decoder to up sample a respective input. In one embodiment, information for the contraction layer is fed forward to the expansion layer. In one embodiment, each of the dilated layers includes a prediction layer to predict a depth map for a subsequent layer.

It should be noted that some or all of the components as shown and described above may be implemented in software, hardware, or a combination thereof. For example, such components may be implemented as software installed and stored in a persistent storage device, which may be loaded into and executed by a processor (not shown) to perform the processes or operations described throughout this application. Alternatively, such components may be implemented as executable code programmed or embedded into dedicated hardware, such as an integrated circuit (e.g., an application specific integrated circuit or ASIC), a Digital Signal Processor (DSP) or Field Programmable Gate Array (FPGA), which is accessible via a respective driver and/or operating system from an application. Further, such components may be implemented as specific hardware logic within a processor or processor core as part of an instruction set accessible by software components through one or more specific instructions.

FIG. 16 is a block diagram illustrating an example of a data processing system that may be used with one embodiment of the present disclosure. For example, system 1500 may represent any of the data processing systems described above that perform any of the processes or methods described above, such as, for example, any of sensing and planning systems 110 or servers 103-104 of fig. 1. System 1500 may include many different components. These components may be implemented as Integrated Circuits (ICs), portions of integrated circuits, discrete electronic devices or other modules adapted for a circuit board, such as a motherboard or add-in card of a computer system, or as components otherwise incorporated within a chassis of a computer system.

It should also be noted that system 1500 is intended to illustrate a high-level view of many components of a computer system. However, it is to be understood that some embodiments may have additional components and, further, other embodiments may have different arrangements of the components shown. System 1500 may represent a desktop computer, a laptop computer, a tablet computer, a server, a mobile phone, a media player, a Personal Digital Assistant (PDA), a smart watch, a personal communicator, a gaming device, a network router or hub, a wireless Access Point (AP) or repeater, a set-top box, or a combination thereof. Further, while only a single machine or system is illustrated, the term "machine" or "system" shall also be taken to include any collection of machines or systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

In one embodiment, system 1500 includes a processor 1501, memory 1503 and devices 1505-1508 connected by a bus or interconnect 1510. Processor 1501 may represent a single processor or multiple processors including a single processor core or multiple processor cores. Processor 1501 may represent one or more general-purpose processors, such as a microprocessor, central Processing Unit (CPU), or the like. More specifically, processor 1501 may be a Complex Instruction Set Computing (CISC) microprocessor, reduced Instruction Set Computing (RISC) microprocessor, very Long Instruction Word (VLIW) microprocessor, or a processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processor 1501 may also be one or more special-purpose processors, such as an Application Specific Integrated Circuit (ASIC), a cellular or baseband processor, a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a network processor, a graphics processor, a communications processor, a cryptographic processor, a coprocessor, an embedded processor, or any other type of logic capable of processing instructions.

Processor 1501 (which may be a low-power multi-core processor socket such as an ultra-low voltage processor) may serve as a main processing unit and central hub for communicating with the various components of the system. Such a processor may be implemented as a system on a chip (SoC). Processor 1501 is configured to execute instructions for performing the operations and steps discussed herein. The system 1500 may also include a graphics interface to communicate with an optional graphics subsystem 1504, which may include a display controller, a graphics processor, and/or a display device.

Processor 1501 may be in communication with memory 1503, which in one embodiment may be implemented via multiple memory devices to provide a given amount of system storage. Memory 1503 may include one or more volatile storage (or memory) devices, such as random access memory(RAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), static RAM (SRAM), or other types of storage devices. Memory 1503 may store information including sequences of instructions that are executed by processor 1501, or any other device. For example, executable code and/or data for various operating systems, device drivers, firmware (e.g., an input output basic system or BIOS), and/or applications may be loaded into memory 1503 and executed by processor 1501. The operating system may be any type of operating system, for example, a Robotic Operating System (ROS), from

Of a company

Operating System, mac from apple Inc

From

Of a company

LINUX, UNIX, or other real-time or embedded operating systems.

System 1500 may also include I/O devices such as devices 1505 through 1508, including network interface device 1505, optional input device 1506, and other optional I/O devices 1507. Network interface device 1505 may include a wireless transceiver and/or a Network Interface Card (NIC). The wireless transceiver may be a WiFi transceiver, an infrared transceiver, a bluetooth transceiver, a WiMax transceiver, a wireless cellular telephone transceiver, a satellite transceiver (e.g., a Global Positioning System (GPS) transceiver), or other Radio Frequency (RF) transceiver, or a combination thereof. The NIC may be an ethernet card.

The input device 1506 may include a mouse, a touch pad, a touch-sensitive screen (which may be integrated with the display device 1504), a pointing device (such as a stylus) and/or a keyboard (e.g., a physical keyboard or a virtual keyboard displayed as part of the touch-sensitive screen). For example, the input device 1506 may include a touch screen controller coupled to a touch screen. Touch screens and touch screen controllers, for example, may detect contact and movement or discontinuities thereof using any of a variety of touch sensitive technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies, as well as other proximity sensor arrays or other elements for determining one or more points of contact with the touch screen.

The I/O devices 1507 may include audio devices. The audio device may include a speaker and/or microphone to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and/or telephony functions. Other I/O devices 1507 may also include Universal Serial Bus (USB) ports, parallel ports, serial ports, printers, network interfaces, bus bridges (e.g., PCI-PCI bridges), sensors (e.g., such as accelerometer motion sensors, gyroscopes, magnetometers, light sensors, compasses, proximity sensors, etc.), or combinations thereof. The device 1507 may also include an imaging processing subsystem (e.g., a camera) that may include an optical sensor, such as a Charge Coupled Device (CCD) or Complementary Metal Oxide Semiconductor (CMOS) optical sensor, for facilitating camera functions, such as recording photographs and video clips. Certain sensors can be coupled to interconnect 1510 via a sensor hub (not shown), while other devices, such as a keyboard or thermal sensors, can be controlled by an embedded controller (not shown) depending on the particular configuration or design of system 1500.

To provide persistent storage for information such as data, applications, one or more operating systems, etc., a mass storage device (not shown) may also be coupled to processor 1501. In various embodiments, such mass storage devices may be implemented via Solid State Devices (SSDs) in order to achieve thinner and lighter system designs and improve system responsiveness. However, in other embodiments, the mass storage device may be implemented primarily using a Hard Disk Drive (HDD), with a smaller amount of the SSD storage device acting as an SSD cache to enable non-volatile storage of context state and other such information during a power down event, enabling fast power up upon a system activity restart. Additionally, a flash device may be coupled to processor 1501, for example, via a Serial Peripheral Interface (SPI). Such flash memory devices may provide non-volatile storage of system software, including the BIOS and other firmware of the system.

Storage 1508 may include a computer-accessible storage medium 1509 (also referred to as a machine-readable storage medium or a computer-readable medium) on which is stored one or more sets of instructions or software (e.g., modules, units, and/or logic 1528) embodying any one or more of the methodologies or functions described herein. The processing module/unit/logic 1528 may represent any of the components described above, such as the planning module 305, the control module 306, and the high resolution point cloud module 307. Processing module/unit/logic 1528 may also reside, completely or at least partially, within memory 1503 and/or within processor 1501 during execution thereof by data processing system 1500, memory 1503 and processor 1501, data processing system 1500, memory 1503 and processor 1501 also constituting machine-accessible storage media. Processing module/unit/logic 1528 may also transmit or receive over a network via network interface device 1505.

The computer-readable storage medium 1509 may also be used to permanently store some of the software functions described above. While the computer-readable storage medium 1509 is shown in an exemplary embodiment to be a single medium, the term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "computer-readable storage medium" shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term "computer-readable storage medium" shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media, or any other non-transitory machine-readable medium.

The processing module/unit/logic 1528, components, and other features described herein may be implemented as discrete hardware components or integrated within the functionality of a hardware component, such as an ASIC, FPGA, DSP, or similar device. Further, the processing module/unit/logic 1528 may be implemented as firmware or functional circuitry within a hardware device. Further, the processing module/unit/logic 1528 may be implemented in any combination of hardware devices and software components.

It should be noted that while system 1500 is illustrated with various components of a data processing system, it is not intended to represent any particular architecture or manner of interconnecting the components; as such details are not germane to embodiments of the present disclosure. It will also be appreciated that network computers, hand-held computers, mobile telephones, servers, and/or other data processing systems which have fewer components or perhaps more components may also be used with embodiments of the present disclosure.

Some portions of the foregoing detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the appended claims, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Embodiments of the present disclosure also relate to apparatuses for performing the operations herein. Such a computer program is stored in a non-transitory computer readable medium. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., computer) readable storage medium (e.g., read only memory ("ROM"), random access memory ("RAM"), magnetic disk storage media, optical storage media, flash memory devices).

The processes or methods depicted in the foregoing figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), software (e.g., embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be appreciated that some of the operations may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

Embodiments of the present disclosure are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the disclosure as described herein.

In the foregoing specification, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims

1. A method of generating a high resolution three dimensional point cloud, the method comprising:

receiving a first image captured by a first camera, the first image capturing a portion of a driving environment of an autonomous vehicle;

2. The method of claim 1, further comprising:

receiving a third image captured by a second camera; and

determining the second depth map by applying the convolutional neural network model to the first image and the third image.

3. The method of claim 1, wherein the first image comprises a cylindrical panoramic image or a spherical panoramic image.

4. The method of claim 3, wherein the cylindrical panoramic image or the spherical panoramic image is generated based on a plurality of images captured by a plurality of camera devices.

5. The method of claim 3, further comprising:

reconstructing the second point cloud by projecting the second depth map into 3-D space based on the cylindrical panoramic image or the spherical panoramic image.

6. The method of claim 2, further comprising:

mapping the third image onto an image plane of the first image.

7. The method of claim 6, wherein the third depth map is generated by blending one or more generated depth maps, wherein the third depth map is a panoramic map.

8. The method of claim 1, wherein the convolutional neural network model comprises:

a plurality of puncturing layers, wherein each puncturing layer includes an encoder to downsample a respective input; and

a plurality of expansion layers coupled to the plurality of contraction layers, wherein each expansion layer includes a decoder to upsample a respective input.

9. The method of claim 8, wherein information for the plurality of contracting layers is fed forward to the plurality of expanding layers.

10. The method of claim 8, wherein each of the plurality of dilation layers comprises a prediction layer to predict a depth map for a subsequent layer.

11. A non-transitory machine-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

12. The non-transitory machine-readable medium of claim 11, the operations further comprising:

receiving a third image captured by a second camera; and

13. The non-transitory machine-readable medium of claim 11, wherein the first image comprises a cylindrical panoramic image or a spherical panoramic image.

14. The non-transitory machine-readable medium of claim 13, wherein the cylindrical panoramic image or the spherical panoramic image is generated based on a plurality of images captured by a plurality of camera devices.

15. The non-transitory machine-readable medium of claim 13, the operations further comprising:

16. A data processing system, comprising:

a processor; and

generating a third depth map by applying a conditional random domain model to the first image, the second image, and the second depth map, the third depth map having a higher resolution than the first depth map, wherein the third depth map represents a second point cloud for perceiving the driving environment surrounding the autonomous vehicle.

17. The system of claim 16, the operations further comprising:

receiving a third image captured by a second camera; and

18. The system of claim 16, wherein the first image comprises a cylindrical panoramic image or a spherical panoramic image.

19. The system of claim 18, wherein the cylindrical panoramic image or the spherical panoramic image is generated based on a plurality of images captured by a plurality of camera devices.

20. The system of claim 18, the operations further comprising: