WO2020199072A1

WO2020199072A1 - Autonomous driving dataset generation with automatic object labelling methods and apparatuses

Info

Publication number: WO2020199072A1
Application number: PCT/CN2019/080776
Authority: WO
Inventors: Yimin Zhang; Haibing Ren; Xiangbin WU; Ignacio Alvarez
Original assignee: Intel Corporation
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2020-10-08
Also published as: EP3948647A4; EP3948647A1; CN113366488A

Abstract

Apparatuses, storage media and methods associated with computer assisted or autonomous driving (CA/AD), are disclosed herein. A method comprises correspondingly processing a plurality of sequences of images collected by a CA/AD system (100,400) of the CA/AD vehicles (352a-352c) to detect objects (70) on the plurality of roadways; individually processing the sequences of images collected to detect objects (70) on the plurality of roadways via single camera motion based object detection analysis; collectively processing the sequences of images collected to detect objects (70) on the plurality of roadways via multi-view object detection analysis; and generating the autonomous driving dataset based at least in part on the object detection results of the corresponding, individual and collective processing of the sequence of images.

Description

AUTONOMOUS DRIVING DATASET GENERATION WITH AUTOMATIC OBJECT LABELLING METHODS AND APPARATUSES

Technical Field

The present disclosure relates to the field of computer-assisted or autonomous driving (CA/AD) . More particularly, the present disclosure relates to generation of CA/AD training or reference datasets, including automatic object labelling.

Background

Autonomous driving has been researched for many years. Besides the traditional car manufacturing companies, high-tech companies have taken a strong interest in developing autonomous driving solutions, including Waymo (Google) , Uber, NVidia, and Intel. The most famous project is probably Google’s self-driving project, which began in 2009 and has recently released an driver-less taxi service for the Phoenix residential area.

Among the important technologies for autonomous driving is the technology of vision based environment perception. With camera inputs, the autonomous driving vehicle is to recognize the road, traffic signs, cars, trucks, pedestrians and other objects on the roadway. Probably, the most popular approach to address this challenge is data-driven machine learning solution. Extremely large training datasets with labeled ground-truth are crucial to train object detectors, in order to provide the required robustness and accuracy. But real public roads are very complex and the captured training images are affected by a lot of factors, including seasons, weather, illumination, view-points, occlusion, etc.

Currently, the most popular public benchmark training dataset for autonomous driving is KITTI, a project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago, which was captured in 5 days, and it had 389 sequences with the length of 39.2km distance. More than 200k three dimensional (3D) object annotations were labeled manually. For Mobileye, the ground-truth were also labeled manually. The Oxford RobotCar Dataset has 20TB driving data of over 1000km, much larger than KITTI. But there is no ground-truth information of 3D objects. Thus, the amount of data in these or other widely used training datasets still appear not enough to guarantee robust perception algorithms. For example, a Tesla driver was killed in crash with Autopilot active on May 7 ^th 2016. According to American National Highway Traffic Safety Administration, the Tesla model S car misrecognized the truck as a brightly lit sky. One of the possible reasons for the misrecognition is that this type of scenario has never appeared in the training dataset, which suggests Tesla’s training dataset may be insufficient.

In order to collect broad enough training datasets, a range of data collection development vehicles have begun to operate on real public roads. Real image sequences of million miles have been captured. But the main limit for extremely large training datasets is the ground-truth manual labeling. Labeling 3D objects for these large training datasets, including roads, road markings, signals, pedestrians and other objects, is very time-consuming and expensive. According to Amnon Shashua, 800 persons were labeling the image data for Mobileye in 2016. More than 200k 3D object annotations were labeled manually. Even these massive investment on labelling efforts can only process a very small part of the captured images. Some training datasets decide therefore not to provide 3D object ground-truth information such as the Oxford RobotCar Dataset, which has 20TB driving data of over 1000km, much larger than KITTI, but no ground-truth information of 3D objects.

Note that a benchmark training dataset may also referred to as a benchmark reference dataset, or simply a training or reference dataset. Hereinafter, it might also be simply referred to as “dataset. ”

Brief Description of the Drawings

Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

Figure 1 illustrates an overview of an environment for incorporating and using the autonomous driving dataset generation with automatic object labelling technology of the present disclosure, in accordance with various embodiments.

Figure 2 illustrates an overview of the generation of autonomous driving dataset with automatic object labelling, according to various embodiments.

Figure 3 illustrates multi-view capturing of images of roadways of the present disclosure, according to various embodiments.

Figure 4 illustrates a component view of an example computer-assisted/autonomous driving system, according to various embodiments.

Figure 5 illustrates an example process for generating an autonomous driving dataset with automatic object labelling, according to various embodiments

Figure 6 illustrates an example process for calibrating image sensors of the data capturing CA/AD vehicles, according to various embodiments.

Figures 7A-7C illustrate respective example processes for real time and local collecting of images and detecting of objects in roadways, single camera motion based object detection and multi-view object detection, according to various embodiments.

Figure 8 illustrates an example process for merging the results of object detection of various methods, according to various embodiments.

Figure 9 illustrates an example neural network suitable for use by the object detection subsystem, according to various embodiments;

Figure 10 illustrates a software component view of the in-vehicle (CA/AD) system, according to various embodiments.

Figure 11 illustrates a hardware component view of a computing platform, suitable for use as an in-vehicle (CA/AD) system or a cloud server, according to various embodiments.

Figure 12 illustrates a storage medium having instructions for practicing aspects of the methods described with references to Figures 1-8, according to various embodiments.

Detailed Description

Disclosed herein are novel methods, apparatuses and computer-readable storage medium (CRM) associated with generation of autonomous driving dataset, including automatic labelling of 3D objects, to address the challenges discussed in the background section. In various embodiments, multiple methods are applied to detect and automatically label objects. One of the methods is based on real time local object detection by the data capturing vehicles themselves. Another method is based on single camera motion based object detection analysis. A third method is multi-view object detection using sequences of images collectively captured by a multi-view vision system constituted with proximally operated data collection or capturing vehicles. The results of object detection are merged together to provide the automatic object labelling in the generated autonomous driving dataset. With the merge of redundant results from multiple methods, high accuracy can be achieved. Experience has shown that the approach provides much better performance than the traditional results.

More specifically, in various embodiments, a process for generating an autonomous driving dataset for training computer-assisted or autonomous driving (CA/AD) systems of CA/AD vehicles, comprises proximally operating a plurality of CA/AD vehicles on a plurality of roadways; and collecting a plurality of sequences of images of the plurality of roadways with image sensors disposed in the plurality of proximally operated CA/AD vehicles, including synchronously collecting some of the images by the image sensors. Additionally, the process includes correspondingly processing the plurality of sequences of images collected by the CA/AD systems of the CA/AD vehicles to detect objects on the plurality of roadways; individually processing the sequences of images collected to detect objects on the plurality of roadways via single camera motion based object detection analysis; and collectively processing the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis. Further, the process includes generating the autonomous driving dataset with automatic object labelling based at least in part on the object detection results of the corresponding, individual and collective processing of the sequence of images.

In various embodiments, according to the final result, 2D projection on the original images are also generated as automatic 2D ground-truth, which is very convenient for manually check or later processing.

In various embodiments, a computer-assisted or autonomous driving (CA/AD) system for a CA/AD vehicle comprises: a sensor interface and an input/output (I/O) interface; and an autonomous driving dataset generator (ADDG) agent coupled with the sensor interface, and the I/O interface. The ADDG agent, via the sensor interface, is to forward synchronization signals to an image sensor of the CA/AD vehicle, and to receive a sequence of images of a plurality of roadways collected by the image sensor, at least some of received images being collected synchronously with image collections on one or more other proximally operated CA/AD vehicles, based at least in part on the synchronization signals. Further, the ADDG agent, via the I/O interface, is to output the received sequence of images to an ADDG to process the sequence of images to detect objects on the plurality of roadways under a plurality of manners, and to generate an autonomous driving dataset with automated object labelling, based at least in part on results of the plurality of manners of processing.

In various embodiments, at least one computer-readable medium (CRM) having instructions stored therein to cause a computing system (e.g., a server) , in response to execution of the instructions by a processor of the computing system, to operate an autonomous driving dataset generator (ADDG) to: individually process a plurality of sequences of images collected by image sensors of a plurality of proximally operated computer-assisted or autonomous driving (CA/AD) vehicles to detect objects on the plurality of roadways via single camera motion based object detection analysis, including individual calibration of the image sensors and detection of moving areas within the images; and collectively process the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis, including cross calibration of the image sensors and reconstruction of 3D scenes within the images. Further, the computing system is caused to operate the ADDG to generate an autonomous driving dataset with automated object labelling, based at least in part on results of the individual and collective processing of the sequence of images.

Though fully-automatic methods may not get 100%recall and precision rates, it still can save much effort. As the size of the datasets is extremely large, even a very small automation portion will have a significant impact on cost &effort.

In the following detailed description, these and other aspects of the autonomous driving dataset generation, including automatic labelling of 3D objects technology will be further described. References will be made to the accompanying drawings which form a part hereof wherein like numerals designate like parts throughout, and in which is shown by way of illustration embodiments that may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and their equivalents.

Aspects of the disclosure are disclosed in the accompanying description. Alternate embodiments of the present disclosure and their equivalents may be devised without parting from the spirit or scope of the present disclosure. It should be noted that like elements disclosed below are indicated by like reference numbers in the drawings.

Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order than the described embodiment. Various additional operations may be performed and/or described operations may be omitted in additional embodiments.

For the purposes of the present disclosure, the phrase “A and/or B” means (A) , (B) , or (A and B) . For the purposes of the present disclosure, the phrase “A, B, and/or C” means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B and C) . The description may use the phrases “in an embodiment, ” or “In some embodiments, ” which may each refer to one or more of the same or different embodiments. Furthermore, the terms “comprising, ” “including, ” “having, ” and the like, as used with respect to embodiments of the present disclosure, are synonymous.

As used herein, the term “module” or “engine” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC) , an electronic circuit, a processor (shared, dedicated, or group) and/or memory (shared, dedicated, or group) that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.

Referring now to Figure 1, wherein an overview of an environment incorporating and using the autonomous driving dataset generation with automatic object labelling technology of the present disclosure, in accordance with various embodiments, is illustrated. As shown, for the illustrated embodiments, example environment 50 includes vehicle 52. Vehicle 52 includes an engine, transmission, axles, wheels and so forth (not shown) . Further, vehicle 52 includes an in-vehicle system (IVS) (also referred to as computer-assisted or autonomous driving (CA/AD) system) 100, sensors 110, and driving control units (DCU) 120. In various embodiments, IVS or CA/AD system 100 includes in particular, navigation subsystem 130, object detection subsystem 140, and autonomous driving dataset generator (ADDG) agent 150. ADDG agent 150 is configured to complement an ADDG, e.g., ADDG 85, disposed e.g., in server 60, to generate autonomous driving datasets to train CA/AD systems of CA/AD vehicles, e.g., object detection subsystems of the CA/AD systems (such as object detection subsystem 140 of CA/AD system 100) . ADDG agent 150 and ADDG 85 are incorporated with technology of the present disclosure to enable the autonomous driving datasets be generated with automatic object labelling, described more fully below.

In various embodiments, navigation subsystem 130 may be configured to provide navigation guidance or control, depending on whether CA/AD vehicle 52 is a computer-assisted vehicle, partially or fully autonomous driving vehicle. Object detection subsystem 140 may be configured with computer vision to recognize stationary or moving objects 70 (such as a travelers, other vehicles, bicycles, street signs, traffic lights, and so forth) in a rolling area surrounding CA/AD vehicle 52, based at least in part on sensor data collected by sensors 110, as it travels enroute on the roadways to its destination. In various embodiments, in response to the stationary or moving objects recognized in the rolling area surrounding CA/AD vehicle 52, CA/AD system 100 makes decisions in guiding or controlling DCUs of CA/AD vehicle 52 to drive or assist in driving CA/AD vehicle to its destination.

In various embodiments, sensors 110 includes one or more high resolution red/green/blue (RGB) and light detection and ranging (LiDAR) image sensors (cameras) (not shown) to capture a plurality of sequences of images of the rolling surrounding area of CA/AD vehicle 52, as it travels enroute on the roadways to its destination, In various embodiments, sensors 110 may also include accelerometers, gyroscopes, Global Positioning System (GPS) circuitry, Globalnaya Navigazionnaya Sputnikovaya Sistema, or Global Navigation Satellite System (GLONASS) circuitry, and so forth.

Examples of driving control units (DCU) may include control units for controlling engine, transmission, brakes of CA/AD vehicle 52. In various embodiments, in addition to navigation subsystem 130, object detection subsystem 140, and ADDG agent 150, IVS or CA/AD system 100 may further include a number of infotainment subsystems/applications, e.g., instrument cluster subsystem/applications, front-seat infotainment subsystem/application, such as, a navigation subsystem/application, a media subsystem/application, a vehicle status subsystem/application and so forth, and a number of rear seat entertainment subsystems/applications (not shown) .

In various embodiments, IVS or CA/AD 100, on its own or in response to user interactions, communicates or interacts 54 with one or more remote/cloud servers 60. Remote/cloud servers 60 may include any one of a number of driving assistance (such as map) or content provision (such as multi-media infotainment) services 80. In various embodiments, as described earlier, remote/cloud servers 60 include in particular, ADDG 85 to generate autonomous driving datasets with automatic object labelling. Except for ADDG 85, driving assistance (such as map) or content provision (such as multi-media infotainment) services 80, may be one or more of these services, known in the art

In various embodiments, IVS or CA/AD 100 communicates 54 with server 60 via cellular communication, e.g., via a wireless signal repeater or base station on transmission tower 56 near vehicle 52. Examples of private and/or public wired and/or wireless networks 58 may include the Internet, the network of a cellular service provider, and so forth. It is to be understood that transmission tower 56 may be different towers at different times/locations, as vehicle 52 travels enroute to its destination or personal system 150 moves around. In various embodiments, IVS or CA/AD 100 communicates with servers 60 via wired communication, such as Ethernet, or removable storage medium, such as solid state drives, disks or tapes.

Except for the autonomous driving dataset generation with automatic object labelling technology of the present disclosure provided, IVS or CA/AD system 100, CA/AD vehicle 52, servers 60 and driving assistance and/or content services 80, otherwise may be any one of a number of in-vehicle systems, CA/AD vehicles, from computer-assisted to partially or fully autonomous vehicles, servers, and driving assistance/content services known in the art. These and other aspects of the autonomous driving dataset generation with automatic object labelling technology will be further described with references to the remaining Figures.

Referring now to Figure 2, wherein an overview of the generation of autonomous driving dataset with automatic object labelling, according to various embodiments, is illustrated. As shown, for the illustrated embodiments, the final object detection results 208 included in the generated autonomous driving dataset is a merger of the results of different object detection methods. In the case of the illustrated embodiments, the final object detection results 208 included in the generated autonomous driving dataset is a merger of the results of three object detection methods. In alternate embodiments, final object detection results 208 included in the generated autonomous driving dataset may be a merger of the results of more or less object detection methods. Generally, the final object detection results 208 included in the generated autonomous driving dataset will be more accurate if they are a merger of the results of more object detection methods, as opposed to less object detection methods.

For the illustrated embodiments, the results of the different object detection methods include:

-the results of real time local object detection 202 made by the object detection subsystems of the data capturing CA/AD vehicles themselves;

-the results of object detection 204 obtained via offline single camera motion based object detection analysis; and

-the results of object detection 206 obtained via offline multi-view object detection analysis.

The results of the single camera motion based object detection analysis are obtained by individually processing the sequence of images captured by the image sensors of data capturing CA/AD vehicles. They may be the same sequence of images used by the object detection subsystems of the data capturing CA/AD vehicles to perform real time local object detections, as the CA/AD vehicles travel enroute to their destinations.

The results of the multi-view object detection analysis are obtained by collectively processing the sequence of images synchronously captured by the image sensors of the data capturing CA/AD vehicles. The combined image sensors of the data capturing CA/AD vehicles may provide collaboratively point clouds with large view field, less occlusion and high resolution. It may improve the object detection greatly because of the following merits:

● With large view field and less occlusion, more parts of the objects will be observed. For partial object detection, the detection rate on large portion is much higher than the performance on small portion.

● In 3D space, it is very easy to remove cluttered background and segment the object itself.

● For 3D object detection, it is known that the 3D shape information will compensate the insufficient texture to increase the detection rate and decrease the false alarm rate simultaneously.

● High resolution is very helpful for small object detection

Referring now to Figure 3, wherein multi-view capturing of images of roadways of the present disclosure, according to various embodiments, is illustrated. As shown, images for multi-view object detection analysis are collected using image sensors 356a-356c corresponding disposed in a number of proximally operated CA/AD vehicles 352a-352c. Image sensors 356a-356c of proximally operated CA/AD vehicles 352a-352c periodically capture images of the roadways synchronously. In various embodiments, each of CA/AD vehicles 352a-352c may be an instance of CA/AD vehicle 52 of Figure 1.

In various embodiments, proximally operated CA/AD vehicles 352a-352c are equipped with inter-vehicle communication, e.g., WiFi. For these embodiments, proximally operated CA/AD vehicles 352a-352c are further equipped with intelligence to negotiate with each other, and elect one of the proximally operated CA/AD vehicles 352a-352c as a master vehicle, for coordinating the capturing of the roadway images. In various embodiments, when it is time to take an image, the master vehicle sends a synchronous signal 354a-354b to the other proximally operated CA/AD vehicles 352a-352c. For these embodiments, the centrally disposed one of the proximally operated CA/AD vehicles 352a-352c, such as CA/AD vehicle 352b, may be elected as the master vehicle.

In alternate embodiments, the capturing of the multi-view images may be coordinated or synchronized in other manners. For example, the proximally operated CA/AD vehicles 352a-352c may negotiate an image capturing frequency (e.g., every sec) , and synchronize their start time at the beginning. In still other embodiments, the synchronous capturing of the multi-view roadway images may be coordinated by a remote server, e.g., remote server 60 of Figure 1.

For ease of understanding, only three proximally operated CA/AD vehicles 352a-352c are shown in Figure 3. However, the disclosure is not so limited. In alternate embodiments, the disclosure may be practiced with more or less proximally operated CA/AD vehicles 352a-352c.

Referring now to Figure 4, wherein a component view of an example computer-assisted/autonomous driving system, according to various embodiments, is illustrated. As shown, for the illustrated embodiments, CA/AD system 400, which may be IVS or CA/AD system 100 of Figure 1, includes main system controller 402, navigation subsystem 404, object detection subsystem 406, ADDG Agent 408, intra-vehicle communication subsystem 410, inter-vehicle communication subsystem 412 and remote communication subsystem 414. In other embodiments, CA/AD system 400 may include more or less subsystems.

In various embodiments, main system controller 402 is configured to control overall operation of CA/AD system 400, including controlling DCU 420 of the host vehicle of CA/AD system 400, via intra-vehicle communication subsystem 410. Main system controller 402 may control DCU 420 based at least in part on the sensor data provided by various sensors 430, via intra-vehicle communication subsystem 410, as well as the results of object detection provided by object detection subsystem 406.

Object detection subsystem 406, which may be object detection subsystem 140 of Figure 1, is configured to recognize stationary or moving objects 70 (such as a travelers, other vehicles, bicycles, street signs, traffic lights, and so forth) in a rolling area surrounding the host vehicle of CA/AD system 400, based at least in part on sensor data collected by sensors 430, as it travels enroute on the roadways to its destination. In various embodiments, object detection subsystem 406 may include a neural network in detecting objects within the rolling area surrounding the host vehicle. Figure 9 illustrates an example neural network that may be employed for real time local object detection, to be described in more detail below.

Navigation subsystem 404, which may be navigation subsystem 130, may be configured to provide navigation guidance or control, depending on whether the host vehicle of CA/AD system 400 is a computer-assisted vehicle, partially or fully autonomous driving vehicle. Navigation subsystem 404 may provide navigation guidance or control based at least in part on sensor data provided by other sensors , such as GPS/GLONASS sensors, via intra-vehicle communication subsystem 410. Navigation subsystem 404 may be any one of such subsystems known in the art.

ADDG agent 408 is configured to complement an offline ADDG, e.g., ADDG 85 of Figure 1, in the generation of autonomous driving datasets with automatic object labelling. In various embodiments, ADDG agent 408 is configured to cooperate with the proximally operated vehicles in the collection of multi-view images of the roadways. In particular, in various embodiments, ADDG agent 408 is configured to negotiate with the proximally operated vehicles in selecting the master vehicle among the proximally operated vehicles. For these embodiments, ADDG agent 408 is further configured to send or receive the synchronization signals to synchronize the taking of the multi-view roadway images, depending on whether the host vehicle of CA/AD system 400 is selected as the master vehicle. Further, ADDG agent 408 is configured to output the roadway images captured by image sensors 430, including the roadway images taken synchronously with the proximally operated vehicles, and the results of the object detection by object detection subsystem 406, for the offline ADDG, via remote communicate subsystem 414.

The sensor data may include, but are not limited to, sensor data (images) from one or more cameras of the host vehicle providing frontal, rearward and/or side world views looking out the host vehicle; sensor data from accelerometer, inertia measurement units (IMU) , and/or gyroscopes of the vehicle, providing speed and/or deceleration data, and so forth.

In various embodiments, main system controller 402, navigation subsystem 404, object detection subsystem 406 and ADDG agent 408 may be implemented in hardware and/or software, with or without the employment of hardware accelerators. Figures 10-11 illustrate example hardware and/or software implementations of CA/AD system 400, to be described in more detail later.

In some embodiments, intra-vehicle communication subsystem 410 may be coupled with sensors 430 and driving control units 420 via a vehicle bus. Intra-vehicle communication subsystem 410 may communicate with sensors 430 and driving control units 420 in accordance with the Controller Area Network communication protocol. In some embodiments, intra-vehicle communication subsystem 410 may be communicatively coupled with sensors 430 via a wireless network, and communicate in accordance with a wireless network protocol, such as Near Field Communication (NFC) ,

WiFi and so forth. By virtue of its inter-operation with sensors 430, intra-vehicle communication subsystem 410 may also be referred as sensor interface.

As alluded to earlier, inter-vehicle communication subsystem 412 is configured to facilitate communication with proximally operated CA/AD vehicles. In some embodiments, inter-vehicle communication subsystem 412 is configured to support inter-vehicle communication in accordance with one or more industry accepted practices. In some embodiments, inter-vehicle communication subsystem 412 may be configured to communicate with communication subsystems of the other vehicles via WiFi or cellular, such as LTE 4G/5G.

As alluded to earlier, remote communication subsystem 414 is configured to facilitate communication with one or more remote/offline servers, which may be server 60 of Figure 1. In some embodiments, remote communication subsystem 414 may be configured to communicate with the remote/offline servers wirelessly, via a wide area network, such as the Internet. Wireless communication may be WiFi or cellular, such as LTE 4G/5G. In other embodiments, remote communication subsystem 414 may be configured to communicate with the remote/offline servers via wired communication, such as Ethernet, or through portable storage medium, such as removable solid state drives, disks or tapes. By virtue of the nature of its inter-operation with remote servers, remote communication subsystem 414 may also be referred as an input/output (I/O) interface of CA/AD system 400.

Referring now to Figure 5, wherein an example process for generating an autonomous driving dataset generation with automatic object labelling, according to various embodiments, is illustrated. As shown, process 500 includes operations performed at blocks 502-512. Operations at blocks 502-512 may be performed by a provider of autonomous driving datasets, using in particular, ADDG 85 of Figure 1, complemented by ADDG agent 150 of Figure 1 or 408 of Figure 4. In alternate embodiments, process 500 may include more or less operations.

Process 500 starts at block 502. At block 502, image sensors of a plurality of CA/AD vehicles to be proximally operated to capture images of roadways for generation of autonomous driving dataset with automatic object labelling, are calibrated. In various embodiments, as described earlier, the image sensors of the CA/AD vehicles include RGB and LiDAR cameras. For these embodiments, the calibrations include 2D and 3D calibration of the RGB and LiDAR cameras, as well as cross calibration of the image sensors for multi-view image processing. The calibrations will be further described later with references to Figure 6.

Next, at block 503, the plurality of CA/AD vehicles having various calibrated sensors (including the image sensors) and object detection capabilities are proximally operated on a plurality of roadways to collect data (including images) of a plurality of roadways.

At block 504, while operating on the plurality of roadways, sensor data (including images) of the roadways are individually collected, as well as cooperatively collected, to detect objects on the roadways. That is, sensors (including image sensor (s) ) of a CA/AD vehicle may collect sensor data (including images) of the roadways continuously, with at least a subset of the images are collected synchronously in coordination among the plurality of CA/AD vehicles, as earlier described. The operations of collecting images with image sensors will be further described later with references to Figure 7A.

From block 504, process 500 proceeds to blocks 506-510.

At block 506, while operating on the plurality of roadways and collecting the sensor data (including images) , each of the CA/AD vehicles individually detects for objects on the roadways, based at least in part on the sensor data (including images) collected, using the corresponding object detection subsystems of the CA/AD vehicles. The results of the object detection are accumulated and later outputted for the operations at block 512. As described earlier, in various embodiments, an object detection subsystem of a CA/AD vehicle may employ a neural network in making the detection. An example neural network is later described with references to Figure 9.

At block 508, after operating on the plurality of roadways and collecting the sensor data (including images) , the images collected by the image sensors of the CA/AD vehicles may be correspondingly processed to perform single camera motion based object detection. Similarly, the results of the single camera motion based object detection are outputted for the operations at block 512. The operations of single camera motion based object detection will be further described later with references to Figure 7B.

At block 510, after operating on the plurality of roadways and collecting the sensor data (including images) , the images collected by the image sensors of the CA/AD vehicles may be collectively processed to perform multi-view object detection. Similarly, the results of the multi-view object detection are outputted for the operations at block 512. The operations of multi-view object detection will be further described later with references to Figure 7C.

From

blocks

506, 508 and 510, process 500 proceeds to block 512. At block 512, the results of the real time object detections by the object detection subsystems of the CA/AD vehicles, the results of the single camera motion based object detection analysis, and the results of the multi-view object detection analysis are merged together to provide the automatic object labelling for the autonomous driving dataset being generated. The operations of merging the various object detection results will be further described later with references to Figure 8.

Referring now to Figure 6, wherein an example process for correspondingly calibrating and cross calibrating the image sensors of the CA/AD vehicles, according to various embodiments, is illustrated. As shown, process 600 for correspondingly calibrating and cross calibrating the image sensors of the CA/AD vehicles include operations performed at blocks 602-606. In various embodiments, the operations may be performed by the provider of the autonomous driving datasets, e.g., using ADDG 85 of Figure 1. In alternate embodiments, process 600 may be practiced with more or less operations.

Process 600 starts at block 602. At block 602, the 3D LiDAR cameras and the 2D RGB cameras of the CA/AD vehicles are correspondingly calibrated. Compared with general RGBD camera, experience has shown the combination of 3D LiDAR and 2D RGB cameras provide better results for outside environment. The combination senses much longer distance and have better depth accuracy. The 3D LIDAR cameras are used to sense the depth information, while the 2D RGB cameras are used to sense the color information. In various embodiments, the intrinsic and extrinsic parameters of each pair of two imaging systems are calibrated with the method described in Jesse Levinson, Sebastian Thrun. Automatic Online Calibration of Cameras and Lasers. Robotics: Science and Systems, 2013. In general, the extrinsic parameters represent a rigid transformation from 3-D world coordinate system to the 3-D camera’s coordinate system. The intrinsic parameters represent a projective transformation from the 3-D camera’s coordinates into the 2-D image coordinates. In alternate embodiments, other calibration methods may be practiced. On calibration, the captured depth images will be aligned to the RGB images.

From block 602, process 600 proceeds to block 604. At block 604, after smoothing and interpolation, the 3D point cloud with RGB color is generated with the same 3D coordinate systems of the 2D cameras. These 3D point clouds from multiple vehicles are used for vehicle calibration and 3D object detection later. As the 3D LiDAR and 2D RGB cameras are typically fixed in the CA/AD vehicles, the calibration typically only needs be done once or infrequently after multiple repeated operations.

Next, at block 606, the 3D LiDAR and 2D RGB cameras of the CA/AD vehicles are crossed calibrated to enable subsequent multi-view analysis of the images they capture. After the calibration of 2D camera and 3D LIDAR, each vehicle is considered as a 3D vision system which outputs 3D point cloud with RGB information. Then, the extrinsic parameter of the multiple car cameras in order to combine all the point cloud together, are cross calibrated. In embodiments, the cross calibration is performed in 2 stages:

Stage 1: Estimate the rotation and translation between two neighboring 3D vision systems.

A neighboring 3D vision system is the one that is the closest to the current system of interest, in physical distance. In various embodiments, only the closest pair of vision systems will be calibrated because they may share the largest view field. Further, iterative close point (ICP) method is used to estimate the rotation and translation between the 3D vision systems via the registration of two 3D point clouds. As is known that ICP may converge to a local minimum if its initialization parameters are not well set. Experience has shown that with coarse 2D location and pose of each vehicle, good initialization of ICP translation and rotation is achieved with the method. Further, very accurate extrinsic parameters between neighboring vehicle vision system can be estimated.

Stage 2: Set 3D coordinates of the 3D vision system on the CA/AD vehicle to be operated in substantially the center among the proximally operated CA/AD vehicles, as the world coordinate system. Transfer the coordinate systems of all other 3D vision systems on other CA/AD vehicles to the world coordinate system one by one.

If Cw represents the world coordinate system; C1 and Cw, C1 and C2 are neighboring coordinate systems, the relationship of the extrinsic parameter calibration are governed by the following equations:

Cw = R1*C1 + T1 (1)

C1 = R2*C2 + T2 (2)

where (R1, T1) is the rotation and translation between C1 and Cw; (R2, T2) is the rotation and translation between C1 and C2.

So the translation and rotation between C2 and Cw is given by the question:

Cw = R1* (R2*C2 + T2) + T1 = (R1*R2) *C2 + (R1*T2 + T1) (3)

Using these equations, one by one, the transfers of all coordinate systems to the world coordinate system are determined.

In various embodiments, when calibrating camera extrinsic parameters between neighboring vehicles, for high robust and accurate camera extrinsic parameter calibration, 2 measurements are taken. Not the 2D color images from the cameras, but the 3D point clouds are used for extrinsic parameter calibration. Generally, 3D registration of point clouds are much more robust than traditional 2D camera calibration. ICP (Iterative Closest Point) or its variants may perform very robustly. Good initialization parameters for ICP are estimated to guarantee its convergence. These good initialization parameters are based on car coarse position and orientation. Additionally, after 3D calibration of multiple nearby cars, 3D point clouds from these cars may be merged to be a big one with larger view field, much less occlusion and higher resolution. Further, 3D object detection are done on the final big point cloud. And 3D pose of the objects may also be obtained during the 3D object detection.

Referring now to Figures 7A-7C, wherein respective example processes for collecting roadway images, single camera motion based object detection and multi-view object detection, according to various embodiments, are illustrated. Figure 7A illustrates an example process for collecting images for roadways, according to various embodiments. As illustrated, process 700 for collecting images for roadways, which is performed on each proximally operated CA/AD vehicle, includes operations at blocks 702-708. In various embodiments, operations at blocks 702-708 may be performed by components of a CA/AD system, e.g., CA/AD system 400 of Figure 4. In alternate embodiments, process 700 may have more or less operations.

Process 700 starts at block 702. At block 702, the CA/AD vehicle is self-localized. In various embodiments, self-location of a CA/AD vehicle may be performed using a combination of sensor data from GPS/GLONASS and IMU. In general, with GPS and GLONASS data, a CA/AD vehicle may locate itself very accurately and robustly on the roadways. But, it may fail occasionally, when both GPS and GLONASS signals are occluded very seriously. Under the circumstance, IMU data are used for short-term continuous self-location.

Next at block 704, the CA/AD vehicle performs camera coarse three-dimensional (3D) position and orientation estimation. In various embodiments, the offsets between vehicle camera position/orientation and vehicle’s position/orientation are fixed and can be measured before data capture. For these embodiments, only the vehicle’s 3D position and orientation are estimated. In various embodiments, only the coarse 3D position and orientation is estimated, assuming the other proximally operated vehicles are on the same horizontal plane (ground plane) . Therefore, only the 2D position and orientation on ground plane is estimated.

In various embodiments, the position from vehicle self-location is used as the coarse position. For these embodiments, experience has shown that the vehicle self-location error is generally within 1m. Though it seems to be quite good for general vehicle navigation applications, the error is still considered a little large for extrinsic parameter calibration between different vehicle cameras or same vehicle camera at different time. Therefore, in case of the vehicle orientation, with the vehicle motion trajectory, the vehicle velocity vector is estimated via trajectory differential operation. The velocity direction is considered as the vehicle’s coarse orientation. In this way, the vehicle camera coarse 3D position and orientation are obtained. In various embodiments, the coarse 3D position and orientation will also be later used during the offline processing as initialization parameters to estimate the fine extrinsic parameters.

From block 704, process 700 proceeds to

blocks

706 and 708. At block 706, from time to time (e.g., periodically) synchronization signals are send or receive to or from the other proximally operated CA/AD vehicles to synchronize capturing of images of the roadways. At block 708, RGB and LiDAR images of the roadways are captured continuously, with some of the images captured synchronously with the other proximally operated CA/AD vehicles, in response to the synchronization signals.

As described earlier, the captured RGB and LiDAR images are outputted and used real time to detect object as the CA/AD vehicle travels enroute on the roadways to its destination. Further, the captured RGB and LiDAR images are also outputted for subsequent offline single camera motion based object detection analysis, as well as multi-view object detection analysis.

Figure 7B illustrates an example process for single camera motion based object detection. As shown, for the illustrated embodiments, process 720 for single camera motion based object detection includes operations performed at blocks 724-726. In various embodiments, the operations may be performed by e.g., ADDG 85 of Figure 1. Process 720 is correspondingly performed on each of the sequence of images collected by the proximally operated CA/AD vehicles. For each performance, process 720 starts at block 724. At block 724, moving area detection is performed on a sequence of images captured by an image sensor of a CA/AD vehicle. In various embodiments, a point cloud is generated for each frame of a sequence of consecutively captured images. Each vehicle is considered to be a different view point at different time. At each view point, the 3D scene is reconstructed. (Construction of 3D scene will be described more fully below when multi-view object detection is described) . At the same time, the area with large registration error will be the moving area (including the moving object and background region) .

Next, at block 726, on detection of a moving area, detection of moving object is performed. In some embodiments, only moving objects of 3 categories are detected. They are pedestrian, cyclist and vehicle. Vehicle is a big category which includes some sub-categories, such as car, truck, bus etc. In various embodiments, the detection method may be any object detection method known in the art, but specially trained for these 3 categories of interest. Therefore, it will have higher accuracy and faster speed in detecting moving objects of the 3 categories of interest. In alternate embodiments, additional categories may be detected.

Figure 7C illustrates an example process for multi-view object detection. As shown, for the illustrated embodiments, process 740 for multi-view object detection includes operations performed at blocks 742-746. In various embodiments, the operations may be performed by e.g., ADDG 85 of Figure 1. In other embodiments, process 740 may include more or less operations.

Process 740, for the illustrated embodiments, starts at

blocks

742 and 746. At

block

742, 3D scenes are reconstructed. After transferring all vehicle camera coordinate systems to the world coordinate system, all point clouds are also transferred to the world coordinate system and merged together. The point cloud from single car is sparse and the view field of independent is small. The merged 3D point cloud has larger view field, less occlusion and higher resolution. In various embodiments, as there are many overlapping points, the merged point clouds are processed to remove the redundant points and keep the details. In various embodiments, the redundant points may be removed, with the details kept, in accordance with the method described in Pfister, M. Zwicker, J. van Baar, and M. Gross. Surfels: Surface elements as rendering primitives. In ACM Transactions on Graphics (Proc. of SIGGRAPH) , 2000. In alternate embodiments, other redundant point removal methods may be employed instead. On removal of the redundant points, a 3D point cloud of the whole environment is obtained.

From block 742, process 740 may proceed to block 746. At

block

746, 3D object detection is performed. In various embodiments, the 3D objects are detected in the merged point cloud. The 3D object is represented as a 3D bounding box with facing orientation. A deep learning based method is used to detect the vehicles, pedestrians, cyclists, traffic signs and signals in the 3D space of the merged point cloud. In various embodiments, the deep learning based method may be the method described in Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, Ingmar Posner. Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. 2017 IEEE International Conference on Robotics and Automation (ICRA2017) . In alternate embodiments, 3D object detection methods may be employed instead.

Back at block 744, while operations of block 742 are being performed to reconstruct 3D scenes, 3D vehicle projection is performed. Some of the data capturing vehicles are in field view of the merged point cloud. After cross calibration of the image sensors of the vehicles, their 3D position and orientation in the world coordinate system are known. In various embodiments, information such as the model, size and even 3D shape of the CA/AD vehicles are also known. So it is not necessary to detect the CA/AD vehicles. Their known position, size and facing orientation are directly added to the ground-truth list of the autonomous driving dataset being generated.

On performance of the operations of

blocks

744 and 746, the 3D vehicle projections and the results of the 3D object detection are outputted for the object detection result merge operations.

Referring now to Figure 8, wherein an example process for merging the results of object detection from different object detection methods, according to various embodiments, is illustrated. As shown, process 800 for merging the results of object detection from different methods includes the operations performed at blocks 802-804. In various embodiments, the operations may be performed by e.g., ADDG 85 of Figure 1. In other embodiments, process 800 may include more or less operations.

Process 800 starts at block 802. At block 802, the results of the real time local object detection by the CA/AD vehicles, the results of the single camera motion based object detection analysis, and the results of the multi-view object detection analysis are merged together. In various embodiments, the results from the 3 methods are merged together with a non-maximal suppression method. In the context of object detection, the non-maximal suppression method is used to transform a smooth response map that triggers many imprecise object window hypotheses in, ideally, a single bounding-box for each detected object. Experience has shown that these redundant results will improve the detection accuracy greatly. In various embodiments, among the methods, i.e. relative to each other, the results of the real time local object detection are considered to have the highest confidence. And the results of the motion based object detection analysis have the middle level confidence while the results of the multi-view object detection analysis have the lowest confidence. These confidences are employed to remove the redundancy during the suppression procedure.

Next, at block 804, the detected 3D objects are back projected to the vehicle’s coordinate systems. The target of these operations is to get the 3D object location and orientation on each vehicle’s coordinate system, and back-project the 3D detection results to the original car coordinate system. First, the 3D object coordinates and orientation are transferred to the original vehicle camera’s 3D coordinate system via the rotation and translation. The rotation and translation matrix is the inverse of equation from multiple car camera calibration:

Cw = R1*C1 + T1 → C1 = R1 ^-1* (C1-T1) (4)

Then, the 2D ground-truth of the original 3D vision coordinate system is calculated. In various embodiments, with the camera intrinsic parameter, the vertex and edges of the 3D objects are projected on the 2D image plane via perspective projection model.

Referring now to Figure 9, wherein an example neural network, in accordance with various embodiments, is shown. Example neural network 900 may be suitable for use e.g., by object detection subsystem 140 of Figure 1 or object detection subsystem 406 of Figure 4. As shown, example neural network 900 may be a multilayer feedforward neural network (FNN) comprising an input layer 912, one or more hidden layers 914 and an output layer 916. Input layer 912 receives data of input variables (x _i) 902. Hidden layer (s) 914 processes the inputs, and eventually, output layer 916 outputs the determinations or assessments (y _i) 904. In one example implementation the input variables (x _i) 902 of the neural network are set as a vector containing the relevant variable data, while the output determination or assessment (y _i) 904 of the neural network are also as a vector.

Multilayer feedforward neural network (FNN) may be expressed through the following equations:

where ho _i and y _i are the hidden layer variables and the final outputs, respectively. f () is typically a non-linear function, such as the sigmoid function or rectified linear (ReLu) function that mimics the neurons of the human brain. R is the number of inputs. N is the size of the hidden layer, or the number of neurons. S is the number of the outputs.

The goal of the FNN is to minimize an error function E between the network outputs and the desired targets, by adapting the network variables iw, hw, hb, and ob, via training, as follows:

where

where y _kp and t _kp are the predicted and the target values of pth output unit for sample k, respectively, and m is the number of samples.

For object detection subsystem 140 or 406, input variables (x _i) 902 may include various sensor data collected by various vehicles sensors, as well as data describing relevant factors to object detection. The output variables (y _i) 904 may include the objects detected, a pedestrian, a vehicle, a bicyclist, a traffic sign, a traffic light, and so forth. The network variables of the hidden layer (s) for the neural network, are determined by the training data.

In the example of Figure 9, for simplicity of illustration, there is only one hidden layer in the neural network. In some other embodiments, there can be many hidden layers. Furthermore, the neural network can be in some other types of topology, such as Convolution Neural Network (CNN) , Recurrent Neural Network (RNN) , and so forth.

Referring now to Figure 10, wherein a software component view of the in vehicle system, according to various embodiments, is illustrated. As shown, for the embodiments, IVS or CA/AD system 1000, which could be IVS or CA/

AD system

100 or 400, includes hardware 1002 and software 1010. Software 1010 includes hypervisor 1012 hosting a number of virtual machines (VMs) 1022 -1028. Hypervisor 1012 is configured to host execution of VMs 1022-1028. The VMs 1022-1028 include a service VM 1022 and a number of user VMs 1024-1028. Service machine 1022 includes a service OS hosting execution of a number of instrument cluster applications 1032. User VMs 1024-1028 may include a first user VM 1024 having a first user OS hosting execution of front seat infotainment applications 1034, a second user VM 1026 having a second user OS hosting execution of rear seat infotainment applications 1036, a third user VM 1028 having a third user OS hosting execution of navigation and object detection subsystems and ADDG Agent 1038, and so forth.

Except for autonomous driving datasets generation with automatic object labelling technology of the present disclosure, software 1010 may otherwise be any one of a number of these elements known in the art. For example, hypervisor 1012 may be any one of a number of hypervisors known in the art, such as KVM, an open source hypervisor, Xen, available from Citrix Inc, of Fort Lauderdale, FL., or VMware, available from VMware Inc of Palo Alto, CA, and so forth. Similarly, service OS of service VM 1022 and user OS of user VMs 1024-1028 may be any one of a number of OS known in the art, such as Linux, available e.g., from Red Hat Enterprise of Raleigh, NC, or Android, available from Google of Mountain View, CA.

Referring now to Figure 11, wherein an example computing platform that may be suitable for use to practice aspects of the present disclosure, according to various embodiments, is illustrated. As shown, computing platform 1100, which may be hardware 1002 of Figure 10, or a computing platform of one of the servers 60 of Figure 1. For the illustrated embodiments, computing platform 1100 includes one or more system-on-chips (SoCs) 1102, ROM 1103 and system memory 1104. Each SoCs 1102 may include one or more processor cores (CPUs) , one or more graphics processor units (GPUs) , one or more accelerators, such as computer vision (CV) and/or deep learning (DL) accelerators. ROM 1103 may include basic input/output system services (BIOS) 1105. CPUs, GPUs, and CV/DL accelerators may be any one of a number of these elements known in the art. Similarly, ROM 1103 and BIOS 1105 may be any one of a number of ROM and BIOS known in the art, and system memory 1104 may be any one of a number of volatile storage devices known in the art. In various embodiments, one of the CV/DL accelerators may be used to implement the object detection subsystem of a CA/AD system.

Additionally, computing platform 1100 may include persistent storage devices 1106. Example of persistent storage devices 1106 may include, but are not limited to, flash drives, hard drives, compact disc read-only memory (CD-ROM) and so forth. Further, computing platform 1100 may include one or more input/output (I/O) interfaces 1108 to interface with one or more I/O devices, such as sensors 1120. Other example I/O devices may include, but are not limited to, display, keyboard, cursor control and so forth. Computing platform 1100 may also include one or more communication interfaces 1110 (such as network interface cards, modems and so forth) . Communication devices may include any number of communication and I/O devices known in the art. Examples of communication devices may include, but are not limited to, networking interfaces for

Near Field Communication (NFC) , WiFi, Cellular communication (such as LTE 4G/5G) and so forth. The elements may be coupled to each other via system bus 1111, which may represent one or more buses. In the case of multiple buses, they may be bridged by one or more bus bridges (not shown) .

Each of these elements may perform its conventional functions known in the art. In particular, ROM 1103 may include BIOS 1105 having a boot loader. System memory 1104 and mass storage devices 1106 may be employed to store a working copy and a permanent copy of the programming instructions implementing the operations associated with hypervisor 112 (including for some embodiments, functions associated with ADDG 85 or the ADDG agent 150/408) , service/user OS of service/user VM 1022-1028, components of navigation subsystem 1038, collectively referred to as computational logic 1122. The various elements may be implemented by assembler instructions supported by processor core (s) of SoCs 1102 or high-level languages, such as, for example, C, that can be compiled into such instructions.

As will be appreciated by one skilled in the art, the present disclosure may be embodied as methods or computer program products. Accordingly, the present disclosure, in addition to being embodied in hardware as earlier described, may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to as a “circuit, ” “module” or “system. ” Furthermore, the present disclosure may take the form of a computer program product embodied in any tangible or non-transitory medium of expression having computer-usable program code embodied in the medium. Figure 12 illustrates an example computer-readable non-transitory storage medium that may be suitable for use to store instructions that cause an apparatus, in response to execution of the instructions by the apparatus, to practice selected aspects of the present disclosure described with references to Figures 1-8. As shown, non-transitory computer-readable storage medium 1202 may include a number of programming instructions 1204. Programming instructions 1204 may be configured to enable a device, e.g., computing platform 1100, in response to execution of the programming instructions, to implement (aspects of) hypervisor 112 (including for some embodiments, functions associated with the ADDG or the ADDG agent) , service/user OS of service/user VM 122-128, or components of navigation subsystem 1038. In alternate embodiments, programming instructions 1204 may be disposed on multiple computer-readable non-transitory storage media 1202 instead. In still other embodiments, programming instructions 1204 may be disposed on computer-readable transitory storage media 1202, such as, signals.

Any combination of one or more computer usable or computer readable medium (s) may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.

Computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s) . It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a, ” “an” and “the” are intended to include plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising, ” when used in this specification, specific the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operation, elements, components, and/or groups thereof.

Embodiments may be implemented as a computer process, a computing system or as an article of manufacture such as a computer program product of computer readable media. The computer program product may be a computer storage medium readable by a computer system and encoding a computer program instructions for executing a computer process.

The corresponding structures, material, acts, and equivalents of all means or steps plus function elements in the claims below are intended to include any structure, material or act for performing the function in combination with other claimed elements are specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill without departing from the scope and spirit of the disclosure. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for embodiments with various modifications as are suited to the particular use contemplated.

Thus various example embodiments of the present disclosure have been described including, but are not limited to:

Example 1 is a method for generating an autonomous driving dataset for training computer-assisted or autonomous driving (CA/AD) systems of CA/AD vehicles, comprising: proximally operating a plurality of CA/AD vehicles on a plurality of roadways; collecting a plurality of sequences of images of the plurality of roadways with image sensors disposed in the plurality of proximally operated CA/AD vehicles, including synchronously collecting some of the images by the image sensors; correspondingly processing the plurality of sequences of images collected by the CA/AD systems of the CA/AD vehicles to detect objects on the plurality of roadways; individually processing the sequences of images collected to detect objects on the plurality of roadways via single camera motion based object detection analysis; collectively processing the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis; and generating the autonomous driving dataset based at least in part on the object detection results of the corresponding, individual and collective processing of the sequence of images.

Example 2 is example 1, wherein proximally operating the plurality of CA/AD vehicles on the plurality of roadways comprises establishing inter-vehicle communication between the proximally operated plurality of CA/AD vehicles, and dynamically selecting one of the plurality of CA/AD vehicles as a master vehicle among the plurality of CA/AD vehicles to coordinate at least in part the collecting of the plurality of sequences of images of the plurality of roadways with the image sensors disposed in the plurality of proximally operated CA/AD vehicles.

Example 3 is example 2, wherein proximally operating the plurality of CA/AD vehicles on the plurality of roadways comprises the master vehicle sending synchronization signals to the other CA/AD vehicles to synchronize at least in part the collection of images for the multi-view object detection analysis.

Example 4 is example 1, wherein generating comprises merging the object detection results of the corresponding, individual and collective processing of the sequence of images with a non-maximal suppression method.

Example 5 is example 4, wherein generating further comprises back projecting objects in the merged object detection results into respective coordinate systems of the CA/AD vehicles.

Example 6 is any one of examples 1-5, further comprising correspondingly calibrating the image sensors of the plurality of CA/AD vehicles, as well as cross calibrating the image sensors of pairs of neighboring CA/AD vehicles.

Example 7 is example 6, further comprising on corresponding calibrating the image sensors of the plurality of CA/AD vehicles, generating independent three dimensional (3D) point cloud coordinate systems with 3D coordinate systems of two dimensional image sensors.

Example 8 is example 6, wherein cross calibrating the image sensors of a pair of neighboring CA/AD vehicles comprises estimating rotation and translation between the image sensors of the pair of neighboring CA/AD vehicles.

Example 9 is example 6, wherein cross calibrating the image sensors of a pair of neighboring CA/AD vehicles further comprises setting 3D coordinates of the image sensor of the CA/AD vehicle to be operated in substantially the center among the proximally operated CA/AD vehicles, as the world coordinate system.

Example 10 is example 9, wherein if Cw represents the world coordinate system; C1 and Cw, C1 and C2 are neighboring coordinate systems, relationship of the extrinsic parameter calibration are governed by the following equations:

Cw = R1*C1 + T1

C1 = R2*C2 + T2

where (R1, T1) is the rotation and translation between C1 and Cw;

(R2, T2) is the rotation and translation between C1 and C2.

Example 11 is a computer-assisted or autonomous driving (CA/AD) system for a CA/AD vehicle comprising: a sensor interface and an input/output (I/O) interface; and an autonomous driving dataset generator (ADDG) agent coupled with the sensor interface, and the I/O interface; wherein the ADDG agent, via the sensor interface, is to forward synchronization signals to an image sensor of the CA/AD vehicle, and to receive a sequence of images of a plurality of roadways collected by the image sensor, at least some of received images being collected synchronously with image collections on one or more other proximally operated CA/AD vehicles, based at least in part on the synchronization signals; and wherein the ADDG agent, via the I/O interface, is to output the received sequence of images to an ADDG to process the sequence of images to detect objects on the plurality of roadways under a plurality of manners, and to generate an autonomous driving dataset with automated object labelling, based at least in part on results of the plurality of manners of processing.

Example12 is example 11, further comprising an inter-vehicle communication interface coupled to the ADDG agent, wherein the ADDG agent, via the inter-vehicle communication interface, is to send or receive the synchronization signals to or from the one or more other proximally operated CA/AD vehicles, to synchronize collections of some of the images among the CA/AD vehicle and the one or more other proximally operated CA/AD vehicles.

Example13 is example 11, further comprising an objection detection subsystem coupled to the sensor interface; wherein the object detection subsystem, via the sensor interface, is also to receive the sequence of images of the plurality of roadways collected by the image sensor, and locally detect objects in the plurality of roadways based at least in part on the images; wherein the ADDG agent is to further output, via the I/O interface, to the ADDG, which further bases it generation of an autonomous driving dataset with automated object labelling, on results of the local detection of objects on the plurality of roadways.

Example14 is example 11, wherein the ADDG agent is further arranged to determine a geographic location of the CA/AD vehicle based on geolocation data provided by a global position system disposed on the CA/AD vehicle or motion data provided by an inertial measurement unit of the CA/AD vehicle.

Example15 is any one of examples 11-14, wherein the ADDG agent is further arranged to estimate a three dimension (3D) location and orientation of the image sensor of the CA/AD vehicle via coarse estimation of a 3D location and orientation of the CA/AD vehicle that includes estimation of a two dimension (2D) location and orientation of the CA/AD vehicle on a ground plane.

Example 16 is at least one computer-readable medium (CRM) having instructions stored therein to cause a computing system, in response to execution of the instructions by a processor of the computing system, to operate an autonomous driving dataset generator (ADDG) to: individually process a plurality of sequences of images collected by image sensors of a plurality of proximally operated computer-assisted or autonomous driving (CA/AD) vehicles to detect objects on the plurality of roadways via single camera motion based object detection analysis, including detection of moving areas within the images; collectively process the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis, including reconstruction of three dimensional (3D) scenes within the images; and generate an autonomous driving dataset with automated object labelling, based at least in part on results of the individual and collective processing of the sequence of images.

Example17 is example 16, wherein the computing system is further causes to operate the ADDG to generate a plurality of independent 3D point cloud coordinate systems corresponding to the image sensors of the proximally operated CA/AD vehicles for use to cross calibrate image sensors of pairs of neighboring CA/AD vehicles.

Example18 is example 16, wherein to individually the plurality of sequences of images collected by image sensors of the plurality of proximally operated CA/AD vehicles to detect objects on the plurality of roadways via single camera motion based object detection analysis, include detection of pedestrians, cyclists and vehicles with the detected moving areas within the images.

Example19 is example 16, wherein to collectively process the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis, further includes to represent detected objects with 3D bounding boxes having facing orientations.

Example 20 is example 16, wherein reconstruction of 3D scenes within the images comprises transfer of all coordinate systems of the image sensors of the CA/AD vehicles to a world coordinate system, as well as transfer of all point clouds of the image sensors of the CA/AD vehicles to the world coordinate system, and merger of the transferred point clouds.

Example 21 is example 16, wherein the computing system is further caused to operate the ADDG to perform 3D projections of the CA/AD vehicles including position, size and facing orientation of the CA/AD vehicles.

Example 22 is any one of examples 16-21, wherein to generate the autonomous driving dataset with automated object labelling includes to merge 3D object detection results of the individual and collective processing of the sequence of images.

Example 23 is example 22, wherein the computing system is further caused to operate the ADDG to receive local object detection results for the plurality of roadways by the CA/AD vehicles; and wherein to merge further includes to merge the local object detection results with the 3D object detection results of the individual and collective processing of the sequence of images.

Example 24 is example 23, wherein to merge the local object detection results with the 3D object detection results of the individual and collective processing of the sequence of images comprises to merge the local object detection results, and the 3D object detection results of the individual and collective processing of the sequence of images using a non-maximal suppression method.

Example 25 is example 23, wherein to generate the autonomous driving dataset with automated object labelling further includes to back project the merged 3D object detection results to 3D ground truth in each CA/AD vehicle’s coordinate system.

It will be apparent to those skilled in the art that various modifications and variations can be made in the disclosed embodiments of the disclosed device and associated methods without departing from the spirit or scope of the disclosure. Thus, it is intended that the present disclosure covers the modifications and variations of the embodiments disclosed above provided that the modifications and variations come within the scope of any claims and their equivalents.

Claims

A method for generating an autonomous driving dataset for training computer-assisted or autonomous driving (CA/AD) systems of CA/AD vehicles, comprising:

proximally operating a plurality of CA/AD vehicles on a plurality of roadways;

collecting a plurality of sequences of images of the plurality of roadways with image sensors disposed in the plurality of proximally operated CA/AD vehicles, including synchronously collecting some of the images by the image sensors;

correspondingly processing the plurality of sequences of images collected by the CA/AD systems of the CA/AD vehicles to detect objects on the plurality of roadways;

individually processing the sequences of images collected to detect objects on the plurality of roadways via single camera motion based object detection analysis;

collectively processing the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis; and

generating the autonomous driving dataset based at least in part on the object detection results of the corresponding, individual and collective processing of the sequence of images.
The method of claim 1, wherein proximally operating the plurality of CA/AD vehicles on the plurality of roadways comprises establishing inter-vehicle communication between the proximally operated plurality of CA/AD vehicles, and dynamically selecting one of the plurality of CA/AD vehicles as a master vehicle among the plurality of CA/AD vehicles to coordinate at least in part the collecting of the plurality of sequences of images of the plurality of roadways with the image sensors disposed in the plurality of proximally operated CA/AD vehicles.
The method of claim 2, wherein proximally operating the plurality of CA/AD vehicles on the plurality of roadways comprises the master vehicle sending synchronization signals to the other CA/AD vehicles to synchronize at least in part the collection of images for the multi-view object detection analysis.
The method of claim 1, wherein generating comprises merging the object detection results of the corresponding, individual and collective processing of the sequence of images with a non-maximal suppression method.
The method of claim 4, wherein generating further comprises back projecting objects in the merged object detection results into respective coordinate systems of the CA/AD vehicles.
The method of any one of claims 1-5, further comprising correspondingly calibrating the image sensors of the plurality of CA/AD vehicles, as well as cross calibrating the image sensors of pairs of neighboring CA/AD vehicles.
The method of claim 6, further comprising on corresponding calibrating the image sensors of the plurality of CA/AD vehicles, generating independent three dimensional (3D) point cloud coordinate systems with 3D coordinate systems of two dimensional image sensors.
The method of claim 6, wherein cross calibrating the image sensors of a pair of neighboring CA/AD vehicles comprises estimating rotation and translation between the image sensors of the pair of neighboring CA/AD vehicles.
The method of claim 6, wherein cross calibrating the image sensors of a pair of neighboring CA/AD vehicles further comprises setting 3D coordinates of the image sensor of the CA/AD vehicle to be operated in substantially the center among the proximally operated CA/AD vehicles, as the world coordinate system.
The method of claim 9, wherein if Cw represents the world coordinate system; C1 and Cw, C1 and C2 are neighboring coordinate systems, relationship of the extrinsic parameter calibration are governed by the following equations:

Cw = R1*C1 + T1

C1 = R2*C2 + T2

where (R1, T1) is the rotation and translation between C1 and Cw;

(R2, T2) is the rotation and translation between C1 and C2.
A computer-assisted or autonomous driving (CA/AD) system for a CA/AD vehicle comprising:

a sensor interface and an input/output (I/O) interface; and

an autonomous driving dataset generator (ADDG) agent coupled with the sensor interface, and the I/O interface;

wherein the ADDG agent, via the sensor interface, is to forward synchronization signals to an image sensor of the CA/AD vehicle, and to receive a sequence of images of a plurality of roadways collected by the image sensor, at least some of received images being collected synchronously with image collections on one or more other proximally operated CA/AD vehicles, based at least in part on the synchronization signals; and

wherein the ADDG agent, via the I/O interface, is to output the received sequence of images to an ADDG to process the sequence of images to detect objects on the plurality of roadways under a plurality of manners, and to generate an autonomous driving dataset with automated object labelling, based at least in part on results of the plurality of manners of processing.
The CA/AD system of claim 11, further comprising an inter-vehicle communication interface coupled to the ADDG agent, wherein the ADDG agent, via the inter-vehicle communication interface, is to send or receive the synchronization signals to or from the one or more other proximally operated CA/AD vehicles, to synchronize collections of some of the images among the CA/AD vehicle and the one or more other proximally operated CA/AD vehicles.
The CA/AD system of claim 11, further comprising an objection detection subsystem coupled to the sensor interface; wherein the object detection subsystem, via the sensor interface, is also to receive the sequence of images of the plurality of roadways collected by the image sensor, and locally detect objects in the plurality of roadways based at least in part on the images; wherein the ADDG agent is to further output, via the I/O interface, to the ADDG, which further bases it generation of an autonomous driving dataset with automated object labelling, on results of the local detection of objects on the plurality of roadways.
The CA/AD system of claim 11, wherein the ADDG agent is further arranged to determine a geographic location of the CA/AD vehicle based on geolocation data provided by a global position system disposed on the CA/AD vehicle or motion data provided by an inertial measurement unit of the CA/AD vehicle.
The CA/AD system of any one of claims 11-14, wherein the ADDG agent is further arranged to estimate a three dimension (3D) location and orientation of the image sensor of the CA/AD vehicle via coarse estimation of a 3D location and orientation of the CA/AD vehicle that includes estimation of a two dimension (2D) location and orientation of the CA/AD vehicle on a ground plane.
At least one computer-readable medium (CRM) having instructions stored therein to cause a computing system, in response to execution of the instructions by a processor of the computing system, to operate an autonomous driving dataset generator (ADDG) to:

individually process a plurality of sequences of images collected by image sensors of a plurality of proximally operated computer-assisted or autonomous driving (CA/AD) vehicles to detect objects on the plurality of roadways via single camera motion based object detection analysis, including detection of moving areas within the images;

collectively process the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis, including reconstruction of three dimensional (3D) scenes within the images; and

generate an autonomous driving dataset with automated object labelling, based at least in part on results of the individual and collective processing of the sequence of images.
The CRM of claim 16, wherein the computing system is further causes to operate the ADDG to generate a plurality of independent 3D point cloud coordinate systems corresponding to the image sensors of the proximally operated CA/AD vehicles for use to cross calibrate image sensors of pairs of neighboring CA/AD vehicles.
The CRM of claim 16, wherein to individually the plurality of sequences of images collected by image sensors of the plurality of proximally operated CA/AD vehicles to detect objects on the plurality of roadways via single camera motion based object detection analysis, include detection of pedestrians, cyclists and vehicles with the detected moving areas within the images.
The CRM of claim 16, wherein to collectively process the sequences of images collected to detect objects on the plurality of roadways via multi-view object detection analysis, further includes to represent detected objects with 3D bounding boxes having facing orientations.
The CRM of claim 16, wherein reconstruction of 3D scenes within the images comprises transfer of all coordinate systems of the image sensors of the CA/AD vehicles to a world coordinate system, as well as transfer of all point clouds of the image sensors of the CA/AD vehicles to the world coordinate system, and merger of the transferred point clouds.
The CRM of claim 16, wherein the computing system is further caused to operate the ADDG to perform 3D projections of the CA/AD vehicles including position, size and facing orientation of the CA/AD vehicles.
The CRM of any one of claims 16-21, wherein to generate the autonomous driving dataset with automated object labelling includes to merge 3D object detection results of the individual and collective processing of the sequence of images.
The CRM of claim 22, wherein the computing system is further caused to operate the ADDG to receive local object detection results for the plurality of roadways by the CA/AD vehicles; and wherein to merge further includes to merge the local object detection results with the 3D object detection results of the individual and collective processing of the sequence of images.
The CRM of claim 23, wherein to merge the local object detection results with the 3D object detection results of the individual and collective processing of the sequence of images comprises to merge the local object detection results, and the 3D object detection results of the individual and collective processing of the sequence of images using a non-maximal suppression method.
The CRM Of claim 23, wherein to generate the autonomous driving dataset with automated object labelling further includes to back project the merged 3D object detection results to 3D ground truth in each CA/AD vehicle’s coordinate system.