CN114332907A

CN114332907A - Data enhancement including background modification for robust prediction using neural networks

Info

Publication number: CN114332907A
Application number: CN202111145496.8A
Authority: CN
Inventors: N·普里; S·西瓦拉曼; R·谢蒂; N·阿瓦达汗纳姆
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2020-09-30
Filing date: 2021-09-28
Publication date: 2022-04-12
Also published as: US11688074B2; DE102021125234A1; JP2022058135A; US20220101047A1

Abstract

In various examples, a background of an object may be modified to generate a training image. A segmentation mask may be generated and used to generate an object image comprising image data representing the object. The object images may be integrated into different backgrounds and used for data enhancement in training neural networks. Data enhancement may also be performed using tone adjustment (e.g., of an object image) and/or rendering of three-dimensional captured data corresponding to an object from a selected view. The inference score may be analyzed to select a context of the image to be included in the training dataset. A background may be selected and training images may be iteratively added to the training data set during training (e.g., between epochs). Further, early or late fusion using object mask data may be employed to improve the reasoning performed by neural networks trained using object mask data.

Description

Data enhancement including background modification for robust prediction using neural networks

Background

When training a neural network to perform a predictive task, such as object classification, the accuracy of the trained neural network is typically limited by the quality of the training data set. In order for training to produce a robust neural network, the network should be trained using challenging training images. For example, when training a neural network for hand gesture recognition (e.g., raising a thumb, peaceful gestures, fist, etc.), the network may have difficulty detecting hand gestures ahead of certain environmental features. If the gesture includes a stretched finger, the network may perform well when the gesture is in front of a mostly solid environment, but may not perform well when the gesture is in front of an environment that includes certain color patterns. As another example, the network may have difficulty in certain gestures from certain angles or when the environment and hands have similar tones.

However, whether a particular training image is challenging for the neural network may depend on many factors, such as the prediction task being performed, the architecture of the neural network, and other training images seen by the network. Therefore, it is difficult to construct a training data set by anticipating which training images should be used to train the network, which will result in a robustly trained network. It is possible to estimate which features of the training images may be challenging for the neural network. However, even if such an estimation is possible and accurate, it may not be possible or practical to obtain enough images showing these features to adequately train the network.

Disclosure of Invention

Embodiments of the present disclosure relate to data enhancement including background filtering for robust prediction using neural networks. Systems and methods are disclosed that provide data enhancement techniques, such as those based on background filtering, that can be used to increase the robustness of trained neural networks.

In contrast to conventional systems, the present disclosure provides for modifying the background of an object to generate a training image. A segmentation mask (mask) may be generated and used to generate a subject image including image data representing the subject. The object images may be integrated into different backgrounds and used for data enhancement in training neural networks. Other aspects of the present disclosure provide data enhancement using tone adjustments (e.g., of an object image) and/or rendering three-dimensional captured data corresponding to an object from a selected view direction. The present disclosure also provides for analyzing the inference score to select a context of the image to be included in the training dataset. During training (e.g., between epochs) a background can be selected and training images can be iteratively added to the training dataset. Further, the present disclosure provides for using early or late fusion of object mask data to improve the reasoning performed by neural networks trained using the object mask data.

Drawings

The system and method for data enhancement including background filtering for robust prediction using neural networks of the present invention is described in detail below with reference to the accompanying drawings, in which:

FIG. 1 is a data flow diagram illustrating an example process for training one or more machine learning models based at least on integrating object images with a background in accordance with some embodiments of the present disclosure;

FIG. 2 is a data flow diagram illustrating an example process for generating an object image and integrating the object image with one or more contexts in accordance with some embodiments of the present disclosure;

FIG. 3 includes an example of preprocessing that may be used to generate an object mask for generating an object image, in accordance with some embodiments of the present disclosure;

FIG. 4 is an illustration of how a three-dimensional capture of an object may be rasterized from multiple views in accordance with some embodiments of the present disclosure;

FIG. 5A is a data flow diagram illustrating an example of reasoning using early fusion of a machine learning model and object mask data, in accordance with some embodiments of the present disclosure;

FIG. 5B is a data flow diagram illustrating an example of reasoning using a machine learning model and late fusion of object mask data in accordance with some embodiments of the present disclosure;

FIG. 6 is a flow diagram illustrating a method for training one or more machine learning models based at least on integrating an image of an object with at least one background, in accordance with some embodiments of the present disclosure;

FIG. 7 is a flow diagram illustrating a method of reasoning using a machine learning model, where a mask corresponding to an image and at least a portion of the image are input, in accordance with some embodiments of the present disclosure;

FIG. 8 is a flow diagram illustrating a method for selecting a context for training objects of one or more machine learning models, in accordance with some embodiments of the present disclosure;

FIG. 9 is a block diagram of an example computing device suitable for use in implementing some embodiments of the present disclosure;

FIG. 10 is a block diagram of an example data center suitable for use in implementing some embodiments of the present disclosure;

FIG. 11A is an illustration of an example autonomous vehicle, according to some embodiments of the disclosure;

fig. 11B is an example of a camera position and field of view of the example autonomous vehicle of fig. 11A, in accordance with some embodiments of the present disclosure;

FIG. 11C is a block diagram of an example system architecture of the example autonomous vehicle of FIG. 11A, according to some embodiments of the present disclosure; and

fig. 11D is a system diagram of communications between one or more cloud-based servers and the example autonomous vehicle of fig. 11A, according to some embodiments of the present disclosure.

Detailed Description

Systems and methods related to data enhancement including background filtering for robust prediction using neural networks are disclosed. Embodiments of the present disclosure relate to data enhancement including background filtering for robust prediction using neural networks. Systems and methods are disclosed that provide data enhancement techniques, such as those based on background filtering, that can be used to increase the robustness of trained neural networks.

The disclosed embodiments may be implemented using a variety of different systems, such as automotive systems, robotics, aerospace systems, medical systems, rowing systems, intelligent area monitoring systems, simulation systems, and/or other areas of technology. The disclosed methods may be used for any perceptually or more generally image-based analysis using machine learning models, for example for monitoring and/or tracking of objects and/or environments.

Applications of the disclosed technology include multimodal sensor interfaces, which may be applied in a healthcare environment. For example, a patient intubated or otherwise unable to speak to communicate may use a gesture or hand gesture having a gesture interpreted by a computing system. Applications of the disclosed technology also include autonomous driving and/or vehicle control or interaction. For example, the disclosed techniques may be used to implement hand gesture or hand gesture recognition sensed within the cabin of the vehicle 1100 of fig. 11A-11D to control convenient functions, such as control of multimedia options. Gesture recognition gestures may also be applied to the external environment of vehicle 1100 to control any of a variety of autonomous driving control operations, including Advanced Driver Assistance System (ADAS) functions.

As various examples, the disclosed techniques may be implemented in a system that includes or is included in one or more of a system for performing conversational AI or personal assistance operations, a system for performing simulation operations to test or validate autonomous machine applications, a system for performing deep learning operations, a system implemented using edge devices, a system incorporating one or more Virtual Machines (VMs), a system implemented at least in part in a data center, or a system implemented at least in part using cloud computing resources.

In contrast to conventional systems, the present disclosure provides for identifying a region in an image that corresponds to an object and using the region to filter, remove, replace, or otherwise modify the background of the object and/or the object itself to generate a training image. In accordance with the present disclosure, a segmentation mask may be generated that identifies one or more segments corresponding to an object and one or more segments corresponding to a background of the object in a source image. The segmentation mask may be applied to the source image to identify regions corresponding to the object, e.g., to generate an object image including image data representing the object. The subject images can be integrated into different backgrounds and used for data enhancement in training neural networks. Other aspects of the present disclosure provide data enhancement using tone adjustment (e.g., of an object image) and/or rendering of three-dimensional captured data corresponding to an object from a selected view direction.

Further aspects of the present disclosure provide methods for selecting a context of a subject to train a neural network. In accordance with the present disclosure, a Machine Learning Model (MLM) may be trained, at least in part, and inference data may be generated by the MLM using images that include different backgrounds. The MLM may include a neural network during training or a different MLM. Inference scores corresponding to the inference data may be analyzed to select one or more features of the training images, such as a particular context or type of context of the images to be included in the training data set. The image may be selected from existing images or generated using any suitable method using the object image and background, such as those described herein. In at least one embodiment, one or more features may be selected and one or more corresponding training images may be iteratively added to the training data set during training.

The present disclosure also provides methods of using object mask data to improve reasoning performed by a neural network trained using the object mask data. Post-fusion can be performed in which one set of inference data is generated from the source image and another set of inference data is generated from an image that captures the object mask data, such as an object image (e.g., using two copies of a neural network). This set of inferential data can be fused and used to update the neural network. In a further example, early fusion may be performed in which the source image and the image capturing the object mask data are combined and the inference data is generated from the combined image. The object mask data may be used to weaken or otherwise modify the background of the source image.

Referring now to fig. 1, fig. 1 is a data flow diagram illustrating an example process 100 for training one or more machine learning models based at least on integrating object images with a background in accordance with some embodiments of the present disclosure. For example, the process 100 is described with respect to a Machine Learning Model (MLM) training system 140. The MLM training system 140 may include, among other potential components, a context integrator 102, a MLM trainer 104, a MLM post-processor 106, and a context selector 108.

At a high level, the process 100 may include the background integrator 102 receiving one or more backgrounds 110 (which may be referred to as background images) and object images 112 corresponding to one or more objects (e.g., to be classified, analyzed, and/or detected by the MLM 122). The background integrator 102 may integrate the object image 112 with the background 110 to produce image data that captures (e.g., represents) the background 110 and at least a portion of the object in one or more images. The MLM trainer 104 may generate one or more MLM122 inputs 120 from the image data. MLM122 may process input 120 to generate one or more outputs 124. The MLM post-processor 106 may process the output 124 to generate prediction data 126 (e.g., inference scores, object class labels, object bounding boxes or shapes, etc.). The context selector 108 may analyze the prediction data 126 and select one or more of the contexts 110 and/or objects for training based at least on the prediction data 126. In some embodiments, the process 100 may be repeated any number of iterations until one or more MLMs 122 are sufficiently trained or the background selector 108 may be used, either once or intermittently, to select the background 110 for the first training iteration and/or any other iteration.

For example, and without limitation, the MLMs 122 described herein may include any type of machine learning model, such as machine learning models using linear regression, logistic regression, decision trees, Support Vector Machines (SVMs), naive bayes, K-nearest neighbors (Knn), K-means clustering, random forests, dimension reduction algorithms, gradient boosting algorithms, neural networks (e.g., autoencoders, convolutions, loops, perceptrons, long/short term memory (LSTM), hopfields, Boltzmann, deep beliefs, deconvolution, generative confrontations, liquid state machines, etc.), and/or other types of machine learning models.

The process 100 may be used, at least in part, to train one or more MLMs 122 to perform predictive tasks. The present disclosure focuses on gesture recognition and/or gesture recognition, and more particularly hand gesture recognition. However, the disclosed techniques are broadly applicable to training MLMs to perform various possible prediction tasks, such as image and/or object classification tasks. Examples include object detection, bounding box or shape determination, object classification, gesture classification, and so forth. For example, FIG. 2 shows an example of an image 246 that may be captured by the input 120 to the MLM 122. The process 100 may be used to train the MLM122 to predict the hand's pose depicted in the image 246 (e.g., thumb-up, thumb-down, fist, peace gesture, open hand, OK gesture, etc.).

In some embodiments, one or more iterations of the process 100 may not include training one or more MLMs 122. For example, the background selector 108 may use iterations to select one or more backgrounds 110 and/or objects corresponding to object images 112 to be included in a training data set used by the MLM trainer 104 for training. Further, in some examples, the process 100 may select the background 110 and/or objects for a training data set in one iteration using one MLM122, and may train the same or a different MLM122 using the training data set (e.g., in the process 100 or other subsequent iterations). For example, the MLM122 used to select from the background 110 for training may be partially or fully trained to perform predictive tasks. Where an iteration of the process 100 uses a trained or partially trained MLM122 to select another MLM122 from the background 110 for training, the iteration can be used to guide the training of the other MLM122 by selecting a challenging background image and/or background and object image combination for training.

In various examples, one or more iterations of the process 100 may form a feedback loop in which the context selector 108 uses the iterative (e.g., training period) prediction data to select one or more contexts 110 and/or objects corresponding to the object images 112 for inclusion in subsequent training data sets. In subsequent iterations (e.g., subsequent training periods) of the process 100, the background integrator 102 may generate or otherwise prepare or select corresponding images that the MLM trainer 104 may incorporate into the training data set. The training data set may then be applied to the MLM122 being trained to generate prediction data 126 for use by the context selector 108 to select one or more contexts 110 and/or objects corresponding to the object images 112 for inclusion in a subsequent training data set. A feedback loop may be used to determine the enhancement to ensure continued improvement in the accuracy, generalization performance, and robustness of the MLM122 being trained.

In various examples, based at least on the selection of one or more backgrounds and/or objects by the background selector 108, the MLM trainer 104 merges one or more images from the background integrator 102 that include the selected backgrounds and/or selected backgrounds and object combinations. As an example, the MLM trainer 104 may add one or more images to a training data set for a previous training iteration and/or epoch. The training data set may be grown for each iteration. However, in some cases, the MLM trainer 104 may also remove one or more images from the training data set for a previous training iteration and/or period (e.g., based on the selection of the background selector 108 for removal and/or based on the training data set exceeding a threshold number of images). In the illustrated example, the background integrator 102 may generate one or more images to be included in the training data set at the beginning of an iteration of the process 100 based on the selection made by the background selector 108. In other examples, one or more images may be pre-generated by the background integrator 102, e.g., at least partially prior to any training using the process 100 and/or during one or more previous iterations. In the case of pre-generated images, the MLM trainer 104 may retrieve the pre-generated images from storage based on the selections made by the background selector 108.

The selection of a background described herein may refer to selecting a background to include at least one image for training. The selection of the background may also include selecting an object to be included in the image with the background. In at least one embodiment, the context selector 108 may select a context and/or a combination of context and object based at least on a confidence of the MLM122 in one or more predictions made using the MLM 122. In various examples, confidence may be captured by a set of inference scores that correspond to predictions of prediction tasks performed by one or more MLMs 122 on one or more images. For example, the prediction may be made in a current iteration and/or one or more previous iterations of the process 100. The inference score may refer to a score that the MLM is trained to provide or is being trained to provide with respect to the predictive task or a portion thereof. In some examples, the inference score may represent a confidence of the MLM with respect to one or more respective outputs 124 (e.g., tensor data) and/or may be used to determine or calculate a confidence with respect to the prediction task. For example, the inference score may represent a confidence in the objects detected by the MLM122 in the images belonging to the target category (e.g., a probability that the input 120 belongs to the target category).

Context selector 108 may use a variety of possible methods to select one or more of contexts and/or context and object combinations based on inference scores. In at least one embodiment, the context selector 108 may select one or more particular contexts and/or objects based at least on an analysis of the inference scores corresponding to the images containing these elements. In at least one embodiment, the background selector 108 may select one or more backgrounds and/or objects of a particular category or type or having one or more other particular features based at least on an analysis of inference scores corresponding to images including those elements having those one or more features (e.g., a particular background, a particular object, texture, color, lighting conditions, object and/or overlay, hue, perspective, orientation, skin tone, size, theme, included background elements, etc. described with respect to fig. 4). For example, the context selector 108 may select at least one context comprising a blind and a gesture comprising an open palm based at least on the inference scores of the images sharing those features. As another example, the context selector 108 may select a particular context based at least on an inference score of an image that includes the context. As an additional example, the context selector 108 may select objects of a particular context and object class based at least on an inference score of an image of the object that includes the context and object class (e.g., thumbs up, thumbs down, etc.).

In some cases, the inference score may be evaluated by the context selector 108 based at least on computing the confusion score. The confusion score can be used as a metric that quantifies relative network confusion with respect to predictions made for one or more images having a particular set of features. In the event that the confusion score exceeds a threshold (e.g., indicating sufficient confusion), the background selector 108 may select at least some elements (e.g., the background or a combination of the background and the object) having the set of features to modify the training data set. In at least one embodiment, the obfuscation of the background may be based at least in part on the number of correct and incorrect predictions made for an image that includes elements having a set of features. For example, a confusion score may be calculated based at least on a ratio between correct and incorrect predictions. Additionally or alternatively, a confusion score may be calculated based at least on a difference in the prediction accuracy or inference score of images of elements having a set of features (e.g., indicating when a particular context or context type is used, there is a large difference between target categories).

The context selector 108 may select one or more contexts and/or combinations of contexts and objects to include in at least one image of the training dataset based at least on selecting one or more different element characteristics. For example, the context selector 108 may select one or more different element features based at least on the corresponding confusion scores. The context selector 108 may rank the different feature sets and select one or more sets for modifying the training data set based on the ranking. As a non-limiting example, the context selector 108 may select the top N particular contexts or combinations of contexts and objects (or other feature sets), where N is an integer (e.g., for each confusion score that exceeds a threshold).

The context integrator 102 may select, retrieve (e.g., from storage), and/or generate one or more images that satisfy the selections made by the context selector 108 to modify the training data set. Where the background integrator 102 generates an image from a selected background, the background integrator 102 may include the entire background image in the image or one or more portions of the background. For example, the background integrator 102 may sample regions from the background (e.g., rectangles sized based on the input 120 of the MLM 122) using a random or non-random sampling method. Thus, the image processed by the MLM122 may include the entire background image or a region of the background image (e.g., a sampling region). Similarly, where the background integrator 102 generates an image from a selected object, the background integrator 102 may include the entirety of the object image or a portion of the object image in the image.

In at least one embodiment, the background integrator 102 can generate one or more synthetic backgrounds. For example, a synthetic background may be generated for situations where network bias and sensitivity are understood (empirically measuring network performance or intuitively). For example, a synthetic background of a particular type (e.g., dots and stripes) may be generated for a network that is sensitive to those particular textures, with the background selector selecting the background type. One or more synthetic backgrounds may be generated prior to any training of MLM122 and/or during training (e.g., between iterations or periods). Various possible methods may be used to generate the synthetic background, for example, based at least on rendering a three-dimensional virtual environment associated with the background type, algorithmically generating a texture that includes the selected pattern, modifying an existing background or image, and so forth.

According to aspects of the present disclosure, the object image 112 may be extracted from one or more source images, and the background integrator 102 may replace or modify the original background of the source images with one or more backgrounds 110. Referring now to fig. 2, fig. 2 is a data flow diagram illustrating an example process 200 for generating an object image 212 and integrating the object image 212 with one or more backgrounds 110, according to some embodiments of the present disclosure.

By way of example, process 200 is described with respect to object image extraction system 202. Among other potential components, the object image extraction system 202 may include a region identifier 204, a preprocessor 206, and an image data determiner 208.

As an overview, in process 200, the region identifier 204 may be configured to identify a region within the source image. For example, the region identifier 204 can identify a region 212A within the source image 220 that corresponds to an object (e.g., a hand) having a background in the source image 220. The region identifier 204 may further generate a segmentation mask 222 that includes a segment 212B that corresponds to the object based on the identified region 212A. The region identifier 204 may further detect the location of the object to define the source image 220 and/or the region 230 of the segmentation mask 222. The preprocessor 206 may process at least a portion of the segment 212B in the region 230 of the segmentation mask 222 to generate an object mask 232. The image data determiner 208 may generate the object image 212 from the source image 220 using the object mask 232. The object image may then be provided to the background integrator 102 for integration with one or more backgrounds 110 (e.g., overlaying or overlaying the object on the background image).

According to various embodiments, one or more object images 112, such as object image 212, may be generated prior to training MLM122 and/or during training MLM 122. For example, one or more object images 112 may be generated (e.g., as described in fig. 2) and stored and then retrieved as needed to generate the input 120 in the process 100. As another example, one or more object images 112 may be generated during process 100, for example, on-the-fly or as needed by background integrator 102. In some embodiments, object images 112 are generated on-the-fly in process 100, and may then be stored and/or reused in subsequent iterations of process 100 and/or used to train MLMs other than MLM122 at a later time.

As described herein, the region identifier 204 can identify a region 212A within the source image 220 that corresponds to an object (e.g., a hand) having a background in the source image 220. In the example shown, the region identifier 204 can also identify a region 210A within the source image 220 that corresponds to the background of the object. In other examples, the region identifier 204 may only identify the region 212A.

In at least one embodiment, the region identifier 204 can identify the region 212A to determine at least a segment 212B of the source image 220 corresponding to the object based at least on performing image segmentation on the source image. Image segmentation may also be used to identify the region 210A to determine a segment 212B of the source image 220 corresponding to the background based at least on performing image segmentation on the source image 220. In at least one embodiment, the region identifier 204 may generate data representing a segmentation mask 222 from the source image 220, where the segmentation mask 222 indicates a segment 212B corresponding to an object (white pixels in fig. 2) and/or a segment 210B corresponding to a background (black pixels in fig. 2).

The area identifier 204 may be implemented in a number of possible ways, such as using AI-driven background removal. In at least one embodiment, the region identifier 204 includes one or more MLMs that are trained to classify or label individual pixels or groups of pixels of an image. For example, the MLM may be trained to identify the foreground (e.g., corresponding to an object) and/or the background in the image, and the image segments may correspond to the foreground and/or the background. As a non-limiting example, region identifier 204 may be implemented using the background removal technique of NVIDIA corporation's RTX Greenscreen. In some examples, the MLM may be trained to identify object types and label pixels accordingly.

In at least one embodiment, the area identifier 204 may include one or more object detectors, such as object detectors trained to detect objects (e.g., hands) to be classified by the MLM 122. The object detector may be implemented using one or more MLMs trained to detect objects. The object detector may output data indicative of the object position and may be used to define a region 230 including the source image 220 and/or segmentation mask 222 of the object. For example, the object detector may be trained to provide a bounding box or shape of the object, and the bounding box or shape may be used to define the region 230.

In the illustrated example, the region 230 may be defined by extending a bounding box, while in other examples, a bounding box may be used as the region 230. In this example, the region identifier 204 can identify a segment 212B corresponding to an object by applying the source image 220 to the MLM. In other examples, the region identifier 204 may apply the region 230 to the MLM instead of (or in addition to) the source image. By applying source image 220 to the MLM, the MLM may have additional context that is not available in area 230, which may improve the accuracy of the MLM.

In embodiments where the region 230 is determined, the pre-processor 206 may perform pre-processing based at least on the region 230. For example, the regions 230 of the segmentation mask 222 may be preprocessed by the preprocessor 206 before being used by the image data determiner 208. In at least one embodiment, the preprocessor 206 may crop image data corresponding to the region 230 from the segmentation mask 222 and process the cropped image data to generate an object mask 232. The preprocessor 206 may perform various types of preprocessing on the region 230, which may improve the capabilities of the image data determiner 208.

Referring now to fig. 3, fig. 3 includes an example of preprocessing that may be used to generate an object mask for generating an object image, in accordance with some embodiments of the present disclosure. For example, the preprocessor 206 may crop the segmentation mask 222, resulting in the object mask 300A. The preprocessor 206 may perform dilation on the object mask 300A to produce an object mask 300B. The preprocessor 206 may then blur the object mask 300B, producing an object mask 232. The object mask 232 may then be used by the image data determiner 208 to generate the object image 212.

The preprocessor 206 may use dilation to expand the fragments 212B of the object mask 300A corresponding to the object. For example, segment 212B may be expanded to correspond to segment 210B of the background. In an embodiment, the preprocessor 206 may perform binarization (binary) expansion. Other types of expansion, such as grayscale expansion, may be performed. As an example, the preprocessor 206 may first blur the object mask 300A and then perform gray scale expansion. Dilation may be used to increase the robustness of the object mask 232 to errors in the segmentation mask 222. For example, where the object includes a hand, the palm of the hand may sometimes be classified as being in the background. Dilation is one way to fix this potential error. Other masking preprocessing techniques are within the scope of the present disclosure, such as binarization or gray scale erosion. As an example, erosion may be performed on segment 210B corresponding to the background.

The preprocessor 206 may use blur (e.g., gaussian blur) to help the background integrator 102 mitigate transitions between image data corresponding to objects in the object image 212 and image data corresponding to the background 110. To mitigate the transition between the object and the background, the region corresponding to the edge of the object mask 232 may be sharp and artificial. Using a blending technique such as blurring and then applying an object mask may result in a more natural or realistic transition between the objects in the image 246 and the background 110. Although the masking process has been described as being performed on the object mask prior to applying the mask, in other examples, the image data determiner 208 may perform similar or different image processing operations when applying the object mask (e.g., to the source image 220).

Returning to fig. 2, image data determiner 208 may generally use an object mask (such as object mask 232) to generate object image 212. For example, the image data determiner 208 may use the object mask 232 to identify and/or extract the region 242 from the source image 220 corresponding to the object. In other embodiments, the object mask may not be employed and another technique may be used to identify and/or extract the region 242. When using the object mask 232, the image data determiner 208 may multiply the object mask 232 with the source image 220 to obtain the object image 212, the object image 212 including image data representing a region 242 corresponding to an object (e.g., a foreground of the source image 220).

In integrating the object image 212 with the background 110, the background integrator 102 may use the object image 212 as a mask and may apply the inverse of the mask to the background 110, the resulting image being blended with the target image. For example, the background integrator 102 may perform alpha compositing between the object image 212 and the background 110. Blending the object image 212 with the background integrator 102 may use a variety of potential blending techniques. In some embodiments, the background integrator 102 may use alpha blending to integrate the object image 212 with the background 110. Alpha blending may zero out the background from the subject image 212 when combining the subject image 212 with the background 110, and foreground pixels may be superimposed on the background 110 to generate the image 246, or pixels may be weighted (e.g., from 0 to 1) when combining image data from the subject image 212 and the background 110 according to blurring or other means applied by the preprocessor 206. In at least one embodiment, the background integrator 102 may employ one or more seamless blending techniques. The seamless blending technique may be intended to create a seamless boundary between the object in the image 246 and the background 110. Examples of seamless blending techniques include gradient domain blending, laplacian pyramid blending, or poisson blending.

Data enhancementFurther examples of robust techniques

As described herein, the process 200 may be used to enhance a training data set used to train a MLM, such as using the MLM122 of the process 100. The present disclosure provides further methods that may be used to enhance the training data set. In accordance with at least some embodiments, the hue of objects identified in a source image (e.g., source image 220) can be modified for data enhancement. As an example, where the object represents at least a portion of a human, skin tones, hair colors, and/or other tones may be modified to augment the training data set. For example, the hue of one or more portions of the region corresponding to the object may be converted (e.g., uniformly or otherwise). In at least one embodiment, the hue may be selected randomly or non-randomly. In some cases, the hue may be selected based on analyzing the prediction data 126. For example, the hue may be used as a feature (e.g., by the background integrator 102) for selecting or generating one or more training images, as described herein.

Certain regions, such as the background or non-primary or secondary portions of the region, tend to have consistent color tones for different real-life changes of the object, and may retain the original color tone. For example, if the object is an automobile, the color tone of the panel may change while maintaining the color tone of the lights, bumper, and tires. In at least one embodiment, the background integrator 102 can perform a tone modification. For example, tone modification may be performed on one or more portions of the object represented in the object image 212. In other examples, the object image 212 or the object mask 232 may be used (e.g., by the image data determiner 208) to identify image data representing the object and modify the color tone in one or more regions of the source image 220. These examples may not include the background integrator 102.

In accordance with at least some embodiments, the source image 220 may be rendered from various different views of objects in the environment for data enhancement. Referring now to fig. 4, fig. 4 is an illustration of how a three-dimensional (3D) capture 402 of an object may be rasterized from multiple views, in accordance with some embodiments of the present disclosure. In at least one embodiment, the object image extraction system 202 can select a view of the object in the environment. For example, the object image extraction system 202 may select from the

views

406A, 406B, 406C or any arbitrary view of the objects in the environment 400. The object image extraction system 202 can then generate a source image 220 based at least on rasterization from a 3D capture of objects in the environment of the view. For example, three-dimensional (3D) capture 402 may include depth information captured by a physical or virtual depth sensing camera in a physical or virtual environment (which may be different from environment 400). In one or more embodiments, the 3D capture 402 may include a point cloud that captures at least a portion of the object and potential additional elements of the environment 400. For example, if view 406A is selected, the object image extraction system 202 may use at least the 3D capture 402 to rasterize the source image 220 from the view 406A of the camera 404. In at least one embodiment, the view may be selected randomly or non-randomly. In some cases, a view may be selected based on analyzing the prediction data 126. For example, the view may be used as a feature (e.g., by the background integrator 102) for selecting or generating one or more training images, as described herein (e.g., in connection with object categories). In at least one embodiment, the object may be rasterized from the view to generate an object image 112, which may then be integrated with one or more backgrounds using the methods described herein. In other examples, the object may be rasterized with the background 110 (two-dimensional image) or with other 3D content of the environment 400 to form the background.

Examples of reasoning using object masks

As described herein, the object mask may be used to train data enhancement in MLM122, e.g., using process 100. In at least one embodiment, the MLM122 trained using mask data may perform inference on the image without utilizing object masks. For example, the input 120 to the MLM122 during deployment may correspond to one or more images captured by a camera. In such examples, the object mask may be used for data enhancement only. In other embodiments, object masks may also be utilized for reasoning. An example of how inference can be made using object masks will be described with reference to fig. 5A and 5B.

Referring now to fig. 5A, fig. 5A is a data flow diagram 500 illustrating an example of inference using early fusion of MLM122 and object mask data, according to some embodiments of the present disclosure. In the example of FIG. 5A, MLM122 may be trained to perform inference on image 506 while utilizing object image 508 corresponding to the object mask of image 506. For example, input 120 can be generated from a combination of image 506 and object image 508 and then provided to MLM122 (e.g., a neural network), which can generate output 124 that includes inference data 510. Where MLM122 includes a neural network, inferential data 510 may include tensor data from the neural network. Post-processing may be performed on the inference data 510 to generate the prediction data 126.

Object image 508 is one example of object mask data that may be combined with image 506 for reasoning. Where early fusion of object mask data is used for reasoning, as in the data flow diagram 500, the input 120 to the MLM122 may be similarly generated during training (e.g., in the process 100). In general, the object mask data may capture information about the manner in which the region identifier 204 generates the mask from the source image. By utilizing object mask data during training and reasoning, MLM122 may learn to account for any erroneous or unnatural artifacts that may result from object mask generation. The object mask data may also capture information about the manner in which the pre-processor 206 pre-processes the object masks to capture any errors or unnatural artifacts that may result from the pre-processing or remain after the pre-processing.

Object image 508 may be generated (e.g., at inference time) using object image extraction system 202 similar to object image 212. Although the object image 508 is shown in fig. 5A and 5B, in other examples, the segmentation mask 222 and/or the object mask 232 may be used in addition to the object image 508 or instead of the object image 508 (before or after pre-processing by the pre-processor 206).

The input 120 may be generated from a combination of the image 506 and the object image 508 (more generally object mask data) using various methods. In at least one embodiment, the image 506 and the object image 508 are provided as separate inputs 120 to the MLM 122. As a further example, the image 506 and the object image 508 may be combined to form a combined image and the input 120 may be generated from the combined image. In at least one embodiment, the object mask data may be used to fade, weaken, mark, indicate, distinguish, or otherwise modify one or more portions of the image 506 (as captured by the object mask data) that represent the background relative to the object or foreground of the image 506. For example, the image 506 may be blended with the object image 508, resulting in fading, blurring, or defocusing of the background of the image 506 (e.g., using depth effects). When combining image 506 and object image 508, the weights used to determine the resulting pixel colors may decrease (e.g., exponentially) with distance from the object, as indicated by the object mask data (e.g., using a focus effect).

Referring now to fig. 5B, fig. 5B is a data flow diagram 502 illustrating an example of inference using late fusion of MLM122 and object mask data, according to some embodiments of the present disclosure.

In the example of fig. 5B, the MLM122 may provide separate outputs 124 for the image 506 and the object image 508. Output 124 may include inference data 510A corresponding to image 506 and inference data 510B corresponding to object image 508. Further, the MLM122 may include separate inputs 120 for the image 506 and the object image 508. For example, the MLM122 may include multiple copies of the MLM122 trained to perform inference on images, where one copy performs inference on the images 506 and generates inference data 510A, and another copy performs inference on the object images 508 and generates inference data 510B (e.g., in parallel). Post-processing may be performed on inference data 510A and inference data 510B and post-fusion may be used to generate prediction data 126. For example, corresponding tensor values across inference data 510A and inference data 510B may be combined (e.g., averaged) to fuse inference data, which may then be further post-processed to generate prediction data 126. In at least one embodiment, the tensor values of inference data 510A and inference data 510B may be combined using weights (e.g., using a weighted average). In at least one embodiment, the weights may be adjusted on the validation data set.

Inference using MLM122 may also include temporal filtering of the inference scores to generate prediction data 126, which may improve temporal stability of the predictions. Furthermore, the illustrated example is primarily related to static gesture recognition. However, the disclosed techniques may also be applied to dynamic gesture recognition, which may be referred to as gestures. To train and use the MLM to predict pose, in at least one embodiment, MLM122 may be provided with a plurality of images that capture the object over a period of time or number or sequence of frames. In the case of using the object mask data, the object mask data may be provided for each input image.

Referring now to fig. 6, each block of method 600, as well as other methods described herein, includes a computational process that may be performed using any combination of hardware, firmware, and/or software. For example, various functions may be performed by a processor executing instructions stored in a memory. The method may also be embodied as computer useable instructions stored on a computer storage medium. The method may be provided by a stand-alone application, a service or a hosted service (either alone or in combination with another hosted service), or a plug-in to another product, to name a few. Further, by way of example, the method 600 is described with respect to the system 140 of fig. 1 and the system 202 of fig. 2. However, the method may additionally or alternatively be performed by any one or any combination of systems, including but not limited to those described herein.

Fig. 6 is a flow diagram illustrating a method 600 for training one or more machine learning models based at least on combining object images with at least one background in accordance with some embodiments of the present disclosure. At block B602, the method 600 includes identifying a region in the first image that corresponds to an object having a first background. For example, the region identifier 204 may identify a region 212A in the source image 220 that corresponds to an object having a background in the source image 220.

At block B604, the method 600 includes determining image data representing the object based at least on the region. For example, the image data determiner 208 may determine image data representing the object based at least on the region 212A of the object. In at least one embodiment, the image data determiner 208 may determine the image data using an object mask 232 or a non-mask based approach.

At block B606, the method 600 includes generating a second image including an object having a second background using the image data. The background integrator 102 may generate an image 246 including an object having the background 110 based at least on combining the object with the background 110 using the image data. For example, the image data determiner 208 may incorporate the image data into the object image 212 and provide the object image 212 to the background integrator 102 for integration with the background 110.

At block B608, the method 600 includes training at least one neural network to perform a prediction task using the second image. For example, the MLM trainer 104 may train the MLM using the images 246 to classify objects in the images.

Referring now to fig. 7, fig. 7 is a flow diagram illustrating a method 700 for inference using a machine learning model, where a mask corresponding to an image and at least a portion of the image are input, according to some embodiments of the present disclosure. At block B702, the method 700 includes obtaining (or accessing) at least one neural network trained to perform predictive tasks on images using inputs generated from masks corresponding to objects. For example, the MLM122 of fig. 5A or 5B may be obtained (accessed) and may have been trained in accordance with the process 100 of fig. 1.

At block B704, the method 700 includes generating a mask corresponding to an object in the image, where the object has a background in the image. For example, the region identifier 204 may generate a segmentation mask 222 corresponding to an object in the source image 220, where the object has a background in the source image 220.

At block B706, the method 700 includes generating inputs to at least one neural network using the mask. For example, the fig. 5A or 5B input 120 may be generated using the segmentation mask 222 (or without using the object mask data). The input 120 may capture an object with at least a portion of the background.

At block B708, the method 700 includes generating at least one prediction of the prediction task based at least on applying the input to the at least one neural network. For example, the MLM122 may be used to generate at least one prediction for a prediction task based at least on applying the input 120 to the MLM122, and may use the output 124 from the MLM122 to determine the prediction data 126.

Referring now to fig. 8, fig. 8 is a flow diagram illustrating a method 800 for selecting a context for training objects of one or more machine learning models, according to some embodiments of the present disclosure. At block B802, the method 800 includes receiving an image of one or more objects having a plurality of backgrounds. For example, the MLM training system 140 may receive images of one or more objects having multiple backgrounds 110.

At block B804, the method 800 includes generating a set of inference scores corresponding to the prediction tasks using the images. For example, the MLM trainer 104 may provide inputs 120 to one or more MLMs 122 (or different MLMs) to generate outputs 124, and the MLM post-processor 106 may process the outputs 124 to produce prediction data 126.

At block B806, method 800 includes selecting a context based at least on the one or more inference scores. For example, the context selector 108 may select one or more of the contexts 110 based at least on the prediction data 126.

At block B808, the method 800 includes generating an image based at least on integrating the object with the background. For example, the background integrator 102 may generate an image based at least on integrating the object with the background (e.g., using the object image 112 and the background 110).

At block B810, the method 800 includes training at least one neural network using the images to perform the prediction task. For example, the MLM trainer 104 may use the images to train one or more MLMs 122.

Example computing device

Fig. 9 is a block diagram of an example computing device 900 suitable for implementing some embodiments of the present disclosure. The computing device 900 may include an interconnection system 902 that directly or indirectly couples the following devices: memory 904, one or more Central Processing Units (CPUs) 906, one or more Graphics Processing Units (GPUs) 909, a communication interface 910, input/output (I/O) ports 912, input/output components 914, a power supply 916, one or more presentation components 919 (e.g., display (s)), and one or more logic units 920. In at least one embodiment, computing device 900 may include one or more Virtual Machines (VMs), and/or any components thereof may include virtual components (e.g., virtual hardware components). For non-limiting examples, the one or more GPUs 908 can include one or more vGPU, the one or more CPUs 906 can include one or more vGPU, and/or the one or more logic units 920 can include one or more virtual logic units. Thus, computing device 900 may include discrete components (e.g., a complete GPU dedicated to computing device 900), virtual components (e.g., a portion of a GPU dedicated to computing device 900), or a combination thereof.

Although the various blocks of fig. 9 are shown connected via an interconnect system 902 having wires, this is not intended to be limiting and is for clarity only. For example, in some embodiments, a presentation component 918 such as a display device can be considered an I/O component 914 (e.g., if the display is a touch screen). As another example, the CPU906 and/or the GPU908 may include memory (e.g., the memory 904 may represent a storage device other than memory of the GPU908, the CPU906, and/or other components). In other words, the computing device of fig. 9 is merely illustrative. No distinction is made between categories such as "workstation," "server," "laptop," "desktop," "tablet," "client device," "mobile device," "handheld device," "gaming console," "Electronic Control Unit (ECU)," "virtual reality system," and/or other device or system types, as all are contemplated within the scope of the computing device of fig. 9.

The interconnect system 902 may represent one or more links or buses, such as an address bus, a data bus, a control bus, or a combination thereof. The interconnect system 902 may include one or more bus or link types, such as an Industry Standard Architecture (ISA) bus, an Extended Industry Standard Architecture (EISA) bus, a Video Electronics Standards Association (VESA) bus, a Peripheral Component Interconnect (PCI) bus, a peripheral component interconnect express (PCIe) bus, and/or another type of bus or link. In some embodiments, there is a direct connection between the components. By way of example, the CPU906 may be directly connected to the memory 904. Further, the CPU906 may be directly connected to the GPU 908. Where there is a direct or point-to-point connection between components, the interconnect system 902 may include a PCIe link to perform the connection. In these examples, the PCI bus need not be included in computing device 900.

Memory 904 may include any of a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 900. Computer readable media may include both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

Computer storage media may include volatile and nonvolatile media, and/or removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data types. For example, memory 904 may store computer readable instructions (e.g., representing programs and/or program elements, such as an operating system). Computer storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 900. As used herein, a computer storage medium does not include a signal per se.

Computer storage media may include computer readable instructions, data structures, program modules, and/or other data types in a modulated data signal, such as a carrier wave, or other transport mechanism, and includes any information delivery media. The term "modulated data signal" may refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, computer storage media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

The CPU906 may be configured to execute at least some computer-readable instructions in order to control one or more components of the computing device 900 to perform one or more of the methods and/or processes described herein. Each of the CPUs 906 can include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) capable of processing a large number of software threads simultaneously. The CPU906 may include any type of processor, and may include different types of processors, depending on the type of computing device 900 implemented (e.g., a processor with fewer cores for a mobile device and a processor with more cores for a server). For example, depending on the type of computing device 900, the processor may be an advanced instruction set computing (RISC) machine (ARM) processor implemented using RISC or an x86 processor implemented using Complex Instruction Set Computing (CISC). In addition to one or more microprocessors or supplemental coprocessors such as math coprocessors, computing device 900 may also include one or more CPUs 906.

In addition to or in lieu of the CPU906, the GPU908 may be configured to execute at least some computer readable instructions to control one or more components of the computing device 900 to perform one or more of the methods and/or processes described herein. One or more of the GPUs 908 can be an integrated GPU (e.g., with one or more of the CPUs 906 and/or one or more of the GPUs 908 can be a discrete GPU). In an embodiment, one or more of the GPUs 908 can be a coprocessor of one or more of the CPUs 906. The GPU908 may be used by the computing device 900 to render graphics (e.g., 3D graphics) or to perform general-purpose computations. For example, GPU908 may be used for general purpose computing on a GPU (GPGPU). The GPU908 may include hundreds or thousands of cores capable of processing hundreds or thousands of software threads simultaneously. The GPU908 may generate pixel data for an output image in response to a rendering command (e.g., a rendering command from the CPU906 received via a host interface). The GPU908 may include a graphics memory, such as a display memory, for storing pixel data or any other suitable data, such as GPGPU data. Display memory may be included as part of memory 904. The GPUs 908 may include two or more GPUs operating in parallel (e.g., via a link). The link may connect the GPU directly (e.g., using NVLINK) or may connect the GPU through a switch (e.g., using NVSwitch). When combined together, each GPU908 can generate pixel data or GPGPU data for a different portion of output or for a different output (e.g., a first GPU for a first image and a second GPU for a second image). Each GPU may include its own memory, or may share memory with other GPUs.

In addition to, or in lieu of, the CPU906 and/or the GPU908, the logic unit(s) 920 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 900 to perform one or more of the methods and/or processes described herein. In embodiments, CPU(s) 906, GPU(s) 908, and/or logic unit(s) 920 may perform any combination of methods, processes, and/or portions thereof, either separately or jointly. One or more of the logic units 920 may be one or more of the CPU906 and/or the GPU908 and/or one or more integrated within the CPU906 and/or the GPU908, and/or one or more of the logic units 920 may be discrete components or otherwise external to the CPU906 and/or the GPU 908. In an embodiment, one or more of the logic units 920 may be one or more of the CPUs 906 and/or a coprocessor of one or more GPUs of the GPUs 908.

Examples of logic unit 920 include one or more processing cores and/or components thereof, such as Tensor Cores (TC), Tensor Processing Units (TPU), Pixel Vision Cores (PVC), Vision Processing Units (VPU), Graphics Processing Clusters (GPC), Texture Processing Clusters (TPC), Streaming Multiprocessors (SM), Tree Traversing Units (TTU), Artificial Intelligence Accelerators (AIA), Deep Learning Accelerators (DLA), Arithmetic Logic Units (ALU), Application Specific Integrated Circuits (ASIC), Floating Point Units (FPU), input/output (I/O) elements, Peripheral Component Interconnect (PCI) or peripheral component interconnect express (PCIe) elements, and/or the like.

The communication interface 910 may include one or more receivers, transmitters, and/or transceivers that enable the computing device 900 to communicate with other computing devices via an electronic communication network, including wired and/or wireless communication. Communication interface 910 may include components and functionality to enable communication over any of a number of different networks, such as a wireless network (e.g., Wi-Fi, Z-wave, bluetooth LE, ZigBee, etc.), a wired network (e.g., over ethernet or infiniband communication), a low-power wide area network (e.g., LoRaWAN, SigFox, etc.), and/or the internet.

The I/O ports 912 may enable the computing device 900 to be logically coupled to other devices including I/O components 914, presentation components 918, and/or other components, some of which may be built into (e.g., integrated into) the computing device 900. Illustrative I/O components 914 include a microphone, mouse, keyboard, joystick, game pad, game controller, satellite dish, scanner, printer, wireless device, and so forth. The I/O component 914 may provide a Natural User Interface (NUI) that handles user-generated air gestures, speech, or other physiological inputs. In some examples, the input may be transmitted to an appropriate network element for further processing. The NUI may implement any combination of voice recognition, stylus recognition, facial recognition, biometric recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, and touch recognition associated with a display of computing device 900 (as described in more detail below). Computing device 900 may include a depth camera such as a stereo camera system, an infrared camera system, an RGB camera system, touch screen technology, and combinations of these for gesture detection and recognition. Further, the computing device 900 may include an accelerometer or gyroscope (e.g., as part of an Inertial Measurement Unit (IMU)) that enables motion detection. In some examples, the output of an accelerometer or gyroscope may be used by computing device 900 to render immersive augmented reality or virtual reality.

The power source 916 may include a hard-wired power source, a battery power source, or a combination thereof. The power source 916 may provide power to the computing device 900 to enable operation of the components of the computing device 900.

The presentation component 918 may include a display (e.g., a monitor, touch screen, television screen, Heads Up Display (HUD), other display types, or combinations thereof), speakers, and/or other presentation components. The rendering component 918 can receive data from other components (e.g., the GPU908, the CPU906, etc.) and output the data (e.g., as images, video, sound, etc.).

Example data center

Fig. 10 illustrates an example data center 1000 that can be used in at least one embodiment of the present disclosure. The data center 1000 includes a data center infrastructure layer 1010, a framework layer 1020, a software layer 1030, and an application layer 1040.

As shown in fig. 10, the data center infrastructure layer 1010 may include a resource coordinator 1010, packet computing resources 1014, and node computing resources ("node c.r.") 1016(1) -1016(N), where "N" represents a positive integer (which may be an integer "N" different from the integers used in other figures). In at least one embodiment, nodes c.r.1016(1) -1016(N) may include, but are not limited to, any number of central processing units ("CPUs") or other processors (including accelerators, Field Programmable Gate Arrays (FPGAs), graphics processors or Graphics Processing Units (GPUs), etc.), memory devices (e.g., dynamic read only memories), storage devices (e.g., solid state drives or disk drives), network input/output ("NW I/O") devices, network switches, virtual machines ("VMs"), power modules and/or cooling modules, etc. In some embodiments, one or more of the nodes c.r.1016(1) -1016(N) may correspond to a server having one or more of the computing resources described above. Further, in some embodiments, nodes c.r.1016(1) -10161(N) may include one or more virtual components, such as vGPU, vCPU, etc., and/or one or more of nodes c.r.1016(1) -1016(N) may correspond to a Virtual Machine (VM).

In at least one embodiment, the group computing resources 1014 may comprise a single group (not shown) of nodes c.r.1016 housed within one or more racks, or a number of racks (also not shown) housed within data centers at various geographic locations. The individual groupings of nodes c.r.1016 within the grouped computing resources 1014 may include computing, network, memory, or storage resources that may be configured or allocated as a group to support one or more workloads. In at least one embodiment, several nodes c.r.1016, including CPUs, GPUs, and/or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. One or more racks may also include any number of power modules, cooling modules, and/or network switches, in any combination.

Resource coordinator 1022 may configure or otherwise control one or more nodes c.r.1016(1) -1016(N) and/or grouped computing resources 1014. In at least one embodiment, the resource coordinator 1022 may include a software design infrastructure ("SDI") management entity for the data center 1000. In at least one embodiment, the resource coordinator 1022 may include hardware, software, or some combination thereof.

In at least one embodiment, as shown in FIG. 10, framework layer 1020 includes a job scheduler 1032, a configuration manager 1034, a resource manager 1036, and a distributed file system 1038. In at least one embodiment, framework layer 1020 can include a framework that supports software 1032 of software layer 1030 and/or one or more applications 1042 of application layer 1040. Software 1032 or application 1042 may comprise a Web-based service software or application, respectively, such as Services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. Framework layer 1020 may be, but is not limited to, a free and open source software web application framework, such as Apache Spark (hereinafter "Spark") that may utilize distributed file system 1038 for large-scale data processing (e.g., "big data"). In at least one embodiment, job scheduler 1032 may include a Spark driver to facilitate scheduling workloads supported by various tiers of data center 1000. The configuration manager 1034 may be capable of configuring different layers, such as a software layer 1030 and a framework layer 1020 including Spark and a distributed file system 1038 for supporting large-scale data processing. Resource manager 1036 can manage the cluster or group computing resources mapped to or allocated to support distributed file system 1038 and job scheduler 1032. In at least one embodiment, the clustered or grouped computing resources may comprise grouped computing resources 1014 on the data center infrastructure layer 1010. The resource manager 1036 may coordinate with the resource coordinator 1012 to manage these mapped or allocated computing resources.

In at least one embodiment, the software 1032 included in the software layer 1030 may include software used by at least a portion of the nodes c.r.1016(1) -1016(N), the grouped computing resources 1014, and/or the distributed file system 1038 of the framework layer 1020. One or more types of software may include, but are not limited to, Intemet Web page search software, email virus scanning software, database software, and streaming video content software.

In at least one embodiment, one or more application programs 1042 included in the application layer 1040 can include one or more types of application programs used by at least a portion of the nodes c.r.1016(1) -1016(N), the packet computing resources 1014, and/or the distributed file system 1038 of the framework layer 1020. The one or more types of applications can include, but are not limited to, any number of genomic applications, cognitive computing, applications, and machine learning applications, including training or reasoning software, machine learning framework software (e.g., PyTorch, tensrflow, Caffe, etc.), or other machine learning applications used in connection with one or more embodiments.

In at least one embodiment, any of configuration manager 1034, resource manager 1036, and resource coordinator 1012 may implement any number and type of self-modifying actions based on any number and type of data obtained in any technically feasible manner. The self-modifying action may mitigate a data center operator of data center 1000 from making configuration decisions that may not be good and may avoid underutilization and/or poorly performing portions of the data center.

Data center 1000 may include tools, services, software, or other resources to train or use one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, the machine learning model may be trained by computing weight parameters according to a neural network architecture using software and/or computing resources described above with respect to data center 1000. In at least one embodiment, using the weight parameters calculated by one or more training techniques (such as, but not limited to, those described herein), the information can be inferred or predicted using the trained or deployed machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 1000.

In at least one embodiment, the data center 100 may use a CPU, Application Specific Integrated Circuit (ASIC), GPU, FPGA, or other hardware (or virtual computing resources corresponding thereto) to perform training and/or reasoning using the above resources. Further, one or more of the software and/or hardware resources described above may be configured as a service to allow a user to train or perform information reasoning, such as image recognition, voice recognition, or other artificial intelligence services.

Example network Environment

A network environment suitable for use in implementing embodiments of the present disclosure may include one or more client devices, servers, Network Attached Storage (NAS), other backend apparatus, and/or other apparatus types. Client devices, servers, and/or other device types (e.g., each device) may be implemented on one or more instances of computing device(s) 900 of fig. 9-e.g., each device may include similar components, features, and/or functionality of computing device(s) 900. Further, where backend devices (e.g., servers, NAS, etc.) are implemented, the backend devices can be included as part of the data center 1000, examples of which are described in more detail herein with respect to fig. 10.

The components of the network environment may communicate with each other via a network(s), which may be wired, wireless, or both. The network may include multiple networks or one of multiple networks. For example, the network may include one or more Wide Area Networks (WANs), one or more Local Area Networks (LANs), one or more public networks such as the internet and/or the Public Switched Telephone Network (PSTN), and/or one or more private networks. Where the network comprises a wireless telecommunications network, components such as base stations, communication towers, or even access points (among other components) may provide wireless connectivity.

Compatible network environments may include one or more peer-to-peer network environments (in which case, a server may not be included in the network environment) and one or more client-server network environments (in which case, one or more servers may be included in the network environment). In a peer-to-peer network environment, the functionality described herein with respect to a server may be implemented on any number of client devices.

In at least one embodiment, the network environment may include one or more cloud-based network environments, distributed computing environments, combinations thereof, or the like. A cloud-based network environment may include a framework layer, a work scheduler, a resource manager, and a distributed file system implemented on one or more servers, which may include one or more core network servers and/or edge servers. The framework layer may include a framework that supports software of the software layer and/or one or more applications of the application layer. The software or applications may include web-based service software or applications, respectively. In embodiments, one or more client devices may use network-based service software or applications (e.g., by accessing the service software and/or applications via one or more Application Programming Interfaces (APIs)). The framework layer may be, but is not limited to, a type of free and open source software web application framework, such as may be used for large-scale data processing (e.g., "big data") using a distributed file system.

A cloud-based network environment may provide cloud computing and/or cloud storage that performs any combination of the computing and/or data storage functions described herein (or one or more portions thereof). Any of these different functions may be distributed across multiple locations from a central or core server (e.g., which may be distributed across one or more data centers on a state, region, country, earth, etc.). If the connection with the user (e.g., client device) is relatively close to the edge server, the core server may assign at least a portion of the functionality to the edge server. A cloud-based network environment may be private (e.g., limited to a single organization), may be public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).

The client device may include at least some of the components, features, and functions of the example computing device(s) 900 described herein with respect to fig. 9. By way of example and not limitation, a client device may be implemented as a Personal Computer (PC), a laptop computer, a mobile device, a smartphone, a tablet computer, a smart watch, a wearable computer, a Personal Digital Assistant (PDA), an MP3 player, a virtual reality headset, a Global Positioning System (GPS) or device, a video player, a camera, a monitoring device or system, a vehicle, a boat, a spacecraft, a virtual machine, a drone, a robot, a handheld communication device, a hospital device, a gaming device or system, an entertainment system, a vehicle computer system, an embedded system controller, a remote control, an appliance, a consumer electronic device, a workstation, an edge device, any combination of these depicted devices, or any other suitable device.

Example autonomous vehicle

Fig. 11A is an illustration of an example autonomous vehicle 1100, according to some embodiments of the disclosure. Autonomous vehicle 1100 (alternatively, referred to herein as "vehicle 1100") may include, but is not limited to, a passenger vehicle, such as an automobile, a truck, a bus, an emergency vehicle, a shuttle vehicle, an electric or motorized bicycle, a motorcycle, a fire truck, a police vehicle, an ambulance, a boat, a construction vehicle, an underwater vehicle, a drone, and/or another type of vehicle (e.g., unmanned and/or may accommodate one or more passengers). Autonomous Vehicles are generally described in Terms of an Automation level defined by one of the U.S. department of transportation-the national Road traffic safety administration (NHTSA) and the Society of Automotive Engineers (SAE) "Taxonomy and Definitions for Terms Related to Automation Systems for On-Road Motor Vehicles" (standard No. j3016-201806 published 6/15 in 2018, standard No. j3016-201609 published 30/9 in 2016, and previous and future versions of that standard). The vehicle 1100 may be capable of performing functions consistent with one or more of levels 3-5 of the autonomous driving level. For example, depending on the embodiment, the vehicle 1100 may be capable of conditional automation (level 3), high automation (level 4), and/or full automation (level 5).

Vehicle 1100 may include components such as a chassis, a body, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and other components of the vehicle. The vehicle 1100 may include a propulsion system 1150, such as an internal combustion engine, a hybrid power plant, an all-electric engine, and/or another type of propulsion system. Propulsion system 1150 may be connected to a driveline of vehicle 1100, which may include a transmission, to effect propulsion of vehicle 1100. The propulsion system 1150 may be controlled in response to receiving a signal from the throttle/accelerator 1152.

A steering (steering) system 1154, which may include a steering wheel, may be used to steer the vehicle 1100 (e.g., along a desired path or route) when the propulsion system 1150 is operating (e.g., while the vehicle is in motion). Steering system 1154 may receive signals from steering actuators 1156. The steering wheel may be optional for fully automatic (5-level) functions.

The brake sensor system 1146 may be used to operate the vehicle brakes in response to receiving signals from the brake actuators 1148 and/or brake sensors.

One or more controllers 1136, which may include one or more systems on a chip (SoC)1104 (fig. 11C) and/or one or more GPUs, may provide signals (e.g., representative of commands) to one or more components and/or systems of vehicle 1100. For example, one or more controllers may send signals to operate vehicle brakes via one or more brake actuators 1148, steering system 1154 via one or more steering actuators 1156, and propulsion system 1150 via one or more throttle/accelerators 1152. The one or more controllers 1136 may include one or more onboard (e.g., integrated) computing devices (e.g., supercomputers) that process the sensor signals and output operating commands (e.g., signals representative of the commands) to implement autonomous driving and/or to assist a human driver in driving the vehicle 1100. The one or more controllers 1136 may include a first controller 1136 for autonomous driving functions, a second controller 1136 for functional safety functions, a third controller 1136 for artificial intelligence functions (e.g., computer vision), a fourth controller 1136 for infotainment functions, a fifth controller 1136 for redundancy in emergency situations, and/or other controllers. In some examples, a single controller 1136 may handle two or more of the above-described functions, two or more controllers 1136 may handle a single function, and/or any combination thereof.

One or more controllers 1136 may provide signals for controlling one or more components and/or systems of vehicle 1100 in response to sensor data (e.g., sensor inputs) received from one or more sensors. The sensor data may be received from, for example and without limitation, global navigation satellite system sensors 1158 (e.g., global positioning system sensors), RADAR sensors 1160, ultrasound sensors 1162, LIDAR sensors 1164, Inertial Measurement Unit (IMU) sensors 1166 (e.g., accelerometers, gyroscopes, magnetic compasses, magnetometers, etc.), microphones 1196, stereo cameras 1168, wide-angle cameras 1170 (e.g., fisheye cameras), infrared cameras 1172, surround cameras 1174 (e.g., 360 degree cameras), remote and/or mid-range cameras 1198, speed sensors 1144 (e.g., for measuring velocity of the vehicle 1100), vibration sensors 1142, steering sensors 1140, braking sensors (e.g., as part of the braking sensor system 1146), and/or other sensor types.

One or more of the controllers 1136 may receive input from an instrument cluster 1132 of the vehicle 1100 (e.g., represented by input data) and provide output (e.g., represented by output data, display data, etc.) via a human-machine interface (HMI) display 1134, audible annunciators, speakers, and/or via other components of the vehicle 1100. These outputs may include information such as vehicle speed, time, map data (e.g., HD map 1122 of fig. 11C), location data (e.g., the location of vehicle 1100 on a map, for example), directions, locations of other vehicles (e.g., occupancy grids), information about objects and object states as perceived by controller 1136, and so forth. For example, the HMI display 1134 may display information regarding the presence of one or more objects (e.g., street signs, warning signs, traffic light changes, etc.) and/or information regarding driving maneuvers that the vehicle has made, is making, or will make (e.g., a lane change now, a two mile exit 34B, etc.).

The vehicle 1100 further includes a network interface 1124, which may communicate over one or more networks using one or more wireless antennas 1126 and/or a modem. For example, network interface 1124 may be capable of communicating via LTE, WCDMA, UMTS, GSM, CDMA2000, or the like. The one or more wireless antennas 1126 may also enable communication between objects (e.g., vehicles, mobile devices, etc.) in the environment using one or more local area networks such as bluetooth, bluetooth LE, Z-wave, ZigBee, etc., and/or one or more low-power wide area networks (LPWAN) such as LoRaWAN, SigFox, etc.

Fig. 11B is an example of camera locations and fields of view for the example autonomous vehicle 1100 of fig. 11A, according to some embodiments of the present disclosure. The cameras and respective fields of view are one example embodiment and are not intended to be limiting. For example, additional and/or alternative cameras may be included, and/or the cameras may be located at different locations on the vehicle 1100.

The type of camera used for the camera may include, but is not limited to, a digital camera that may be suitable for use with the components and/or systems of the vehicle 1100. The camera may operate under Automotive Safety Integrity Level (ASIL) B and/or under another ASIL. The camera type may have any image capture rate, e.g., 60 frames per second (fps), 120fps, 240 ns, etc., depending on the embodiment. The camera may be capable of using a rolling shutter, a global shutter, another type of shutter, or a combination thereof. In some examples, the color filter array may include a red white (RCCC) color filter array, a red white blue (RCCB) color filter array, a red blue green white (RBGC) color filter array, a Foveon X3 color filter array, a bayer sensor (RGGB) color filter array, a monochrome sensor color filter array, and/or another type of color filter array. In some embodiments, a clear pixel camera, such as a camera with an RCCC, RCCB, and/or RBGC color filter array, may be used in an effort to improve light sensitivity.

In some examples, one or more of the cameras may be used to perform Advanced Driver Assistance System (ADAS) functions (e.g., as part of a redundant or fail-safe design). For example, a multi-function monocular camera may be installed to provide functions including lane departure warning, traffic sign assistance, and intelligent headlamp control. One or more of the cameras (e.g., all of the cameras) may record and provide image data (e.g., video) simultaneously.

One or more of the cameras may be mounted in a mounting assembly, such as a custom designed (3-D printed) assembly, in order to cut off stray light and reflections from within the automobile (e.g., reflections from the dashboard reflected in the windshield mirror) that may interfere with the image data capture capabilities of the cameras. With respect to the wingmirror mounting assembly, the wingmirror assembly may be custom 3-D printed such that the camera mounting plate matches the shape of the wingmirror. In some examples, one or more cameras may be integrated into the wingmirror. For side view cameras, one or more cameras may also be integrated into four pillars at each corner of the cab.

Cameras having a field of view that includes portions of the environment in front of the vehicle 1100 (e.g., front facing cameras) may be used for look around to help identify forward paths and obstacles, as well as to help provide information critical to generating an occupancy grid and/or determining a preferred vehicle path with the help of one or more controllers 1136 and/or control socs. The front camera may be used to perform many of the same ADAS functions as LIDAR, including emergency braking, pedestrian detection, and collision avoidance. The front-facing camera may also be used for ADAS functions and systems, including lane departure warning ("LDW"), autonomous cruise control ("ACC"), and/or other functions such as traffic sign recognition.

A wide variety of cameras may be used in the front-end configuration, including, for example, monocular camera platforms including CMOS (complementary metal oxide semiconductor) color imagers. Another example may be a wide-angle camera 1170 that may be used to sense objects (e.g., pedestrians, cross traffic, or bicycles) entering the field of view from the periphery. Although only one wide-angle camera is shown in fig. 11B, any number of wide-angle cameras 1170 may be present on the vehicle 1100. Furthermore, remote cameras 1198 (e.g., a pair of long-view stereo cameras) may be used for depth-based object detection, particularly for objects for which a neural network has not been trained. Remote camera 1198 may also be used for object detection and classification as well as basic object tracking.

One or more stereo cameras 1168 may also be included in the front-facing configuration. The stereo camera 1168 may include an integrated control unit that includes an extensible processing unit that may provide a multi-core microprocessor and programmable logic (FPGA) with an integrated CAN or ethernet interface on a single chip. Such a unit may be used to generate a 3-D map of the vehicle environment, including distance estimates for all points in the image. An alternative stereo camera 1168 may include a compact stereo vision sensor that may include two camera lenses (one on the left and right) and an image processing chip that may measure the distance from the vehicle to the target object and use the generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. Other types of stereo cameras 1168 may be used in addition to or in the alternative to those described herein.

A camera having a field of view that includes an environmental portion of the side of the vehicle 1100 (e.g., a side view camera) may be used for looking around, providing information used to create and update occupancy grids and generate side impact collision warnings. For example, a surround camera 1174 (e.g., four surround cameras 1174 as shown in fig. 11B) may be placed on the vehicle 1100. The surround cameras 1174 may include wide angle cameras 1170, fisheye cameras, 360 degree cameras, and/or the like. Four examples, four fisheye cameras may be placed in front, behind, and to the side of the vehicle. In an alternative arrangement, the vehicle may use three surround cameras 1174 (e.g., left, right, and rear), and may utilize one or more other cameras (e.g., a forward facing camera) as the fourth surround view camera.

A camera having a field of view that includes a rear environmental portion of the vehicle 1100 (e.g., a rear view camera) may be used to assist in parking, looking around, rear collision warning, and creating and updating occupancy grids. A wide variety of cameras may be used, including but not limited to cameras that are also suitable as front-facing cameras (e.g., remote and/or mid-range camera 1198, stereo camera 1168, infrared camera 1172, etc.) as described herein.

Fig. 11C is a block diagram of an example system architecture for the example autonomous vehicle 1100 of fig. 11A, according to some embodiments of the present disclosure. It should be understood that this arrangement and the other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, groupings of functions, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by an entity may be carried out by hardware, firmware, and/or software. For example, various functions may be carried out by a processor executing instructions stored in a memory.

Each of the components, features, and systems of vehicle 1100 in fig. 11C are shown connected via a bus 1102. The bus 1102 may include a Controller Area Network (CAN) data interface (alternatively referred to herein as a "CAN bus"). The CAN may be a network internal to the vehicle 1100 that is used to assist in controlling various features and functions of the vehicle 1100, such as the actuation of brakes, acceleration, braking, steering, windshield wipers, and the like. The CAN bus may be configured to have tens or even hundreds of nodes, each with its own unique identifier (e.g., CAN ID). The CAN bus may be read to find steering wheel angle, ground speed, engine Revolutions Per Minute (RPM), button position, and/or other vehicle status indicators. The CAN bus may be ASIL B compatible.

Although bus 1102 is described herein as a CAN bus, this is not intended to be limiting. For example, FlexRay and/or ethernet may be used in addition to or instead of the CAN bus. Further, although bus 1102 is represented by a single line, this is not intended to be limiting. For example, there may be any number of buses 1102, which may include one or more CAN buses, one or more FlexRay buses, one or more ethernet buses, and/or one or more other types of buses using different protocols. In some examples, two or more buses 1102 may be used to perform different functions, and/or may be used for redundancy. For example, the first bus 1102 may be used for collision avoidance functions, and the second bus 1102 may be used for drive control. In any example, each bus 1102 may communicate with any component of vehicle 1100, and two or more buses 1102 may communicate with the same component. In some examples, each SoC1104, each controller 1136, and/or each computer within the vehicle may have access to the same input data (e.g., input from sensors of the vehicle 1100), and may be connected to a common bus, such as a CAN bus.

The vehicle 1100 may include one or more controllers 1136, such as those described herein with respect to fig. 11A. The controller 1136 may be used for a variety of functions. The controller 1136 may be coupled to any of the other various components and systems of the vehicle 1100, and may be used for control of the vehicle 1100, artificial intelligence of the vehicle 1100, infotainment for the vehicle 1100, and/or the like.

Vehicle 1100 may include one or more systems on a chip (SoC) 1104. SoC1104 may include CPU1106, GPU1108, processor 1110, cache 1112, accelerators 1114, data store 1116, and/or other components and features not shown. In a wide variety of platforms and systems, SoC1104 may be used to control vehicle 1100. For example, one or more socs 1104 may be incorporated in a system (e.g., a system of vehicle 1100) with an HD map 1122, which may obtain map refreshes and/or updates from one or more servers (e.g., one or more servers 1178 of fig. 11D) via network interface 1124.

CPU1106 may include a CPU cluster or CPU complex (alternatively referred to herein as "CCPLEX"). CPU1106 may include multiple cores and/or an L2 cache. For example, in some embodiments, CPU1106 may include eight cores in a coherent multiprocessor configuration. In some embodiments, the CPU1106 may include four dual-core clusters, with each cluster having a dedicated L2 cache (e.g., a 2MB L2 cache). The CPUs 1106 (e.g., CCPLEX) may be configured to support simultaneous cluster operations such that any combination of clusters of CPUs 1106 can be active at any given time.

CPU1106 may implement power management capabilities including one or more of the following features: each hardware block can automatically perform clock gating when being idle so as to save dynamic power; due to the execution of WFI/WFE instructions, each core clock may be gated when the core is not actively executing instructions; each core may be independently power gated; when all cores are clock gated or power gated, each cluster of cores may be clock gated independently; and/or when all cores are power gated, each cluster of cores may be power gated independently. CPU1106 may further implement enhanced algorithms for managing power states wherein allowed power states and desired wake times are specified and hardware/microcode determines the optimal power state to enter for the core, cluster and CCPLEX. The processing core may support a simplified power state entry sequence in software, with the work offloaded to microcode.

GPU1108 may include an integrated GPU (alternatively referred to herein as an "iGPU"). GPU1108 may be programmable and efficient for parallel workloads. In some examples, GPU1108 may use an enhanced tensor instruction set. GPU1108 may include one or more streaming microprocessors, where each streaming microprocessor may include an L1 cache (e.g., an L1 cache having at least 96KB storage capability), and two or more of these streaming microprocessors may share an L2 cache (e.g., an L2 cache having 512KB storage capability). In some embodiments, GPU1108 may include at least eight streaming microprocessors. GPU1108 may use a computing Application Programming Interface (API). Further, GPU1108 may use one or more parallel computing platforms and/or programming models (e.g., CUDA by NVIDIA).

In the case of automotive and embedded use, the GPU1108 may be power optimized for optimal performance. For example, GPU1108 may be fabricated on a fin field effect transistor (FinFET). However, this is not intended to be limiting, and GPU1108 may be manufactured using other semiconductor manufacturing processes. Each streaming microprocessor may incorporate several mixed-precision processing cores divided into multiple blocks. For example and without limitation, 64 PF32 cores and 32 PF64 cores may be divided into four processing blocks. In such an example, each processing block may allocate 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed precision NVIDIA tensor cores for deep learning matrix arithmetic, an L0 instruction cache, a thread bundle (warp) scheduler, a dispatch unit, and/or a 64KB register file. In addition, the streaming microprocessor may include independent parallel integer and floating point data paths to provide efficient execution of workloads with a mix of computations and addressing computations. Streaming microprocessors may include independent thread scheduling capabilities to allow finer grained synchronization and collaboration between parallel threads. Streaming microprocessors may include a combined L1 data cache and shared memory unit to improve performance while simplifying programming.

GPU1108 may include a High Bandwidth Memory (HBM) and/or 16GB HBM2 memory subsystem that provides a peak memory bandwidth of approximately 900GB/s in some examples. In some examples, a Synchronous Graphics Random Access Memory (SGRAM), such as a fifth generation graphics double data rate synchronous random access memory (GDDR5), may be used in addition to or instead of HBM memory.

GPU1108 may include a unified memory technology that includes access counters to allow memory pages to be more precisely migrated to the processor that most frequently accesses them, thereby increasing the efficiency of the memory range shared between processors. In some examples, Address Translation Service (ATS) support may be used to allow GPU1108 to directly access CPU1106 page tables. In such an example, when GPU1108 Memory Management Unit (MMU) experiences a miss, an address translation request may be transmitted to CPU 1106. In response, CPU1106 can look for a virtual-to-physical mapping for the address in its page table and transfer the translation back to GPU 1108. In this way, unified memory technology may allow a single unified virtual address space to be used for memory for both CPU1106 and GPU1108, thereby simplifying GPU1108 programming and moving (port) applications to GPU 1108.

Further, GPU1108 may include an access counter that may track how often GPU1108 accesses memory of other processors. The access counters may help ensure that memory pages are moved to the physical memory of the processor that most frequently accesses those pages.

SoC1104 may include any number of caches 1112, including those described herein. For example, cache 1112 may include an L3 cache available to both CPU1106 and GPU1108 (e.g., connected to both CPU1106 and GPU 1108). Cache 1112 may include a write-back cache that may track the state of a line, for example, by using a cache coherency protocol (e.g., MEI, MESI, MSI, etc.). Depending on the embodiment, the L3 cache may comprise 4MB or more, but smaller cache sizes may also be used.

SoC1104 may include an Arithmetic Logic Unit (ALU) that may be used to perform processing, such as processing DNNs, for any of a variety of tasks or operations with respect to vehicle 1100. In addition, SoC1104 may include a Floating Point Unit (FPU), or other mathematical coprocessor or digital coprocessor type, for performing mathematical operations within the system. For example, SoC 104 may include one or more FPUs integrated as execution units within CPU1106 and/or GPU 1108.

The SoC1104 may include one or more accelerators 1114 (e.g., hardware accelerators, software accelerators, or a combination thereof). For example, SoC1104 may include a hardware acceleration cluster, which may include an optimized hardware accelerator and/or a large on-chip memory. The large on-chip memory (e.g., 4MB SRAM) may enable hardware acceleration clusters to accelerate neural networks and other computations. Hardware acceleration clusters may be used to supplement GPU1108 and offload some tasks of GPU1108 (e.g., freeing up more cycles of GPU1108 for performing other tasks). As one example, the accelerator 1114 may be used for targeted workloads that are stable enough to easily control acceleration (e.g., perception, Convolutional Neural Networks (CNNs), etc.). As used herein, the term "CNN" may include all types of CNNs, including region-based or Regional Convolutional Neural Networks (RCNNs) and fast RCNNs (e.g., for object detection).

The accelerator 1114 (e.g., hardware acceleration cluster) may include a Deep Learning Accelerator (DLA). DLA may include one or more Tensor Processing Units (TPUs) that may be configured to provide additional 10 trillion operations per second for deep learning applications and reasoning. The TPU may be an accelerator configured to perform and optimized for performing image processing functions (e.g., for CNN, RCNN, etc.). DLA can be further optimized for a specific set of neural network types and floating point operations and reasoning. DLA designs can provide higher performance per millimeter than general purpose GPUs and far exceed CPU performance. The TPU may perform several functions including a single instance convolution function, supporting, for example, INT8, INT16, and FP16 data types for both features and weights, and post-processor functions.

DLAs can quickly and efficiently execute neural networks, particularly CNNs, on processed or unprocessed data for any of a wide variety of functions, such as, and not limited to: CNN for object recognition and detection using data from the camera sensor; CNN for distance estimation using data from the camera sensor; CNN for emergency vehicle detection and identification and detection using data from the microphone; CNN for face recognition and owner recognition using data from the camera sensor; and/or CNN for security and/or security related events.

The DLA may perform any function of GPU1108, and through the use of an inference accelerator, for example, a designer may direct the DLA or GPU1108 to any function. For example, the designer may focus CNN processing and floating point operations on the DLA and leave other functionality to the GPU1108 and/or other accelerators 1114.

The accelerator 1114 (e.g., hardware acceleration cluster) may include a Programmable Visual Accelerator (PVA), which may alternatively be referred to herein as a computer vision accelerator. PVA may be designed and configured to accelerate computer vision algorithms for Advanced Driver Assistance System (ADAS), autonomous driving, and/or Augmented Reality (AR), and/or Virtual Reality (VR) applications. PVA can provide a balance between performance and flexibility. For example, each PVA may include, for example and without limitation, any number of Reduced Instruction Set Computer (RISC) cores, Direct Memory Access (DMA), and/or any number of vector processors.

The RISC core may interact with an image sensor (e.g., of any of the cameras described herein), an image signal processor, and/or the like. Each of these RISC cores may include any number of memories. Depending on the embodiment, the RISC core may use any of several protocols. In some examples, the RISC core may execute a real-time operating system (RTOS). The RISC core may be implemented using one or more integrated circuit devices, Application Specific Integrated Circuits (ASICs), and/or memory devices. For example, the RISC core may include an instruction cache and/or tightly coupled RAM.

DMA may enable components of the PVA to access system memory independently of CPU 1106. The DMA may support any number of features to provide optimization to the PVA, including, but not limited to, support for multidimensional addressing and/or circular addressing. In some examples, DMA may support addressing up to six or more dimensions, which may include block width, block height, block depth, horizontal block stepping, vertical block stepping, and/or depth stepping.

The vector processor may be a programmable processor that may be designed to efficiently and flexibly perform programming for computer vision algorithms and provide signal processing capabilities. In some examples, the PVA may include a PVA core and two vector processing subsystem partitions. The PVA core may include a processor subsystem, one or more DMA engines (e.g., two DMA engines), and/or other peripherals. The vector processing subsystem may operate as the main processing engine of the PVA and may include a Vector Processing Unit (VPU), an instruction cache, and/or a vector memory (e.g., VMEM). The VPU core may include a digital signal processor, such as, for example, a Single Instruction Multiple Data (SIMD), Very Long Instruction Word (VLIW) digital signal processor. The combination of SIMD and VLIW may enhance throughput and rate.

Each of the vector processors may include an instruction cache and may be coupled to a dedicated memory. As a result, in some examples, each of the vector processors may be configured to execute independently of the other vector processors. In other examples, a vector processor included in a particular PVA may be configured to employ data parallelization. For example, in some embodiments, multiple vector processors included in a single PVA may execute the same computer vision algorithm, but on different regions of the image. In other examples, a vector processor included in a particular PVA may perform different computer vision algorithms simultaneously on the same image, or even different algorithms on sequential images or portions of images. Any number of PVAs may be included in a hardware accelerated cluster, and any number of vector processors may be included in each of these PVAs, among other things. In addition, the PVA may include additional Error Correction Code (ECC) memory to enhance overall system security.

The accelerator 1114 (e.g., hardware acceleration clusters) may include an on-chip computer vision network and SRAM to provide high bandwidth, low latency SRAM for the accelerator 1114. In some examples, the on-chip memory may include at least 4MB SRAM, consisting of, for example and without limitation, eight field-configurable memory blocks, which may be accessed by both PVA and DLA. Each pair of memory blocks may include an Advanced Peripheral Bus (APB) interface, configuration circuitry, a controller, and a multiplexer. Any type of memory may be used. The PVA and DLA may access the memory via a backbone (backbone) that provides high-speed memory access to the PVA and DLA. The backbone may include an on-chip computer vision network that interconnects the PVA and DLA to memory (e.g., using APB).

The computer-on-chip visual network may include an interface that determines that both the PVA and DLA provide a ready and valid signal prior to transmitting any control signals/addresses/data. Such an interface may provide separate phases and separate channels for transmitting control signals/addresses/data, as well as burst-type communication for continuous data transmission. This type of interface may conform to ISO 26262 or IEC 61508 standards, but other standards and protocols may also be used.

In some examples, the SoC1104 may include a real-time ray tracing hardware accelerator such as described in U.S. patent application No.16/101,232 filed on 8/10/2018. The real-time ray tracing hardware accelerator may be used to quickly and efficiently determine the location and extent of objects (e.g., within a world model) in order to generate real-time visualization simulations for RADAR signal interpretation, for sound propagation synthesis and/or analysis, for SONAR system simulation, for general wave propagation simulation, for comparison with LIDAR data for localization and/or other functional purposes, and/or for other uses. In some embodiments, one or more Tree Traversal Units (TTUs) may be used to perform one or more ray tracing related operations.

The accelerator 1114 (e.g., a cluster of hardware accelerators) has a wide range of autonomous driving uses. PVA may be a programmable visual accelerator that may be used for critical processing stages in ADAS and autonomous vehicles. The capabilities of the PVA are a good match to the algorithm domain that requires predictable processing, low power, and low latency. In other words, PVA performs well on semi-dense or dense rule computations, even on small data sets that require predictable runtime with low latency and low power. Thus, in the context of platforms for autonomous vehicles, PVAs are designed to run classical computer vision algorithms because they are efficient in object detection and integer mathematical operations.

For example, according to one embodiment of the technology, the PVA is used to perform computer stereo vision. In some examples, algorithms based on semi-global matching may be used, but this is not intended to be limiting. Many applications for level 3-5 autonomous driving require instantaneous motion estimation/stereo matching (e.g., from moving structures, pedestrian recognition, lane detection, etc.). The PVA may perform computer stereo vision functions on input from two monocular cameras.

In some examples, PVA may be used to perform dense optical flow. The RADAR data is raw (e.g., using a 4D fast fourier transform) according to a process to provide processed RADAR. In other examples, the PVA is used for time-of-flight depth processing, for example by processing raw time-of-flight data to provide processed time-of-flight data.

DLA can be used to run any type of network to enhance control and driving safety, including, for example, neural networks that output confidence measures for each object detection. Such confidence values may be interpreted as probabilities, or as providing a relative "weight" of each detection compared to the other detections. The confidence value enables the system to make further decisions as to which detections should be considered true positive detections rather than false positive detections. For example, the system may set a threshold for confidence, and only detect that exceed the threshold are considered true positive detections. In an Automatic Emergency Braking (AEB) system, a false positive detection may cause the vehicle to automatically perform emergency braking, which is clearly undesirable. Therefore, only the most confident detection should be considered as a trigger for AEB. DLA may run a neural network for regression confidence values. The neural network may have as its inputs at least some subset of parameters, such as ground plane estimates obtained (e.g., from another subsystem) of bounding box dimensions, Inertial Measurement Unit (IMU) sensor 1166 outputs related to vehicle 1100 orientation, distance, 3D position estimates of objects obtained from the neural network and/or other sensors (e.g., LIDAR sensor 1164 or RADAR sensor 1160), and so on.

SoC1104 may include one or more data stores 1116 (e.g., memory). Data store 1116 may be an on-chip memory of SoC1104, which may store a neural network to be executed on the GPU and/or DLA. In some examples, the data store 1116 may be large enough to store multiple instances of a neural network for redundancy and safety. The data storage 1112 may include an L2 or L3 cache 1112. References to the data store 1116 may include references to memory associated with the PVA, DLA, and/or other accelerators 1114 as described herein.

The SoC1104 may include one or more processors 1110 (e.g., embedded processors). Processor 1110 may include a boot and power management processor, which may be a dedicated processor and subsystem for handling boot power and management functions and related security implementations. The boot and power management processor may be part of the SoC1104 boot sequence and may provide runtime power management services. The boot power and management processor may provide clock and voltage programming, assist in system low power state transitions, SoC1104 thermal and temperature sensor management, and/or SoC1104 power state management. Each temperature sensor may be implemented as a ring oscillator whose output frequency is proportional to temperature, and SoC1104 may use the ring oscillator to detect the temperature of CPU1106, GPU1108, and/or accelerator 1114. If it is determined that the temperature exceeds the threshold, the boot and power management processor may enter a temperature fault routine and place the SoC1104 in a lower power state and/or place the vehicle 1100 in a driver safe park mode (e.g., safely park the vehicle 1100).

The processor 1110 may further include a set of embedded processors that may function as an audio processing engine. The audio processing engine may be an audio subsystem that allows for full hardware support for multi-channel audio over multiple interfaces and a wide range of flexible audio I/O interfaces. In some examples, the audio processing engine is a dedicated processor core having a digital signal processor with dedicated RAM.

The processor 1110 may further include an always-on-processor engine that may provide the necessary hardware features to support low power sensor management and wake-up use cases. The always-on-processor engine may include a processor core, tightly coupled RAM, support peripherals (e.g., timers and interrupt controllers), various I/O controller peripherals, and routing logic.

The processor 1110 may further include a secure cluster engine that includes a dedicated processor subsystem that handles security management of automotive applications. The secure cluster engine may include two or more processor cores, tightly coupled RAM, supporting peripherals (e.g., timers, interrupt controllers, etc.), and/or routing logic. In the secure mode, the two or more cores may operate in lockstep mode and act as a single core with comparison logic that detects any differences between their operations.

Processor 1110 may further include a real-time camera engine, which may include a dedicated processor subsystem for handling real-time camera management.

The processor 1110 may further include a high dynamic range signal processor, which may include an image signal processor, which is a hardware engine that is part of the camera processing pipeline.

Processor 1110 may include a video image compositor, which may be a processing block (e.g., implemented on a microprocessor) that implements the video post-processing functions required by a video playback application to generate a final image for a player window. The video image compositor may perform lens distortion correction for wide angle camera 1170, surround camera 1174, and/or for in-cab surveillance camera sensors. The in-cab monitoring camera sensor is preferably monitored by a neural network running on another instance of the advanced SoC, configured to recognize in-cab events and respond accordingly. The in-cab system may perform lip reading to activate mobile phone services and place a call, dictate an email, change vehicle destinations, activate or change the infotainment systems and settings of the vehicle, or provide voice-activated web surfing. Certain functions are available to the driver only when the vehicle is operating in the autonomous mode, and are disabled otherwise.

The video image compositor may include enhanced temporal noise reduction for spatial and temporal noise reduction. For example, in the case of motion in video, noise reduction weights spatial information appropriately, reducing the weight of information provided by neighboring frames. In the case where the image or portion of the image does not include motion, the temporal noise reduction performed by the video image compositor may use information from previous images to reduce noise in the current image.

The video image compositor may also be configured to perform stereoscopic correction on the input stereoscopic lens frames. The video image compositor may further be used for user interface composition when the operating system desktop is in use and GPU1108 is not required to continuously render (render) new surfaces. Even when GPU1108 is powered on and active for 3D rendering, the video image compositor may be used to ease the burden on GPU1108 to improve performance and responsiveness.

SoC1104 may further include a Mobile Industry Processor Interface (MIPI) camera serial interface for receiving video and input from a camera, a high speed interface, and/or a video input block that may be used for camera and related pixel input functions. SoC1104 may further include an input/output controller that may be controlled by software and may be used to receive I/O signals that are not submitted to a particular role.

SoC1104 may further include a wide range of peripheral interfaces to enable communication with peripherals, audio codecs, power management, and/or other devices. The SoC1104 may be used to process data from cameras (connected via gigabit multimedia serial link and ethernet), sensors (e.g., LIDAR sensor 1164, RADAR sensor 1160, etc., which may be connected via ethernet), data from the bus 1102 (e.g., velocity of the vehicle 1100, steering wheel position, etc.), data from GNSS sensors 1158 (connected via an ethernet or CAN bus). SoC1104 may further include a dedicated high-performance mass storage controller, which may include their own DMA engine, and which may be used to free CPU1106 from daily data management tasks.

SoC1104 may be an end-to-end platform with a flexible architecture that spans automation levels 3-5, providing a comprehensive functional security architecture that leverages and efficiently uses computer vision and ADAS technology to achieve diversity and redundancy, along with deep learning tools to provide a platform for a flexible and reliable driving software stack. The SoC1104 may be faster, more reliable, and even more energy and space efficient than conventional systems. For example, when combined with the CPU1106, GPU1108, and data storage 1116, the accelerator 1114 may provide a fast and efficient platform for a class 3-5 autonomous vehicle.

The techniques thus provide capabilities and functionality not achievable by conventional systems. For example, computer vision algorithms may be executed on CPUs that may be configured, using a high-level programming language such as the C programming language, to execute a wide variety of processing algorithms across a wide variety of visual data. However, CPUs often fail to meet the performance requirements of many computer vision applications, such as those related to, for example, execution time and power consumption. In particular, many CPUs are not capable of executing complex object detection algorithms in real time, which is a requirement of onboard ADAS applications and a requirement of utility class 3-5 autonomous vehicles.

In contrast to conventional systems, by providing a CPU complex, a GPU complex, and a hardware acceleration cluster, the techniques described herein allow multiple neural networks to be executed simultaneously and/or sequentially, and the results combined together to achieve level 3-5 autonomous driving functionality. For example, CNNs performed on DLAs or dGPU (e.g., GPU 1120) may include text and word recognition, allowing a supercomputer to read and understand traffic signs, including signs for which neural networks have not been specifically trained. The DLA may further include a neural network capable of recognizing, interpreting, and providing a semantic understanding of the sign, and communicating the semantic understanding to a path planning module running on the CPU complex.

As another example, multiple neural networks may be operating simultaneously, as required for level 3, 4, or 5 driving. For example, by "note: flashing lights indicate icing conditions "a warning sign in conjunction with an electric light can be interpreted by several neural networks, either independently or collectively. The sign itself may be recognized as a traffic sign by a first neural network deployed (e.g., a trained neural network), and the text "flashing lights indicate icing conditions" may be interpreted by a second neural network deployed that informs path planning software (preferably executing on the CPU complex) of the vehicle that icing conditions exist when the flashing lights are detected. The flashing lights may be identified by operating a third neural network deployed over a plurality of frames that informs the path planning software of the vehicle of the presence (or absence) of the flashing lights. All three neural networks may run simultaneously, for example, within the DLA and/or on the GPU 1108.

In some examples, the CNN for facial recognition and owner recognition may use data from the camera sensor to identify the presence of an authorized driver and/or owner of the vehicle 1100. The processing engine, which is always on the sensor, may be used to unlock the vehicle and turn on the lights when the vehicle owner approaches the driver door, and in the safe mode, disable the vehicle when the vehicle owner leaves the vehicle. In this manner, the SoC1104 provides security against theft and/or hijacking.

In another example, the CNN used for emergency vehicle detection and identification may use data from the microphone 1196 to detect and identify an emergency vehicle alert (siren). In contrast to conventional systems that use a generic classifier to detect alarms and manually extract features, the SoC1104 uses CNNs to classify environmental and urban sounds and to classify visual data. In a preferred embodiment, the CNN running on the DLA is trained to identify the relative turn-off rate of the emergency vehicle (e.g., by using the doppler effect). The CNN may also be trained to identify emergency vehicles specific to the local area in which the vehicle is operating as identified by GNSS sensor 1158. Thus, for example, while operating in europe, CNN will seek to detect european alarms, and while in the united states, CNN will seek to identify north american only alarms. Once an emergency vehicle is detected, the control program may be used to execute emergency vehicle safety routines, slow the vehicle, drive to the curb, stop the vehicle, and/or idle the vehicle with the assistance of the ultrasonic sensor 1162 until the emergency vehicle passes.

The vehicle may include a CPU 1118 (e.g., a discrete CPU or dCPU) that may be coupled to SoC1104 via a high speed interconnect (e.g., PCIe). The CPU 1118 may include, for example, an X86 processor. The CPU 1118 may be used to perform any of a variety of functions including, for example, arbitrating potentially inconsistent results between ADAS sensors and the SoC1104, and/or monitoring the status and health of the controller 1136 and/or infotainment SoC 1130.

Vehicle 1100 may include a GPU 1120 (e.g., a discrete GPU or a dppu) that may be coupled to SoC1104 via a high-speed interconnect (e.g., NVLINK by NVIDIA). The GPU 1120 may provide additional artificial intelligence functionality, for example, by executing redundant and/or different neural networks, and may be used to train and/or update the neural networks based at least in part on input from sensors (e.g., sensor data) of the vehicle 1100.

The vehicle 1100 may further include a network interface 1124, which may include one or more wireless antennas 1126 (e.g., one or more wireless antennas for different communication protocols, such as a cellular antenna, a bluetooth antenna, etc.). Network interface 1124 may be used to enable wireless connectivity over the internet with the cloud (e.g., with server 1178 and/or other network devices), with other vehicles, and/or with computing devices (e.g., passenger's client devices). To communicate with other vehicles, a direct link may be established between the two vehicles, and/or an indirect link may be established (e.g., across a network and through the internet). The direct link may be provided using a vehicle-to-vehicle communication link. The vehicle-to-vehicle communication link may provide the vehicle 1100 with information about vehicles approaching the vehicle 1100 (e.g., vehicles in front of, to the side of, and/or behind the vehicle 1100). This function may be part of a cooperative adaptive cruise control function of vehicle 1100.

Network interface 1124 may include a SoC that provides modulation and demodulation functions and enables controller 1136 to communicate over a wireless network. Network interface 1124 may include a radio frequency front end for up-conversion from baseband to radio frequency and down-conversion from radio frequency to baseband. The frequency conversion may be performed by well-known processes and/or may be performed using a super-heterodyne (super-heterodyne) process. In some examples, the radio frequency front end functionality may be provided by a separate chip. The network interface may include wireless functionality for communicating over LTE, WCDMA, UMTS, GSM, CDMA2000, bluetooth LE, Wi-Fi, Z-wave, ZigBee, LoRaWAN, and/or other wireless protocols.

The vehicle 1100 may further include a data store 1128 that may include an off-chip (e.g., off-SoC 1104) memory device. The data store 1128 can include one or more storage elements including RAM, SRAM, DRAM, VRAM, flash memory, a hard disk, and/or other components and/or devices that can store at least one bit of data.

The vehicle 1100 may further include a GNSS sensor 1158. GNSS sensors 1158 (e.g., GPS and/or assisted GPS sensors, differential GPS (dgps) sensors, etc.) are used to assist in mapping, sensing, occupancy grid generation, and/or path planning functions. Any number of GNSS sensors 1158 may be used, including for example and without limitation GPS using a USB connector with an ethernet to serial (RS-232) bridge.

The vehicle 1100 may further include RADAR sensors 1160. The RADAR sensor 1160 may be used by the vehicle 1100 for remote vehicle detection even in dark and/or severe weather conditions. The RADAR function security level may be ASIL B. The RADAR sensor 1160 may use the CAN and/or bus 1102 (e.g., to transmit data generated by the RADAR sensor 1160) for control and access of object tracking data, in some examples ethernet to access raw data. A wide variety of RADAR sensor types may be used. For example and without limitation, the RADAR sensor 1160 may be adapted for anterior, posterior, and lateral RADAR use. In some examples, a pulsed doppler RADAR sensor is used.

The RADAR sensor 1160 may include different configurations, such as long range with a narrow field of view, short range with a wide field of view, short range side coverage, and so forth. In some examples, a remote RADAR may be used for adaptive cruise control functions. The remote RADAR system may provide a wide field of view (e.g., within 250 m) achieved by two or more independent scans. The RADAR sensor 1160 may help distinguish between static objects and moving objects and may be used by the ADAS system for emergency braking assistance and forward collision warning. The remote RADAR sensor may include a single station, multi-mode RADAR with multiple (e.g., six or more) stationary RADAR antennas and high-speed CAN and FlexRay interfaces. In an example with six antennas, the central four antennas may create a focused beam pattern designed to record the surroundings of the vehicle 1100 at higher speeds with minimal traffic interference from adjacent lanes. The other two antennas may extend the field of view, making it possible to quickly detect a vehicle entering or leaving the lane of the vehicle 1100.

As one example, a mid-range RADAR system may include a range of up to 1160m (anterior) or 80m (posterior) and a field of view of up to 42 degrees (anterior) or 1150 degrees (posterior). The short range RADAR system may include, but is not limited to, RADAR sensors designed to be mounted at both ends of the rear bumper. When mounted across the rear bumper, such RADAR sensor systems can create two beams that continuously monitor blind spots behind and beside the vehicle.

The short range RADAR system may be used in ADAS systems for blind spot detection and/or lane change assistance.

The vehicle 1100 may further include an ultrasonic sensor 1162. Ultrasonic sensors 1162, which may be placed in front, behind, and/or to the sides of the vehicle 1100, may be used for parking assistance and/or to create and update occupancy grids. A wide variety of ultrasonic sensors 1162 may be used, and different ultrasonic sensors 1162 may be used for different detection ranges (e.g., 2.5m, 4 m). The ultrasonic sensor 1162 may operate at ASIL B at a functional security level.

The vehicle 1100 may include a LIDAR sensor 1164. The LIDAR sensor 1164 may be used for object and pedestrian detection, emergency braking, collision avoidance, and/or other functions. The LIDAR sensor 1164 may be ASIL B of a functional security level. In some examples, the vehicle 1100 may include a plurality of LIDAR sensors 1164 (e.g., two, four, six, etc.) that may use ethernet (e.g., to provide data to a gigabit ethernet switch).

In some examples, the LIDAR sensor 1164 may be capable of providing a list of objects and their distances to a 360 degree field of view. Commercially available LIDAR sensors 1164 may have an advertising range of approximately 1100m, for example, with an accuracy of 2cm-3cm, supporting 1100Mbps ethernet connections. In some examples, one or more non-protruding LIDAR sensors 1164 may be used. In such an example, the LIDAR sensor 1164 may be implemented as a small device that may be embedded into the front, rear, sides, and/or corners of the vehicle 1100. In such an example, the LIDAR sensor 1164 may provide a field of view up to 120 degrees horizontal and 35 degrees vertical, with a range of 200m, even for low reflectivity objects. The front mounted LIDAR sensor 1164 may be configured for a horizontal field of view between 45 degrees and 135 degrees.

In some examples, LIDAR technology such as 3D flash LIDAR may also be used. 3D flash LIDAR uses a flash of laser light as an emission source to illuminate the vehicle surroundings up to about 200 m. The flash LIDAR unit includes a receptor that records the laser pulse transit time and reflected light on each pixel, which in turn corresponds to the range from the vehicle to the object. The flash LIDAR may allow for the generation of a highly accurate and distortion-free image of the surrounding environment with each laser flash. In some examples, four flashing LIDAR sensors may be deployed, one on each side of the vehicle 1100. Available 3D flash LIDAR systems include solid state 3D staring array LIDAR cameras (e.g., non-scanning LIDAR devices) without moving parts other than fans (movingparts). A flashing LIDAR device may use 5 nanosecond class I (eye safe) laser pulses per frame and may capture reflected laser light in the form of a 3D range point cloud and co-registered intensity data. By using a flashing LIDAR, and because a flashing LIDAR is a solid-state device with no moving parts, the LIDAR sensor 1164 may be less susceptible to motion blur, vibration, and/or shock.

The vehicle may further include IMU sensors 1166. In some examples, IMU sensor 1166 may be located at the center of the rear axle of vehicle 1100. IMU sensors 1166 may include, for example and without limitation, accelerometers, magnetometers, gyroscopes, magnetic compasses, and/or other sensor types. In some examples, for example in a six-axis application, IMU sensors 1166 may include an accelerometer and a gyroscope, while in a nine-axis application, IMU sensors 1166 may include an accelerometer, a gyroscope, and a magnetometer.

In some embodiments, the IMU sensors 1166 may be implemented as a miniature high-performance GPS-assisted inertial navigation system (GPS/INS) incorporating micro-electromechanical systems (MEMS) inertial sensors, high-sensitivity GPS receivers, and advanced kalman filtering algorithms to provide estimates of position, velocity, and attitude. As such, in some examples, IMU sensor 1166 may enable vehicle 1100 to estimate direction (heading) without input from a magnetic sensor by directly observing and correlating changes in speed from the GPS to IMU sensor 1166. In some examples, IMU sensors 1166 and GNSS sensors 1158 may be combined into a single integrated unit.

The vehicle may include microphones 1196 positioned in and/or around vehicle 1100. The microphone 1196 may be used for emergency vehicle detection and identification, among other things.

The vehicle may further include any number of camera types, including a stereo camera 1168, a wide angle camera 1170, an infrared camera 1172, a surround-sound camera 1174, a remote and/or mid-range camera 1198, and/or other camera types. These cameras may be used to capture image data around the entire periphery of the vehicle 1100. The type of camera used depends on the embodiment and the requirements of the vehicle 1100, and any combination of camera types may be used to provide the necessary coverage around the vehicle 1100. Further, the number of cameras may vary depending on the embodiment. For example, the vehicle may include six cameras, seven cameras, ten cameras, twelve cameras, and/or another number of cameras. As one example and not by way of limitation, these cameras may support Gigabit Multimedia Serial Links (GMSL) and/or gigabit ethernet. Each of the cameras is described in more detail herein with respect to fig. 11A and 11B.

The vehicle 1100 may further include a vibration sensor 1142. The vibration sensor 1142 may measure vibrations of a component of the vehicle, such as an axle. For example, a change in vibration may indicate a change in the road surface. In another example, when two or more vibration sensors 1142 are used, the difference between the vibrations may be used to determine the friction or slip of the road surface (e.g., when there is a vibration difference between the powered drive shaft and the free rotating shaft).

The vehicle 1100 may include an ADAS system 1138. In some examples, ADAS system 1138 may include a SoC. The ADAS system 1138 may include autonomous/adaptive/Auto Cruise Control (ACC), Coordinated Adaptive Cruise Control (CACC), Forward Collision Warning (FCW), Automatic Emergency Braking (AEB), Lane Departure Warning (LDW), Lane Keeping Assist (LKA), Blind Spot Warning (BSW), Rear Cross Traffic Warning (RCTW), Collision Warning System (CWS), Lane Centering (LC), and/or other features and functions.

The ACC system may use a RADAR sensor 1160, a LIDAR sensor 1164, and/or a camera. ACC systems may include longitudinal ACC and/or transverse ACC. The longitudinal ACC monitors and controls the distance to the vehicle immediately in front of the vehicle 1100, and automatically adjusts the vehicle speed to maintain a safe distance from the vehicle in front. The lateral ACC performs distance maintenance and suggests the vehicle 1100 to change lanes if necessary. The lateral ACC is related to other ADAS applications such as LCA and CWS.

The CACC uses information from other vehicles, which may be received indirectly from other vehicles via a wireless link via network interface 1124 and/or wireless antenna 1126, or through a network connection (e.g., over the internet). The direct link may be provided by a vehicle-to-vehicle (V2V) communication link, while the indirect link may be an infrastructure-to-vehicle (I2V) communication link. In general, the V2V communication concept provides information about the immediately preceding vehicle (e.g., the vehicle immediately in front of and in the same lane as vehicle 1100), while the I2V communication concept provides information about traffic further ahead. The CACC system may include either or both of I2V and V2V information sources. The CACC may be more reliable given the information of the vehicles ahead of vehicle 1100, and it may be possible to increase the smoothness of the traffic flow and reduce road congestion.

FCW systems are designed to alert the driver to the danger so that the driver can take corrective action. The FCW system uses a front-facing camera and/or RADAR sensor 1160 coupled to a special purpose processor, DSP, FPGA and/or ASIC that is electrically coupled to driver feedback such as a display, speaker and/or vibrating components. The FCW system may provide alerts in the form of, for example, audio, visual alerts, vibration, and/or rapid braking pulses.

The AEB system detects an impending frontal collision with another vehicle or other object and may automatically apply the brakes if the driver takes no corrective action within specified time or distance parameters. The AEB system may use a front-facing camera and/or RADAR sensor 1160 coupled to a dedicated processor, DSP, FPGA and/or ASIC. When the AEB system detects a hazard, it typically first alerts (alert) the driver to take corrective action to avoid the collision, and if the driver does not take corrective action, the AEB system may automatically apply the brakes in an effort to prevent or at least mitigate the effects of the predicted collision. AEB systems may include technologies such as dynamic braking support and/or collision-imminent braking.

The LDW system provides visual, audible, and/or tactile warnings, such as steering wheel or seat vibrations, to alert the driver when the vehicle 1100 crosses a lane marker. When the driver indicates a deliberate lane departure, the LDW system is not activated by activating the turn signal. LDW systems may use a front-facing camera coupled to a special purpose processor, DSP, FPGA and/or ASIC electrically coupled to driver feedback such as a display, speaker and/or vibrating components.

The LKA system is a variation of the LDW system. If the vehicle 1100 begins to leave the lane, the LKA system provides a steering input or braking that corrects the vehicle 1100.

The BSW system detects and alerts the driver to vehicles in the blind spot of the car. BSW systems may provide visual, audible, and/or tactile alerts to indicate that it is unsafe to merge or change lanes. The system may provide additional warnings when the driver uses the turn signal. The BSW system may use rear-facing camera and/or RADAR sensors 1160 coupled to a special purpose processor, DSP, FPGA and/or ASIC that is electrically coupled to driver feedback such as a display, speakers and/or vibrating components.

The RCTW system may provide visual, audible, and/or tactile notification when an object is detected outside of the range of the rear camera when the vehicle 1100 is reversing. Some RCTW systems include an AEB to ensure that the vehicle brakes are applied to avoid a crash. The RCTW system may use one or more rear RADAR sensors 1160 coupled to a special purpose processor, DSP, FPGA and/or ASIC that is electrically coupled to driver feedback such as a display, speaker and/or vibrating components.

Conventional ADAS systems may be prone to false positive results, which may be annoying and distracting to the driver, but are typically not catastrophic, as the ADAS system alerts the driver and allows the driver to decide whether a safety condition really exists and act accordingly. However, in the autonomous vehicle 1100, in the event of conflicting results, the vehicle 1100 itself must decide whether to note (heed) the results from the primary or secondary computer (e.g., first controller 1136 or second controller 1136). For example, in some embodiments, the ADAS system 1138 may be a backup and/or auxiliary computer for providing sensory information to the backup computer reasonableness module. The standby computer rationality monitor can run redundant and diverse software on hardware components to detect faults in perceptual and dynamic driving tasks. Output from the ADAS system 1138 may be provided to a supervisory MCU. If the outputs from the primary and secondary computers conflict, the supervising MCU must determine how to coordinate the conflict to ensure safe operation.

In some examples, the host computer may be configured to provide a confidence score to the supervising MCU indicating the confidence of the host computer on the selected result. If the confidence score exceeds a threshold, the supervising MCU may follow the direction of the main computer regardless of whether the auxiliary computer provides conflicting or inconsistent results. In the event that the confidence score does not satisfy the threshold and in the event that the primary and secondary computers indicate different results (e.g., conflicts), the supervising MCU may arbitrate between these computers to determine the appropriate results.

The supervising MCU may be configured to run a neural network that is trained and configured to determine a condition for the auxiliary computer to provide a false alarm based at least in part on outputs from the main computer and the auxiliary computer. Thus, the neural network in the supervising MCU can know when the output of the helper computer can be trusted and when it cannot. For example, when the helper computer is a RADAR-based FCW system, the neural network in the supervising MCU can know when the FCW system is identifying a metal object that is not in fact dangerous, such as a drainage grid or manhole cover that triggers an alarm. Similarly, when the helper computer is a camera-based LDW system, the neural network in the supervising MCU may learn to disregard this LDW when a rider or pedestrian is present and lane departure is in fact the safest strategy. In embodiments that include a neural network running on a supervising MCU, the supervising MCU may include at least one of a DLA or a GPU adapted to run the neural network using an associated memory. In a preferred embodiment, the supervising MCU may include and/or be included as a component of the SoC 1104.

In other examples, ADAS system 1138 may include an auxiliary computer that performs ADAS functions using conventional computer vision rules. In this way, the helper computer may use classical computer vision rules (if-then), and the presence of a neural network in the supervising MCU may improve reliability, safety and performance. For example, the diversified implementation and intentional non-identity makes the overall system more fault tolerant, especially for faults caused by software (or software-hardware interface) functionality. For example, if there is a software bug or error in the software running on the main computer and the non-identical software code running on the auxiliary computer provides the same overall result, the supervising MCU can be more confident that the overall result is correct and that the bug in the software or hardware on the main computer does not cause a substantial error.

In some examples, the output of the ADAS system 1138 may be fed to a perception block of the host computer and/or a dynamic driving task block of the host computer. For example, if ADAS system 1138 indicates a frontal collision warning due to the immediately preceding reason for the object, the perception block may use this information in identifying the object. In other examples, the helper computer may have its own neural network that is trained and therefore reduces the risk of false positives as described herein.

The vehicle 1100 may further include an infotainment SoC 1130 (e.g., an in-vehicle infotainment system (IVI)). Although shown and described as a SoC, the infotainment system may not be a SoC and may include two or more discrete components. Infotainment SoC 1130 may include a combination of hardware and software that may be used to provide audio (e.g., music, personal digital assistants, navigation instructions, news, broadcasts, etc.), video (e.g., TV, movies, streaming media, etc.), telephony (e.g., hands-free calls), network connectivity (e.g., LTE, Wi-Fi, etc.), and/or information services (e.g., navigation systems, post-parking assistance, radio data systems, vehicle-related information such as fuel level, total distance covered, brake fuel level, door open/close, air filter information, etc.) to vehicle 1100. For example, infotainment SoC 1130 may include a radio, a disk player, a navigation system, a video player, USB and bluetooth connections, a car computer, car entertainment, Wi-Fi, steering wheel audio controls, hands-free voice controls, a heads-up display (HUD), HMI display 1134, a telematics device, a control panel (e.g., for controlling and/or interacting with various components, features, and/or systems), and/or other components. The infotainment SoC 1130 may further be used to provide information (e.g., visual and/or audible) to a user of the vehicle, such as information from the ADAS system 1138, autonomous driving information such as planned vehicle maneuvers, trajectories, ambient environment information (e.g., intersection information, vehicle information, road information, etc.), and/or other information.

Infotainment SoC 1130 may include GPU functionality. Infotainment SoC 1130 may communicate with other devices, systems, and/or components of vehicle 1100 via bus 1102 (e.g., CAN bus, ethernet, etc.). In some examples, the infotainment SoC 1130 may be coupled to a supervisory MCU such that the GPU of the infotainment system may perform some self-driving functions in the event of a failure of the master controller 1136 (e.g., the primary and/or backup computer of the vehicle 1100). In such an example, infotainment SoC 1130 may place vehicle 1100 in a driver safe parking mode as described herein.

Vehicle 1100 may further include an instrument cluster 1132 (e.g., a digital instrument cluster, an electronic instrument cluster, a digital instrument panel, etc.). The instrument cluster 1132 may include a controller and/or a supercomputer (e.g., a separate controller or supercomputer). The instrument cluster 1132 may include a suite of instruments such as a speedometer, fuel level, oil pressure, tachometer, odometer, turn indicator, gear position indicator, seat belt warning light, parking brake warning light, engine fault light, airbag (SRS) system information, lighting controls, safety system controls, navigation information, and so forth. In some examples, information may be displayed and/or shared between infotainment SoC 1130 and instrument cluster 1132. In other words, instrument cluster 1132 may be included as part of infotainment SoC 1130, or vice versa.

Fig. 11D is a system diagram of communications between a cloud-based server and the example autonomous vehicle 1100 of fig. 11A, according to some embodiments of the present disclosure. The system 1176 may include a server 1178, a network 1190, and vehicles, including vehicle 1100. Server 1178 may include a plurality of GPUs 1184(a) -1284(H) (collectively referred to herein as GPUs 1184), PCIe switches 1182(a) -1182(H) (collectively referred to herein as PCIe switches 1182), and/or CPUs 1180(a) -1180(B) (collectively referred to herein as CPUs 1180). GPU1184, CPU 1180, and PCIe switch may be interconnected with a high-speed interconnect such as, for example and without limitation, NVLink interface 1188 developed by NVIDIA and/or PCIe connection 1186. In some examples, GPU1184 is connected via NVLink and/or NVSwitch SoC, and GPU1184 and PCIe switch 1182 are connected via a PCIe interconnect. Although eight GPUs 1184, two CPUs 1180, and two PCIe switches are shown, this is not intended to be limiting. Depending on the embodiment, each of the servers 1178 may include any number of GPUs 1184, CPUs 1180, and/or PCIe switches. For example, each of the servers 1178 can include eight, sixteen, thirty-two, and/or more GPUs 1184.

The server 1178 may receive image data representing images showing unexpected or changed road conditions, such as recently started road works, over the network 1190 and from the vehicle. The server 1178 may transmit the neural network 1192, the updated neural network 1192, and/or map information 1194, including information about traffic and road conditions, through the network 1190 and to the vehicle. The updates to the map information 1194 may include updates to the HD map 1122 such as information about a building site, pothole, curve, flood, or other obstruction. In some examples, the neural network 1192, the updated neural network 1192, and/or the map information 1194 may have been represented from new training and/or data received from any number of vehicles in the environment and/or generated based at least in part on experience with training performed at the data center (e.g., using the server 1178 and/or other servers).

The server 1178 can be used to train a machine learning model (e.g., a neural network) based at least in part on the training data. The training data may be generated by the vehicle, and/or may be generated in a simulation (e.g., using a game engine). In some examples, the training data is labeled (e.g., where the neural network benefits from supervised learning) and/or subject to other preprocessing, while in other examples, the training data is not labeled and/or preprocessed (e.g., where the neural network does not require supervised learning). Training may be performed according to any one or more classes of machine learning techniques, including but not limited to the following classes: supervised training, semi-supervised training, unsupervised training, self-learning, reinforcement learning, federated learning, migratory learning, feature learning (including principal component and cluster analysis), multi-linear subspace learning, manifold learning, representation learning (including alternate dictionary learning), rule-based machine learning, anomaly detection, and any variant or combination thereof. Once the machine learning model is trained, the machine learning model may be used by the vehicle (e.g., transmitted to the vehicle over network 1190), and/or the machine learning model may be used by server 1178 to remotely monitor the vehicle.

In some examples, the server 1178 may receive data from the vehicle and apply the data to the latest real-time neural network for real-time intelligent reasoning. The server 1178 may include a deep learning supercomputer and/or a dedicated AI computer powered by the GPU1184, such as the DGX and DGX station machines developed by NVIDIA. However, in some examples, the server 1178 may include a deep learning infrastructure of a data center that is powered using only CPUs.

The deep learning infrastructure of the server 1178 may be able to reason quickly in real time and may use this capability to assess and verify the health of the processor, software and/or associated hardware in the vehicle 1100. For example, the deep learning infrastructure may receive periodic updates from the vehicle 1100, such as a sequence of images and/or objects located in the sequence of images that the vehicle 1100 has located (e.g., via computer vision and/or other machine learning object classification techniques). The deep learning infrastructure may run its own neural network to identify objects and compare them to those identified by the vehicle 1100, and if the results do not match and the infrastructure concludes that the AI in the vehicle 1100 is malfunctioning, the server 1178 may transmit a signal to the vehicle 1100 instructing the fail-safe computer of the vehicle 1100 to control, notify the passengers, and complete the safe parking maneuver.

To reason, server 1178 may include a GPU1184 and one or more programmable inference accelerators (e.g., TensorRT of NVIDIA). The combination of GPU-powered servers and inferential acceleration may enable real-time responses. In other examples, CPU, FPGA and other processor-powered servers may be used for reasoning, for example, where performance is less important.

The disclosure may be described in the general context of machine-useable instructions, or computer code, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal digital assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The present disclosure may be practiced in a wide variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, and the like. The disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.

As used herein, a statement that "and/or" pertains to two or more elements should be interpreted as referring to only one element or a combination of elements. For example, "element a, element B, and/or element C" may include only element a, only element B, only element C, element a and element B, element a and element C, element B and element C, or elements A, B and C. Further, "at least one of element a or element B" may include at least one of element a, at least one of element B, or at least one of element a and at least one of element B. Further, "at least one of element a and element B" may include at least one of element a, at least one of element B, or at least one of element a and at least one of element B.

The subject matter of the present disclosure is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this disclosure. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms "step" and/or "block" may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.

Claims

1. A method, comprising:

identifying, in a first image, a region corresponding to an object having a first background in the first image;

determining image data representing the object based at least on the region of the object;

generating a second image including the object having a second background based at least on integrating the object having the second background using the image data; and

training at least one neural network to perform a prediction task using the second image.

2. The method of claim 1, wherein the identification of the region comprises determining that at least a first segment of the first image corresponds to the object and at least a second segment of the first image corresponds to the first background based at least on performing image segmentation on the first image.

3. The method of claim 1, wherein the training of the at least one neural network is to classify one or more poses of the subject.

4. The method of claim 1, wherein determining the image data comprises generating a mask based at least on the region in the first image and applying the mask to the first image.

5. The method of claim 1, wherein determining the image data comprises generating a mask based at least on the region in the first image, and generating the second image is based at least on performing one or more of dilation or erosion on a portion of the mask corresponding to the object.

6. The method of claim 1, wherein determining the image data comprises generating a mask based at least on the region in the image, and generating the second image is based at least on blurring at least a portion of a boundary of the mask corresponding to the object.

7. The method of claim 1, wherein generating the second image comprises modifying a hue of the object.

8. The method of claim 1, further comprising:

selecting a view of the object in the environment; and

generating the first image from a three-dimensional capture of the object in the environment of the view based at least on rasterization.

9. The method of claim 1, further comprising generating a third image comprising the object with a third background based at least on integrating the object with the third background, wherein the training of the at least one neural network further uses the third image.

10. The method of claim 1, wherein the integration comprises seamless blending of the object with the second background.

11. A method, comprising:

receiving an image of one or more objects having a plurality of backgrounds;

generating a set of inference scores corresponding to one or more predictions of a prediction task performed on the one or more objects using the image;

selecting a context based at least on one or more inference scores of the set of inference scores;

generating an image based at least on integrating an object with the background based at least on the selection of the background; and

applying the image during training of at least one neural network to perform the prediction task.

12. The method of claim 11, wherein the one or more inference scores comprise a plurality of inference scores for a set of the images that includes the context, and the selecting is based at least on an analysis of the plurality of inference scores.

13. The method of claim 11, wherein the inference score is generated using the at least one neural network in a first period of training the at least one neural network and the use of the image is in a second period of the training.

14. The method of claim 11, wherein generating the image comprises generating a mask based at least on identifying regions of the object in the image and applying the mask to the image.

15. The method of claim 11, wherein the selection of the context is based at least on determining that at least one of the one or more inference scores is below a threshold.

16. The method of claim 11, wherein the one or more inference scores correspond to a first cropped region of the background and the object is said integrated with a second cropped region of the background different from the first cropped region.

17. The method of claim 11, wherein the context is a context type and the method further comprises synthetically generating the context based at least on the context type.

18. A system, comprising:

one or more processors; and

one or more memory devices storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising:

based at least on modifying a background of an object using a mask, obtaining at least one neural network trained to perform a prediction task on the image using an input generated from the mask corresponding to the object in the image;

generating a mask corresponding to an object in an image, wherein the object has a background in the image;

generating an input to the at least one neural network using the mask, the input capturing the object with at least a portion of the background; and

generating at least one prediction of the prediction task based at least on applying the input to the at least one neural network.

19. The system of claim 18, wherein the input comprises a first input representing at least a portion of the image and a second input representing at least a portion of the mask, wherein the image comprises the object with at least a portion of the background.

20. The system of claim 18, wherein the generating of the input comprises generating image data based at least on modifying the background using the mask, and at least a portion of the input corresponds to the image data.

21. The system of claim 18, wherein the generating is based at least on fusing a first set of inference scores corresponding to a first portion of the input representing at least a portion of the image of the object including at least a portion of the background and a second set of inference scores corresponding to a second portion of the input representing at least a portion of the mask.

22. The system of claim 18, wherein the operations are performed by at least one of:

a control system for an autonomous or semi-autonomous machine;

a perception system of an autonomous or semi-autonomous machine;

a system for performing a simulation operation;

a system to perform deep learning operations;

a system implemented using an edge device;

a system implemented using a robot;

a system that merges one or more Virtual Machines (VMs);

a system implemented at least in part at a data center; or

A system implemented at least in part using cloud computing resources.