CN113762461A

CN113762461A - Training neural networks with finite data using reversible enhancement operators

Info

Publication number: CN113762461A
Application number: CN202110623844.1A
Authority: CN
Inventors: T·T·卡拉斯; M·S·艾塔拉; J·J·海尔斯顿; S·M·莱内; J·T·莱赫蒂宁; T·O·艾拉
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2020-06-05
Filing date: 2021-06-04
Publication date: 2021-12-07
Also published as: DE102021205690A1

Abstract

Embodiments relate to techniques for training a neural network using a limited amount of data, such as generating a antagonistic neural network (GAN). Training the GAN using too little example data typically results in the discriminator overfitting, resulting in training divergence and poor results. An adaptive discriminator enhancement mechanism is used that significantly stabilizes training with limited data, providing the ability to train high quality GANs. An enhancement operator is applied to the distribution of inputs to the discriminator used to train the generator, which represents a reversible transformation to ensure that there is no enhancement leakage to the image generated by the generator. Reducing the amount of training data required to achieve convergence has the potential to significantly aid many applications and may increase the use of generative models in areas such as medicine.

Description

Training neural networks with finite data using reversible enhancement operators

Require priority

The present application claims the benefit of U.S. provisional application No.63/035,448 entitled "generating a countermeasure network with Limited Data Training" filed on 5.6.2020, which is incorporated herein by reference in its entirety.

Background

A large amount of training data is needed to adequately train the generation of a countering neural network (GAN) to perform well, e.g., in order to generate realistic images. Training the GAN with too little data typically results in overfitting of the discriminator, where the feedback of the discriminator to the generator becomes meaningless, causing the training to diverge rather than converge. Unfortunately, it may be difficult to provide large amounts of training data. It can be challenging to collect a sufficiently large set of images for a particular application that imposes constraints on topic type, image quality, geographic location, time period, privacy, copyright status, and the like. These difficulties are further exacerbated in applications that require the capture of new, custom data sets: acquiring, processing and distributing as many as millions or more of images as are required to train modern high quality, high resolution GANs are expensive tasks.

In almost all areas of deep learning, data set enhancement provides additional training data and is a standard solution to overfitting. For example, training an image classifier under rotation, noise, etc. results in an increase in constancy for these semantic preserving distortions (highly desirable quality in the classifier). In contrast, GAN learning trained under similar dataset enhancement to generate enhanced distributions. In general, enhancing such "leakage" into the generated image is highly undesirable. For example, noise enhancement results in noisy results even if no noise is present in the data set. In other words, the generator learns to generate images containing enhancements. There is a need to address these and/or other problems associated with the prior art.

Disclosure of Invention

Embodiments of the present disclosure relate to a technique for training a neural network, such as generating a countering neural network (GAN), using a limited amount of data. Depending on the task, the GAN generated data may include images, audio, video, three-dimensional (3D) objects, text, and the like. Training the GAN using too little example data typically results in the discriminator overfitting, resulting in training divergence. To avoid divergence during training, an adaptive discriminator-enhancing mechanism is used that significantly stabilizes training in a limited data format. The adaptive discriminator enhancement mechanism does not require changes to the loss function or network architecture and is applicable both when training from scratch and when fine-tuning an existing GAN on another data set. The enhancement operator applied to the distribution of the inputs to the discriminator represents a reversible transformation to ensure that there is no enhancement leakage into the image generated by the discriminator-trained generator. Reducing the amount of training data required to achieve convergence has the potential to significantly aid many applications. For example, reducing the amount of training data needed may increase the use of generative models in fields such as medicine.

A method, computer-readable medium, and system for training a neural network using a limited amount of data are disclosed. The neural network receives training data comprising input data and ground truth output, wherein the input data is associated with a first distribution. Applying at least one enhancement to the input data to produce enhanced input data associated with a second distribution, wherein an enhancement operator corresponding to a transformation from the first distribution to the second distribution is invertible and specifies the at least one enhancement. The augmented input data is processed by the neural network according to parameters to produce output data, and the parameters are adjusted to reduce differences between the output data and the ground truth output.

Drawings

The present system and method for adaptive enhancement is described in detail below with reference to the attached drawing figures, wherein:

fig. 1A illustrates a block diagram of a GAN training framework, according to an embodiment.

Fig. 1B shows a graph of convergence for various amounts of GAN training data using prior art techniques.

Fig. 1C shows a GAN training process using the prior art.

Fig. 2A illustrates a block diagram of an example enhanced training configuration suitable for implementing some embodiments of the present disclosure.

Fig. 2B illustrates a block diagram of an example enhanced training configuration suitable for use in implementing some embodiments of the present disclosure.

Fig. 2C illustrates a transformation using an irreversible enhancement operator, according to an embodiment.

Fig. 2D illustrates a transformation using a reversible enhancement operator, according to an embodiment.

Fig. 2E illustrates a flow diagram of a method of training a neural network with limited data using discriminator enhancement, according to an embodiment.

Fig. 3A illustrates the enhancement unit shown in fig. 2A and 2B according to an embodiment.

Fig. 3B shows a graph of FIDs for different p-values and various amounts of GAN training data, according to an embodiment.

Fig. 3C shows a graph of FID and various amounts of GAN training data for different target values of adaptive discriminator enhancement, according to an embodiment.

Fig. 3D illustrates a flow diagram of a method for training a neural network with limited data using adaptive discriminator enhancement, according to an embodiment.

Fig. 3E illustrates an improved GAN training process enhanced using an adaptive discriminator according to an embodiment.

FIG. 4 illustrates an example parallel processing unit suitable for implementing some embodiments of the present disclosure.

FIG. 5A is a conceptual diagram of a processing system implemented using the PPU of FIG. 4, which is suitable for implementing some embodiments of the present disclosure.

FIG. 5B illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.

FIG. 5C illustrates components of an exemplary system that can be used to train and utilize machine learning in at least one embodiment.

Fig. 6 illustrates an exemplary streaming system suitable for implementing some embodiments of the present disclosure.

Detailed Description

Systems and methods are disclosed that relate to training neural networks (such as generating countermeasure networks) using a limited amount of data. First, there is a need to better understand how the amount of available training data affects GAN training. An integrated analysis of the conditions for preventing enhanced leakage is then described. Then, different sets of enhancements may be provided, and the adaptive control scheme enables a consistent solution to be used regardless of the amount of training data, the attributes of the data set, or the exact training settings (e.g., training from scratch or transfer learning).

In particular, the techniques described herein may be used to train the generation of a antagonistic neural network (GAN). The GAN includes a generator neural network and a discriminator neural network. The generator receives the latent code (random number) and applies the parameters learned during training to the latent code to generate output data, such as an image. The discriminator acts as an adaptive loss function used during training of the generator. The training data of the discriminator includes example output data (true data) with which the output data (generated data) generated by the generator should be in agreement. The discriminator determines whether the generated data appears similar to the real data included in the training data. For example, when the training generator generates an image of a human face, the discriminator determines whether the generated human face appears similar to an example image (e.g., a real image) of the human face. Discriminator enhancement techniques provide a wide range of enhancements to prevent over-fitting of the discriminator while ensuring that the enhancements do not leak into the generated data.

Fig. 1A illustrates a block diagram of a GAN training framework, according to an embodiment. The GAN120 may be implemented by a program, a custom circuit, or by a combination of custom circuit and program. For example, GAN120 may be implemented using a GPU, a CPU, or any processor capable of performing the operations described herein. Moreover, those of ordinary skill in the art will appreciate that any system that performs the operations of the GAN120 is within the scope and spirit of embodiments of the present invention.

The GAN120 includes a generator (neural network) 100, a discriminator (neural network) 110, and a training loss unit 115. The topology of both the generator 100 and the discriminator 110 may be modified during training. The GAN120 may operate in an unsupervised setting or a conditional setting. The generator 100 receives input and produces output data. Depending on the task, the output data may be image, audio, video or other types of data (configuration settings). In one embodiment, the training simulates a "two-player game," where the generator 100 and the evaluator 110 operate as first and second players, respectively. In one embodiment, the generator 100 learns to generate data that is judged to be authentic or generated by the discriminator 110. In other words, the game is such that the generator 100 tricks the authenticator 110 into thinking that the generated data is authentic.

In an embodiment, the training is performed on one or more iterations, where each iteration consists of multiple phases performed in an alternating manner. In the first stage, the parameters of the discriminator 110 are learned while the discriminator 110 receives the generated output data produced by the generator 100. The discriminator 110 is trained to indicate that the generated output data is generated and not authentic. In the second stage, the discriminator 110 receives example output data (e.g., a true image of a human face), and updates the parameters when the discriminator 110 is trained to indicate that the example output data is true and not generated. In an embodiment, the ground truth output indicates that the example output data is real (not generated) and that the generated data is not real. In the third phase, the parameters of the generator 100 are adjusted (while the parameters of the discriminator 110 are fixed) because the generator 100 learns to generate output data that the discriminator 110 indicates is consistent with the example output data. In an embodiment, transition learning is performed and training is started using generators and discriminators that have been trained for different tasks.

As shown in fig. 1A, an earlier generated image of a human face may include some realistic characteristics (e.g., eyes, mouth, nose, hair), but not consistent with a real image of the human face. The discriminator 110 indicates that the earlier generated image is inconsistent with the example output data used to train the discriminator 110. The generated image, which is later produced during training of the generator 100, is more consistent with the real image of the human face.

In one embodiment, the discriminator 110 outputs a continuous value indicating how closely the output data matches the example output data. For example, in one embodiment, the discriminator 110 outputs a first training stimulus (e.g., high value, TRUE, first state) when the output data is determined to match the example output data, and outputs a second training stimulus (e.g., low value, FALSE, second state) when the output data is determined to not match the example output data. The training loss unit 115 adjusts parameters (weights) of the generator and the discriminator 120 based on the output of the discriminator 110. In an embodiment, the training loss unit 115 adjusts the parameters to reduce the difference between the values and the ground truth output.

When the generator 100 is trained for a particular task, such as generating an image of a human face, the discriminator 110 outputs a high value when the output data is an image of a human face. The output data generated by the generator 100 need not be identical to the example output data used by the discriminator 110 to determine that the output data matches the example output data. In the context of the following description, the discriminator 110 determines that the output data matches the example output data when the output data is similar to any of the example output data. Training may continue to adjust parameters of the discriminator and/or generator by providing the generated output data and the example output data to the discriminator and evaluating the output values.

In a conditional setting, the input data to the GAN120 may include other data, such as images, classification labels, segmentation contours, and other (additional) types of data (distribution, audio, etc.). After training, the additional data may be used as an explicit way to control the output of the generator, e.g. to generate images corresponding to a particular selection of classification labels. Additional data may be specified in addition to the random latent code, or the additional data may completely replace the random latent code.

More illustrative information will now be set forth regarding various optional architectures and features by which the foregoing framework may be implemented, as desired by a user. It should be particularly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any way. Optional the following features may optionally be incorporated or not exclude other features described.

Fig. 1B illustrates the convergence for various amounts of GAN training data using prior art techniques. Different curves on the graph correspond to different training data set sizes used to train the GAN. The vertical axis indicates the fraechet starting distance (FID) and the horizontal axis corresponds to the number of training samples (i.e., real images) shown to the discriminator. The training data set is repeatedly (in randomized order) input to the discriminator until a fixed number of training samples have been processed. FID measures the difference between the generated image and the real image, with lower values associated with greater similarity. For a training data set size of 140k, the FID steadily improves as the training progresses. In contrast, the training data set

sizes

30k, 10k, and 2k begin to improve and then degrade as training continues. Specifically, training data sets of

sizes

2k, 10k, and 30k begin to degrade at

points

122, 124, and 126, respectively. The less data in the training dataset, the earlier the FID begins to rise. In contrast, as further described herein, when a GAN is trained using discriminator-enhancing techniques, a stable improvement is achieved throughout the training process-even for reduced training data set sizes.

Fig. 1C shows a GAN training process using the prior art. The chart for 50k image training 130 shows the improved FID up to the point 128 where the progress is reversed. Discriminator output 135 shows the value d (x) produced in response to input x during training. At the 2M training sample, the values output by the discriminator are indicative of the overlap distribution 132 of the real image and the generated image, where negative values indicate the generated image and positive values indicate the real image. The overlap distribution 132 at 2M training samples is shown in the following graph, where the overlap region is centered at d (x) -0. As training continues, at 13M training samples, the distribution 134 shifts apart, coinciding with the FID starting to rise. For training samples exceeding 13M, the separation continues to increase.

The true distribution and the generated distribution initially overlap, but as the discriminator becomes more and more confident, the shift is more and more distant, and the point at which the FID begins to degrade coincides with a loss of sufficient overlap between the distributions. This is a strong indication of overfitting, further evidenced by a reduction in accuracy (FID) measured for the individual validation sets. The discriminator enhancement technique employs generic enhancements that prevent the discriminator from becoming overly trusted.

By definition, any enhancement applied to the training data set will be inherited into the generated image as a result of the training generator. Recently, balanced consistency regularization (bCR) has been proposed as a solution that is presumed not to leak enhancements to the generated images. Two sets of enhancements that apply to the same input image should produce the same output. A consistency regularization term is used for discriminator loss and discriminator consistency is enforced for both the real and generated images, without applying an enhancement or consistency loss term when training the generator. As such, bCR effectively strives to generalize the discriminator by blinding it to the enhancements used in the CR term. However, meeting this goal opens the gate for leakage enhancement, as the generator will be free to generate images containing enhancement without any penalty. In fact, bCR does suffer from leakage and therefore does not provide a high quality solution for training GAN with limited data.

Instead of using a separate CR loss term, the random discriminator enhancement technique uses only the enhanced image to evaluate the discriminator. Similarly, when training the generator, the feedback the generator receives from the discriminator is based on evaluating the discriminator using an enhanced version of the generated image. The possibility of using only the enhanced image is of little interest, probably because at first glance it would seem unobvious even if it were working: if the discriminator never sees what the training image really looks like, it is unclear whether the discriminator can properly direct the generator. Hereinafter, conditions under which the random discriminator enhancement technique will not leak enhancement to the generated image are determined, as further described in conjunction with fig. 2D, 3B, and 3C, and a complete pipeline may be implemented based on the conditions.

Fig. 2A illustrates a block diagram of an example enhanced training configuration suitable for implementing some embodiments of the present disclosure. The output data generated by the generator 100 is processed by the enhancement unit 200 before being input to the discriminator 110. In one embodiment, one or more enhancement units 200 are executed within the GAN120 in conjunction with the discriminator 110. The training loss unit 115 updates the parameters of the generator 100 based on the values output by the trained discriminator 110. In an embodiment, the enhancement operators implemented by the enhancement unit 200 are one or more reversible transformations, as further described herein. In an embodiment, the updates are computed via backpropagation, and the enhancement operator is differentiable.

The discriminator enhancement performed by the enhancement unit 200 may be understood as placing distortion (possibly even destructive goggles) on the discriminator 110 and requiring the generator 100 to produce samples that are indistinguishable from example output data (e.g., training data sets) when viewed through the goggles. Conceptually, the enhancement unit 200 corresponds to goggles. In one embodiment, the enhancement is random.

Fig. 2B illustrates another block diagram of an example adaptive augmented training configuration suitable for implementing some embodiments of the present disclosure. The discriminator 110 is trained using the example output data processed by the enhancement unit 200 and using the generated output data produced by the generator 100 that is also processed by the enhancement unit 200. The discriminator 110 processes the enhanced example output data and the enhanced generated output data to produce a value. The training loss unit 115 provides a discriminator parameter update based on the value and whether the input to the discriminator 110 corresponds to the enhanced example output data or the enhanced generated output data. In an embodiment, the update is calculated to reduce the difference between the value and the ground truth output.

In an embodiment, the training goal achieved by the training loss unit 115 is to have the distribution x of the generated output data match the distribution y of the example output data as closely as possible. The goal of the training generator 100 is to generate output data with a distribution x-y. Furthermore, for image generation, when the enhancement unit 200 will enhance the operator

All inputs (true) applied to discriminator 110Both images and generated images), the training target drives the distribution of the enhanced generated images

To match the distribution of the enhanced real image

When the enhancement operator is invertible, there is only one distribution x, which when enhanced, produces a distribution

In other words, there is

A distribution x of (a). As long as the destruction process is represented by a reversible transformation of the probability distribution over the data space, the generator 100 trains to implicitly undo the destruction introduced by the enhancement unit 200 and find the correct distribution. In other words,

it will necessarily imply that x ═ y if the enhancement operator is invertible, but that it is not true if the operator is irreversible. In the context of the following description, an enhancement operator representing such a reversible transformation is referred to as leak-free. The reversible enhancement operator is referred to as leak-free, since the enhancement will not leak into the generated data. When the transformation represented by the enhancement operator is invertible-considering the distribution as a whole, the distribution of the output data generated by the generator converges to become more similar to that of the real image, x-y. The enhancement operator corresponds to the data distribution and specifies the enhancement to be applied to individual samples (e.g., individual images) of the input data.

The ability of the reversible transformations performed by the enhancement unit 200 is that conclusions about the equality or inequality of the underlying data set can be drawn by observing only the enhanced data set without requiring the generator 100 and/or discriminator 110 to see the data set without enhancement. It is important to understand that this does not mean that the enhancement performed on the individual images will need to be undone. For example, an enhancement that is as extreme as setting the input image to 90% time zero is reversible in the sense of a probability distribution: even for humans, it is easy to reason about the original distribution by ignoring black images until only 10% of the images remain. On the other hand, random rotations uniformly chosen from {0 °, 90 °, 180 °, 270 ° } are irreversible: it is not possible to discern the difference between the directions after enhancement.

Fig. 2C illustrates a transformation using an irreversible enhancement operator, according to an embodiment. The random rotation chosen with uniform probability from {0 °, 90 °, 180 °, 270 ° } is the enhancement operator corresponding to the irreversible transform that produced the set of enhancement data. The example output data 212 includes an upright face that is enhanced to produce enhanced example output data 216. The same set of enhancements is applied to generation data 210 that erroneously includes only the face that is inverted to produce enhanced generation data 214. The generator should be trained to produce generated data 210 consistent with example output data 212. However, because the distribution of the enhanced generated data 214 matches the distribution of the enhanced example output data 216, the discriminator will incorrectly conclude that the generated data 210 is consistent with the example output data 216. In particular, a uniform probability of 25% for each of four different rotations produces a uniform distribution of 25% for both enhanced generation data 214 and enhanced example output data 216, regardless of the original orientation and whether they match between the set of generated data 210 and the set of example output data 212.

Fig. 2D illustrates a transformation using a reversible enhancement operator, according to an embodiment. In contrast to the irreversible rotational transformations shown in FIG. 2C, the rotational transformations can be made reversible by changing the probability from uniform to non-uniform. Enhancement unit 200 processes generated data 211 and example output data 213 using a reversible rotation transform to produce enhanced generated data 220 and enhanced example output data 222, respectively. The probability of performing the enhancement is set to 80% such that for 0 °, 90 °, 180 °, 270 °, each of the four rotations is 20%, respectively. Since no enhancement occurs at a probability of 20%, the resulting probability of 0 ° rotation is 40%.

When skipping enhancement, the relative occurrence of 0 ° rotation increases, and the now enhanced distribution may only match if the generated data 211 has the correct orientation consistent with the example output data 213. Similarly, many other random enhancements may be designed to be leak-free with non-zero probability of skipping the enhancement. Such enhancements include, for example, geometric warping, color transformation, deterministic mapping (e.g., base transformation), additive noise, transformation groups (e.g., image or color space rotation, flipping, and scaling), and projection (e.g., cropping). Furthermore, synthesizing the no-leak enhancements in a fixed order produces an overall no-leak enhancement. In an embodiment, the probability value p is provided as an input to the enhancement unit 200 to control the application of the enhancement, wherein 0 < p < 1. When p < 1, the transformation corresponding to the enhancement operator will be reversible.

Fig. 2E illustrates a flow diagram of a method 250 for training a neural network with limited data using adaptive and/or random discriminator enhancement, according to an embodiment. Each block of method 250 described herein includes a computational process that may be performed using any combination of hardware, firmware, and/or software. For example, the different functions may be performed by a processor executing instructions stored in a memory. The method 250 may also be embodied as computer useable instructions stored on a computer storage medium. The method 250 may be provided by a standalone application, service, or hosted service (either standalone or in combination with another hosted service), or a plug-in to another product, to name a few. Further, by way of example, the method 250 is described with respect to the GAN120 of fig. 1A and/or the training configuration shown in fig. 2A and/or fig. 2B. However, this method may additionally or alternatively be performed by any system or any combination of systems, including but not limited to the systems described herein. For example, the method 250 may be used to train a neural network for tasks such as classification. Moreover, one of ordinary skill in the art will appreciate that any system that performs the method 250 is within the scope and spirit of embodiments of the present disclosure.

At step 255, a neural network (e.g., a discriminator) receives training data including input data and ground truth output, wherein the input data is associated with a first distribution. In one embodiment, the neural network is a discriminator 110. In an embodiment, the input data comprises images and each image in the training data is enhanced. In an embodiment, at least one enhancement is randomly disabled based on a value (p) defining the enhancement strength. In an embodiment, at least one enhancement is differentiable. In an embodiment, the at least one enhancement is implemented as a sequence of different enhancements.

At step 260, at least one enhancement is applied to the input data to produce enhanced input data associated with the second distribution, wherein an enhancement operator corresponding to the transformation from the first distribution to the second distribution is invertible and specifies the at least one enhancement. In an embodiment, the input data is output data produced by the generator 100. In an embodiment, the first distribution is simply a distribution for which applying at least one enhancement to the input data associated with the first distribution results in an enhanced distribution that matches the second distribution.

At step 265, the augmented input data is processed by the neural network according to the parameters to produce output data. At step 270, the parameters are adjusted to reduce the difference between the output data and the ground truth output. In an embodiment, the input data comprises a first subset of the generated data and a second subset of the real data, and the first subset of the generated data is generated by the generator neural network model based on the second parameter. In an embodiment, the second parameter is adjusted such that the distribution of the first subset and the second subset more closely matches.

In an embodiment, the input data comprises a first subset of generated data and a second subset of real data, the output data comprises a value indicative of a first state or a second state, and the parameter is adjusted such that a first portion of the output data produced for the first subset more closely matches the first state and a second portion of the output data produced for the second subset more closely matches the second state. In other words, the distribution of the real sum generated input images matches the corresponding distribution of the discriminator classification outputs generated for the enhanced real sum generated input images.

Fig. 3A shows an enhancement unit 200 according to an embodiment. The enhancement pipeline within the enhancement unit 200 includes a series of transform units 300, each of the transform units 300 being configured to apply a transform to produce enhanced data. In embodiments, the enhancement unit 200 may comprise fewer or more transform units 300. In an embodiment, the transform units 300 are random, i.e. they may produce different outputs when performed multiple times for the same input. The probability value p input to each transform unit 300 controls the enhancement strength and ensures that the transform is invertible. In one embodiment, the enhancement unit 200 comprises at least one transform unit 300. In one embodiment, each transform unit 300 may be configured to perform any one of N transforms. In another embodiment, each transform unit 300 is configured to perform a particular one of the N transforms. In one embodiment, the N transforms may be grouped into 6 categories: pixel bit block transfer (x flip, 90 rotation, integer translation), more general geometric transformations, color transformations, image space filtering, additive noise, and clipping. In one embodiment, during training, each image is shown to the discriminator 110 after being processed by the enhancement unit 200 in a fixed order using a predefined set of transforms. Since the enhancement is also used when training the generator 100, the enhancement should be differentiable for back propagation when calculating generator parameter updates.

The strength of the enhancement is controlled by the scalar p ∈ [0,1] such that each transform unit 300 applies a transform with probability p or skips a transform with probability 1-p. In the embodiment, the transformation unit 300 is always performed on the data input to the transformation unit 300, and p is used to selectively output the enhanced data or the data input to the transformation unit 300 according to the probability p. In an embodiment, the same p-value is input to each transform unit 300. In one embodiment, randomization according to p is performed separately for each enhancement and each image in a mini-batch (minipatch). Given the presence of multiple transform units 300 in the enhancement pipeline, the discriminator is unlikely to see a clean image even if the value of p is quite small. Nevertheless, the generator 100 is directed to produce only clean images as long as p remains below the actual safety limit.

Not unexpectedly, using p-1 may result in enhancing the output image that leaks to the generator 100. More specifically, experiments have demonstrated that the above-described rotations of random multiples of 90 should be skipped during at least a portion of the time that the generator 100 is optimally trained. When p is too high, the generator 100 cannot know which way the generated image should face and eventually one of the possibilities to pick up randomly. As can be expected, this problem does not occur exclusively in the limiting case of p ═ 1. In practice, the training settings are also ill-conditioned for nearby values due to limited sampling, limited representation power of the network, inductive biasing, and training dynamics. Based on experiments with several different kinds of enhancement, when p remains below about 0.85, the generated image is always correct, i.e. there is no leakage.

The effectiveness of random discriminator enhancement is evaluated by performing an exhaustive scan of P for different enhancement categories and data set sizes. The optimal strength of enhancement depends largely on the amount of training data. Thus, relying on any fixed p may not be the best choice for data sets of different sizes. Also, the training improvement may vary for different enhancement categories.

Fig. 3B shows a graph of FIDs for different p-values and various amounts of GAN training data, according to an embodiment. For a 2k training set, strong boosting can be used to obtain the best results. Specifically, a value of p-0.4 corresponds to the lowest FID at point 315. The curve also indicates that some of the enhancement becomes leaky as p approaches the value 1. In the case of the 10K and 50K training sets, the higher value of p is less useful, and the value of p-0.2 corresponds to the lowest FID at

points

305 and 310, respectively. Using a 140k training set (not shown), the situation is significantly different: all enhancements are detrimental, so p-0 provides the best results. In fact, sensitivity to dataset size requires expensive grid searches to determine the optimal p-value for each dataset size.

Preferably, manual adjustment of the boost intensity will be avoided, but the boost intensity should be dynamically controlled during training based on the degree of overfitting. One way to quantify overfitting is to use a separate validation set and observe its behavior relative to the training set. For training data sets of limited size, overfitting begins as the validation set begins to behave more and more like the generated image. This is a quantifiable effect, albeit with the disadvantage of requiring a separate validation set when the training data may already be out of supply. As shown in fig. 1C, the discriminator 110 output values for the true and generated images diverge symmetrically around zero when the overfitting deteriorates. Divergence can be quantified without a separate validation set.

The discriminator output values are represented as D for the training set, validation set, and generated images, respectively_train、D_validationAnd D_generatedAnd they pass E [. cndot.]And (6) averaging. In one embodiment, N-4 corresponds to 4x 64-256 images. Two reasonable overfitting heuristics (heuristic) can be defined:

for both heuristics, r-0 means no overfitting, and r-1 indicates full overfitting, and the goal is to adjust the enhancement probability p so that the selected heuristic matches the appropriate target value. First heuristic Algorithm r_vRepresenting the output of the discriminator against the validation set relative to the training set and the generated images. Because the first heuristic assumes that there is a separate validation set, it is used primarily as a comparison to evaluate the second heuristic. Second heuristic r_tThe portion of the training set for which positive values are output by the discriminator 110 is estimated. r is_vAnd r_tBoth are effective in preventing overfitting, and both improve the results beyond the best fixed p found using grid search (see fig. 3B). In one embodiment, the second heuristic compares the direct view to

The obvious alternative of (2) is less sensitive to the selected hyper-parameter.

The enhancement intensity p can be controlled as follows. In an embodiment, p is initialized to zero and adjusted every N mini-batches based on the selected overfitting heuristic. If the heuristics indicate too much/little overfitting, p can be adjusted by incrementing/decrementing at a fixed adjustment size (i.e., a predetermined amount). In one embodiment, the determination is based on comparing the heuristic values to predefined target values. If the heuristic values are higher than the target values, it is determined that there is too much overfitting, and vice versa. In an embodiment, the resizing is set such that p can be raised from 0 to 1 fast enough, e.g. in a 500k image. The value of p may be clamped to 0, which means that it does not become negative after decrementing. Dynamically controlling the enhancement strength may be referred to as adaptive discriminator enhancement (ADA). The ADA mechanism does not require changes to the loss function or network architecture and is applicable both when training from scratch and when fine-tuning existing GANs on another data set.

FIG. 3C illustrates r enhanced using an adaptive discriminator according to an embodiment_tPlots of FID for different target values (x-axis) for heuristic and various amounts of GAN training data. In an embodiment, the selected target value is 0.6 and is based on r_tHeuristics are used to adjust the boost strength. The dashed lines show the optimal fixed value p for each training data set from fig. 3B. When ADA is used for the 2k training set, the best result is obtained at point 320 with a target value of 0.5. In the case of the 10K and 50K training sets, the target values 0.5 and 0.6 correspond to the lowest FID at

points

322 and 324, respectively. Using the 140k training set, 0.6 provided the best results.

Fig. 3D illustrates a flow diagram of a method 330 for training a neural network with limited data using adaptive discriminator enhancement, according to an embodiment. Each block of method 330 described herein includes a computational process that may be performed using any combination of hardware, firmware, and/or software. For example, the different functions may be performed by a processor executing instructions stored in a memory. The method 330 may also be embodied as computer-useable instructions stored on a computer storage medium. The method 330 may be provided by a standalone application, a service, or a hosted service (either alone or in combination with another hosted service), or a plug-in to another product, to name a few. Further, by way of example, the method 330 is described with respect to the GAN120 of fig. 1A and/or the training configuration shown in fig. 2A and/or fig. 2B. However, this method may additionally or alternatively be performed by any system or any combination of systems, including but not limited to the systems described herein. For example, the method 330 may be used to train a neural network for tasks such as classification. Moreover, one of ordinary skill in the art will appreciate that any system that performs the method 330 is within the scope and spirit of embodiments of the present disclosure.

At step 335, the boost strength is initialized by setting p to a value (e.g., zero). At step 340, the neural network model is trained for N mini-batches, where N is a positive integer. At step 345, if the neural network model being trained is overfitting, the method 330 proceeds to step 350. Otherwise, the method 330 proceeds to step 352 and decreases the boost strength before proceeding to step 355. At step 350, the enhancement unit 200 increases the enhancement strength before proceeding to step 355. In an embodiment, a resizing for increasing or decreasing the enhancement strength is applied to p. In an embodiment, the enhancement unit 200 is based on a neural network model for N sub-batches (e.g., D)_train) The output data is generated to increment or decrement p. In an embodiment, the N mini-batches (e.g., r) will be targeted by the neural network model_t) The resulting output data is compared to a reference (e.g., target value 0.6) and p is adjusted based on the result of the comparison.

In step 355, the training loss unit 115 determines whether training should continue and, if so, returns to step 340. Otherwise, the training is complete. In an embodiment, training is complete when the loss calculated by the training loss unit 115 is below a threshold. In one embodiment, the training is complete when the entire training data set has been applied to the neural network model M times, where M is a predefined constant. In one embodiment, training is complete when a predetermined number of lots have been completed. In one embodiment, the neural network model includes at least one of the generator 100 and the discriminator 110.

Fig. 3E illustrates an improved GAN training process enhanced using an adaptive discriminator according to an embodiment. Similar to the graph for 50k image training 130 shown in fig. 1C, the graph 365 for 20k image training is performed without any enhancement and the FID improves until a point 362 where the progress begins to reverse. Similar to discriminator output 135 shown in fig. 1C, discriminator output 370 shows the value d (x) produced during training. Before point 362, the values output by the discriminator have an overlapping distribution of true data and generated data. As training continues after point 362, the discriminator output values separate.

In contrast, when ADA is used for 20k image training 375, the FID improves steadily throughout the training and reaches a lower value than that reached at any point by non-enhanced training 365. Similarly, the discriminator output 380 has an overlapping distribution for the true data and the generated data throughout the training. Thus, overfitting that occurs during conventional training can be avoided when random discriminator enhancement or ADA is used during training, especially when a reduced size training data set is used.

Data-driven generative modeling means learning a computational recipe that generates complex data purely based on examples. In addition to playing a fundamental role in machine learning, generative models have several uses within applied machine learning research as priors, regulars, etc. Generative models advance the ability of computer vision and graphical algorithms to analyze and synthesize realistic images. Random discriminator enhancement (SDA) and ADA reliably stabilize training and greatly improve the training results of the generated model when the training data is limited. SDA and ADA can be used to train high quality generative models using significantly less data than is required by existing methods. In particular, when ADA is used, the generator output improves steadily throughout training, regardless of the training set size, and no overfitting occurs. Without enhancement, the gradient received by the generator from the discriminator becomes very simple over time-the discriminator is only concerned with a few features at the outset, and the generator is free to otherwise create meaningless images.

SDA and ADA make it easier to train high quality generative models with customized data sets, thereby significantly reducing the barriers to applying GAN-type models in the research field of many applications. For example, modeling the space of possible appearances of biological samples (tissues, tumors, etc.) is an ever-growing field of research that appears to suffer from limited high-quality data for a long time. In general, generative models hold the promise of increasing the understanding of complex and difficult to accurately determine relationships in many real-world phenomena.

Parallel processing architecture

FIG. 4 illustrates a Parallel Processing Unit (PPU)400 according to one embodiment. PPU 400 may be used to implement GAN 120. The PPU 400 may be used to implement one or more of the generator 100, discriminator 110, enhancement unit 200, and training loss unit 220 shown in fig. 1A, 2B, and 3A. PPU 400 may be configured to perform methods 150 and/or 250.

In an embodiment, PPU 400 is a multithreaded processor implemented on one or more integrated circuit devices. PPU 400 is a latency hiding architecture designed for processing many threads in parallel. A thread (e.g., a thread of execution) is an instance of a set of instructions configured to be executed by PPU 400. In one embodiment, PPU 400 is a Graphics Processing Unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data in order to generate two-dimensional (2D) image data for display on a display device. In other embodiments, PPU 400 may be used to perform general-purpose computations. Although one example parallel processor is provided herein for purposes of illustration, it is specifically noted that this processor is set forth for purposes of illustration only, and any processor may be used in addition to and/or in place of this processor.

One or more PPUs 400 may be configured to accelerate thousands of High Performance Computing (HPC), data centers, cloud computing, and machine learning applications. PPU 400 may be configured to accelerate a wide variety of deep learning systems and applications for autonomous vehicles, simulations, computational graphics (such as ray or path tracking), deep learning, high-precision speech, image and text recognition systems, intelligent video analysis, molecular simulation, drug development, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, among others.

As shown in FIG. 4, PPU 400 includes input/output (I/O) unit 405, front end unit 415, scheduler unit 420, work allocation unit 425, hub 430, crossbar (Xbar)470, one or more general purpose processing clusters (GPCs) 450, and one or more memory partition units 480. PPUs 400 may be connected to host processors or other PPUs 400 via one or more high-speed nvlinks 410 interconnects. PPU 400 may be connected to a host processor or other peripheral device via interconnect 402. PPU 400 may also be connected to local memory 404, which includes a plurality of memory devices. In one embodiment, the local memory may include a plurality of Dynamic Random Access Memory (DRAM) devices. DRAM devices may be configured as High Bandwidth Memory (HBM) subsystems, with multiple DRAM dies (die) stacked within each device.

The NVLink 410 interconnect enables the system to be scalable and includes one or more PPUs 400 in combination with one or more CPUs, supporting cache coherency between PPUs 400 and CPUs, and CPU hosting. Data and/or commands may be sent by NVLink 410 to and from other units of PPU 400, such as one or more replication engines, video encoders, video decoders, power management units, etc. (not explicitly shown) via hub 430. NVLink 410 is described in more detail in conjunction with FIG. 5B.

The I/O unit 405 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the interconnect 402. The I/O unit 405 may communicate with the host processor directly via the interconnect 402 or through one or more intermediate devices, such as a memory bridge. In one embodiment, I/O unit 405 may communicate with one or more other processors (e.g., one or more PPUs 400) via interconnect 402. In one embodiment, I/O unit 405 implements a peripheral component interconnect express (PCIe) interface for communicating over a PCIe bus, and interconnect 402 is a PCIe bus. In alternative embodiments, the I/O unit 405 may implement other types of known interfaces for communicating with external devices.

I/O unit 405 decodes packets received via interconnect 402. In one embodiment, the data packets represent commands configured to cause PPU 400 to perform different operations. The I/O unit 405 transmits the decoded command to various other units of the PPU 400 as specified by the command. For example, some commands may be sent to the front end unit 415. Other commands may be sent to hub 430 or other units of PPU 400, such as one or more replication engines, video encoders, video decoders, power management units, and the like (not explicitly shown). In other words, I/O unit 405 is configured to route communications between and among the various logical units of PPU 400.

In one embodiment, a program executed by a host processor encodes a command stream in a buffer that provides workloads to the PPU 400 for processing. The workload may include a number of instructions and data to be processed by those instructions. A buffer is an area of memory that is accessible (e.g., read/write) by both the host processor and the PPU 400. For example, I/O unit 405 may be configured to access buffers in system memory connected to interconnect 402 via memory requests transmitted over interconnect 402. In one embodiment, the host processor writes the command stream to a buffer and then sends a pointer to the beginning of the command stream to PPU 400. The front end unit 415 receives pointers to one or more command streams. Front end unit 415 manages one or more streams, reads commands from the streams and forwards the commands to the various units of PPU 400.

The front end unit 415 is coupled to a scheduler unit 420, which configures the various GPCs 450 to process tasks defined by one or more streams. The scheduler unit 420 is configured to track status information related to various tasks managed by the scheduler unit 420. The status may indicate which GPC450 the task is assigned to, whether the task is active or inactive, a priority associated with the task, and so on. The scheduler unit 420 manages the execution of multiple tasks on one or more GPCs 450.

The scheduler unit 420 is coupled to a work allocation unit 425 configured to dispatch tasks for execution on the GPCs 450. The work allocation unit 425 may track several scheduled tasks received from the scheduler unit 420. In one embodiment, the work allocation unit 425 manages a pending task pool and an active task pool for each GPC 450. When a GPC450 completes execution of a task, the task is evicted from the active task pool of the GPC450, and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 450. If an active task on a GPC450 has been idle, for example while waiting for a data dependency to be resolved, the active task may be evicted from the GPC450 and returned to the pending task pool, while another task in the pending task pool is selected and scheduled for execution on the GPC 450.

In one embodiment, the host processor executes a driver kernel that implements an Application Programming Interface (API) that enables one or more applications to be executed on the host processor to schedule operations for execution on the PPU 400. In one embodiment, multiple computing applications are executed simultaneously by PPU 400, and PPU 400 provides isolation, quality of service (QoS), and independent address spaces for the multiple computing applications. An application may generate instructions (e.g., API calls) that cause a driver kernel to generate one or more tasks to be executed by PPU 400. The driver kernel exports tasks to one or more streams being processed by PPU 400. Each task may include one or more related thread groups, referred to herein as thread bundles (warp). In one embodiment, the thread bundle includes 32 related threads that may be executed in parallel. Cooperative threads may refer to multiple threads that include instructions to perform tasks and may exchange data through a shared memory. Tasks may be distributed to one or more processing units within a GPC450, and instructions are scheduled for execution by at least one thread bundle.

The work distribution unit 425 communicates with one or more GPCs 450 via XBar 470. XBar 470 is an interconnection network that couples many of the elements of PPU 400 to other elements of PPU 400. For example, XBar 470 may be configured to couple work allocation unit 425 to a particular GPC 450. Although not explicitly shown, one or more other units of PPU 400 may also be connected to XBar 470 via hub 430.

Tasks are managed by the scheduling unit 420 and dispatched to the GPCs 450 by the work distribution unit 425. The GPCs 450 are configured to process tasks and generate results. The results may be consumed by other tasks within a GPC450, routed to a different GPC450 via XBar 470, or stored in memory 404. The results may be written to memory 404 via memory partition unit 480, memory partition unit 480 implementing a memory interface for writing data to memory 404 and reading data from memory 404. The results may be transmitted to another PPU 400 or CPU via NVlink 410. In one embodiment, PPU 400 includes a number U of memory partition units 480 equal to the number of separate and distinct memory devices coupled to memory 404 of PPU 400. Each GPC450 may include a memory management unit to provide virtual to physical address translation, memory protection, and arbitration of memory requests. In one embodiment, the memory management unit provides one or more Translation Lookaside Buffers (TLBs) for performing translations of virtual addresses to physical addresses in memory 404.

In one embodiment, the memory interface implements the HBM2 memory interface, and Y equals half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as the PPU 400, providing significant power and area savings compared to conventional GDDR5 SDRAM systems. In one embodiment, each HBM2 stack includes four memory dies and Y equals 4, where each HBM2 stack includes two 128-bit lanes per die, for a total of 8 lanes and a data bus width of 1024 bits.

In one embodiment, memory 404 supports Single Error Correction Double Error Detection (SECDED) Error Correction Codes (ECC) to protect data. For computing applications that are sensitive to data corruption, ECC provides higher reliability. In large cluster computing environments, reliability is particularly important where PPU 400 handles very large data sets and/or long running applications.

In one embodiment, PPU 400 implements a multi-level memory hierarchy. In one embodiment, memory partitioning unit 480 supports unified memory to provide a single unified virtual address space for CPU and PPU 400 memory, enabling data sharing between virtual memory systems. In one embodiment, the frequency of accesses by PPU 400 to memory located on other processors is tracked to ensure that a page of memory is moved to the physical memory of PPU 400 that more frequently accesses the page. In one embodiment, NVLink 410 supports an address translation service that allows PPU 400 to directly access CPU's page tables and provides full access to CPU memory by PPU 400.

In one embodiment, a replication engine transfers data between multiple PPUs 400 or between PPUs 400 and a CPU. The copy engine may generate a page fault for an address that does not map to a page table. The memory partition unit 480 may then service the page fault, mapping the address into a page table, after which the copy engine may perform the transfer. In conventional systems, fixed memory (e.g., non-pageable) is operated for multiple copy engines between multiple processors, which significantly reduces available memory. Due to a hardware paging error, the address can be passed to the copy engine without worrying about whether the memory page resides and whether the copy process is transparent.

Data from memory 404 or other system memory may be retrieved by memory partition unit 480 and stored in L2 cache 460, L2 cache 460 being on-chip and shared among GPCs 450. As shown, each memory partition unit 480 includes a portion of the L2 cache associated with the corresponding memory 404. The lower level cache may then be implemented in multiple units within the GPC 450. For example, each processing unit within a GPC450 may implement a level one (L1) cache. The L1 cache is a dedicated memory dedicated to a particular processing unit. L2 cache 460 is coupled to memory interface 470 and XBR 370, data from L2 cache can be fetched and stored in each L1 cache for processing.

In one embodiment, the processing units within each GPC450 implement a SIMD (single instruction, multiple data) architecture, in which each thread in a set of threads (e.g., a thread bundle) is configured to process different sets of data based on the same instruction set. All threads in the set of threads execute the same instruction. In another embodiment, the processing unit implements a SIMT (single instruction, multi-threaded) architecture, in which each thread of a set of threads is configured to process a different set of data based on the same instruction set, but in which individual threads of a set of threads are allowed to diverge during execution. In an embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, enabling concurrency between thread bundles and serial execution within thread bundles as threads within thread bundles diverge. In another embodiment, program counters, call stacks, and execution state are maintained for each individual thread, enabling equal concurrency among all threads, within a thread bundle, and between thread bundles. While maintaining the execution state for each individual thread, threads executing the same instructions may converge and execute in parallel for maximum efficiency.

Collaboration groups are programming models for organizing groups of communication threads that allow developers to express the granularity at which threads are communicating, enabling richer, more efficient parallel decomposition to be expressed. The cooperative launch API supports synchronicity between thread blocks to execute parallel algorithms. The conventional programming model provides a single simple structure for the synchronous cooperative threads: barriers (barriers) across all threads of a thread block (e.g., synchreads () function). However, programmers often want to define thread groups at a granularity less than the thread block granularity and synchronize within the defined groups, enabling higher performance, design flexibility, and software reuse in the form of collective group-wide function interfaces (collective-wide function interfaces).

The collaboration group enables programmers to explicitly define thread groups at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations, such as synchronicity across threads in the collaboration group. The programming model supports clean composition across software boundaries so that libraries and utility functions can be safely synchronized in their local environment without assumptions on convergence. The collaboration group primitives enable new modes of collaborative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block grid.

The processing unit includes a large number (e.g., 128, etc.) of different processing cores (e.g., functional units), which may be fully pipelined, single precision, double precision, and/or mixed precision and include floating point arithmetic logic units and integer arithmetic logic units. In one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-. In one embodiment, the cores include 64 single precision (32-bit) floating point cores, 64 integer cores, 32 double precision (64-bit) floating point cores, and 8 tensor cores.

The tensor core is configured to perform matrix operations. In particular, the tensor core is configured to perform deep learning matrix operations, such as GEMM (matrix multiplied by matrix) for convolution operations during neural network training and reasoning. In one embodiment, each tensor core operates on a 4 × 4 matrix and performs a matrix multiply and accumulate operation D ═ a × B + C, where A, B, C and D are 4 × 4 matrices.

In one embodiment, the matrix multiplication inputs a and B may be integer, fixed point or floating point matrices, while the accumulation matrices C and D may be integer, fixed point or floating point matrices of equal or higher bit width. In one embodiment, the tensor core operates on 1, 4, or 8 bit integer input data using 32 bit integer accumulation. The 8-bit integer matrix multiplication requires 1024 operations and results in a full precision product, which is then accumulated with the other intermediate products of the 8x8x16 matrix multiplication using 32-bit integer addition. In one embodiment, the tensor core operates on 16-bit floating-point input data using 32-bit floating-point accumulation. 16-bit floating-point multiplication requires 64 operations, produces a full precision product, and then accumulates with the other intermediate products of the 4x 4 matrix multiplication using 32-bit floating-point addition. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations built up from these smaller elements. APIs (such as the CUDA 9C + + API) disclose specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently use the tensor core from the CUDA-C + + program. At the CUDA level, the thread bundle level interface assumes a 16 × 16 size matrix that spans all 32 threads of a thread bundle.

Each processing unit may also include M Special Function Units (SFUs) that perform special functions (e.g., attribute evaluation, reciprocal square root, etc.). In one embodiment, the SFU may include a tree traversal unit configured to traverse a hierarchical tree data structure. In one embodiment, the SFU may include a texture unit configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texels) from memory 404 and sample the texture map to produce sampled texture values for use in a shader program executed by the processing unit. In one embodiment, the texture map is stored in a shared memory which may include or include an L1 cache. Texture units implement texture operations, such as filtering operations using mip maps (i.e., texture maps of different levels of detail). In one embodiment, each processing unit includes two texture units.

Each processing unit also includes N Load Store Units (LSUs) that implement load and store operations between the shared memory and the register file. Each processing unit includes an interconnect network connecting each core to the register file and the LSU to the register file, the shared memory. In one embodiment, the interconnect network is a crossbar switch that may be configured to connect any core to any register in the register file, and to connect the LSU to the register file and to a memory location in the shared memory.

Shared memory is an on-chip memory array that allows data storage and communication between processing units and between threads within processing units. In one embodiment, the shared memory comprises 128KB of storage capacity and is in the path from each processing unit to the memory partitioning unit 480. Shared memory may be used for cache reads and writes. One or more of the shared memory, the L1 cache, the L2 cache, and the memory 404 are backing stores.

Combining data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by programs as a cache that does not use shared memory. For example, if the shared memory is configured to use half the capacity, texture and load/store operations may use the remaining capacity. Integration within the shared memory enables the shared memory to function as a high throughput pipeline for streaming data while providing high bandwidth and low latency access to frequently reused data.

When configured for general-purpose parallel computing, a simpler configuration can be used compared to graphics processing. In particular, fixed function graphics processing units are bypassed, creating a simpler programming model. In a general purpose parallel computing configuration, work allocation unit 425 assigns and allocates thread blocks directly to processing units within a GPC 450. The threads execute the same program, use a unique thread ID in the computations to ensure that each thread generates a unique result, execute the program and perform the computations using one or more processing units, use shared memory to communicate between the threads, and use LSUs to read and write to global memory through the shared memory and memory partitioning unit 480. When configured for general purpose parallel computing, a processing unit may also write a command that scheduler unit 420 may use to initiate a new job on the processing unit.

PPUs 400 may each include and/or be configured to perform functions of one or more processing cores and/or components thereof, such as Tensor Cores (TCs), Tensor Processing Units (TPUs), Pixel Vision Cores (PVCs), Ray Tracing (RT) cores, Vision Processing Units (VPUs), Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Tree Traversal Units (TTUs), Artificial Intelligence Accelerators (AIAs), Deep Learning Accelerators (DLAs), Arithmetic Logic Units (ALUs), Application Specific Integrated Circuits (ASICs), Floating Point Units (FPUs), input/output (I/O) elements, Peripheral Component Interconnect (PCI) or peripheral component interconnect express (PCIe) elements, and/or the like.

PPU 400 may be included in a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), Personal Digital Assistant (PDA), digital camera, vehicle, head mounted display, handheld electronic device, etc. In one embodiment, PPU 400 is included on a single semiconductor substrate. In another embodiment, PPU 400 is included on a system on a chip (SoC) along with one or more other devices, such as an additional PPU 400, memory 404, a Reduced Instruction Set Computer (RISC) CPU, a Memory Management Unit (MMU), a digital-to-analog converter (DAC), and so forth.

In one embodiment, PPU 400 may be included on a graphics card that includes one or more memory devices 404. The graphics card may be configured to interface with a PCIe slot on a motherboard of the desktop computer. In yet another embodiment, PPU 400 may be an Integrated Graphics Processing Unit (iGPU) or a parallel processor included in a chipset of a motherboard. In yet another embodiment, PPU 400 may be implemented in reconfigurable hardware. In yet another embodiment, portions of PPU 400 may be implemented in reconfigurable hardware.

Exemplary computing System

Systems with multiple GPUs and CPUs are used in various industries as developers expose and exploit more parallelism in applications such as artificial intelligence computing. High performance GPU acceleration systems with tens to thousands of compute nodes are deployed in data centers, research institutions, and supercomputers to address larger problems. As the number of processing devices within high performance systems increases, communication and data transfer mechanisms need to be extended to support the increased bandwidth.

FIG. 5A is a conceptual diagram of a processing system 500 implemented using the PPU 400 of FIG. 4, according to one embodiment. Exemplary system 500 may be configured to implement method 250 shown in fig. 2E and/or method 330 shown in fig. 3D. Processing system 500 includes a CPU 530, a switch 510, and a plurality of PPUs 400 and corresponding memory 404.

NVLink 410 provides a high-speed communication link between each PPU 400. Although a particular number of NVLink 410 and interconnect 402 connections are shown in FIG. 5B, the number of connections to each PPU 400 and CPU 530 may vary. Switch 510 interfaces between interconnect 402 and CPU 530. PPU 400, memory 404, and NVLink 410 may be located on a single semiconductor platform to form parallel processing module 525. In one embodiment, the switch 510 supports two or more protocols that interface between various different connections and/or links.

In another embodiment (not shown), NVLink 410 provides one or more high-speed communication links between each PPU 400 and CPU 530, and switch 510 interfaces between interconnect 402 and each PPU 400. PPU 400, memory 404, and interconnect 402 may be located on a single semiconductor platform to form parallel processing module 525. In yet another embodiment (not shown), interconnect 402 provides one or more communication links between each PPU 400 and CPU 530, and switch 510 interfaces between each PPU 400 using NVLink 410 to provide one or more high-speed communication links between PPUs 400. In another embodiment (not shown), NVLink 410 provides one or more high speed communication links between PPU 400 and CPU 530 through switch 510. In yet another embodiment (not shown), interconnect 402 provides one or more communication links directly between each PPU 400. One or more NVLink 410 high speed communication links may be implemented as physical NVLink interconnects or on-chip or bare-die interconnects using the same protocol as NVLink 410.

In the context of this specification, a single semiconductor platform may refer to only a single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connections that simulate on-chip operation and are substantially improved by utilizing conventional bus implementations. Of course, the various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, parallel processing module 525 may be implemented as a circuit board substrate, and PPU 400 and/or memory 404 may each be packaged devices. In one embodiment, CPU 530, switch 510, and parallel processing module 525 are located on a single semiconductor platform.

In one embodiment, the signaling rate for each NVLink 410 is 20 to 25 gbits/sec, and each PPU 400 includes six NVLink 410 interfaces (as shown in fig. 5A, each PPU 400 includes five NVLink 410 interfaces). Each NVLink 410 provides a data transfer rate of 25 gbits/sec in each direction, with six links providing 400 gbits/sec. When CPU 530 also includes one or more NVLink 410 interfaces, NVLink 410 may be dedicated to PPU communications as shown in FIG. 5A, or some combination of PPU to PPU and PPU to CPU.

In one embodiment, NVLink 410 allows direct load/store/atomic access from CPU 530 to memory 404 of each PPU 400. In one embodiment, NVLink 410 supports coherency operations, allowing data read from memory 404 to be stored in the cache hierarchy of CPU 530, reducing cache access latency of CPU 530. In one embodiment, NVLink 410 includes support for Address Translation Services (ATS), allowing PPU 400 direct access to page tables within CPU 530. One or more nvlinks 410 may also be configured to operate in a low power mode.

Fig. 5B illustrates an exemplary system 565 in which the various architectures and/or functionalities of the various previous embodiments may be implemented. Exemplary system 565 may be configured to implement method 250 shown in fig. 2E and/or method 330 shown in fig. 3D.

As shown, a system 565 is provided that includes at least one central processing unit 530 coupled to a communication bus 575. The communication bus 575 may directly or indirectly couple one or more of the following devices: main memory 540, network interface 535, one or more CPUs 530, one or more display devices 545, one or more input devices 560, switch 510, and parallel processing system 525. The communication bus 575 may be implemented using any suitable protocol and may represent one or more links or buses, such as an address bus, a data bus, a control bus, or a combination thereof. The communication bus 575 can include one or more bus or link types, such as an Industry Standard Architecture (ISA) bus, an Extended Industry Standard Architecture (EISA) bus, a Video Electronics Standards Association (VESA) bus, a Peripheral Component Interconnect (PCI) bus, a peripheral component interconnect express (PCIe) bus, a hypertransport, and/or another type of bus or link. In some embodiments, there is a direct connection between the components. For example, the CPU 530 may be directly connected to the main memory 540. Further, CPU 530 may be directly connected to parallel processing system 525. Where there are direct or point-to-point connections between components, the communication bus 575 may include a PCIe link for performing the connection. In these examples, a PCI bus need not be included in system 565.

Although the various blocks of fig. 5C are shown connected with wires via a communication bus 575, this is not intended to be limiting and is for clarity only. For example, in some embodiments, a presentation component such as display device 545 can be considered an I/O component such as input device 560 (e.g., if the display is a touch screen). As another example, CPU 530 and/or parallel processing system 525 may include memory (e.g., main memory 540 may represent a storage device in addition to parallel processing system 525, CPU 530, and/or other components). In other words, the computing device of fig. 5C is merely illustrative. No distinction is made between such categories as "workstation," "server," "laptop," "desktop," "tablet," "client device," "mobile device," "handheld device," "gaming console," "Electronic Control Unit (ECU)," "virtual reality system," and/or other device or system types, as all are considered within the scope of the computing device of fig. 5C.

The system 565 also includes a main memory 540. Control logic (software) and data are stored in main memory 540, and main memory 540 may take the form of a variety of computer-readable media. Computer readable media can be any available media that can be accessed by system 565. Computer readable media can include both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

Computer storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data types. For example, main memory 540 may store computer readable instructions (e.g., representing programs and/or program elements, such as an operating system). Computer storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by system 565. As used herein, a computer storage medium does not include a signal per se.

Computer storage media may embody computer readable instructions, data structures, program modules, and/or other data types in a modulated data signal such as a carrier wave or other transport mechanism and include any information delivery media. The term "modulated data signal" may refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, computer storage media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

The computer programs, when executed, enable the system 565 to perform various functions. The CPU 530 may be configured to execute at least some of the computer readable instructions to control one or more components of the system 565 to perform one or more of the methods and/or processes described herein. CPUs 530 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) capable of processing multiple software threads simultaneously. The CPU 530 may include any type of processor, and may include different types of processors (e.g., a processor with fewer cores for a mobile device and a processor with more cores for a server) depending on the type of system 565 implemented. For example, depending on the type of system 565, the processor may be an advanced instruction set computing (RISC) machine (ARM) processor implemented using RISC or an x86 processor implemented using CISC. System 565 may also include one or more CPUs 530 in addition to one or more microprocessors or supplemental coprocessors, such as math coprocessors.

In addition to or in lieu of CPU 530, parallel processing module 525 may be configured to execute at least some of the computer readable instructions to control one or more components of system 565 to perform one or more of the methods and/or processes described herein. Parallel processing module 525 may be used by system 565 to render graphics (e.g., 3D graphics) or to perform general-purpose computations. For example, parallel processing module 525 may be used for general purpose computing on a GPU (GPGPU). In embodiments, one or more CPUs 530 and/or parallel processing modules 525 may perform any combination of methods, processes, and/or portions thereof, either discretely or jointly.

System 565 also includes one or more input devices 560, a parallel processing system 525, and one or more display devices 545. The display device 545 may include a display (e.g., a monitor, touchscreen, television screen, Heads Up Display (HUD), other display types, or combinations thereof), speakers, and/or other presentation components. Display device 545 may receive data from other components (e.g., parallel processing system 525, CPU 530, etc.) and output the data (e.g., as images, video, sound, etc.).

Network interface 535 may enable system 565 to be logically coupled to other devices, including an input device 560, one or more display devices 545, and/or other components, some of which may be built into (e.g., integrated within) system 565. Illustrative input devices 560 include a microphone, mouse, keyboard, joystick, game pad, game controller, satellite dish, scanner, printer, wireless device, or the like. Input device 560 may provide a Natural User Interface (NUI) that handles air gestures, speech, or other physiological input generated by a user. In some cases, the input may be sent to an appropriate network element for further processing. The NUI may implement any combination of voice recognition, stylus recognition, facial recognition, biometric recognition, on-screen and near-screen gesture recognition, air gestures, head and eye tracking, and touch recognition (as described in more detail below) associated with the display of system 565. The system 565 may include a depth camera for gesture detection and recognition, such as a stereo camera system, an infrared camera system, an RGB camera system, touch screen technology, and combinations of these. Additionally, system 565 can include an accelerometer or gyroscope (e.g., as part of an Inertial Measurement Unit (IMU)) that enables detection of motion. In some examples, system 565 may use the output of an accelerometer or gyroscope to render immersive augmented reality or virtual reality.

Further, system 565 can be coupled to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) (e.g., the internet), a peer-to-peer network, a cable network, etc.) through network interface 535 for communication purposes. System 565 can be included within a distributed network and/or cloud computing environment.

Network interface 535 may include one or more receivers, transmitters, and/or transceivers that enable system 565 to communicate with other computing devices via an electronic communication network (including wired and/or wireless communication). Network interface 535 may include components and functionality to enable communication via any of a number of different networks, such as a wireless network (e.g., Wi-Fi, Z-wave, bluetooth LE, ZigBee, etc.), a wired network (e.g., via ethernet or infiniband communication), a low-power wide area network (e.g., LoRaWAN, SigFox, etc.), and/or the internet.

System 565 can also include a secondary storage device (not shown). The secondary storage means includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, a Digital Versatile Disk (DVD) drive, a recording device, a Universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner. The system 565 can also include a hard-wired power source, a battery power source, or a combination thereof (not shown). The power supply may provide power to the system 565 to enable the components of the system 565 to operate.

Each of the aforementioned modules and/or devices may even reside on a single semiconductor platform to form system 565. Alternatively, the different modules may also be positioned individually or in different combinations of semiconductor platforms, as desired by the user. While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Example network Environment

A network environment suitable for implementing embodiments of the present disclosure may include one or more client devices, servers, Network Attached Storage (NAS), other backend devices, and/or other device types. The client devices, servers, and/or other device types (e.g., each device) may be implemented on one or more instances of the processing system 500 of fig. 5A and/or the exemplary system 565 of fig. 5B-e.g., each device may include similar components, features, and/or functionality of the processing system 500 and/or the exemplary system 565.

The components of the network environment may communicate with each other via one or more networks, which may be wired, wireless, or both. The network may include multiple networks or one of multiple networks. For example, the network may include one or more Wide Area Networks (WANs), one or more Local Area Networks (LANs), one or more public networks such as the internet and/or the Public Switched Telephone Network (PSTN), and/or one or more private networks. Where the network comprises a wireless telecommunications network, components such as base stations, communication towers, or even access points (among other components) may provide wireless connectivity.

Compatible network environments may include one or more peer-to-peer network environments (in which case, a server may not be included in the network environment) and one or more client-server network environments (in which case, one or more servers may be included in the network environment). In a peer-to-peer network environment, the functionality described herein with respect to a server may be implemented on any number of client devices.

In at least one embodiment, the network environment may include one or more cloud-based network environments, distributed computing environments, combinations thereof, or the like. A cloud-based network environment may include a framework layer, a work scheduler, a resource manager, and a distributed file system implemented on one or more servers, which may include one or more core network servers and/or edge servers. The framework layer may include a framework that supports software of the software layer and/or one or more applications of the application layer. The software or applications may include web-based service software or applications, respectively. In embodiments, one or more client devices may use web-based service software or applications (e.g., by accessing the service software and/or applications via one or more Application Programming Interfaces (APIs)). The framework layer may be, but is not limited to, a type of free and open source software web application framework, such as may be used for large-scale data processing (e.g., "big data") using a distributed file system.

A cloud-based network environment may provide cloud computing and/or cloud storage that performs any combination of the computing and/or data storage functions described herein (or one or more portions thereof). Any of these different functions may be distributed across multiple locations from a central or core server (e.g., which may be distributed across one or more data centers on a state, region, country, earth, etc.). If the connection with the user (e.g., client device) is relatively close to the edge server, the core server may assign at least a portion of the functionality to the edge server. A cloud-based network environment may be private (e.g., limited to a single organization), may be public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).

The client device may include at least some of the components, features, and functions of the example processing system 500 of fig. 5B and/or the example system 565 of fig. 5C. By way of example and not limitation, a client device may be implemented as a Personal Computer (PC), laptop computer, mobile device, smartphone, tablet computer, smart watch, wearable computer, Personal Digital Assistant (PDA), MP3 player, virtual reality headset, Global Positioning System (GPS) or apparatus, video player, camera, surveillance device or system, vehicle, boat, spacecraft, virtual machine, drone, robot, handheld communication device, hospital device, gaming device or system, entertainment system, vehicle computer system, embedded system controller, remote control, appliance, consumer electronics, workstation, edge device, any combination of these depicted devices, or any other suitable device.

Machine learning

Deep Neural Networks (DNNs) developed on processors, such as PPU 400, have been used for various use cases: from self-driving to faster drug development, from automatic image captioning in an image database to intelligent real-time language translation in video chat applications. Deep learning is a technique that models the neural learning process of the human brain, learns continuously, becomes increasingly smarter, and delivers more accurate results faster over time. A child is initially taught by adults to correctly recognize and classify various shapes, and ultimately to be able to recognize shapes without any coaching. Also, deep learning or neural learning systems need to be trained in object recognition and classification in order to become more intelligent and efficient when identifying basic objects, occluding objects, etc., while also assigning scenarios to objects.

At the simplest level, neurons in the human brain look at various inputs received, assign a level of importance to each of these inputs, and pass the output to other neurons for processing. Artificial neurons or perceptrons are the most basic model of neural networks. In one example, a perceptron may receive one or more inputs representing various features of an object that the perceptron is being trained to recognize and classify, and each of these features is given a weight based on the importance of the feature when defining the shape of the object.

Deep Neural Network (DNN) models include multiple layers of many connected nodes (e.g., perceptrons, boltzmann machines, radial basis functions, convolutional layers, etc.), which can be trained with large amounts of input data to solve complex problems quickly and with high accuracy. In one example, the first layer of the DNN model decomposes the input image of the car into various parts and finds basic patterns (such as lines and corners). The second layer assembles the lines to look for higher level patterns such as wheels, windshields and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input images to identify the model of a particular automobile brand.

Once the DNNs are trained, they may be deployed and used to identify and classify objects or patterns in a process called inference (inference). Examples of reasoning (the process by which DNN extracts useful information from a given input) include identifying handwritten numbers deposited on check deposits in ATM machines, identifying images of friends in photographs, providing movie recommendations to more than fifty million users, identifying and classifying road hazards in different types of cars, pedestrians, and unmanned cars, or translating human speech in real time.

During training, data flows through the DNN during the forward propagation phase until a prediction is generated, which indicates the label corresponding to the input. If the neural network does not correctly label the input, the error between the correct label and the predicted label is analyzed and the weights are adjusted for each feature during the back propagation phase until the DNN correctly labels the input and the other inputs in the training dataset. Training complex neural networks requires a significant amount of parallel computational performance, including floating-point multiplication and addition supported by PPU 400. Inference is less computationally intensive than training, a delay sensitive process in which a trained neural network is applied to new inputs that it has not seen before to perform image classification, detect emotions, identify recommendations, recognize and translate speech, and generally infer new information.

Neural networks rely heavily on matrix mathematics, and complex multi-layer networks require a large amount of floating point performance and bandwidth to improve efficiency and speed. With thousands of processing cores, optimized for matrix mathematical operations, and delivering performance in the tens to hundreds of TFLOPS, PPU 400 is a computing platform capable of delivering the performance required for deep neural network-based artificial intelligence and machine learning applications.

Further, images generated applying one or more of the techniques disclosed herein may be used for training, testing, or certification (certifiy) of DNNs used to identify objects and environments in the real world. Such images may include scenes of roads, factories, buildings, urban environments, rural environments, humans, animals, and any other physical object or real-world environment. Such images may be used to train, test, or certify DNNs employed in machines or robots to manipulate, process, or modify physical objects in the real world. Further, such images may be used to train, test, or certify DNNs employed in autonomous vehicles in order to navigate and move vehicles in the real world. Additionally, images generated using one or more of the techniques disclosed herein may be used to convey information to users of such machines, robots, and vehicles.

Fig. 5C illustrates components of an example system 555 that can be used to train and utilize machine learning in accordance with at least one embodiment. As will be discussed, the various components may be provided by various combinations of computing devices and resources or a single computing system, which may be under the control of a single entity or multiple entities. Further, aspects may be triggered, initiated, or requested by different entities. In at least one embodiment, training of the neural network may be directed by a vendor associated with the vendor environment 506, and in at least one embodiment, may be requested by a customer or other user who is able to access the vendor environment through the client device 502 or other such resource. In at least one embodiment, the training data (or data to be analyzed by the trained neural network) may be provided by a provider, a user, or a third-party content provider 524, among others. In at least one embodiment, client device 502 can be a vehicle or object that can navigate on behalf of a user, e.g., the user can submit a request and/or receive instructions to facilitate device navigation.

In at least one embodiment, the request can be submitted over at least one network 504 to be received by the provider environment 506. In at least one embodiment, the client device may be any suitable electronic and/or computing device that enables a user to generate and send such a request, such as, but not limited to, a desktop computer, a notebook computer, a computer server, a smart phone, a tablet computer, a game console (portable or otherwise), a computer processor, computing logic, a set-top box, and so forth. The one or more networks 504 may include any suitable network for sending requests or other such data, and may include, for example, the internet, an intranet, an ethernet, a cellular network, a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), an ad hoc network with direct wireless connections between peers, and so forth.

In at least one embodiment, the request may be received at an interface layer 508, which may forward the data to the training and reasoning manager 532 in this example. In at least one embodiment, the training and reasoning manager 532 can be a system or service including hardware and software for managing services and requests consistent with data or content, and in at least one embodiment, the training and reasoning manager 532 can receive requests to train a neural network and can provide the requested data to the training module 512. In at least one embodiment, if the request is not specified, the training module 512 can select the appropriate model or network to use and can train the model using the relevant training data. In at least one embodiment, the training data may be a batch of data received from client device 502 or obtained from third party vendor 524 that is stored to training data store 514. In at least one embodiment, training module 512 may be responsible for training data. The neural network may be any suitable network, such as a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN), and the like. Once the neural network is trained and successfully evaluated, the trained neural network may be stored in model repository 516, which may store different models or networks for users, applications, services, or the like, for example. In at least one embodiment, multiple models may exist for a single application or entity, as multiple models may be utilized based on multiple different factors.

In at least one embodiment, at a subsequent point in time, a request for content (e.g., path determination) or data determined or affected at least in part by the trained neural network may be received from client device 502 (or another such device). The request may include, for example, input data to be processed using a neural network to obtain one or more inferences or other output values, classifications or predictions, or for at least one embodiment, although different systems or services may also be used, the input data may be received by the interface layer 508 and directed to the inference module 518. In at least one embodiment, if not already stored locally to inference module 518, inference module 518 can obtain an appropriate trained network, such as a trained Deep Neural Network (DNN) described herein, from model store 516. Inference module 518 can provide data as input to the trained network, which can then generate one or more inferences as output. This may include, for example, a classification of the instance of the input data. In at least one embodiment, the inference can then be sent to client device 502 for display to the user or other communication with the user. In at least one embodiment, the user's context data may also be stored to a user context data store 522, which may include data about the user that may be useful as network input for generating inferences or determining data to return to the user after obtaining an instance. In at least one embodiment, relevant data, which may include at least some of the input or inference data, may also be stored to local database 534 for use in processing future requests. In at least one embodiment, the user may use account information or other information to access resources or functions of the provider environment. In at least one embodiment, user data can also be collected and used to further train the model, if allowed and available, to provide more accurate inferences for future requests. In at least one embodiment, a request to execute the machine learning application 526 on the client device 502 may be received through a user interface and the results displayed through the same interface. The client device may include resources, such as a processor 528 and memory 562 for generating requests and processing results or responses, and at least one data storage element 552 for storing data for the machine learning application 526.

In at least one embodiment, the processor 528 (or the processor of the training manager 512 or the inference module 518) will be a Central Processing Unit (CPU). However, as described above, resources in such an environment may utilize the GPU to process data for at least certain types of requests. GPUs with thousands of cores (e.g., PPUs 300) are designed to handle a large number of parallel workloads and thus become popular in deep learning for training neural networks and generating predictions. While offline building using a GPU can train larger, more complex models faster, generating predictions offline means that request-time input features cannot be used, or predictions must be generated and stored in a look-up table for all feature permutations to service real-time requests. If the deep learning framework supports CPU mode, and the model is small and simple enough that feed forward can be performed on the CPU with reasonable delay, then the service on the CPU instance can host the model. In this case, training can be done offline on the GPU and reasoning done in real time on the CPU. If the CPU approach is not feasible, the service may run on the GPU instance. However, running a service that offloads runtime algorithms to the GPU may require that it be designed differently than a CPU-based service, since the GPU has different performance and cost characteristics than the CPU.

In at least one embodiment, video data may be provided from client device 502 for enhancement in provider environment 506. In at least one embodiment, the video data may be processed for enhancement on client device 502. In at least one embodiment, the video data may be streamed from the third-party content provider 524 and enhanced by the third-party content provider 524, the provider environment 506, or the client device 502. In at least one embodiment, video data may be provided from client device 502 for use as training data in vendor environment 506.

In at least one embodiment, supervised and/or unsupervised training may be performed by client device 502 and/or vendor environment 506. In at least one embodiment, a set of training data 514 (e.g., classified or labeled data) may be provided as input to be used as training data. In at least one embodiment, the training data may include instances of at least one type of object for which the neural network is trained, as well as information identifying the type of object. In at least one embodiment, the training data may include a set of images that each include a representation of an object type, where each image further includes or is associated with a label, metadata, classification, or other information identifying the object type represented in the respective image. Various other types of data may also be used as training data, and may include text data, audio data, video data, and the like. In at least one embodiment, training data 514 is provided as training input to training module 512. In at least one embodiment, the training module 512 may be a system or service comprising hardware and software, such as one or more computing devices executing a training application for training a neural network (or other models or algorithms, etc.). In at least one embodiment, the training module 512 receives an instruction or request indicating the type of model to be used for training, which in at least one embodiment may be any suitable statistical model, network, or algorithm for such purposes, as may include artificial neural networks, deep learning algorithms, learning classifiers, bayesian networks, and the like. In at least one embodiment, training module 512 may select an initial model or other untrained model from an appropriate repository 516 and train the model with training data 514, generating a trained model (e.g., a trained deep neural network) that may be used to classify similar types of data, or generating other such inferences. In at least one embodiment in which training data is not used, an appropriate initial model may still be selected for each training module 512 to train the input data.

In at least one embodiment, the model may be trained in a number of different ways, as may depend in part on the type of model selected. In at least one embodiment, the machine learning algorithm may be provided with a training data set, where the model is a model artifact created by a training process. In at least one embodiment, each instance of training data includes a correct answer (e.g., a classification), which may be referred to as a target or target attribute. In at least one embodiment, the learning algorithm finds patterns in the training data that map the input data attributes to the target, the answers to predict, and outputs a machine learning model that captures these patterns. In at least one embodiment, a machine learning model may then be used to obtain predictions about new data for unspecified targets.

In at least one embodiment, the training and reasoning manager 532 may select from a set of machine learning models including binary classification, multi-class classification, generation, and regression models. In at least one embodiment, the type of model to be used may depend, at least in part, on the type of target to be predicted.

Example streaming System

Images generated using one or more of the techniques disclosed herein may be displayed on a monitor or other display device. In some embodiments, the display device may be directly coupled to a system or processor that generates or renders the image. In other embodiments, the display device may be indirectly coupled to the system or processor, for example, via a network. Examples of such networks include the internet, mobile telecommunications networks, WIFI networks, and any other wired and/or wireless networking system. When the display device is indirectly coupled, images generated by the system or processor may be streamed to the display device via a network. Such streaming allows a video game or other application that renders images to be executed, for example, on a server, data center, or in a cloud-based computing environment, and the rendered images will be transmitted and displayed on one or more user devices (such as computers, video game consoles, smart phones, other mobile devices, etc.) that are physically separate from the server or data center. Therefore, the techniques disclosed herein may be applied to enhanced streamed images as well as services that enhance streamed images, such as NVIDIA GeForce Now (GFN), Google stable, and the like.

Fig. 6 is an example system diagram of a streaming system 605 according to some embodiments of the present disclosure. In an embodiment, streaming system 605 is a gaming streaming system. Fig. 6 includes one or more servers 603 (which may include similar components, features, and/or functions as the example processing system 500 of fig. 5A and/or the example system 565 of fig. 5B), one or more client devices 604 (which may include similar components, features, and/or functions as the example processing system 500 of fig. 5A and/or the example system 565 of fig. 5B), and one or more networks 606 (which may be similar to one or more networks described herein). In some embodiments of the present disclosure, system 605 may be implemented.

In system 605, for a game session, one or more client devices 604 may receive input data only in response to input to one or more input devices, send the input data to one or more servers 603, receive encoded display data from one or more servers 603, and display the display data on display 624. As such, more computationally intensive computations and processing are offloaded to the one or more servers 603 (e.g., rendering, in particular ray or path tracing, graphical output for one or more GPUs of the one or more servers 603 to perform a game session). In other words, the game session is streamed from the server 603 to the client device 604, thereby reducing the graphics processing and rendering requirements of the client device 604.

For example, with respect to the example of a game session, the client device 604 may display frames of the game session on the display 624 based on receiving display data from the server 603. The client device 604 may receive input from one of the one or more input devices and generate input data in response. The client device 604 can send input data to the one or more servers 603 via the communication interface 621 and over the one or more networks 606 (e.g., the internet), and the one or more servers 603 can receive input data via the communication interface 618. The CPU may receive the input data, process the input data, and send the data to the GPU, which causes the GPU to generate a rendering of the game session. For example, the input data may represent movements of a user's character, weapon firing, reloading, passing, turning the vehicle, etc. in a game. The rendering component 612 may render the game session (e.g., representing the result of the input data), and the rendering capture component 614 may capture the rendering of the game session as display data (e.g., capturing image data of the rendered frames of the game session). The rendering of the game session may include lighting and/or shading effects computed using one or more parallel processing units (e.g., GPUs) that may further employ one or more dedicated hardware accelerators or processing cores to perform ray or path tracing techniques for the server 603. The encoder 616 may then encode the display data to generate encoded display data, and the encoded display data may be sent over the network 606 to the client device 604 via the communication interface 618. The client device 604 may receive the encoded display data via the communication interface 621, and the decoder 622 may decode the encoded display data to generate the display data. The client device 604 may then display the display data via the display 624.

It should be noted that the techniques described herein may be embodied in executable instructions stored in a computer readable medium for use by or in connection with a processor-based instruction execution machine, system, device, or apparatus. Those skilled in the art will appreciate that for some embodiments, different types of computer-readable media may be included for storing data. As used herein, "computer-readable medium" includes one or more of any suitable medium for storing executable instructions of a computer program, such that an instruction-executing machine, system, apparatus, or device can read (or retrieve) the instructions from the computer-readable medium and execute the instructions for implementing the described embodiments. Suitable storage formats include one or more of electronic, magnetic, optical, and electromagnetic formats. A non-exhaustive list of conventional exemplary computer readable media includes: a portable computer diskette; random Access Memory (RAM); read Only Memory (ROM); erasable programmable read-only memory (EPROM); a flash memory device; and optical storage devices including portable Compact Discs (CDs), portable Digital Video Discs (DVDs), and the like.

It is to be understood that the arrangement of components shown in the figures is for illustrative purposes and that other arrangements are possible. For example, one or more of the elements described herein may be implemented in whole or in part as electronic hardware components. Other elements may be implemented in software, hardware, or a combination of software and hardware. Further, some or all of these other elements may be combined, some elements may be omitted entirely, and additional components may be added while still achieving the functionality described herein. Thus, the subject matter described herein can be embodied in a number of different variations, and all such variations are contemplated to be within the scope of the claims.

To facilitate an understanding of the subject matter described herein, many aspects are described in terms of sequences of actions. Those skilled in the art will recognize that the different actions could be performed by specialized circuits or circuits, by program instructions being executed by one or more processors, or by a combination of both. The description herein of any sequence of actions is not intended to imply that the particular order described for performing the sequence must be followed. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.

The use of the terms "a" and "an" and "the" and similar references in the context of describing the subject matter (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term "at least one" followed by a list of one or more items (e.g., "at least one of a and B") should be construed to mean one item selected from the listed items (a or B) or any combination of two or more of the listed items (a and B), unless otherwise indicated herein or clearly contradicted by context. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, as the scope of protection sought is defined by the claims as set forth hereinafter and any equivalents thereof. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate the subject matter and does not pose a limitation on the scope of the subject matter unless otherwise claimed. The use of the term "based on" and other similar phrases in the claims and in the written description indicating conditions leading to a result is not intended to exclude any other conditions leading to that result. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention as claimed.

Claims

1. A computer-implemented method, comprising:

receiving training data comprising input data and ground truth output, wherein the input data is associated with a first distribution;

applying at least one enhancement to the input data to produce enhanced input data associated with a second distribution, wherein an enhancement operator corresponding to a transformation from the first distribution to the second distribution is invertible and specifies the at least one enhancement;

processing, by a neural network model, the enhanced input data according to parameters to produce output data; and

adjusting the parameter to reduce a difference between the output data and the ground truth output.

2. The computer-implemented method of claim 1, wherein the first distribution is only a distribution for which the enhancement operator transforms the first distribution into an enhanced distribution that matches the second distribution.

3. The computer-implemented method of claim 1, wherein the input data comprises a first subset of generated data and a second subset of real data, the output data comprises a value indicative of a first state or a second state, and the parameter is adjusted such that a first portion of the output data produced for the first subset more closely matches the first state and a second portion of the output data produced for the second subset more closely matches the second state.

4. The computer-implemented method of claim 1, wherein the input data includes images and the at least one enhancement is disabled for at least one image in the training data.

5. The computer-implemented method of claim 1, wherein the applying further comprises: randomly disabling the at least one enhanced application for the portion of the input data based on a value.

6. The computer-implemented method of claim 5, the value is dynamically adjusted during repeated applications and processes.

7. The computer-implemented method of claim 6, wherein the value is increased or decreased based on at least a portion of the output data.

8. The computer-implemented method of claim 6, wherein at least a portion of the output data is compared to a reference to produce a result, and the value is adjusted based on the result.

9. The computer-implemented method of claim 1, wherein the at least one enhancement is differentiable.

10. The computer-implemented method of claim 1, wherein the at least one enhancement is implemented as a sequence of different enhancements.

11. The computer-implemented method of claim 1, wherein the input data comprises a first subset of generated data and a second subset of real data, and the first subset is produced by a generator neural network model based on a second parameter.

12. The computer-implemented method of claim 11, further comprising: after adjusting the parameters, adjusting the second parameters such that the distributions of the first subset and the second subset more closely match.

13. The computer-implemented method of claim 1, wherein the steps of applying, processing, and adjusting are performed within a cloud computing environment.

14. The computer-implemented method of claim 1, wherein the steps of applying, processing, and adjusting are performed on a server or in a data center, and at least one of the output data and the parameters is streamed to a user device.

15. The computer-implemented method of claim 1, wherein the neural network model is used to train, test, or certify another neural network model employed in a machine, robot, or autonomous vehicle.

16. The computer-implemented method of claim 1, wherein the steps of applying, processing, and adjusting are performed on a virtual machine that includes a portion of a graphics processing unit.

17. A system, comprising:

a processor configured to:

18. The system of claim 17, wherein the first distribution is only a distribution for which applying the at least one enhancement to the distribution results in an enhanced distribution that matches the second distribution.

19. A non-transitory computer readable medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of:

20. A computer-implemented method, comprising:

identifying a reversible enhancement operator corresponding to a transformation from the first distribution to a second distribution;

selecting at least one enhancement specified by the enhancement operator;

enhancing the input data using the at least one enhancement to produce enhanced input data associated with the second distribution;