CN110059793A

CN110059793A - The gradually modification of production confrontation neural network

Info

Publication number: CN110059793A
Application number: CN201811242789.6A
Authority: CN
Inventors: 泰罗·塔帕尼·卡拉斯; T·O·艾拉; 萨穆利·马蒂亚斯·莱内; J·T·莱赫蒂宁
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2017-10-26
Filing date: 2018-10-24
Publication date: 2019-07-26
Anticipated expiration: 2038-10-24
Also published as: CN110059793B

Abstract

The invention discloses the gradually modifications of production confrontation neural network (GAN).Specifically, GAN learns particular task by showing many examples.In a kind of scene, GAN can be trained to generate the new images including special object (such as face, bicycle etc.).When for task training GAN, the topology of GAN is modified, rather than training has the complicated GAN of the interconnection between predetermined characteristic topology and feature with learning tasks.The topology of GAN can be simple and as GAN during the training period learns and becomes more complicated when starting, and finally develop into the predetermined topology of matching complexity GAN.When starting, then the extensive details (there are two wheels for bicycle) of GAN learning tasks, as GAN becomes more complicated, learns smaller details (wheel has spoke).

Description

The gradually modification of production confrontation neural network

Priority claim

This application claims entitled " the gradually growths of production confrontation network submitted on October 26th, 2017 The U.S. Provisional Application No.62/ of (Progressive Growing of Generative Adversarial Networks) " The priority of 577,611 (attorney number NVIDP1193+/17-HE-0239-US0), entire contents are incorporated by reference into Herein.

Technical field

The present invention relates to productions to fight neural network (GAN), and in particular to modifies the topology of GAN during the training period.

Background technique

After for particular task training production confrontation neural network (GAN), which can be used for generating new output Data.For example, GAN can be trained to generate new images after being trained using the distribution of the higher-dimension of example image.Recently, raw Accepted way of doing sth method is finding to be widely used, such as in speech synthesis, image into image conversion and image repair.However, at There are serious practical problems in terms of function training GAN.Training is often unstable, and composograph is often unpractical, and And the changeability exported during the training period can deteriorate suddenly, this phenomenon is known as mode collapse.

The traditional technology for generating model for training respectively has significant merits and demerits.Autoregression model generates clear Image, but assess slowly, and do not indicate potentially, can because the condition of autoregression model Direct Modeling pixel is distributed Their applicability can be limited.Variation autocoder (VAE) is easy to train, but due to the limitation of model, often generates Fuzzy result.Traditional GAN generates clearly image, although only the fairly small resolution ratio and limited variation the case where Under, although achieving progress recently, training is still unstable.Mixed method combines each of autoregression model, VAE and GAN Kind advantage, but GAN is lagged behind in terms of picture quality.

In general, GAN is made of two neural networks: generator and discriminator (also known as critic).Generator is from latent code (latent code) generates sample (such as image), and the distribution of the image of generator generation should be ideally and for instructing The distribution for practicing the image of GAN cannot be distinguished.Since function is not useable for confirmation distribution matching really, training discriminator is to hold This assessment of row.Discriminator is differentiable, and during the training period, calculates gradient to manipulate generator towards generating closer to instructing Practice the output of image.In general, discriminator realizes adaptive loss function, and discriminator is abandoned once generator is trained to.

When the difference between the image in the image and distribution generated in measurement training distribution, if distribution does not have Substantial overlapping, i.e., image generated is too easy to be distinguished with training image, then gradient can be directed toward more or less random Direction.With the raising of the resolution ratio of generated image, the training of generator may become more difficult, because of higher resolution Rate make discriminator be easier image generated is distinguished with training image, thus amplify significantly gradient calculated and Hinder the convergence of generator.Need to solve these problems and/or other problems associated with the prior art.

Summary of the invention

Production fights neural network (GAN) and learns particular task by being demonstrated with many examples.In a kind of scene In, GAN can be trained to generate the new images including special object (such as face, bicycle etc.).Rather than training has in advance The complicated GAN of the interconnection between topological characteristic and feature is determined with learning tasks, when for task training GAN, the topology of GAN It is modified.The topology of GAN can be when starting it is simple, and as GAN during the training period learns and becomes more complicated, Finally develop into the predetermined topology of matching complexity GAN.When starting, (there are two bicycles for the extensive details of GAN learning tasks Wheel), then, as GAN becomes more complicated, learn smaller details (wheel has spoke).

It discloses a kind of for gradually modifying method, computer-readable medium and the system of the topology of GAN during the training period. GAN includes the generator neural network for being coupled to discriminator neural network.For the first duration, GAN is trained to, wherein The topology of GAN includes the interconnection in generator neural network between feature and feature in discriminator neural network.Modify GAN Topology to generate modified GAN, then for the modified GAN of the second duration training.

Detailed description of the invention

Figure 1A shows the block diagram of GAN system according to the embodiment.

Figure 1B shows the concept map of GAN topological modification during training according to the embodiment.

Fig. 1 C shows according to the embodiment for modifying the flow chart of the method for the topology of GAN during the training period.

Fig. 2A shows the block diagram of another GAN system according to the embodiment.

Fig. 2 B shows according to the embodiment for providing the technology of smooth modification GAN topology.

Fig. 2 C shows the flow chart of the method according to the embodiment for seamlessly transitting between GAN topology.

Fig. 2 D shows the example image according to the embodiment that the GAN generation of bicycle image is generated by being trained to.

Fig. 3 shows parallel processing element according to the embodiment.

Fig. 4 A shows the general procedure cluster in the parallel processing element of Fig. 3 according to the embodiment.

Fig. 4 B shows the memory partition unit of the parallel processing element of Fig. 3 according to the embodiment.

Fig. 5 A shows the Steaming Multiprocessors of Fig. 4 A according to the embodiment.

Fig. 5 B is the concept map for the processing system that the PPU according to the embodiment using Fig. 3 is realized.

Fig. 5 C, which is shown, may be implemented the various architectural frameworks of various preceding embodiments and/or the exemplary system of function.

Specific embodiment

A kind of training technique for GAN is disclosed, by adding or removing layer (for example, the layer connected entirely, convolution Layer, up-sampling, Chi Hua, normalization, compression etc.), addition or removal feature (for example, Feature Mapping, neuron, activation etc.) add Add or removes the connection between feature etc. to modify the topology of GAN.In embodiment, the modification of GAN topology changes generator The processing capacity of neural network and/or the discriminator neural network including GAN.For example, low-resolution image can be used in training Start, and as the resolution ratio of image is stepped up, process layer can be added to GAN.This multi step format technology allows to instruct Attention, is then transferred to increasingly finer scale details by the large-scale structure for practicing first discovery image distribution, rather than It must learn all scales simultaneously.

Figure 1A shows the block diagram of GAN 100 according to the embodiment.GAN 100 can be by program, custom circuit or customization The combination of circuit and program is realized.For example, GPU (graphics processing unit), CPU (central processing list can be used in GAN 100 Member) or any processor of operation described herein is able to carry out to realize.In addition, it will be appreciated by the skilled addressee that Any system of the operation of GAN 100 is executed all in the scope of embodiments of the invention and spirit.

GAN 100 includes generator (neural network) 110, discriminator (neural network) 115 and training loss unit 105. The topology of both generator 110 and discriminator 115 can be modified during the training period.GAN 100 can in unsupervised setting or It is operated in condition setting.Generator 110 receives input data and generates output data.In unsupervised setting, input data can To be latent code, i.e., the random N-dimensional vector extracted from such as Gaussian Profile.According to task, output data can be image, sound Frequently, video or other kinds of data (configuration setting).Discriminator 115 be training generator 110 during use it is adaptive Loss function.Using training dataset, training generator 110 and discriminator 115, training dataset include by generator 110 simultaneously The output data of generation should consistent example output data therewith.Generator 110 generates output data in response to input data, Discriminator 115 determines whether output data looks similar to the example output data for including in training data.

In unsupervised setting, discriminator 115 exports successive value, indicates the matching of output data and example output data Degree.For example, in one embodiment, when determining that output data is matched with example output data, then the output of discriminator 115 the One training stimulus (for example, high level), when determining that output data and example output data mismatch, the output of discriminator 115 second Training stimulus (for example, low value).Training loss unit 105 adjusts the parameter (power of GAN 100 based on the output of discriminator 115 Weight).When generator 110 is trained for particular task (such as the image for generating bicycle), when output data is voluntarily Discriminator exports high level when the image of vehicle.It does not need to export with the example of discriminator 115 by the output data that generator 110 generates Data are identical, to determine that output data is matched with example output data.In the case where being described below, when output data is similar to When any example output data, discriminator 115 determines that output data is matched with example output data.

In condition setting, the input data to generator 110 may include other (additional) data, and such as image divides Class label (for example, " bicycle "), segmentation contour (for example, profile of object) and other data types (distribution, audio etc.). Other than code of diving at random, additional data or additional data can also be specified to can be completely replaced latent code at random.Training data Collection may include input/output data pair, and the task of discriminator 115 can be based on discriminator 115 in training data Whether the example input/output seen seems the output data to determine generator 110 and inputs unanimously.

It will be illustrated now according to the expectation of user about the various optional architectural frameworks and feature that aforesaid frame may be implemented More descriptive informations.It should be expressly noted that elaborating following information for illustrative purpose, and should not be solved It is interpreted as being limited in any way.Any following characteristics can be optionally incorporated into or be not excluded for other described features.

Figure 1B shows the concept map of 100 topological modification of GAN during training according to the embodiment.The training of GAN 100 is opened The top of Figure 1B is started from, and as the bottom of Figure 1B is modified and proceeded to the topology of GAN 100.During the training period, GAN 100 topology is modified to add or remove layer, addition or removal feature (such as Feature Mapping or neuron), adds or removes Connection etc. between feature.As shown in Figure 1B, layer is added to both generator 110 and discriminator 115.

In one embodiment, each of the parameter of generator 110-A and discriminator 115-A can be at the beginning of random values Beginningization.In embodiment, weight is initialized by extracting weight from unit Gaussian Profile, and at runtime, according toWeight is zoomed in and out, wherein w_iIt is weight, c is every layer of normaliztion constant.At runtime rather than first It is that some are delicate that the benefit of weight is scaled during beginningization, constant with the scaling in common self-adapting random gradient descent method Property is related.Self-adapting random gradient descent method is updated by the standard deviation of estimation come normalized gradient, to make to update only Stand on the scaling of parameter.Therefore, if the dynamic range of certain parameters is greater than other parameters, the adjustment time of these parameters It can be longer.Weight is scaled by every layer of normaliztion constant can ensure that the dynamic range of all weights is identical with pace of learning.

In embodiment, the processing capacity for changing generator 110 and/or discriminator 115 of 100 topology of GAN. For example, one or more layers can be added, modify the topology of GAN 100 and increase processing capacity.When GAN 100 be trained to When generating image, both generator 110 and discriminator 115 are modified simultaneously and step by step, wherein training is with easier low point Resolution image starts, and as trained being added introduces the mew layer of higher resolution details.More specifically, GAN 100 It can be initially configured with generator 110-A and discriminator 115-A, each difference only includes several layers 120 and 122, to handle tool There is the example output data of the spatial resolution of 4 × 4 pixels and generates the output data with the spatial resolution of 4 × 4 pixels. Including the conversion layer 230 in generator 110-A by eigenvector projection to output data format.For example, working as 110 quilt of generator When training generates task for image, eigenvector projection is RGB color by conversion layer 230, to generate output image.Including Output data format is projected as feature vector by the conversion layer 235 in discriminator 115-A.It is trained when generating task for image When discriminator 115, RGB color is projected as feature vector by conversion layer 235.In embodiment, conversion layer 230 and conversion layer 235 Execute 1 × 1 convolution.

In embodiment, one or more layers 120 include two convolutional layers.For example, can be 3 after 4 × 4 convolutional layers × 3 convolutional layers.Similarly, one or more layers 122 may include three convolutional layers and down-sampling layer.For example, 1 × 1 convolutional layer It can be two 3 × 3 convolutional layers, the output data that the processing of one of them or more layer 122 is generated by generator 110-A later With example output data.The resolution ratio for the example output data for including in training data can be reduced to match output generated The spatial resolution of image.

With trained progress, the topology of GAN100 is modified by increasing the number of plies and increasing the resolution ratio of example image. Layer 121 and layer 124 are added to generator 110-A and discriminator 115-A respectively, to generate generator 110-B and discriminator 115-B, and increase the spatial resolution of output image generated.For example, Example Output FIG picture and output generated figure The spatial resolution of picture can increase to the spatial resolution of 8 × 8 pixels from the spatial resolution of 4 × 4 pixels.In embodiment In, layer 121 includes up-sampling layer and two 3 × 3 convolutional layers.Similarly, layer 124 may include that two 3 × 3 convolutional layers are adopted under Sample layer.

With training further progress, the topology of GAN 100 is by dividing more layers (finally including layer 123 and layer 126) It is not gradually added into generator 110-B and discriminator 115-B to modify, so that generator 110-C and discriminator 115-C are generated, And increase the spatial resolution of output image generated.For example, the space of Example Output FIG picture and output image generated Resolution ratio can increase to the spatial resolution of 1024 × 1024 pixels.In embodiment, layer 123 includes up-sampling layer, two 3 × 3 convolutional layers and 1 × 1 convolutional layer.Similarly, layer 126 may include 3 × 3 convolutional layers, 4 × 4 convolutional layers and be fully connected layer.

The parameter of all existing layers can be updated in entire training process, it means that hair can be updated during the training period Any layer of one or more weights of raw device 110 and/or discriminator 115.Weight control associated with each feature is special Sign is on the contribution of the output of layer or influence.The value of each weight is not set not instead of directly, learns weight during the training period.Training It is defeated to better discriminate between the output data generated by generator 110 and example that loss unit 105 updates the weighted value of discriminator 115 Data out, and it updates the weighted value of generator 110 to reduce the output data generated by generator 110 and example and export number Difference between.

GAN 100 is trained to provide stability for each topology of GAN 100 since low resolution, even if when the number of plies increases When being generated high resolution output image.Compared with the training since all layers, gradually modification topology also subtracts during the training period The training time is lacked.When training dataset includes high-resolution example image, including the example image concentrated in training data (down-sampling) is modified before being input into discriminator 115, until generator 110 generates final high-definition picture.Weight It wants, GAN 100 includes single generator 110 and single discriminator 115.The final topology of GAN 100 can be in advance really Fixed, and add or remove with each increment (increasing or decreasing) training GAN 100 for output data resolution ratio One or more layers.

Although describing the training of GAN 100 in the context that image generates, additional input data (are set in condition Set) it can be data in addition to image data, and generator 110 and discriminator 115 can be trained to, and be used for generator 110 generate other data using multi step format modification technology.According to task, output data can be image, audio, video or other The data (configuration setting) of type.As topology is modified, the resolution ratio of output data or the processing energy of GAN 100 are gradually modified Power.

Fig. 1 C shows according to the embodiment for modifying the flow chart of the method 130 of the topology of GAN during the training period.Side Method 130 can be executed by the combination of program, custom circuit or custom circuit and program.For example, method 130 can be by GPU (figure Shape processing unit), CPU (central processing unit) or be able to carry out generator neural network and discriminator neural network operation appoint Processor executes.In addition, it will be appreciated by the skilled addressee that executing any system of method 130 all in the present invention Embodiment scope and spirit in.

In step 135, GAN 100 was trained to for the first duration, and wherein the topology of GAN 100 includes in generator 110 Interconnection between the feature and feature in discriminator 115.In embodiment, GAN 100 handles 3 d image data.Implementing In example, GAN 100 handles audio data.In embodiment, training data includes example output data, and during the training period, Generator 110 handles input data to generate output data.In embodiment, modified example output data is modified to generate Training data, for being input to discriminator 115 together with output data.Training loss unit 105 receives the output of discriminator 115 And the parameter updated is generated for GAN 100.

In one embodiment, modification training data includes the density for increasing or decreasing example output data.For example, can be with Reduce the spatial resolution of example output data.In one embodiment, training data includes additional (example) input data, example Such as image, tag along sort, segmentation contour and other kinds of data (distribution, audio), and additional input data and example Output data pairing.

In step 140, the topology of GAN 100 is modified to generate modified GAN 100.In embodiment, modification topology Change the processing capacity of generator 110 and/or discriminator 115.In embodiment, to GAN 100 and example training image Modification is specific for task.For example, in embodiment, by added in generator 110 and/or discriminator 115 one or More layers (for example, full articulamentum, convolutional layer, up-sampling, Chi Hua, normalization, compression etc.), addition or removal feature (for example, Feature Mapping, neuron, activation etc.), connection between feature etc. is added or removed to modify topology.For different tasks, It can be by removing one or more layers, addition or removal feature, addition or shifting in generator 110 and/or discriminator 115 Topology is modified except connection between feature etc..

In step 145, modified GAN 100 was trained to for the second duration.In embodiment, the first duration repaired The training data changed is different compared with the training data that the second duration modified.In embodiment, according to first function The modified training data for modifying for the first duration, when being continued according to the second function modification second for being different from first function Between modified training data.In embodiment, training data is image data, and compared with the second duration, instruction The amount that the pixel resolution of white silk data reduces in the first duration is bigger.

Fig. 2A shows the block diagram of another GAN 200 according to the embodiment.In addition to generator 110, discriminator 115 and training It loses except unit 105, GAN 200 further includes example output data pretreatment unit 215.Example output data pretreatment unit 215 are configured as modifying the example output data for being included together with training data according to the present topology of GAN 200.It can be The density of example output data is increased or decreased before being input to discriminator 115.When example output data is image, Ke Yifen The spatial resolution of image is not increased or reduced by up-sampling or down-sampling.

In embodiment, it is smoothly introduced pair by carrying out interpolation between new and old training data between new and old topology The modification of topology.Generator 110 and discriminator 115 can be configured as smoothly modification topology, the pretreatment of example output data Unit 215 can be configured as from the first modified example output data and be smoothly transitted into the second modified example output number According to.The smooth transition can reduce the unexpected impact to the GAN 200 with the first topology of well trained.For example, having First topology of low capacity layer can be by well trained, therefore is gradually introducing the second topology with extra play during the training period, Because correspondingly gradually modifying training data for the second topology.

Fig. 2 B shows the technology according to the embodiment for smoothly 200 topology of modification GAN.Matched using the first topology Set generator 110-D and discriminator 115-D.In embodiment, for the first topology, modified example output data is 16 × The image data of 16 pixel resolutions.Number can be exported from the modified example of example output data down-sampling of high-resolution According to.Generator 110-D includes one or more layers 220 and the conversion layer 230 by eigenvector projection for output data format. Discriminator 115-D includes conversion layer 235 and one or more layers 222, and wherein output data format is projected as by conversion layer 235 Feature vector.

Before being transformed into the second topology, generator 110-D and discriminator 115-D were trained to for the first duration.In reality It applies in example, with the first topology conversion to the second topology, the processing capacity of generator 110-D and discriminator 115-D are doubled.This turn It changes and is shown as generator 110-E and discriminator 115-E in fig. 2b, the second topology is shown as generator 110-F and identification in fig. 2b Device 115-F.In embodiment, for the second topology, modified example output data is the image of 32 × 32 pixel resolutions Data.The density (for example, spatial resolution) of the intermediate data exported by one or more layers 220 doubles, and is input to one A or more layer 221 and the second conversion layer 230B.In embodiment, (nearest neighbor is filtered using arest neighbors Filtering) intermediate data is doubled.

The high-resolution output data that one or more layers 221 by corresponding to the second topology generate, which is input into, to be turned Change a layer 230B.From the first topology to the transition period of the second topology, to the layer of higher density data manipulation (for example, one or More layers 221) be considered as residual block, generate intermediate data, the intermediate data by within the second duration from 0 to 1 it is linear Increased weight α scaling.As shown in Figure 2 B, the intermediate data of higher density is scaled with α, and density simply doubles and corresponds to the The intermediate data of one topology is with 1- α scaling.The intermediate data phase adduction of scaling is input to discriminator 115-E.

Discriminator 115-E includes conversion layer 235 and the second conversion layer 235B, each conversion layer 235 and the second conversion layer 235B will be from the received data projection of generator 110-E to feature vector.Before reaching the second conversion layer 235B, correspond to the The density (that is, spatial resolution) of the data of one topology halves.In embodiment, average pond is for halving data.One Or more 223 processing feature vector of layer with generate correspond to second topology processed data.By one or more layers 223 processed datas generated halve to generate the output data for corresponding to the second topology.In embodiment, processed number According to being 32 × 32 density datas, it is halved to generate 16 × 16 density datas.

As shown in Figure 2 B, the output data corresponding to the second topology is scaled with α, and simply density is halved and corresponded to The output data of first topology is with 1- α scaling.Scaled output data phase adduction is input to one or more layers 222. In embodiment, for the second topology, modified example output data is the image data of 16 × 16 pixel resolutions.It is opening up It flutters the transition period, example output data pretreatment unit 215 in 16 × 16 pixel resolution image datas and corresponds to the using α How the interpolation of 32 × 32 pixel resolution image datas of two topologys, mix similar to generator 110-E and discriminator 115-E Close two topologys.When α reaches 1, to the conversion of second topological (such as generator 110-F and discriminator 115-F) shown in Fig. 2 B It completes.

It, can also be for other task executions topology although describing topological modification in the context that image generates Modification.For example, can be executed by overturning the sequence (as the increase of training time carries out from top to bottom) of the modification in Fig. 2 B Remove the modification of layer.In one embodiment, training data includes the additional input data with the pairing of example output data, such as Image, tag along sort, segmentation contour and other kinds of data (distribution, audio etc.), and additional input data are interpolated use It is seamlessly transitted in when topology is modified.

Fig. 2 C shows the flow chart of the method 250 according to the embodiment for seamlessly transitting between GAN topology.Method 250 can be executed by the combination of program, custom circuit or custom circuit and program.For example, method 250 can be by GPU (figure Processing unit), CPU (central processing unit) or be able to carry out generator neural network and discriminator neural network operation it is any Processor executes.In addition, it will be appreciated by the skilled addressee that executing any system of method 250 all of the invention In the scope and spirit of embodiment.

In step 135, held using the first topology training GAN 100 first including generator 110-D and discriminator 115-D The continuous time.In step 255, the topology of GAN 100 is modified to generate including the modified of generator 110-E and discriminator 115-E GAN 100.In step 260, training data is modified to correspond to the output data density of the second topology.For example, can be to instruction Practice data and carries out down-sampling.In step 265, when training 100 second duration of GAN, GAN 100 is configured as first Interpolation is carried out between topology and the second topology.In step 270, when training 100 second duration of GAN, in the first topology Interpolation is executed between training data and the modified training data of the second topology.In step 275, as training GAN 100 second When the duration, generator 110-E and discriminator are updated based on the loss function for using the output of discriminator 115-E to calculate The weight of 115-E.

In step 280, training loss unit 105 determines whether to have reached accuracy level, also, if it is not, after Continuous training.Accuracy level can be predetermined threshold (i.e. standard).In embodiment, with the increase of accuracy, control is from the The value of the α of the smooth transition of one topology to the second topology can also increase.Optionally, for each increment of α, the instruction of predetermined amount Practicing data can be used for training GAN 100.

When step 280 realizes level of accuracy, then in step 285, determine whether GAN 100 matches final topology. If it is, training is completed.Otherwise, step 255,260,265,270,275 and 280 are repeated, and modifies GAN 100 with transition To another topology and continue to train.For example, in embodiment, by adding one in generator 110 and/or discriminator 115 Or more layer, addition or remove feature, addition or remove connection etc. between feature further to modify topology.For difference Task, can by removed in generator 110 and/or discriminator 115 one or more layers, addition or remove feature, Addition removes the connection between feature etc. to modify topology.

Fig. 2 D show it is according to the embodiment by being trained to generate the exemplary diagram that the GAN 100 of bicycle image is generated Picture.In response to receiving latent code input, generator 110 generates each image.In embodiment, discriminator 115 is for training GAN 100, and generation image is not used in once training is completed.During the training period, generator 110 and discriminator 115 are all by gradually Modification, from a topological transitions to another topology.Example training data is modified during the training period, is opened from low-resolution image Begin, and increases the resolution ratio of image with the mew layer of addition processing higher resolution details.GAN is modified during the training period 100 images for greatly stabilizing training and GAN 100 being enable to generate the unprecedented quality compared with traditional technology.

The topology for gradually modifying GAN 100 provides two key benefits: GAN 100 converges to fairly good optimum value, And total training time reduce approximately twice as.Forced by the ability being stepped up by generator 110 and discriminator 115 The implicit form of course training explains improved convergence.In the case where the modification of no multi step format, generator 110 and discriminator 115 all layers of task is to find the succinct intermediate representation of both large scale variation and small scale details simultaneously.However, with Gradually modify, existing low-density layer is likely to restrain already, thus the task of generator 110 and discriminator 115 be only exist Expression is refined by the effect of smaller and smaller scale when introducing mew layer.For the training time, gradually modification is obtained great leading, Because of generator 110 and discriminator 115 very shallow-layer and rapid evaluation when starting.

Parallel processing architecture

Fig. 3 shows the parallel processing element (PPU) 300 according to one embodiment.In one embodiment, 300 PPU It is the multiline procedure processor realized in one or more integrated circuit device.PPU 300 is permitted designed for parallel processing The latency hiding architectural framework of multithreading.Thread (that is, execution thread) is configured as by the reality of the instruction set executed of PPU 300 Example.In one embodiment, PPU 300 is graphics processing unit (GPU), is configured as realizing for handling three-dimensional (3D) figure The graphics rendering pipeline of graphic data, to generate two for showing in display device (such as liquid crystal display (LCD) equipment) Tie up (2D) image data.In other embodiments, PPU 300 can be used for executing general-purpose computations.Although for illustrative purposes There is provided herein an exemplary Parallel processors, but it is specifically intended that the processor is explained for illustration purposes only It states, and any processor can be used to supplement and/or substitute the processor.

One or more PPU 300, which can be configured as, accelerates thousands of high-performance calculations (HPC), data center and machine Device study application.PPU 300 can be configured to accelerate numerous deep learning systems and application, including autonomous driving vehicle platform, Deep learning, high-precision voice, image and text recognition system, intelligent video analysis, molecular simulation, drug discovery, disease are examined Disconnected, weather forecast, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robot technology, factory automation, Real-time language translation, on-line search optimization and personalized user recommendation etc..

As shown in figure 3, PPU 300 include input/output (I/O) unit 305, front end unit 315, dispatcher unit 320, Work distribution unit 325, hub 330, crossbar switch (Xbar) 370, one or more general procedure clusters (GPC) 350 And one or more zoning units 380.PPU 300 can be via the mutual downlink connection of one or more high speed NVLink 310 To host-processor or other PPU 300.PPU 300 host-processor can be connected to via interconnection 302 or other peripheries are set It is standby.PPU 300 may be also connected to the local storage including multiple memory devices 304.In one embodiment, it locally deposits Reservoir may include multiple dynamic random access memory (DRAM) equipment.DRAM device can be configured as high bandwidth memory (HBM) subsystem, plurality of DRAM bare crystalline (die) are stacked in each equipment.

The interconnection of NVLink 310 enables the system to extension and including one in conjunction with one or more CPU or more Multiple PPU 300 support cache coherence and CPU master control between PPU 300 and CPU.Data and/or order can To be sent to other units of PPU 300 by hub 330 by NVLink 310 or be sent from it, for example, it is one or more Replication engine, video encoder, Video Decoder, Power Management Unit etc. (are not explicitly shown).It is retouched in more detail in conjunction with Fig. 5 B State NVLink 310.

I/O unit 305 is configured as sending and receiving communication (that is, life from host-processor (not shown) by interconnection 302 It enables, data etc.).I/O unit 305 can directly be communicated with host-processor via interconnection 302, or be passed through in one or more Between equipment (such as memory bridge) communicated with host-processor.In one embodiment, I/O unit 305 can via interconnection 302 with One or more other processors (for example, one or more PPU 300) communications.In one embodiment, I/O unit 305 realize peripheral component interconnection high speed (PCIe) interface, and for being communicated by PCIe bus, and interconnecting 302 is PCIe Bus.In alternate embodiments, other kinds of known interface may be implemented in I/O unit 305, for carrying out with external equipment Communication.

I/O unit 305 is decoded to via the received grouping of interconnection 302.In one embodiment, grouping indicates to be matched It is set to the order for making PPU 300 execute various operations.I/O unit 305 sends PPU for decoded order according to order is specified 300 various other units.For example, number order can be sent to front end unit 315.Other orders can be sent to collection Other units of line device 330 or PPU 300, such as one or more replication engines, video encoder, Video Decoder, electricity Source control unit etc. (is not explicitly shown).In other words, I/O unit 305 be configured as PPU 300 various logic unit it Between and among route communication.

In one embodiment, command stream is encoded in the buffer by the program that host-processor executes, this is slow It rushes area and provides workload for handling to PPU 300.Workload may include will be by many instruction sum numbers of those instruction processing According to.Buffer area is the region that can access (that is, read/write) in memory by both host-processor and PPU300.For example, I/O is mono- Member 305 can be configured as the system storage that interconnection 302 is connected to via the memory requests access by 302 transmission of interconnection In buffer area.In one embodiment, buffer area is written in command stream by host-processor, is then sent and is directed toward to PPU 300 The pointer that command stream starts.Front end unit 315 receives the pointer for being directed toward one or more command streams.Front end unit 315 manages One or more streams from stream reading order and forward the command to each unit of PPU 300.

Front end unit 315 is coupled to dispatcher unit 320, configures various GPC 350 to handle by one or more Flow the task of definition.Dispatcher unit 320 is configured as tracking shape relevant to the various tasks managed by dispatcher unit 320 State information.State can indicate task is assigned to which GPC 350, the task be it is movable or inactive, with this It is engaged in associated priority etc..Dispatcher unit 320 manages the execution of the multiple tasks on one or more GPC 350.

Dispatcher unit 320 is coupled to Work distribution unit 325, is configured as assigned tasks to hold on GPC 350 Row.Work distribution unit 325 can track the multiple scheduler tasks received from dispatcher unit 320.In one embodiment, Work distribution unit 325 is that each GPC 350 manages (pending) task pool to be processed and active task pond.Waiting task Pond may include multiple time slots (for example, 32 time slots), and it includes being designated as being handled by specific GPC 350 of the tasks.Activity Task pool may include multiple time slots (for example, 4 time slots), for by the task of 350 active process of GPC.Work as GPC When the execution of 350 completion tasks, which evicts from from the active task pond of GPC 350, and from waiting task pond One of other tasks are selected and are dispatched to execute on GPC350.If the active task on GPC 350 is idle, such as When waiting data dependency to be solved, then waiting task pond can be evicted from GPC 350 and be returned to active task from, And another task in waiting task pond is selected and is dispatched to execute on GPC 350.

Work distribution unit 325 is communicated via XBar (crossbar switch) 370 with one or more GPC 350.XBar370 It is the interference networks that many units of PPU 300 are coupled to other units of PPU 300.For example, XBar 370 can be matched It is set to and Work distribution unit 325 is coupled to specific GPC 350.Although not being explicitly illustrated, one of PPU 300 or more Other multiple units can also be connected to XBar 370 via hub 330.

Task is managed by dispatcher unit 320 and is dispatched to GPC 350 by Work distribution unit 325.GPC 350 is configured For processing task and generate result.As a result it can be consumed by other tasks in GPC 350, be routed to difference via XBar 370 GPC 350, or be stored in memory 304.As a result memory 304, zoning unit can be written via zoning unit 380 380 realize for reading data from memory 304 and the memory interface of data being written to memory 304.As a result can pass through NVLink310 is sent to another PPU 304 or CPU.In one embodiment, PPU 300 includes the zoning unit that number is U 380, it is equal to the number for being coupled to the independent and different memory devices 304 of PPU 300.It is more detailed below in conjunction with Fig. 4 B Ground describes zoning unit 380.

In one embodiment, host-processor executes the driver kernel for realizing application programming interface (API), It makes it possible to execute one or more application programs on host-processor with scheduling operation for holding on PPU 300 Row.In one embodiment, multiple computer applied algorithms are performed simultaneously by PPU 300, and PPU 300 is multiple computers Application program provides isolation, service quality (QoS) and independent address space.Instruction (i.e. API Calls) can be generated in application program, It makes in driver one or more tasks of karyogenesis to be executed by PPU 300.Driver kernel exports task To one or more streams handled by PPU 300.Each task may include one or more related linear program groups, this Text is known as thread beam (warp).In one embodiment, thread beam includes 32 related linear programs that can be executed parallel.Cooperate line Journey may refer to include execution task instruction and can be by multiple threads of shared-memory switch data.In conjunction with Fig. 5 A Thread and cooperative thread is more fully described.

Fig. 4 A shows the GPC 350 of the PPU 300 according to Fig. 3 of one embodiment.As shown in Figure 4 A, each GPC 350 include multiple hardware cells for handling task.In one embodiment, each GPC 350 includes pipeline managing device 410, pre- raster operation unit (PROP) 415, raster engine 425, work distribution crossbar switch (WDX) 480, memory management list Member (MMU) 490 and one or more data processing clusters (DPC) 420.It should be appreciated that the GPC 350 of Fig. 4 A may include Other hardware cells instead of other hardware cells of unit shown in Fig. 4 A or in addition to the unit shown in Fig. 4 A.

In one embodiment, the operation of GPC 350 is controlled by pipeline managing device 410.The management of pipeline managing device 410 is used for The configuration of one or more DPC 420 of the task of GPC 350 is distributed in processing.In one embodiment, pipeline managing device 410 can configure at least one of one or more DPC 420 to realize at least part of graphics rendering pipeline.Example Such as, DPC 420, which can be configured as, executes vertex shading program on programmable streaming multiprocessor (SM) 440.Pipeline management Device 410 can be additionally configured to that logic list appropriate in GPC 350 will be routed to from the received grouping of Work distribution unit 325 Member.For example, some groupings can be routed to the fixed function hardware cell in PROP 415 and/or raster engine 425, and its He, which is grouped, can be routed to DPC 420 so that primitive engine 435 or SM 440 are handled.In one embodiment, pipeline management Device 410 can configure at least one of one or more DPC 420 to realize neural network model and/or calculate pipeline.

PROP unit 415 is configured as the data generated by raster engine 425 and DPC 420 being routed to raster manipulation (ROP) unit is more fully described in conjunction with Fig. 4 B.PROP unit 415 can be additionally configured to execute colour-mixed optimization, group Pixel data is knitted, address conversion etc. is executed.

Raster engine 425 includes the multiple fixed function hardware cells for being configured as executing various raster manipulations.At one In embodiment, raster engine 425 include setting engine, coarse grating engine, reject engine, cut engine, fine raster engine and Tile aggregation engine.The associated plane side of geometric graphic element that engine receives transformed vertex and generates and defined by vertex is set Journey.Plane equation is sent to coarse grating engine to generate the coverage information of pel (for example, x, y of tile cover mask).Slightly The output of raster engine is sent to rejecting engine, wherein segment associated with the pel that do not tested by z- is removed, and It is sent to cutting engine, wherein the segment being located at except view frustums is cut.Those are stayed after cutting and rejecting Segment can be passed to fine raster engine, to generate the category of pixel segment based on the plane equation that is generated by setting engine Property.The output of raster engine 425 includes for example will be by the segment for the fragment shader processing realized in DPC 420.

Include M pipeline controller (MPC) 430, primitive engine 435 and one including each DPC 420 in GPC 350 Or more SM 440.MPC 430 controls the operation of DPC 420, and the grouping received from pipeline managing device 410 is routed to Appropriate unit in DPC 420.For example, grouping associated with vertex can be routed to primitive engine 435, primitive engine 435 are configured as extracting vertex attribute associated with vertex from memory 304.On the contrary, grouping associated with coloring process SM 440 can be sent to.

SM 440 includes the programmable streaming processor for being configured as handling being indicated by multiple threads for task.Each SM 440 are multithreadings and are configured to concurrently perform multiple threads (for example, 32 threads) from particular thread group.One In a embodiment, SM 440 realizes SIMD (single instrction, most evidences) architectural framework, wherein each of sets of threads (that is, warp) Thread is configured as handling different data sets based on identical instruction set.All threads in sets of threads are carried out identical Instruction.In another embodiment, SM 440 realizes SIMT (single instrction, multithreading) architectural framework, wherein every in sets of threads A thread is configured as handling different data sets based on identical instruction set, but each thread wherein in sets of threads is being held It is allowed to dissipate between the departure date.In one embodiment, for per thread beam maintenance program counter, call stack and execute state, When the thread divergence in thread beam, make it possible in thread beam and thread beam it is serial execute between concurrently.Another In a embodiment, for each individually thread maintenance program counter, call stack and state is executed, thus in thread beam and line It is realized between all threads between Cheng Shu equal concurrent.When for each individually thread maintenance execution state, phase is executed Thread with instruction can be restrained and be executed parallel to obtain maximal efficiency.SM is more fully described below with reference to Fig. 5 A 440。

MMU 490 provides the interface between GPC 350 and zoning unit 380.MMU 490 can provide virtual address to object Manage the conversion of address, the arbitration of memory protection and memory requests.In one embodiment, MMU 490 is provided for holding One or more translation lookaside buffers (TLB) of the conversion for the physical address gone in 304 from virtual address to memory.

Fig. 4 B shows the memory partition unit 380 of the PPU 300 according to Fig. 3 of one embodiment.As shown in Figure 4 B, Memory partition unit 380 includes raster manipulation (ROP) unit 450, second level (L2) cache 460 and memory interface 470. Memory interface 470 is coupled to memory 304.Memory interface 470 may be implemented 32 for high speed data transfer, 64, 128,1024 bit data bus etc..In one embodiment, PPU 300 incorporates U memory interface 470, each pair of subregion list Member 380 has a memory interface 470, wherein each pair of zoning unit 380 is connected to corresponding memory devices 304.For example, PPU 300 may be coupled to up to Y memory devices 304, and such as high bandwidth memory stacks or figure double data rate version This 5 Synchronous Dynamic Random Access Memory or other kinds of long-time memory.

In one embodiment, memory interface 470 realizes HBM2 memory interface, and Y is equal to the half of U.One In a embodiment, HBM2 memory stacking is located in physical package identical with PPU 300, provides and routine GDDR5SDRAM system System is compared to significant power height and area savings.In one embodiment, each HBM2 is stacked including four memory bare crystallines simultaneously And Y is equal to 4, it includes two 128 bit ports of each bare crystalline that wherein HBM2, which is stacked, in total 8 channels and 1024 data/address bus Width.

In one embodiment, memory 304 supports the double false retrievals of SEC code to survey (SECDED) error correcting code (ECC) to protect Data.For the computer applied algorithm sensitive to data corruption, ECC provides higher reliability.It is calculated in large construction cluster In environment, reliability is even more important, and wherein PPU300 handles very big data set and/or long-play application program.

In one embodiment, PPU 300 realizes multi-level store layered structure.In one embodiment, memory point Area's unit 380 supports Unified Memory to provide single unified virtual address space for CPU and 300 memory of PPU, enables Data sharing between virtual memory system.In one embodiment, the storage by PPU 300 to being located on other processors The access frequency of device is tracked, and is deposited with ensuring that locked memory pages are moved to the physics of the PPU 300 of more frequently accession page Reservoir.In one embodiment, NVLink 310 supports Address Translation services, and PPU 300 is allowed directly to access the page table of CPU And provide the complete access by PPU 300 to CPU memory.

In one embodiment, replication engine transmits data between multiple PPU 300 or between PPU 300 and CPU. Replication engine can be the address generation page fault for being not mapped to page table.Then, memory partition unit 380 can be with service page Face mistake, maps the address into page table, and replication engine can execute transmission later.In the conventional system, for multiple processing Multiple replication engines operation fixed memory (that is, can not paging) between device, it significantly reduces available memories.Due to hard Part page fault, address can not have to worry whether locked memory pages are resident for delivery to replication engine, and reproduction process is It is no transparent.

Data from memory 304 or other systems memory can be fetched and be stored by memory partition unit 380 In L2 cache 460, L2 cache 460 is located on chip and shares between each GPC 350.As shown, Each memory partition unit 380 includes and a part of the associated L2 cache 460 of corresponding memory devices 304. Then relatively low-level cache can be realized in multiple units in GPC 350.For example, level-one may be implemented in each SM 440 (L1) cache.L1 cache is the private memory for being exclusively used in specific SM 440.Data from L2 cache 460 It can be acquired and be stored in each L1 cache, to be handled in the function unit of SM 440.L2 cache 460 are coupled to memory interface 470 and XBar 370.

ROP unit 450 executes graphic raster operation relevant to pixel color, color compressed, pixel mixing etc.. ROP unit 450 also realizes depth test with raster engine 425 together, receives and pixel piece from the rejecting engine of raster engine 425 The depth of the associated sample position of section.Test is with the sample position of fragment association relative to the corresponding depth in depth buffer Depth.If segment, by the depth test of sample position, ROP unit 450 updates depth buffer and by depth test Result be sent to raster engine 425.It will be appreciated that the quantity of zoning unit 380 can be different from the quantity of GPC 350, And therefore each ROP unit 450 may be coupled to each GPC 350.The tracking of ROP unit 450 is received from different GPC 350 Grouping and determine which GPC 350 be routed to by Xbar 370 for the result that is generated by ROP unit 450.Although ROP is mono- Member 450 includes in memory partition unit 380 in figure 4b, but in other embodiments, ROP unit 450 can deposited Except memory partition unit 380.For example, ROP unit 450 may reside in GPC 350 or another unit.

Fig. 5 A shows the Steaming Multiprocessors 440 of Fig. 4 A according to one embodiment.As shown in Figure 5A, SM 440 includes Instruction cache 505, one or more dispatcher units 510, register file 520, one or more processing cores 550, one or more special function units (SFU) 552, one or more load/store units (LSU) 554, Internet Network 580, shared memory/L1 cache 570.

As described above, 325 scheduler task of Work distribution unit on the GPC 350 of PPU 300 to execute.Task is assigned To the specific DPC 420 in GPC 350, and if task is associated with coloration program, which can be assigned to SM 440.Dispatcher unit 510 receive the task from Work distribution unit 325 and management be assigned to one of SM 440 or The instruction of more thread blocks is dispatched.510 scheduling thread block of dispatcher unit using the thread Shu Zhihang as parallel thread, wherein Per thread block is assigned at least one thread beam.In one embodiment, 32 threads of per thread Shu Zhihang.Scheduler list Member 510 can manage multiple and different thread blocks, thread beam be distributed to different thread blocks, then in each phase clock cycle Between by the instruction dispatch from multiple and different cooperative groups to each function unit (that is, core 550, SFU 552 and LSU 554)。

Cooperative groups are the programming models for organizing communication sets of threads, allow developer to express thread and are communicating Used granularity makes it possible to express richer, more efficient parallel decomposition.Cooperation starting API is supported between thread block Synchronism, to execute parallel algorithm.Conventional programming model provides single simple structure: cross-thread for synchronous collaboration thread The fence (barrier) (that is, syncthreads () function) of all threads of block.However, it is generally desirable to be less than by programmer The size definition sets of threads of thread block granularity, and it is synchronous in defined group, with complete group of function interface of collective The form of (collective group-wide function interface) enables higher performance, design flexibility and soft Part reuses.

Cooperative groups enable a programmer to explicitly define line at sub-block (that is, small as single thread) and muti-piece granularity Journey group and group performance is executed, the synchronism on thread in such as cooperative groups.Programming model is supported dry across software boundary Net combination, so as to library and utility function can in home environment it is safely synchronous, without assuming convergence.Cooperative groups Pel enables the parallel new model of affiliate, including Producer-consumer problem is parallel, opportunism is parallel and across entire thread The global synchronization of block grid.

Dispatch unit 515 is configured as to one or more function unit send instructions.In this embodiment, scheduler Unit 510 includes two dispatch units 515, makes it possible to dispatch two from identical thread beam during each clock cycle A different instruction.In alternative embodiments, each dispatcher unit 510 may include single dispatch unit 515 or additional assignment Unit 515.

Each SM 440 includes register file 520, provides one group of register of the function unit for SM 440.? In one embodiment, register file 520 is divided between each function unit, so that each function unit is assigned deposit The private part of device file 520.In another embodiment, register file 520 is in the different threads beam executed by SM 440 Between be divided.Register file 520 provides interim storage to be connected to the operand of the data path of function unit.

Each SM 440 includes L processing core 550.In one embodiment, SM 440 includes a large amount of (such as 128 Deng) different processing core 550.Each core 550 may include complete Pipelining, single precision, double precision and/or mixing essence Spend processing unit comprising floating-point operation logic unit and integer arithmetic logic unit.In one embodiment, floating-point operation is patrolled It collects unit and realizes the IEEE 754-2008 standard for being used for floating-point operation.In one embodiment, core 550 includes 64 single essences Spend (32) floating-point core, 64 integer cores, 32 double precision (64) floating-point cores and 8 tensor core (tensor core)。

Tensor core is configured as executing matrix operation, and in one embodiment, one or more tensor cores It is included in core 550.Specifically, tensor core is configured as executing deep learning matrix operation, such as nerve net The convolution algorithm of network training and reasoning.In one embodiment, each tensor core operation and executes square on 4 × 4 matrixes Battle array multiplication and accumulating operation D=A × B+C, wherein A, B, C and D are 4 × 4 matrixes.

In one embodiment, matrix multiplication input A and B is 16 floating-point matrix, and accumulated matrix C and D can be 16 Position floating-point or 32 floating-point matrix.Tensor core is in 16 floating-point input datas and the cumulative upper operation of 32 floating-points.16 floating Point multiplication needs 64 operations, generates the product of full precision, then using among other of 32 floating-points and 4 × 4 × 4 matrix multiplications The Calais Ji Xiang is cumulative.In practice, tensor core is used to execute by the bigger two-dimentional or higher of these lesser elements foundation The matrix operation of dimension.API (such as CUDA 9C++API) discloses special matrix load, matrix multiplication and cumulative and matrix Operation is stored, so that the tensor core from CUDA-C++ program is efficiently used.In CUDA level, thread beam grade interface assumes All 32 threads of 16 × 16 dimension matrixs across thread beam.

Each SM 440 further includes the M SFU 552 for executing special function (for example, attribute evaluation, reciprocal square root etc.). In one embodiment, SFU 552 may include tree Traversal Unit, be configured as traversal layering data tree structure.At one In embodiment, SFU 552 may include the texture cell for being configured as executing texture mapping filter operation.In one embodiment In, texture cell is configured as loading texture mapping (for example, 2D array of texture pixel) from memory 304 and paste to texture Figure is sampled to generate sampled texture value, for using in the coloration program executed by SM 440.In a reality It applies in example, texture mapping is stored in shared memory/L1 cache 470.Texture cell realizes texture operation, such as makes With the filter operation of mip textures (that is, texture mapping of different level of detail).In one embodiment, each SM 440 includes Two texture cells.

Each SM 440 further includes N number of LSU 554, realizes shared memory/L1 cache 570 and register file Load and storage operation between 520.Each SM 440 includes that each function unit is connected to register file 520 and is incited somebody to action LSU 554 is connected to register file 520, shared memory/L1 cache 570 interference networks 580.In one embodiment In, interference networks 580 are crossbar switches, can be configured as and any function unit is connected in register file 520 Any register, and LSU 554 is connected to the memory in register file and shared memory/L1 cache 570 Position.

Shared memory/L1 cache 570 is on-chip memory array, allows data storage and SM 440 and pel The communication between thread between engine 435 and in SM 440.In one embodiment, shared memory/L1 cache 570 including 128KB memory capacity and in from SM 440 to the path of zoning unit 380.Shared memory/L1 high speed is slow Depositing 570 can be used for cache reading and write-in.Shared memory/L1 cache 570, L2 cache 460 and storage One or more in device 304 are standby storages.

By data high-speed caching and shared memory combination of function at single memory block be two kinds of memory access It asks and optimal overall performance is provided.The capacity can be used as the cache without using shared memory by program.For example, if will Shared memory is configured so that half capacity, then residual capacity can be used in texture and load/store operations.In shared storage Integrated in device/L1 cache 570 plays shared memory/L1 cache 570 to gulp down for the height of streaming data The effect of Tu Liangguandao, and at the same time providing the access to the high bandwidth and low latency of frequent reusing data.

When being configured for universal parallel calculating, compared with graphics process, simpler configuration can be used.Specifically Ground, fixed function graphics processing unit shown in Fig. 3 are bypassed, and create simpler programming model.It is calculated in universal parallel In configuration, thread block is directly assigned and distributes to DPC 420 by Work distribution unit 325.Thread in block executes identical journey Sequence ensures that per thread generates unique consequence using unique Thread Id in calculating, executes program using SM 440 and executes It calculates, using shared memory/L1 cache 570 to be communicated between thread, and using LSU 554 by sharing storage Device/L1 cache 570 and memory partition unit 380 read and write global storage.When being configured for universal parallel When calculating, SM 440 can also be written dispatcher unit 320 and can be used to start the order newly to work on DPC 420.

PPU 300 can be included in desktop computer, laptop computer, tablet computer, server, supercomputing Machine, smart phone (for example, wireless, handheld device), personal digital assistant (PDA), digital camera, delivery vehicle, wear-type are aobvious Show in device, hand-held electronic equipment etc..In one embodiment, PPU 300 includes on a single semiconductor substrate.At another In embodiment, PPU 300 and one or more other devices (such as additional PPU 300, memory 204, reduced instruction set computer meter Calculation machine (RISC) CPU, memory management unit (MMU), digital-analog convertor (DAC) etc.) it is included together in system on chip (SoC) on.

In one embodiment, PPU 300 can be included on graphics card, and graphics card includes one or more storages Device equipment 304.Graphics card can be configured as the PCIe slot interface on the mainboard with desktop computer.In another embodiment In, it includes integrated graphical processing unit (iGPU) or parallel processor in the chipset of mainboard that PPU 300, which can be,.

Exemplary computing system

System with multiple GPU and CPU is used for various industries, because developer is applying (such as artificial intelligence meter Calculate) in exposure and utilize more concurrencys.Deployment has tens of to several in data center, research institution and supercomputer The high-performance GPU acceleration system of thousand calculate nodes is bigger to solve the problems, such as.With number of processing devices in high performance system Increase, communication and data transmission mechanism need extend to support the increase bandwidth.

Fig. 5 B is the concept map according to the PPU 300 using Fig. 3 of one embodiment processing system 500 realized.Example Property system 565 can be configured as and realize method 250 shown in method 130 and/or Fig. 2 C shown in Fig. 1 C.Processing system 500 include each of CPU 530, interchanger 510 and multiple PPU 300 and corresponding memory 304.NVLink 310 High speed communications link between each PPU 300 is provided.Although showing certain amount of NVLink 310 and interconnection in Fig. 5 B 302 connections, but the quantity for being attached to the connection of each PPU 300 and CPU 530 can change.Interchanger 510 is in interconnection 302 With the interface of CPU 530.PPU 300, memory 304 and NVLink 310 can be located on single semiconductor platform to be formed Parallel processing module 525.In one embodiment, interchanger 510 supports two or more in various different connections and/or chain The agreement of the interface on road.

In another embodiment (not shown), NVLink 310 provides one between each PPU 300 and CPU 530 Or more high speed communications link, and interchanger 510 interconnect 302 and each PPU 300 between carry out interface.PPU 300, Memory 304 and interconnection 302 can be located on single semiconductor platform to form parallel processing module 525.In another implementation In example (not shown), interconnection 302 provides one or more communication links between each PPU 300 and CPU 530, and hands over It changes planes and 510 carries out interface between each PPU 300 using NVLink 310, it is one or more to be provided between PPU 300 A high speed communications link.In another embodiment (not shown), NVLink 310 passes through friendship between PPU300 and CPU 530 It changes planes one or more high speed communications links of 510 offers.In another embodiment (not shown), interconnection 302 is in each PPU One or more communication links are directly provided between 300.Can be used agreement identical with NVLink 310 by one or More 310 high speed communications links of NVLink are embodied as interconnecting in physics NVLink interconnection or on piece or bare crystalline.

In the context of the present specification, single semiconductor platform can refer to the unique list manufactured in bare crystalline or chip One integrated circuit based on semiconductor.It should be noted that the single semiconductor platform of term can also refer to increased company The multi-chip module connect, simulation on piece operate and by utilizing conventional bus lines implementation to carry out substantial improvements.Certainly, root According to the needs of user, various circuits or device can be with separated or placed with the various combinations of semiconductor platform.It is optional Ground, parallel processing module 525 may be implemented as circuit board substrates, and each of PPU 300 and/or memory 304 It can be packaging.In one embodiment, CPU 530, interchanger 510 and parallel processing module 525 are located at and individually partly lead On body platform.

In one embodiment, the signaling rate of each NVLink 310 is 20 to 25 gigabit/second, and each PPU 300 include six 310 interfaces of NVLink (as shown in Figure 5 B, each PPU 300 includes five 310 interfaces of NVLink).Each NVLink 310 provides 25 gigabit/second of message transmission rate in each direction, wherein six links provide 300 gigabits Bit/second.When CPU 530 further includes one or more 310 interfaces of NVLink, NVLink 310 can be dedicated for such as Fig. 5 B Shown in PPU to PPU communication or PPU to PPU and PPU to CPU certain combine.

In one embodiment, NVLink 310 allows from CPU 530 to the direct of the memory 304 of each PPU 300 Load/store/atomic access.In one embodiment, NVLink 310 supports consistency operation, allows to read from memory 304 The data taken are stored in the cache hierarchy of CPU 530, reduce the cache access delay of CPU 530. In one embodiment, NVLink 310 includes the support to Address Translation services (ATS), and PPU 300 is allowed directly to access CPU Page table in 530.One or more NVLink 310 can be additionally configured to operate in the low power mode.

Fig. 5 C shows exemplary system 565, wherein may be implemented various preceding embodiments various architectural frameworks and/or Function.Exemplary system 565, which can be configured as, realizes the step of method shown in method 130 shown in Fig. 1 C and/or Fig. 2 C Rapid 250.

As shown, providing system 565 comprising be connected at least one central processing unit of communication bus 575 530.Any suitable agreement can be used to realize in communication bus 575, such as PCI (peripheral component interconnection), PCI- Express, AGP (accelerated graphics port), super transmission or any other bus or one or more point to point protocols.System System 565 further includes main memory 540.Control logic (software) and data are stored in main memory 540, main memory 540 The form of random access memory (RAM) can be taken.

System 565 further includes input equipment 560, parallel processing system (PPS) 525 and display equipment 545, i.e. routine CRT (cathode Ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display etc..Can from input equipment 560 (such as Keyboard, mouse, touch tablet, microphone etc.) receive user's input.Each of aforementioned modules and/or equipment even can positions In on single semiconductor platform to form system 565.Optionally, according to the needs of users, modules can be with separated Or it is placed with the various combinations of semiconductor platform.

In addition, system 565 can for it is communication objective by network interface 535 be coupled to network (for example, telecommunication network, Local area network (LAN), wireless network, wide area network (WAN) (internet), peer-to-peer network, cable system etc.).

System 565 can also include auxiliary storage (not shown).Auxiliary storage 610 include for example hard disk drive and/or Removable storage drive represents floppy disk drive, tape drive, CD drive, digital polygamma function disk (DVD) driving Device, recording equipment, universal serial bus (USB) flash memory.Removable storage drive is deposited from removable in a well-known manner Storage unit is read and/or write-in removable storage unit.

Computer program or computer control logic algorithm can store in main memory 540 and/or auxiliary storage.This A little computer programs make system 565 be able to carry out various functions when executed.Memory 540, storage and/or any other Storage is the possibility example of computer-readable medium.

The architectural framework and/or function of various first attached drawings can in general-purpose computing system, circuit board systems, be exclusively used in It entertains and is realized in the context of the game console system of purpose, dedicated system and/or the system needed for any other.For example, System 565 can take desktop computer, laptop computer, tablet computer, server, supercomputer, smart phone (example Such as, wireless, handheld device), personal digital assistant (PDA), digital camera, delivery vehicle, head-mounted display, hand-held electronic Equipment, mobile telephone equipment, television set, work station, game console, the logic of embedded system and/or any other type Form.

Although various embodiments are hereinbefore described, it is understood that, they are only used as example to present, rather than Limitation.Therefore, the range of the application and range should not be limited by any of the above-described exemplary embodiments, and should according only to below and with The claim submitted afterwards and its equivalent limit.

Machine learning

The deep neural network (DNN) developed on processor (such as PPU 300) has been used to various service conditions: from To faster drug development, the intelligence from the automated graphics subtitle in online image data base to Video chat in application is real for self driving When language translation.Deep learning is a kind of technology, it models the neural learning process of human brain, constantly learns, constantly becomes It is smarter, and more accurate result is quickly transmitted over time.One child is instructed by adult, with just Really identification and classification various shape, finally can identify shape in the case where no any guidance.Equally, deep learning or mind It needs to be trained in terms of object identification and classification through learning system, so as to when identifying basic object, blocking object equally Also promising object becomes more intelligent and efficient when distributing scene.

In simplest level, the neuron in human brain checks the various inputs received, by importance information Each of these inputs are distributed to, and output is passed into other neurons to handle.Artificial neuron or sense Know that device is the most basic model of neural network.In one example, perceptron can receive one or more inputs, indicate Perceptron is just trained to the various features of identification and the object classified, and when defining object shapes, in these features Each importance based on this feature assigns certain weight.

Deep neural network (DNN) model includes many connecting nodes (for example, perceptron, Boltzmann machine, radial base Function, convolutional layer etc.) multiple layers, can with largely enter data to training with quick high accuracy solve challenge. In one example, the input picture of automobile is decomposed into various pieces by the first layer of DNN model, and it is (all to search basic pattern Such as lines and angle).The second layer assembles lines to find the pattern of higher level, such as wheel, windshield and mirror.Next layer Identify vehicle type, it is last several layers of for input picture generation label, the model of identification particular automobile brand.

Once DNN is trained to, DNN can be disposed and for being known during referred to as reasoning (inference) Other and object of classification or pattern.The example (process that DNN extracts useful information from given input) of reasoning includes identification deposition The handwritten numeral on checking account in ATM machine, the image of friend in identification photo, to being more than 50,000,000 users offer film The road hazard or real time translation mankind speech recommended, identify and classified in different types of automobile, pedestrian and pilotless automobile Language.

During the training period, data flow through DNN in the propagated forward stage, and until generating prediction, instruction corresponds to defeated The label entered.If neural network does not have correct labeling input, the error between correct label and prediction label is analyzed, and It is directed to each Character adjustment weight during the back-propagating stage, is concentrated until the DNN correct labeling input and training data Until other inputs.The complicated neural network of training needs a large amount of Parallel Computing Performance, including the floating-point supported by PPU 300 Multiplication and addition.Compared with training, it is a delay-sensitive process, wherein passing through that the computation-intensive degree of reasoning is lower than training The new input that trained Application of Neural Network was not met before it, with carry out image classification, translated speech and usually The new information of reasoning.

Neural network depends critically upon matrix mathematical operation, and complicated multitiered network need a large amount of floating-point performance and Bandwidth improves efficiency and speed.It using thousands of processing cores, is optimized, and transmits tens of for matrix function student movement To the performance of hundreds of TFLOPS, PPU 300 is to can be transmitted artificial intelligence and machine learning application based on deep neural network The computing platform of required performance.

Claims

1. a method of computer implementation, comprising:

For the first duration, training production confrontation network (GAN), the GAN includes being coupled to discriminator neural network Generator neural network, wherein the topology of the GAN includes in the generator neural network and the discriminator nerve net The interconnection between feature and the feature in network；

The topology of the GAN is modified to generate modified GAN；And

For the second duration, the training modified GAN.

2. computer implemented method as described in claim 1, wherein modifying the topology changes the generator nerve net The processing capacity of network.

3. computer implemented method as described in claim 1, wherein by being added in the generator neural network Lack a layer to modify the topology.

4. computer implemented method as described in claim 1, wherein by being added in the discriminator neural network Lack a layer to modify the topology.

5. computer implemented method as described in claim 1, wherein by removing in the generator neural network extremely Lack a layer to modify the topology.

6. computer implemented method as described in claim 1, wherein training data includes example output data, and is also wrapped It includes during the training to the modified GAN:

By the generator Processing with Neural Network input data to generate output data；

The example output data is modified to generate modified training data；And

The modified training data as described in the discriminator Processing with Neural Network and the output data, to generate for described The updated parameter of GAN.

7. computer implemented method as claimed in claim 6, wherein described modified with second duration Training data is compared, and the modified training data of first duration is different.

8. computer implemented method as claimed in claim 6, wherein modifying first duration according to first function The modified training data, according to be different from the first function second function modify second duration The modified training data.

9. computer implemented method as claimed in claim 6, wherein the training data further includes additional input data, and And the additional input data and the example output data are matched.

10. computer implemented method as claimed in claim 6, wherein modifying the training data includes increasing or decreasing institute State the density of example output data.

11. computer implemented method as claimed in claim 6, wherein the training data is image data, and is modified The training data includes the pixel resolution for reducing the training data.

12. computer implemented method as claimed in claim 11, wherein compared with second duration, the training The amount that the pixel resolution of data reduces in first duration is bigger.

13. computer implemented method as described in claim 1, further includes: during second duration, smoothly Modify the topology.

14. computer implemented method as claimed in claim 13, wherein using the first of the Topology g eneration by the GAN Median, which is used, carries out interpolation using the second median of the modified Topology g eneration by the GAN.

15. computer implemented method as described in claim 1, wherein the GAN handles 3 d image data.

16. computer implemented method as described in claim 1, wherein the GAN handles audio data.

17. a kind of system, comprising:

Production fights network (GAN), the generator neural network including being coupled to discriminator neural network, wherein

For the first duration, the GAN is trained to, and the topology of the GAN includes in the generator neural network Interconnection between the feature and the feature in the discriminator neural network；

The topology of the GAN is modified to generate modified GAN；And

For the second duration, the modified GAN is trained to.

18. system as claimed in claim 17, wherein modifying the processing energy that the topology changes the generator neural network Power.

19. system as claimed in claim 17, wherein by added in the generator neural network at least one layer come Modify the topology.

20. a kind of non-transitory computer-readable medium is stored for training the computer of production confrontation network (GAN) to refer to It enables, the GAN includes the generator neural network for being coupled to discriminator neural network, is executed when by one or more processors When, so that one or more processor executes following steps:

For the first duration, the training GAN, wherein the topology of the GAN include in the generator neural network and The interconnection between feature and the feature in the discriminator neural network；

The topology of the GAN is modified to generate modified GAN；And

For the second duration, the training modified GAN.

21. one kind, for training the computer implemented method of production confrontation network (GAN), the GAN includes being coupled to mirror The generator neural network of other device neural network, which comprises

Example output data is received in the input terminal of the GAN；

The generator Processing with Neural Network input data is to generate generator output data；

The example output data is compared by the discriminator neural network with the generator output data, and

If the generator output data is sufficiently matched according to standard with the example output data,

The first training stimulus is then exported, and

If the generator output data does not match sufficiently according to the standard with the example output data, is exported Two training stimulus；

Second training stimulus is exported in response to the discriminator neural network, modifies the generator neural network and described At least one of layer, feature and interconnection at least one of discriminator neural network, to modify the GAN.