US20190156214A1 - Systems and methods for exchange of data in distributed training of machine learning algorithms - Google Patents

Systems and methods for exchange of data in distributed training of machine learning algorithms Download PDF

Info

Publication number
US20190156214A1
US20190156214A1 US16/192,924 US201816192924A US2019156214A1 US 20190156214 A1 US20190156214 A1 US 20190156214A1 US 201816192924 A US201816192924 A US 201816192924A US 2019156214 A1 US2019156214 A1 US 2019156214A1
Authority
US
United States
Prior art keywords
layer
nodes
node
backpropagation
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/192,924
Inventor
Alexander Matveev
Nir Shavit
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neuralmagic Inc
Original Assignee
Neuralmagic Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neuralmagic Inc filed Critical Neuralmagic Inc
Priority to US16/192,924 priority Critical patent/US20190156214A1/en
Assigned to Neuralmagic Inc. reassignment Neuralmagic Inc. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MATVEEV, ALEXANDER, SHAVIT, NIR
Publication of US20190156214A1 publication Critical patent/US20190156214A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1916Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Definitions

  • the invention relates generally to machine learning; specifically to training neural networks using distributed systems.
  • NNs Neural networks (NN) or connectionist systems are computing systems inspired by biological computing systems, but operating using manufactured digital computing technology. NNs are made up of computing units typically called neurons (which are artificial neurons, as opposed to biological neurons) communicating with each other via connections, links or edges. In common NN implementations, the signal at the link between artificial neurons may be for example a real number, and the output of each neuron may be computed by function of the (typically weighted) sum of its inputs, such as the ReLU rectifier function. NN links or edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection.
  • NN neurons are divided or arranged into layers, where different layers may perform different kinds of transformations on their inputs and may have different patterns of connections with other layers.
  • a higher or upper layer, or a layer “above” another layer is a layer more towards the output layer
  • a lower layer, preceding layer, or a layer “below” another layer is a layer towards the input layer.
  • Such systems may learn to perform tasks by considering example input data, generally without being programmed with any task-specific rules, being presented with the correct output for the data, and self-correcting.
  • the NN may execute a forward-backward pass where in the forward pass the NN is presented with an input and produces an output, and in the backward pass (backpropagation) the NN is presented with the correct output, generates an error (e.g., a “loss”), and generates update gradients which are used to alter the weights at the links or edges.
  • an error e.g., a “loss”
  • CNN convolutional neural network
  • LSTM long short-term memory
  • a NN is simulated by one or more computing nodes or cores, such as generic central processing units (CPUs, e g. as embodied in personal computers) or graphics processing units (GPUs such as provided by Nvidia Corporation), which may be connected by a data network.
  • CPUs central processing units
  • GPUs graphics processing units
  • a collection of such connected computers may be termed a pod, and computers used with NNs may be single socket (e.g. one main processor) or multi-socket (e.g. multiple processors in one machine, sharing some memory).
  • One or more computing nodes may model a NN using known data structures.
  • the trained NN may for example recognize or categorize images, perform speech processing, or other tasks.
  • a NN may be modeled as an abstract mathematical object, such as a function.
  • a NN may be translated physically to CPU or GPU as for example a sequence of matrix operations where entries in the matrix represent neurons (e.g. artificial neurons connected by edges or links) and matrix functions represent functions of the NN.
  • the NN may be presented with training data.
  • a NN may learn to identify images that contain cats by analyzing example images that have been manually labeled as “cat” or “not a cat” and using the results to identify cats in other images.
  • the NN may do this without any prior knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces. Instead, during learning the NN automatically generates identifying characteristics from the learning material that it processes.
  • One method of training in a NN is data parallel learning, where (typically via a master node or core), the data or training sets are divided, and each core or node operates on the same NN, using forward and backward passes, on only a portion of the data independently, and after each forward/backward pass the nodes or cores exchange parameters (e.g. weights or gradients) with each other, or send them to the master, to come up with the right parameters for the iteration.
  • a master node may send one different image, or a set of images, and the same model of the NN, to each of four CPUs.
  • Each CPU may execute a forward and backward pass over all layers of the model on its specific image, and send the resulting parameters to the master, which then creates an updated model from the parameters sent by all four CPUs.
  • Each node or processor may at times store a different version (with different parameters) of the same NN.
  • each node executes the full machine learning model on a subset of examples, so the number of parameters a node needs to communicate may be the same as the model size.
  • the number of parameters a node needs to communicate may be the same as the model size.
  • Network bottlenecks may slow the learning process. High bandwidth interconnections may be used to speed data transfer, but such systems are expensive compared to more low bandwidth networks, such as an Ethernet network.
  • a loss, inconsistency or error value may be calculated at the output or at an output layer, with possibly multiple loss values being created, e.g. one for each node in an output layer.
  • the output layer or set of layers typically is or includes a fully connected (IPC) layer, where each neuron in the layer accepts an input, edge or link from each neuron/output of a lower or preceding layer (e.g., a layer closer to the input).
  • This fully connected layer is an example of a layer where the number of weights is high (because there may be a link between every input neuron and every output neuron) and yet the layer has a relatively low amount of compute, because the computation as a whole may be equivalent to a matrix multiply rather than a convolution.
  • a loss for a network may represent the difference or inconsistency between the value or values output from the network, and the correct value/values that should be output given the data input to the NN.
  • a loss value may be, for example, a negative log-likelihood or residual sum of squares, but may be computed in another manner.
  • NN learning it is desired to minimize loss, and after receiving a loss the NN model may be updated my modifying weight values in the network using backpropagation.
  • Systems and methods of the present invention may make exchanging data in a neural network (NN) during training more efficient.
  • Exchanging weights among a number of processors training a NN across iterations may in some embodiments include sorting generated weights, compressing the sorted weights, and transmitting the compressed sorted weights. On each Kth iteration a sort order of the sorted weights may be created and transmitted,
  • Embodiments may exchange weights among processors training a NN by executing a forward pass to produce a set of loss values for processors, transmitting loss values to other processors, and at each of the processors, performing backpropagation on at least one layer of the NN using loss values received from other processors.
  • FIG. 1A is a block diagram of a neural network according to an embodiment of the present invention.
  • FIG. 1B is a block diagram of a neural network according to an embodiment of the present invention.
  • FIG. 1C is a block diagram of a system for training a neural network according to an embodiment of the present invention.
  • FIG. 2 is a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.
  • FIG. 3 is a flowchart of a method according to embodiments of the present invention.
  • FIG. 4 depicts a prior art process for training using a multi-node system.
  • FIG. 5 is a flowchart of a method according to embodiments of the present invention.
  • FIG. 6 is a diagram depicting a method of exchanging weights among processors according to at embodiments of the present invention.
  • Embodiments of the invention include systems and methods that may reduce the amount of data communicated during the training process of NNs (e.g. convolutional neural networks, or other networks) using a system including multiple nodes such as CPUs connected via a relatively slow connection such as an Ethernet or similar inexpensive network.
  • CPUs if used, may contain multiple cores, so that certain tasks may be done in parallel or concurrently: for example transmitting or receiving data, sorting, compression, portions of a backward or forward pass, etc.
  • embodiments of the invention are applicable to other, non-NN tasks, for transferring large amounts of data. While a CNN is discussed as an example NN used, embodiments of the invention may be used with other NNs, such as LSTMs. Further, while CPU based machines are discussed, CiPUs or other types of processors may be used.
  • Embodiments of the present invention may be used with pods, and single socket or multi-socket systems, or other types of systems.
  • Embodiments of the invention may take advantage of the computational properties of a NN such as a CNN to distribute the computation and thus reduce the overall communication.
  • Loss values may be transmitted by nodes to a master node or other nodes, which may use the loss values to calculate gradients and/or weights to modify the model.
  • the computation of these parameters may be relatively computationally easy, e.g., have a low computational burden relative to other layers, as in the case of an FC layer, where the computation per output node is a simple dot product
  • Some prior techniques use compression to reduce data size of data transmitted among nodes; however, such techniques achieve only a lossy reduction, e.g. reducing the granularity or accuracy of data on decompression.
  • lossy compression might increase convergence time (e.g., where the NN converges to a state where the error of the calculations is small) or even preclude convergence at all.
  • the computational properties of the weight distributions during NN training contribute to improving compression and distribution of the weights, and thus reduce the overall communication overheads, with, in come cases, no loss of accuracy (e.g. using lossless compression).
  • the distribution or transmission of other parameters, such as loss values or gradients may also be made more efficient.
  • FIG. 1A is a simplified block diagram of a NN according to an embodiment of the present invention; in typical use thousands of neurons and links are used.
  • NN 1000 may input data as for example an input vector 1010 of values (representing, e.g. a photograph, voice recording, or any sort of data), and may produce an output of signals or values, for example output vector 1020 .
  • NN 1000 may have neurons arranged into layers 1030 , each including neurons 1040 connected to other neurons h links or edges 1050 .
  • FIG. 113 is a block diagram of a neural network according to an embodiment of the present invention.
  • NN 1100 may input data, for example image 1110 (e.g.
  • NN 1100 may in one example have layers 1130 (convolution), 1132 (pooling), 11 : 34 (convolution), 1136 (pooling), and one or more output layers 1138 , which may include for example an FC layer 1138 A and a softmax layer 1138 B. Each layer may include neurons connected to other neurons h links or edges.
  • the NNs in FIGS. 1A and 1B are typically simulated, and represented as data, for example in a system such as shown in FIG. 1C , below,
  • a convolutional layer may apply a convolution operation to its input, passing its result to the next layer.
  • the convolution operation may for example emulate the response of an individual neuron to visual stimuli, and may for example include neurons processing data only for its receptive field.
  • a convolutional layer's parameters may include a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume.
  • each filter may be convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter.
  • the NN may learn filters that activate when they detect some specific type of feature at some spatial position in the input.
  • Stacking the activation maps for all filters along the depth dimension may form the full output volume of the convolution layer. Every entry in the output volume for a convolutional layer can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation.
  • NNs used for classification tasks may produce, for each class i, an output z_i, sometimes called a logit, which may encode or represent the likelihood that a given example input should be classified to class i.
  • Logits z_i, for each class i (e.g., for image recognition dog, cat, llama, etc.) may be transformed into probabilities q_i by comparing each z_i to the other logits, in for example a softmax layer.
  • FIG. 1C is a block diagram of a system for training a neural network according to an embodiment of the present invention.
  • a system of computing devices 1 may include computing nodes 10 , 20 and 30 , connected by one or more communications network(s) 40 .
  • Communications network 40 may be for example an Ethernet network, but may be one or more other communications networks.
  • Node 10 may be a master node, distributing training data, collecting parameters and creating updated models, and nodes 20 and 30 may be training nodes, executing forward and backward passes on training data, sending parameters (e.g. weights for edges or links) to master node 10 , and updating the nodes' internal representations of the NN after receiving data from the master node.
  • parameters e.g. weights for edges or links
  • a training node may function as a master node.
  • a fixed “master” node need not be used, and one or more training nodes may execute the functionality of a master node.
  • other numbers of nodes may be used, for example 10 training nodes, 1,028 training nodes, or other numbers.
  • Other numbers of master nodes may be used, for example an embodiment may include two master nodes and 16 training nodes, or 16 nodes total.
  • Master node 10 may include data 12 , e.g., training sets (such as collections of images, audio files, etc) and model data 14 representing a NN (e.g. data representing artificial neurons, links, weights, etc.) and including for example parameters such as weights, and possibly for example the arrangement of nodes, layers and edges.
  • a NN e.g. data representing artificial neurons, links, weights, etc.
  • Each of nodes 10 , 20 and 30 may model the same complete NN, including neurons, links, weights, etc. as the other nodes, but each of nodes 20 and 30 may train on a different set of data.
  • Each node 20 and 30 may model the same NN as master node 10 , and may include for example NN data 22 and 32 .
  • the NN may be for example a CNN, but may be another type of NN.
  • the NN modeled by NN data 22 and 32 may include an input layer 50 , convolution layers 52 and 56 , pool layers 54 and 58 , a fully connected layer 60 , and a softmax layer 62 .
  • Other numbers and types of layers may be used.
  • the NN made of layers 50 - 62 may function and be simulated as is known in the art.
  • a system such as shown in FIG. 1C may execute a trained NN at inference time, although at inference time such NN s may be executed by one processing node, e.g. a workstation, PC, server, etc.
  • Nodes may be for example CPU based systems (e.g. workstations, PCs), CPU based systems, or other systems.
  • master node 10 is a CPU based system and training nodes may be other systems such as GPU based systems.
  • Nodes 10 , 20 and 30 may be or include structures such as those shown in FIG. 2 . While in some embodiments a generic CPU (e.g. a workstation, a PC (personal computer), a multi-core system) is discussed as a node, embodiments of the invention may be used with other types of nodes, such as CPUs. Further, while example embodiments of the invention discuss a relatively simple, slow communications connection between nodes, such as an Ethernet, other networks or communications systems, such as relatively fast, expensive, and specially made systems, may be used.
  • FIG. 2 shows a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.
  • Computing device 100 may include a controller or processor 105 that may be or include, for example, one or more central processing unit processor(s) (CPU), one or more Graphics Processing Unit(s) (CPU or CPCPU), a chip or any suitable computing or computational device, an operating system 115 , a memory 120 , a storage 130 , input devices 135 and output devices 140 .
  • modules and equipment such as nodes 10 , 20 and 30 , and other equipment mentioned herein may be or include a computing device such as included in FIG. 2 , although various units among these entities may be combined into one computing device.
  • Operating system 115 may be or may include any code segment designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 100 , for example, scheduling execution of programs.
  • Memory 120 may be or may include, for example, a Random Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units.
  • Memory 120 may be or may include a plurality of, possibly different memory units.
  • Memory 120 may store for example, instructions to carry out a method (e.g. code 125 ), and/or data such as user responses, interruptions, etc.
  • Executable code 125 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 125 may be executed by controller 105 possibly under control of operating system 115 . For example, executable code 125 may when executed cause NN training, coordination of NN training tasks, NN execution or inference, etc. according to embodiments of the present invention. In some embodiments, more than one computing device 100 or components of device 100 may be used for multiple functions described herein. For the various modules and functions described herein, one or more computing devices 100 or components of computing device 100 may be used. Devices that include components similar or different to those included in computing device 100 may be used, and may be connected to a network and used as a system.
  • One or more processor(s) 105 may be configured to carry out embodiments of the present invention by for example executing software or code.
  • Storage 130 may be or may include, for example, a hard disk drive, a floppy disk drive, a Compact Disk (CD) drive, a CD-Recordable (CD-R) drive, a universal serial bus (USB) device or other suitable removable and/or fixed storage unit.
  • Data such as instructions, code, NN model data, parameters, etc. may be stored in a storage 130 and may be loaded from storage 130 into a memory 120 where it may be processed by controller 105 . In some embodiments, some of the components shown in FIG. 2 may be omitted.
  • Input devices 135 may be or may include for example a mouse, a keyboard, a touch screen or pad or any suitable input device. It will be recognized that any suitable number of input devices may be operatively connected to computing device 100 as shown by block 135 .
  • Output devices 140 may include one or more displays, speakers and/or any other suitable output devices. It will be recognized that any suitable number of output devices may be operatively connected to computing device 100 as shown by block 140 .
  • Any applicable input/output (I/O) devices may be connected to computing device 100 , for example, a wired or wireless network interface card (NIC), a modem, printer or facsimile machine, a universal serial bus (USB) device or external hard drive may be included in input devices 135 and/or output devices 140 .
  • NIC network interface card
  • USB universal serial bus
  • Embodiments of the invention may include one or more article(s) (e.g. memory 120 or storage 130 ) such as a computer or processor non-transitory readable medium, or a computer or processor non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory, encoding, including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein.
  • article(s) e.g. memory 120 or storage 130
  • a computer or processor non-transitory readable medium such as for example a memory, a disk drive, or a USB flash memory
  • encoding including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein.
  • each neuron computes its own gradient for a link for the neuron, the gradient to be applied to adjust the weight of the link.
  • a neuron taking action such as transmitting data, computing data, etc., may mean that a processor simulating the neuron performs a computation to simulate such action; e.g. a computing node simulating a number of neurons may perform the actual action that is ascribed to the neuron.).
  • a node that is simulating neurons may collect weights or other parameters and transmit them to a master node. The master node may receive and collect parameters and construct a model based on these parameters: e.g.
  • a master node may collect all weights from nodes, and for each link, average the weights to produce an updated weight for that node, the weight being a part of the updated model. Techniques other than averaging may be used.
  • a number of nodes simulate forward/backward passes on the same nodes at the same time using different data sets: the resulting changes in parameters, e.g. weights, are sent by each node to a master node which creates an updated model from the parameters and sends the model back to the nodes.
  • one node acts as both a node simulating neurons and also the master node for all nodes.
  • parameters such as weights are represented as floating point (e.g. 32 bit) numbers, but may be represented in other ways, such as integers or numbers represented by different numbers of bits.
  • nodes may communicate parameters such as weights or other parameters with a master node or other nodes by first sorting or arranging them, for example according to the values of the parameters, and then applying a ZIP or a similar lossless compression algorithm (e.g., 7zip, or another suitable compression algorithm) to the sorted sequence. Sorting or arranging, e.g. in order of value, to place similar or same values adjacent to each other in an ordered list or vector, may allow for improved parameter compression, due to the nature of compression algorithms like ZIP.
  • a ZIP or a similar lossless compression algorithm e.g., 7zip, or another suitable compression algorithm
  • Sorted data is typically easier to compress than unsorted data because sequential values are in order when data is sorted so their non-negative differences can be encoded in place of the original values, and repeated values are all contiguous and can be encoded by including a count along with the first instance.
  • the sort order may be used to re-order the data to its proper order; thus in some embodiments a sort order is transmitted typically on the first iteration of training, and periodically on some but importantly not all successive iterations.
  • one iteration may include a forward and backward pass on a batch of multiple inputs, e.g. multiple images, at one time, after which the model may be updated.
  • non-master nodes may exchange loss data by arranging the data in order of value of individual loss parameters, compressing the data, transmitting, uncompressing, and rearranging according to a sort order.
  • the sort order itself does not compress well or at all, and thus transmitting the sort order with each compressed set of parameters would not typically result in bandwidth savings.
  • a sorted order or order of the arranging is sent, which over time (as parameters change) becomes an approximate sorted order that suffices to reduce compression, allowing for lossless compression of parameters.
  • this approximate sorted order does not need to be sent with every set of parameters, and thus the cost of sending the sort order or order of the arranging be amortized over many iterations of compressed data being sent.
  • the parameters are arranged or sorted to be ordered as the order of the last sort order created, before compression or “Zipping”, and not according to the actual values of the parameters at that time.
  • the parameters in the intermediate iterations may be arranged in an order which does not match that of an actual order sorted by value.
  • the quality of the compression may be determined by the extent by which the order of parameters reflects the true sort order of the data, in some embodiments the effectiveness is greatly helped if the order over consecutive sequences or iterations of values transmitted does not change by much. A typical property of the sequences of weights in consecutive training batches or iterations of a.
  • neural network trained using backward propagation and stochastic gradient descent is that the differences between the consecutive sequences is small since they are determined by the gradients which are typically small.
  • compressing the weights of consecutive sequences of weights from iterations of backward propagation and stochastic gradient descent have small differences in the sort orders and small differences between the values, thus lending themselves to good compression even based on the sort order of preceding iterations.
  • the sort/compress/transmit sequence where sorting by value and creating a sort order occurs only periodically can be in both directions—e.g. when the master sends parameters back to slave or “non-master” nodes—and also between slave nodes.
  • the sort order may be an order that the sender (any node, master or slave node) creates and sends to the receiver, typically periodically. If the sort order is shared between two nodes—e.g. a master and slave node and one node (e.g. the slave node) created it originally the other node (e.g. the master node) need riot create a sort order. However, any sender, master or slave node, may create a sort order if needed.
  • the typical pattern for distributed training of machine learning algorithms includes for example iterating or repeating:
  • Each node simulating a set of NN neurons executes a forward-backward pass that calculates or generates new updated weights of links or edges.
  • nodes transmit parameters such as their newly calculated weights to a master node, or to other nodes.
  • master node may receive the parameters and update the model, e.g. by for each link averaging the weights received from nodes.
  • Each node may receive a model and may update its parameters, e.g. update the weights for its links or edges.
  • the process may repeat starting with operation 1 .
  • the iteration may stop when for example a certain accuracy is achieved, after a certain number of runs or iterations, when training data is exhausted, or on other conditions.
  • each node executes the full machine learning model on a subset of examples, and thus the number of parameters a node needs to communicate is the same as the model size.
  • the number of parameters a node needs to communicate is the same as the model size.
  • nodes may compress parameters such as the weights generated at each node during the distributed training, before they are sent over the network. In one embodiment:
  • Nodes may sort or order the weights of the links or edges. Sorting may be in any suitable order, e.g. low to high, or high to low. Sorting may be based on for example the value of the weights themselves.
  • Nodes may compress their sorted weights by using ZIP or a lossless compression algorithm.
  • Sorting and compressing parameters may work well since there may be many similar numbers in the sorted sequence among the parameters for each node which reduces the overall entropy and allows ZIP type compression algorithms to compress well.
  • the nodes that receive the sorted-and-compressed data should know the sort-order in order to be able to access the data appropriately.
  • the sending and receiving nodes have a common understanding of the order of the weights being sent. For example, each edge, link or weight in the NN may be assigned an address, index or a place in an ordered list. For example, both the sending and receiving nodes understand the first weight is for edge X of the network. After sorting, a new understanding a sort order—may be sent.
  • Sort-order or arrangement order information may be for example a table, one column being a weight, edge or link number or index, and the other column being the order in the compressed data, for each weight, edge or link.
  • Sort order or arrangement order information may be for example a list or vector (e.g. an ordered list of numbers), where for each ordered entry X the number indicates the place, position or order, in the compressed list, of the parameter numbered or indexed X. Other forms for a sort order may be used.
  • sort-order information does not compress well, and sorting itself may be computationally expensive. Transmitting the sort order may be expensive in terms of network bandwidth, and thus transmitting sort information with each compressed list may he expensive and may eliminate the advantages of compression.
  • a sort-order is not generated and sent for each iteration, but rather only periodically or once every K′th iteration (K being an integer greater than one), so that the same sort--order is used across multiple iterations, and the cost of communicating and/or calculating the sort-order is amortized over them. K may be fixed as part of the system design, or may change periodically, over time or from iteration to iteration based on for example certain conditions.
  • the NN learns and changes its weights, yet many if not most of the weights do not change by a large percentage from iteration to iteration of training. Thus the actual order, from high to low or low to high, of the weights, changes from iteration to iteration but not by much.
  • gradients which are applied to edge or link weights to change the weights are small. For example, a gradient may be +/ ⁇ 0.0002.
  • a process may include:
  • Iteration X+K is the same as iteration X.
  • the process may be the same: arrange according to the same sort order as the previous iteration, compress, transmit.
  • K is a pre-set interval, such as 20 (causing a new sort to be created once every 20 iterations) the process may be the same as iteration X: sort, compress, transmit.
  • the sort order may be created and transmitted only every K (e.g. 10, 20 or another suitable integer) iterations, so the cost of sending it will be amortized across K iterations.
  • K can be variable rather than fixed. This works best as long as the sort order does not change much across iterations, which is typically the case for distributed machine learning where the parameters change slowly.
  • nodes may transmit parameters after computing parameters for each layer, and thus parameters may be sent multiple times for each backward pass.
  • a master node may update model parameters after receiving parameters for a certain layer, and transmit the parameters back to the nodes after this: for example a node may compute backpropagation of parameters of the next lower layer while it is updating the parameters of layers above this layer whose backpropagation has already ended.
  • a sequence of backpropagation may include nodes updating parameters for layer X; nodes transmitting parameters for layer X; nodes updating parameters for layer X+1, (higher and more towards output than layer X) while master computes model for layer X (concurrently, or simultaneously); master sending model to nodes for layer X; nodes sending parameters for layer X+1; etc. Other orders may be used. Further, in some embodiments nodes may complete a full backpropagation of all layers before sending parameters.
  • FIG. 3 is a flowchart of a method for exchanging or transmitting parameters such as weights according to embodiments of the present invention, while conducting training on a NN. While in one embodiment the operations of FIG. 3 are carried out using systems as shown in FIGS. 1 and 2 , in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 3 may be used with or combined with the embodiment shown in FIG. 5 .
  • a number of nodes may receive training sets or data from one or more master nodes.
  • a master node may send one image each to a number of nodes.
  • the nodes may be for example processors representing a NN using data, the NN including for example artificial neurons connected by edges or links.
  • the NN may be “virtual” and no actual physical neurons, links, etc. may exist, existing rather as data used by the nodes.
  • each node may execute a forward pass on the training data received, to produce an Output.
  • each node may execute a backward or backpropagation pass, comparing its output for a NN to the expected output for the training data used, and calculating parameters such as weights for links or edges, or other data.
  • all layers in the NN, or at least two layers may have parameters generated.
  • the sorting/reordering, compressing and transmitting operations may occur for that layer.
  • the nodes or processors during the backward or backpropagation pass calculate or generate gradients for links and calculate or generate weights for the links based on or using the gradients. For example, gradients may be factors that adjust the values of the weights.
  • each node may sort or arrange parameters created in operation 320 , for example according to the values of the parameters, to create sorted parameters, e.g. sorted weights.
  • a sort order, order of arranging, ordering, or index may be created and possibly saved or stored, based on the sorting process.
  • Each node may have a different locally-created sort order. For example, while sorting the parameters, the new position of each parameter (when compared to the position in the parameter before sorting) may be saved as a sort order.
  • parameters exchanged in a NN system have some inherent order understood by all entities in the system, and the sort process changes that order. Sorting or arranging may be for example low to high, high to low, etc. according to the numerical value of the parameter.
  • the period between when sorting is performed according to the values of the parameters, and a sort order is created, may vary from iteration or cycle to iteration or cycle, and thus K may vary.
  • the sorting performed in operation 340 may be a rearrangement or re-ordering of parameters according to a prior sort order (e.g. the last Kth iteration, or the last time operation 330 was performed), and the “sorted parameters” are not sorted according some ranking of their own values, but rather are arranged according to a prior sort order.
  • the parameters sorted or rearranged in operations 330 and 340 may be compressed by a node, to produce compressed sorted parameters, e.g. compressed sorted weights, typically using lossless compression, although other compression methods may be used.
  • the parameters may be Zipped.
  • the “compressed sorted parameters” may be not sorted according to their own order; rather they may be sorted or arranged according to a prior sort order.
  • data size savings is most when parameters are weights, which typically have a similar order across iterations, as opposed to gradients, which often do not have a similar order across iterations.
  • sorting and compressing may be performed with parameters other than weights, such as gradients, losses, etc.
  • each node may transmit or send its compressed parameters to a master node, or one or more other processors or nodes. If the iteration is a “create sort” iteration, e.g. every K'th iteration, the sort order, ordering, or index created in operation 330 may also be transmitted, for example with the compressed parameters.
  • a master node or processor may receive the parameters and create an updated model of the NN. In order to do so, the master may decompress the parameters, and place the parameters in the order according to the last sort order received.
  • the parameters are typically re-ordered or re-sorted according to the last sort order received for the node that sent the parameters: thus the master node may maintain or store a different “last” sort order for each node sending it parameters.
  • the master node reordering decompressed parameters to their original, proper order may be performed for data received from each node using a separately received sort order, as typically the sort order or indexing from each node is different.
  • the master node may send an updated model to the nodes performing the simulation, and the nodes may update their internal representation of the NN.
  • the updated model may be sent using parameters sorted and compressed according to the relevant sort order.
  • the process may repeat or iterate, moving back to operation 300 .
  • the iteration may stop when for example a certain accuracy is achieved, after a certain number of runs or iteration, or on other conditions. Other or different operations may be used.
  • a node receiving data e.g. a master node—may use operations similar to operations 300 - 390 to transmit data to nodes, or non-master (e.g. “slave”) nodes may use such operations to transmit data.
  • a master node may use the sort order received from node A to transmit model parameters back to node A, by sorting the parameters according to the last sort order received from node A, then compressing the data. Node A then decompresses the received model data and uses the last sort order it created to sort the data back to its original order.
  • a master node may create its own sort order periodically.
  • parameters may be transmitted using a sort and compress method (e.g.
  • data transmitted using a sort and compress method may be from a node executing a forward/backward pass to another node executing a forward/backward pass.
  • data transmitted using a sort and compress method may include parameters other than weights: for example data may include gradient or loss data.
  • a node typically when performing calculations relevant to an output layer (typically an FC layer), instead of using only the loss or error produced at that node to calculate weights or gradients for that layer, in addition use losses from other nodes and transmit or communicate their losses or loss values to other nodes.
  • One or more nodes receiving the losses may receive all losses from all nodes simulating a forward pass, and then compute, in series for losses from each different node sending losses, a gradient and/or weight for each link or edge to the output layer. This may be in place of a master node receiving and averaging parameters for that particular layer.
  • the gradients may be averaged.
  • the nodes receiving loss data may be a master node, or may be all nodes conducting a forward pass, in which case all such nodes perform the same calculations using losses. Since in certain NNs the number of links to neurons in an FC output layer is orders of magnitude greater than the number of loss values for the output layer, this may reduce the amount of data to be communicated (which may allow for a less expensive communications link), in exchange, in some embodiments, for the modest cost of multiple nodes using the global loss values to compute weights or gradients for the model. Further, typically computation for an FC layer, possibly involving a matrix multiplication, is less burdensome than for other layers such as a convolution layer, which may asymptotically involve as many as the square of the number of operations of the matrix multiply.
  • a master node may compute new weights for the model for most layers by accepting weight values computed by nodes and for if example averaging them
  • multiple nodes may compute the new weights (the weights after applying the gradients for the model) from the losses by performing the backpropagation computation step for the layer. This may lower the amount of data that is transmitted. This may be especially useful for a system using a small number of nodes, e.g. a pod of 16 or 32 nodes (other numbers of nodes may be used).
  • the layer or subset of layers on which backpropagation is performed using non-local losses has associated with the layer a large fraction of the total weights in the NN but a much smaller fraction of the weight compute burden in the NN, even when computing using non-local losses.
  • compression may be considered a translation of data movement burden (e.g. network burden) to data compute burden, this may be considered analogous to compression, in that there is a reduction in data movement burden (less weights are moved) and an increase in computation burden (each node redundantly performs substantially similar loss-to-weight calculations. However, given the architecture of some systems, this may result in faster processing.
  • a measure of the amount of parameter transmission or network burden may be the number of bytes sent, or the number of parameters sent.
  • a measure of the amount of compute or processing burden may be the number of computer operations (e.g., machine operations) needed to compute gradients and weights during backpropagation.
  • a layer may have a different amount or burden of computation than other layers, and a layer's transmission of parameters such as gradients or weights may have a different amount or burden for this transmission than other layers.
  • the “compute” ratio of the compute burden of the layer or layers on which backpropagation is performed using non-local losses to the compute burden of the other layers in the NN on which backpropagation is performed using local losses may be smaller than the “transmission” ratio of the data transmission burden of the layer or layers on which backpropagation is performed using non-local losses to the transmission burden of the other layers in the NN on which backpropagation is performed using local losses.
  • the ratio of compute burden of layer(s) on which backpropagation is performed with non-local losses to the compute burden for the other layers in the NN may be less than the ratio of the number of weights for the layer(s) on which backpropagation is performed to the number of weights for the other layers in the NN.
  • the layer(s) on which backpropagation is performed using non-local losses have more weights than another layer, or than all the other layers in the NN (e.g. cumulatively).
  • the layer(s) on which backpropagation is performed using non-local losses has associated with the layer(s) a larger amount of weight values and/or a smaller amount of weight compute burden than all other layers in the NN cumulatively—e.g. than all the values and burdens for the other layers combined.
  • FIG. 4 depicts a prior art process for training using a multi-node system, using two nodes 402 and 404 and one master node 400 ; other numbers of nodes may be used and a master may be part of a node performing NN simulation.
  • Nodes 400 , 402 and 404 may be connected by for example network 406 and may simulate a NN 410 including layers 411 , 412 , 413 , 414 , 415 , 416 and 417 .
  • Master node 400 may store datasets 420 , e.g., training data, and model data and parameters 422 .
  • Embodiments of the present invention may improve on the system of FIG. 4 . Referring to FIG. 4 , in some processes for distributed (multi-node) training of machine learning algorithms operations such as the following may be used:
  • a master may send (operation 430 ) parameters or a model and input data to the nodes.
  • Each node may execute (operation 432 ) a forward-backward pass that generates update gradients and weights.
  • Nodes execute a weight synchronization algorithm, which may involve a parameter update. This may involve nodes sending parameters to one or more master nodes (operation 440 ).
  • a loss 460 may be generated, and convolution layers may generate parameters e.g. parameters 462 , and an FC layer may generate parameters 464 .
  • One or more master nodes may accept parameters to update the model (operation 442 ), e.g. by averaging weights, and send the model back to the nodes; or this may involve each node receiving all other nodes' parameters, so that each node can update its parameters based on averaging weights from all other nodes' executions just as the master would have done.
  • the full model may be transmitted by the nodes to the master over the network.
  • nodes may need to communicate parameters such as weights or gradients over the network to other nodes.
  • parameters such as weights or gradients over the network to other nodes.
  • each node executes the full machine learning model on a subset of examples, and thus the number of parameters a node needs to communicate is the same as the model size, which is a large amount of data to communicate.
  • a synchronization procedure (operation 2 above) in distributed data-parallel training of neural networks included transmitting all of the parameters such as weights or gradients of the backward pass to a master node, or to the other nodes.
  • the parameters such as weights or gradients of the backward pass to a master node, or to the other nodes.
  • a synchronization procedure in distributed data-parallel training of neural networks included transmitting all of the parameters such as weights or gradients of the backward pass to a master node, or to the other nodes.
  • the need to transmit the FC parameters to other nodes is avoided.
  • compute and gradient memory requirements of the different neural network layers are not balanced or the same.
  • the (1) the amount of compute needed to execute (e.g. compute weights for, during training) the FC layer is low compared to other layers such as a convolution layer, and (2) the parameter memory requirement of the FC layer is relatively high (e.g. parameters for each FC node having links from each prior layer node must be stored), while for convolution layers memory requirements may be low (since typically convolution layer neurons are less connected to the layer inputting to the convolution layer compared to an FC layer).
  • the FC layer compute burden may be only 4% of the total CNN compute burden, while the parameter memory burden is 93% of the parameters for the NN.
  • embodiments are described as applied to an PC layer of a CNN, other types of layers can be used, and other types of NNs can be used.
  • embodiments may be applied to the training of any CNN that has a final layer in which the ratio of compute to data size is very small, that is, there is little computation a lot of parameter data to be transferred.
  • operations such as the following may be performed, typically for each node simulating forward/backward passes.
  • FC layer is given as an example of a layer where losses may be transmitted instead of other parameters
  • FC “style” layers that have large number of weights and low compute costs.
  • FIG. 5 is a flowchart of a method of exchanging weights among nodes or processors according to embodiments of the present invention. While in one embodiment the operations of FIG. 5 are carried out using systems as shown in FIGS. 1 and 2 , in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 5 may be used with or combined with the embodiment shown in FIG. 3 . For example some or all of data such as parameters, weights, gradients, and/or loss data may, in an embodiment of FIG. 5 , be transmitted using an embodiment of FIG. 3 . Typically, embodiments of FIG. 5 achieve the most savings in data transmission when nodes are CPU-based. CPU systems may for example have advantages over GPU systems in memory size, which may be important as some embodiments of FIG. 5 require the storage of multiple sets of losses and gradients. However, embodiments of FIG. 5 may be used with systems where nodes are GPU-based. In one embodiment, for each processor or node i which is not a master:
  • the node or processor may receive training data and execute a forward-pass on the NN, which may generate a set of loss values, e.g. loss(es)_i. These may be termed, for each node, local losses: losses local to that node.
  • the processor or node may send or transmit the loss i to other nodes executing a forward pass (e.g. non-master nodes). In other embodiments, such losses may be sent to a master node, which may perform the calculations discussed herein.
  • a forward pass e.g. non-master nodes
  • backpropagation or a backward pass may occur at the node or processor.
  • the node may execute a full backward pass for all layers using its own loss only (“local” losses), not including the other losses received, resulting in the new weights and gradients for all layers including the FC.
  • local losses gradients for the layers which will have losses for other nodes applied (e.g. in operation 550 ) are not applied to modify layer weights. Rather, these gradients are stored or saved, to be used later in operation 550 to modify the weights: this is because modification of a model using should typically losses should typically be performed on the model which generated the losses, as opposed to a modified model.
  • the node may perform backpropagation using the loss values produced by the processor or node.
  • the node may receive the loss(es) of each other node.
  • the node has multiple sets of losses (one for each node in the system, including its own loss(es)).
  • this operation may be performed in an order different than implied; for example nodes sending and receiving losses may be done somewhat concurrently, and this may be performed while nodes are performing other processing, such as backpropagation.
  • the node may transmit or send parameters such as gradients or modified weights generated in operation 520 , apart from or excluding those for the FC layer (or a layer to be used with an embodiment of the present invention), to other nodes or to a master node, substantially as in a standard data-parallel approach. While in some embodiments, the backward pass results for layers such as a convolution layer still are transmitted, the number of parameter values for such layers may be small relative to those for an FC layer, and thus large savings in network traffic may result from not sending the FC layer parameters and sending only the losses.
  • Operations such as sending and receiving data may be performed at any suitable time, and do not have to be performed in the order implied in FIG. 5 .
  • the order of operations in flowcharts ( FIG. 5 and other flowcharts) in this application may be altered in certain embodiments if suitable. For example, transmitting losses may be performed after, or concurrently with, transmitting parameters; other suitable modifications may be implemented.
  • the node may perform backpropagation training on a limited subset of layers, e.g. at least one layer of the NN such as an FC layer possibly including layers from the FC to the output, using loss values received from a set of other nodes or processors, e.g. non-local losses.
  • Application of gradients to weights on such layers may also be performed using saved gradients from operation 520 .
  • the layer(s) on which backpropagation is performed using losses from other nodes has associated with the layer a larger amount of weight values and a smaller amount of weight compute burden than another layer in the NN, e.g. when compared to a convolution layer. Note that losses have already been computed for this layer (and all layers) using the local losses in operation 520 .
  • the node may execute a backward pass for higher layers down to or until and including the FC layer, but typically not beyond (e.g., below, towards the input) the FC layer, one after the other, not continuing with the backpropagation for layers below (towards the input) and beyond the FC.
  • This backpropagation may occur individually and separately for each set of non-local losses received from another node, as typically the losses cannot be combined or averaged across different nodes.
  • the receiving processor may perform a separate backpropagation operation on the layers down to and including the FC layer.
  • one backward pass is done down to and including the FC layer but not beyond for the loss of each node other than the local node, the gradients—but not weight changes—resulting from the backward pass accumulating or being stored.
  • a model should be modified using losses generated for that model, and thus gradients should be applied to the un-modified model, rather than having gradients generated from a model modified from the model that generated the losses.
  • the gradients generated for the relevant layer in operation 520 using local losses, and the gradients generated in operation 550 based on non-local losses are accumulated, and then applied to the NN model stored by the node by for example being applied to the relevant weights, or averaged then applied to the weights.
  • Weights for all other layers may be updated based on weights received from the master or from other nodes.
  • the parallelization of the backpropagation of losses of all other parts of the model—e.g. backpropagation in one pass across nodes, then combining weights—except for the typically inexpensive FC layer and its typically inexpensive preceding layers, among threads, may allow for loss calculation time to be reduced.
  • this node has the full PC backward pass result (or in the case that the FC layer is not the final top-most layer, the full result for every layer from the output to and including the FC layer): each node has same weights for the FC layer, as if a master node had averaged the FC layer weights and sent the weights to the nodes.
  • Such a technique may improve NN learning communication in that in some example NNs, the actual weights of the FC layer which may be 90% of the NN weights, are never transmitted. Rather, only the loss and the 10% of the weights (in one example) for the other layers are transmitted.
  • the node may receive parameters from a master node or other nodes, as a model update (e.g. their calculated average) or from other nodes (e.g. as individual weights to be averaged by the node) and may apply them to the NN being processed by the node, to update the model stored by the node.
  • the node may receive individual parameters such as weights for all other layers, apart from FC, or apart from the layers from the FC to the output inclusive.
  • the node may have new or updated weights: for all layers lower than the FC (towards the input), obtained from the master, and for layers above the FC (towards the output) and including FC from a locally performed loss based calculation.
  • weight updates for layers from the output through and including the FC layer may execute a backward pass for all loss values, separately, and for layers between the FC layer and input layer, weight updates are calculated by a master averaging locally computed weight values. This may decrease communications burden while only slightly increasing processing burden.
  • a NN used with embodiments of FIG. 5 includes, at least one fully connected layer, at least one input layer, and at least one convolution layer, and possibly other layers; see, e.g. the example NN structures of FIGS. 1 and 6 .
  • other structures of NNs may be used with a process such as in FIG. 5 .
  • non-master or slave nodes may send losses to one central node, such as a master node, which may execute backpropagation for a selected subset of layers (e.g. layers from FC to output inclusive) for each loss set, integrate the results into the model or update the model (e.g. by applying each resulting gradient set to the model), and send the updated model back to other nodes.
  • a master node may execute backpropagation for a selected subset of layers (e.g. layers from FC to output inclusive) for each loss set, integrate the results into the model or update the model (e.g. by applying each resulting gradient set to the model), and send the updated model back to other nodes.
  • This may be performed in conjunction with the master node receiving parameters regarding other layers such as weights or gradients and updating the model based on those other parameters: the NN model updated by the master using both loss data and parameters such as weights or gradients may be sent to the non-master nodes conducting training.
  • the backpropagation for those layers is typically independent for each loss set.
  • a loss set from node A may be applied to the model used to generate the losses to generate gradients
  • a loss set from node B may be applied to the model used to generate the losses, etc.
  • the multiple sets of gradients may be then applied to the weights from the model used to generate the losses, for the relevant layer.
  • operations such as: a node sending a loss or set of loss values; and the same node executing a backward pass (e.g. operation 530 , a backward pass based on “local” losses for the processing node only) or a portion of a backward pass, may be done in parallel, or concurrently.
  • a backward pass e.g. operation 530 , a backward pass based on “local” losses for the processing node only
  • a portion of a backward pass may be done in parallel, or concurrently.
  • Different cores within the same processor may be dedicated to different tasks. Improvements may result from tasks being done in parallel such as for example transmitting or receiving data, sorting, compression, portions of a backward or forward pass, etc.
  • nodes computing a forward pass may send their loss values to each other and each may compute FC gradients (e.g. gradients to be used to change weights inputting to neurons in an FC layer) and apply them to alter the FC weights, individually, which may allow for FC layer weights or gradients to not be transmitted; rather only weights of gradients for other layers are transmitted.
  • FC gradients e.g. gradients to be used to change weights inputting to neurons in an FC layer
  • each node computes the total/aggregated FC gradient result which adds computation time to the node, but this is more than made up for with communications savings. For example, if there are N nodes in the distributed system, then the compute time added is FC_layer_compute time*N: savings are maximized when the FC_layer_compute time is small (relative to other layers) and the number of nodes in the system is small. However, savings may result from systems without such characteristics.
  • Such a system of transmitting losses instead of other parameters such as weights or gradients may be combined with the embodiments for improved compression using sorting, as discussed herein, which itself may result in a 3 ⁇ reduction in communications.
  • the two techniques in combination may in some examples result in a 30 ⁇ reduction of network traffic, in a lossless way.
  • FIG. 6 is a diagram depicting a method of exchanging weights among processors according to embodiments of the present invention. While in one embodiment the operations of FIG. 6 are carried out using systems as shown in FIGS. 1 and 2 , in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 6 may be used with or combined with the embodiments shown in FIGS. 3 and/or 5 .
  • FIG. 6 shows an embodiment with two nodes 610 and 620 executing simulations, including models of a NN 612 and 622 , one master node 600 including model information such as parameters 602 and data or training 604 , and a network 650 connecting nodes 600 , 610 and 620 .
  • An iteration may include Phase 1, the execution, and Phase 2, the parameter update.
  • Phase 1 a master sends parameters and input data to the nodes, the nodes perform forward pass, and then each node 610 and 620 transmits its loss value(w) (e.g. the forward pass result) to the other node of 610 and 620 .
  • w loss value
  • each node has one loss data set from each node (itself and other nodes), in this example two losses. Each node may use these losses to compute the final result for FC gradients locally by itself. Then, each node may continue to execute the rest of the backward pass in a way similar to the standard data-parallel approach: for example each node may send convolution weight gradients to master 600 , master 600 may sum convolution weight gradients of nodes 610 and 620 performing forward and backward passes, and may send the final result (e.g. a model) to nodes 610 and 620 .
  • each node may send convolution weight gradients to master 600
  • master 600 may sum convolution weight gradients of nodes 610 and 620 performing forward and backward passes, and may send the final result (e.g. a model) to nodes 610 and 620 .
  • an improvement may result from FC gradients not being transmitted over network 640 at any point in time, which has the potential to provide an order of magnitude reduction in network traffic in many CNNs (without any loss in accuracy).
  • NNs other than CNNs may be used, and while embodiments discuss treating an FC layer differently, embodiments of the present invention may perform local calculations for layers other than an FC layer.
  • Embodiments of the present invention may improve prior NN training and inference by for example allowing for less expensive, more common or commodity equipment to he used. For example, an Ethernet or less expensive data link may be used, and CPU based machines may be used instead of GPU based machines.
  • GPUs may be used with some embodiments of the present invention, typically, CPUs are not as powerful as CPUs at performing certain algorithms such as compression, which involves some sequential tasks: typically, GPUs are better at massively parallel task than CPUs, and CPUs may out perform GPUs at sequential tasks. Thus GPUs may not be as powerful at performing compression as discussed herein which may enable the use of less expensive network connections.
  • CPUs may be better than CPUs at interleaving, pipelining and complex parallel tasks which may be performed according to some embodiments of the present invention.
  • GPUs may lack the large memory size CPU machines have, which may lower the ability of CPU machines to buffer a large amount of data.
  • a node may receive, and buffer or store, a large amount of input training data to process, and may process such data in sequence.
  • a node may multitask or interleave tasks, for example, at the same time, performing a forward pass for one layer of input data (e.g., an input image), sorting and/or compressing the parameter data for another layer.
  • Embodiments have been described in the context of NN learning, data processing in other contexts may make use of an embodiment of sort-and-compress method as described herein. Embodiments are applicable to any system in which the relative order of the elements to be compressed does not change much from one iteration to the next. Thus embodiments may be applied to systems other than machine learning. For example, an embodiment may be used to transmit pixel data for images. A sort-and-compress or sort-and-ZIP algorithm may be applicable to any set of numbers that are generated during iterations.
  • Embodiments of the present invention may be applicable to any set of numbers generated during iterations of distributed or other training, such as floating point parameters or gradients, or integer parameters or gradients that may be a result of quantization, 8 bit representations, etc.
  • Embodiments of the invention may be applicable to NNs computed with any sort of nodes, e.g. CPUs, GPUs, or other types of processors. However, embodiments may particularly useful with CPU based nodes, as sorting and compressing (e.g. sequential compression) may be easier to implement efficiently, or may execute faster, on a CPU.
  • sorting and compressing e.g. sequential compression
  • a process may first quantize floating point parameters to integers, and then perform a sort-and-compress process as described herein.
  • the terms “plurality” and “a plurality” as used herein can include, for example, “multiple” or “two or more”.
  • the terms “plurality” or “a plurality” can be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like.
  • the term set when used herein can include one or more items.
  • the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

Abstract

Systems and methods may make exchanging data in a neural network (NN) during training more efficient. Exchanging weights among a number of processors training a NN across iterations may include sorting generated weights, compressing the sorted weights, and transmitting the compressed sorted weights. On each Kth iteration a sort order of the sorted weights may be created and transmitted. Exchanging weights among processors training a NN may include executing a forward pass to produce a set of loss values for processors, transmitting loss values to other processors, and at each of the processors, performing backpropagation on at least one layer of the NN using loss values received from other processors.

Description

    RELATED APPLICATION DATA
  • This application claims benefit from U.S. provisional patent application 62/588,970, filed on Nov. 21, 2017 and entitled “A Lossless Compression-Based. Method for Reducing Network Traffic in Distributed Training of Machine Learning Algorithms”, and this application claims benefit from U.S. provisional patent application 62/588,324, filed on Nov. 18, 2017 and entitled “A Method for Reducing Network Traffic for Distributed Training of Neural Networks with Fully Connected Layers” each incorporated herein by reference in its entirety.
  • FIELD OF THE INVENTION
  • The invention relates generally to machine learning; specifically to training neural networks using distributed systems.
  • BACKGROUND
  • Neural networks (NN) or connectionist systems are computing systems inspired by biological computing systems, but operating using manufactured digital computing technology. NNs are made up of computing units typically called neurons (which are artificial neurons, as opposed to biological neurons) communicating with each other via connections, links or edges. In common NN implementations, the signal at the link between artificial neurons may be for example a real number, and the output of each neuron may be computed by function of the (typically weighted) sum of its inputs, such as the ReLU rectifier function. NN links or edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Typically, NN neurons are divided or arranged into layers, where different layers may perform different kinds of transformations on their inputs and may have different patterns of connections with other layers. Typically, a higher or upper layer, or a layer “above” another layer, is a layer more towards the output layer, and a lower layer, preceding layer, or a layer “below” another layer, is a layer towards the input layer.
  • Such systems may learn to perform tasks by considering example input data, generally without being programmed with any task-specific rules, being presented with the correct output for the data, and self-correcting. During learning the NN may execute a forward-backward pass where in the forward pass the NN is presented with an input and produces an output, and in the backward pass (backpropagation) the NN is presented with the correct output, generates an error (e.g., a “loss”), and generates update gradients which are used to alter the weights at the links or edges.
  • Various types of NNs exist. For example, a convolutional neural network (CNN) is a deep, feed-forward network, which includes one or more convolutional layers, fully connected layers, and pooling layers. CNNs are particularly useful for visual and speech applications. Other NNs include for example long short-term memory (LSTM) networks.
  • In practice, a NN, or NN learning, is simulated by one or more computing nodes or cores, such as generic central processing units (CPUs, e g. as embodied in personal computers) or graphics processing units (GPUs such as provided by Nvidia Corporation), which may be connected by a data network. A collection of such connected computers may be termed a pod, and computers used with NNs may be single socket (e.g. one main processor) or multi-socket (e.g. multiple processors in one machine, sharing some memory). One or more computing nodes may model a NN using known data structures. During or inference, the trained NN may for example recognize or categorize images, perform speech processing, or other tasks.
  • A NN may be modeled as an abstract mathematical object, such as a function. A NN may be translated physically to CPU or GPU as for example a sequence of matrix operations where entries in the matrix represent neurons (e.g. artificial neurons connected by edges or links) and matrix functions represent functions of the NN.
  • During learning, the NN, or the computing nodes modeling the NN, may be presented with training data. For example, in an image recognition application, a NN may learn to identify images that contain cats by analyzing example images that have been manually labeled as “cat” or “not a cat” and using the results to identify cats in other images. The NN may do this without any prior knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces. Instead, during learning the NN automatically generates identifying characteristics from the learning material that it processes.
  • One method of training in a NN is data parallel learning, where (typically via a master node or core), the data or training sets are divided, and each core or node operates on the same NN, using forward and backward passes, on only a portion of the data independently, and after each forward/backward pass the nodes or cores exchange parameters (e.g. weights or gradients) with each other, or send them to the master, to come up with the right parameters for the iteration. For example, on each iteration, a master node may send one different image, or a set of images, and the same model of the NN, to each of four CPUs. Each CPU may execute a forward and backward pass over all layers of the model on its specific image, and send the resulting parameters to the master, which then creates an updated model from the parameters sent by all four CPUs. Each node or processor may at times store a different version (with different parameters) of the same NN.
  • When a node communicates its resulting weights over the network to other nodes after an iteration of training, a large amount of data may need to be sent. For example, in the data-parallel convolutional NN training approach, each node executes the full machine learning model on a subset of examples, so the number of parameters a node needs to communicate may be the same as the model size. For example, in case of AlexNet CNN, there may be 220 MB of parameters, and if 10 nodes operate on the data, 220 MB*10=2.2 GB of parameters are transferred in both directions over the network for each iteration. Network bottlenecks may slow the learning process. High bandwidth interconnections may be used to speed data transfer, but such systems are expensive compared to more low bandwidth networks, such as an Ethernet network.
  • In some NNs, a loss, inconsistency or error value may be calculated at the output or at an output layer, with possibly multiple loss values being created, e.g. one for each node in an output layer. The output layer or set of layers typically is or includes a fully connected (IPC) layer, where each neuron in the layer accepts an input, edge or link from each neuron/output of a lower or preceding layer (e.g., a layer closer to the input). This fully connected layer is an example of a layer where the number of weights is high (because there may be a link between every input neuron and every output neuron) and yet the layer has a relatively low amount of compute, because the computation as a whole may be equivalent to a matrix multiply rather than a convolution. A loss for a network may represent the difference or inconsistency between the value or values output from the network, and the correct value/values that should be output given the data input to the NN. A loss value may be, for example, a negative log-likelihood or residual sum of squares, but may be computed in another manner. In NN learning, it is desired to minimize loss, and after receiving a loss the NN model may be updated my modifying weight values in the network using backpropagation.
  • SUMMARY
  • Systems and methods of the present invention may make exchanging data in a neural network (NN) during training more efficient. Exchanging weights among a number of processors training a NN across iterations may in some embodiments include sorting generated weights, compressing the sorted weights, and transmitting the compressed sorted weights. On each Kth iteration a sort order of the sorted weights may be created and transmitted, Embodiments may exchange weights among processors training a NN by executing a forward pass to produce a set of loss values for processors, transmitting loss values to other processors, and at each of the processors, performing backpropagation on at least one layer of the NN using loss values received from other processors.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Non-limiting examples of embodiments of the disclosure are described below with reference to figures attached hereto that are listed following this paragraph. Dimensions of features shown in the figures are chosen for convenience and clarity of presentation and are not necessarily shown to scale.
  • The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features and advantages thereof, can be understood by reference to the following detailed description when read with the accompanied drawings. Embodiments of the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like reference numerals indicate corresponding, analogous or similar elements, and in which:
  • FIG. 1A is a block diagram of a neural network according to an embodiment of the present invention.
  • FIG. 1B is a block diagram of a neural network according to an embodiment of the present invention.
  • FIG. 1C is a block diagram of a system for training a neural network according to an embodiment of the present invention.
  • FIG. 2 is a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.
  • FIG. 3 is a flowchart of a method according to embodiments of the present invention.
  • FIG. 4 depicts a prior art process for training using a multi-node system.
  • FIG. 5 is a flowchart of a method according to embodiments of the present invention.
  • FIG. 6 is a diagram depicting a method of exchanging weights among processors according to at embodiments of the present invention.
  • It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn accurately or to scale. For example, the dimensions of some of the elements can be exaggerated relative to other elements for clarity, or several physical components can be included in one functional block or element.
  • DETAILED DESCRIPTION
  • In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention.
  • Embodiments of the invention include systems and methods that may reduce the amount of data communicated during the training process of NNs (e.g. convolutional neural networks, or other networks) using a system including multiple nodes such as CPUs connected via a relatively slow connection such as an Ethernet or similar inexpensive network. CPUs, if used, may contain multiple cores, so that certain tasks may be done in parallel or concurrently: for example transmitting or receiving data, sorting, compression, portions of a backward or forward pass, etc. However, embodiments of the invention are applicable to other, non-NN tasks, for transferring large amounts of data. While a CNN is discussed as an example NN used, embodiments of the invention may be used with other NNs, such as LSTMs. Further, while CPU based machines are discussed, CiPUs or other types of processors may be used. Embodiments of the present invention may be used with pods, and single socket or multi-socket systems, or other types of systems.
  • Embodiments of the invention may take advantage of the computational properties of a NN such as a CNN to distribute the computation and thus reduce the overall communication. Loss values may be transmitted by nodes to a master node or other nodes, which may use the loss values to calculate gradients and/or weights to modify the model. The computation of these parameters may be relatively computationally easy, e.g., have a low computational burden relative to other layers, as in the case of an FC layer, where the computation per output node is a simple dot product
  • Attorney Docket: P-582153-US of its weights. In contrast, in this same FC layer, the number of weights and similarly the gradient values per weight) is high relative to convolutional layers since in an FC layer each node may receive a link or edge from each node in its input layer. This number is even larger when compared to the network's overall number of loss values which is usually the number of outputs the NN has.
  • Some prior techniques use compression to reduce data size of data transmitted among nodes; however, such techniques achieve only a lossy reduction, e.g. reducing the granularity or accuracy of data on decompression. Such lossy compression might increase convergence time (e.g., where the NN converges to a state where the error of the calculations is small) or even preclude convergence at all. In some embodiments of the present invention, the computational properties of the weight distributions during NN training contribute to improving compression and distribution of the weights, and thus reduce the overall communication overheads, with, in come cases, no loss of accuracy (e.g. using lossless compression). The distribution or transmission of other parameters, such as loss values or gradients may also be made more efficient.
  • FIG. 1A is a simplified block diagram of a NN according to an embodiment of the present invention; in typical use thousands of neurons and links are used. NN 1000 may input data as for example an input vector 1010 of values (representing, e.g. a photograph, voice recording, or any sort of data), and may produce an output of signals or values, for example output vector 1020. NN 1000 may have neurons arranged into layers 1030, each including neurons 1040 connected to other neurons h links or edges 1050. FIG. 113 is a block diagram of a neural network according to an embodiment of the present invention. NN 1100 may input data, for example image 1110 (e.g. an input vector, matrix or other data) and may produce an output of signals or values, for example output vector 1120, which may for example indicate the content of or a description of the image. Other input data may be analyzed. NN 1100 may in one example have layers 1130 (convolution), 1132 (pooling), 11:34 (convolution), 1136 (pooling), and one or more output layers 1138, which may include for example an FC layer 1138A and a softmax layer 1138B. Each layer may include neurons connected to other neurons h links or edges. The NNs in FIGS. 1A and 1B are typically simulated, and represented as data, for example in a system such as shown in FIG. 1C, below,
  • A convolutional layer may apply a convolution operation to its input, passing its result to the next layer. The convolution operation may for example emulate the response of an individual neuron to visual stimuli, and may for example include neurons processing data only for its receptive field. A convolutional layer's parameters may include a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter may be convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter. As a result, the NN may learn filters that activate when they detect some specific type of feature at some spatial position in the input. Stacking the activation maps for all filters along the depth dimension may form the full output volume of the convolution layer. Every entry in the output volume for a convolutional layer can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation.
  • NNs used for classification tasks, e.g. classifying photographs into descriptions of the content, may produce, for each class i, an output z_i, sometimes called a logit, which may encode or represent the likelihood that a given example input should be classified to class i. Logits z_i, for each class i, (e.g., for image recognition dog, cat, llama, etc.) may be transformed into probabilities q_i by comparing each z_i to the other logits, in for example a softmax layer.
  • FIG. 1C is a block diagram of a system for training a neural network according to an embodiment of the present invention. Referring to FIG. 1C, a system of computing devices 1 may include computing nodes 10, 20 and 30, connected by one or more communications network(s) 40. Communications network 40 may be for example an Ethernet network, but may be one or more other communications networks. Node 10 may be a master node, distributing training data, collecting parameters and creating updated models, and nodes 20 and 30 may be training nodes, executing forward and backward passes on training data, sending parameters (e.g. weights for edges or links) to master node 10, and updating the nodes' internal representations of the NN after receiving data from the master node. In alternative embodiments, a training node (e.g. node 20 or 30) may function as a master node. In further embodiments, a fixed “master” node need not be used, and one or more training nodes may execute the functionality of a master node. Further, while only two training nodes are shown, other numbers of nodes may be used, for example 10 training nodes, 1,028 training nodes, or other numbers. Other numbers of master nodes may be used, for example an embodiment may include two master nodes and 16 training nodes, or 16 nodes total.
  • Master node 10 may include data 12, e.g., training sets (such as collections of images, audio files, etc) and model data 14 representing a NN (e.g. data representing artificial neurons, links, weights, etc.) and including for example parameters such as weights, and possibly for example the arrangement of nodes, layers and edges. Each of nodes 10, 20 and 30 may model the same complete NN, including neurons, links, weights, etc. as the other nodes, but each of nodes 20 and 30 may train on a different set of data. Each node 20 and 30 may model the same NN as master node 10, and may include for example NN data 22 and 32. The NN may be for example a CNN, but may be another type of NN. For example, the NN modeled by NN data 22 and 32 may include an input layer 50, convolution layers 52 and 56, pool layers 54 and 58, a fully connected layer 60, and a softmax layer 62. Other numbers and types of layers may be used. The NN made of layers 50-62 may function and be simulated as is known in the art. A system such as shown in FIG. 1C may execute a trained NN at inference time, although at inference time such NN s may be executed by one processing node, e.g. a workstation, PC, server, etc.
  • Nodes may be for example CPU based systems (e.g. workstations, PCs), CPU based systems, or other systems. In one example embodiment, master node 10 is a CPU based system and training nodes may be other systems such as GPU based systems. Nodes 10, 20 and 30 may be or include structures such as those shown in FIG. 2. While in some embodiments a generic CPU (e.g. a workstation, a PC (personal computer), a multi-core system) is discussed as a node, embodiments of the invention may be used with other types of nodes, such as CPUs. Further, while example embodiments of the invention discuss a relatively simple, slow communications connection between nodes, such as an Ethernet, other networks or communications systems, such as relatively fast, expensive, and specially made systems, may be used.
  • FIG. 2 shows a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention. Computing device 100 may include a controller or processor 105 that may be or include, for example, one or more central processing unit processor(s) (CPU), one or more Graphics Processing Unit(s) (CPU or CPCPU), a chip or any suitable computing or computational device, an operating system 115, a memory 120, a storage 130, input devices 135 and output devices 140. Each of modules and equipment such as nodes 10, 20 and 30, and other equipment mentioned herein may be or include a computing device such as included in FIG. 2, although various units among these entities may be combined into one computing device.
  • Operating system 115 may be or may include any code segment designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 100, for example, scheduling execution of programs. Memory 120 may be or may include, for example, a Random Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Memory 120 may be or may include a plurality of, possibly different memory units. Memory 120 may store for example, instructions to carry out a method (e.g. code 125), and/or data such as user responses, interruptions, etc.
  • Executable code 125 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 125 may be executed by controller 105 possibly under control of operating system 115. For example, executable code 125 may when executed cause NN training, coordination of NN training tasks, NN execution or inference, etc. according to embodiments of the present invention. In some embodiments, more than one computing device 100 or components of device 100 may be used for multiple functions described herein. For the various modules and functions described herein, one or more computing devices 100 or components of computing device 100 may be used. Devices that include components similar or different to those included in computing device 100 may be used, and may be connected to a network and used as a system. One or more processor(s) 105 may be configured to carry out embodiments of the present invention by for example executing software or code. Storage 130 may be or may include, for example, a hard disk drive, a floppy disk drive, a Compact Disk (CD) drive, a CD-Recordable (CD-R) drive, a universal serial bus (USB) device or other suitable removable and/or fixed storage unit. Data such as instructions, code, NN model data, parameters, etc. may be stored in a storage 130 and may be loaded from storage 130 into a memory 120 where it may be processed by controller 105. In some embodiments, some of the components shown in FIG. 2 may be omitted.
  • Input devices 135 may be or may include for example a mouse, a keyboard, a touch screen or pad or any suitable input device. It will be recognized that any suitable number of input devices may be operatively connected to computing device 100 as shown by block 135. Output devices 140 may include one or more displays, speakers and/or any other suitable output devices. It will be recognized that any suitable number of output devices may be operatively connected to computing device 100 as shown by block 140. Any applicable input/output (I/O) devices may be connected to computing device 100, for example, a wired or wireless network interface card (NIC), a modem, printer or facsimile machine, a universal serial bus (USB) device or external hard drive may be included in input devices 135 and/or output devices 140.
  • Embodiments of the invention may include one or more article(s) (e.g. memory 120 or storage 130) such as a computer or processor non-transitory readable medium, or a computer or processor non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory, encoding, including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein.
  • In some NNs, during backpropagation, each neuron computes its own gradient for a link for the neuron, the gradient to be applied to adjust the weight of the link. (When discussed herein, a neuron taking action such as transmitting data, computing data, etc., may mean that a processor simulating the neuron performs a computation to simulate such action; e.g. a computing node simulating a number of neurons may perform the actual action that is ascribed to the neuron.). A node that is simulating neurons may collect weights or other parameters and transmit them to a master node. The master node may receive and collect parameters and construct a model based on these parameters: e.g. a master node may collect all weights from nodes, and for each link, average the weights to produce an updated weight for that node, the weight being a part of the updated model. Techniques other than averaging may be used. In a data parallel learning, a number of nodes simulate forward/backward passes on the same nodes at the same time using different data sets: the resulting changes in parameters, e.g. weights, are sent by each node to a master node which creates an updated model from the parameters and sends the model back to the nodes. In some embodiments of the present invention one node acts as both a node simulating neurons and also the master node for all nodes. Typically parameters such as weights are represented as floating point (e.g. 32 bit) numbers, but may be represented in other ways, such as integers or numbers represented by different numbers of bits.
  • In embodiments of the present invention nodes may communicate parameters such as weights or other parameters with a master node or other nodes by first sorting or arranging them, for example according to the values of the parameters, and then applying a ZIP or a similar lossless compression algorithm (e.g., 7zip, or another suitable compression algorithm) to the sorted sequence. Sorting or arranging, e.g. in order of value, to place similar or same values adjacent to each other in an ordered list or vector, may allow for improved parameter compression, due to the nature of compression algorithms like ZIP. Sorted data is typically easier to compress than unsorted data because sequential values are in order when data is sorted so their non-negative differences can be encoded in place of the original values, and repeated values are all contiguous and can be encoded by including a count along with the first instance. After decompressing or unpacking the “zipped” or compressed data the sort order may be used to re-order the data to its proper order; thus in some embodiments a sort order is transmitted typically on the first iteration of training, and periodically on some but importantly not all successive iterations. In some embodiments, one iteration may include a forward and backward pass on a batch of multiple inputs, e.g. multiple images, at one time, after which the model may be updated. The distribution or transmission of other parameters, such as loss values or gradients may also be made more efficient by combining sorting with compression as taught herein; furthermore transmission among nodes other than a master node may take advantage of such methods. For example, non-master nodes may exchange loss data by arranging the data in order of value of individual loss parameters, compressing the data, transmitting, uncompressing, and rearranging according to a sort order.
  • The sort order itself, as a collection of data, does not compress well or at all, and thus transmitting the sort order with each compressed set of parameters would not typically result in bandwidth savings. Thus, in some embodiments a sorted order or order of the arranging is sent, which over time (as parameters change) becomes an approximate sorted order that suffices to reduce compression, allowing for lossless compression of parameters. In some embodiments, this approximate sorted order does not need to be sent with every set of parameters, and thus the cost of sending the sort order or order of the arranging be amortized over many iterations of compressed data being sent. Typically, in each iteration in-between when the sorted order is created and sent, the parameters are arranged or sorted to be ordered as the order of the last sort order created, before compression or “Zipping”, and not according to the actual values of the parameters at that time. Thus the parameters in the intermediate iterations (between when the sorted order was created) may be arranged in an order which does not match that of an actual order sorted by value. The quality of the compression may be determined by the extent by which the order of parameters reflects the true sort order of the data, in some embodiments the effectiveness is greatly helped if the order over consecutive sequences or iterations of values transmitted does not change by much. A typical property of the sequences of weights in consecutive training batches or iterations of a. neural network trained using backward propagation and stochastic gradient descent is that the differences between the consecutive sequences is small since they are determined by the gradients which are typically small. Thus, compressing the weights of consecutive sequences of weights from iterations of backward propagation and stochastic gradient descent have small differences in the sort orders and small differences between the values, thus lending themselves to good compression even based on the sort order of preceding iterations.
  • In some embodiments the sort/compress/transmit sequence where sorting by value and creating a sort order occurs only periodically, can be in both directions—e.g. when the master sends parameters back to slave or “non-master” nodes—and also between slave nodes. The sort order may be an order that the sender (any node, master or slave node) creates and sends to the receiver, typically periodically. If the sort order is shared between two nodes—e.g. a master and slave node and one node (e.g. the slave node) created it originally the other node (e.g. the master node) need riot create a sort order. However, any sender, master or slave node, may create a sort order if needed.
  • The typical pattern for distributed training of machine learning algorithms includes for example iterating or repeating:
  • 1) Each node simulating a set of NN neurons executes a forward-backward pass that calculates or generates new updated weights of links or edges.
  • 2) The system executes a parameter synchronization algorithm: for example nodes transmit parameters such as their newly calculated weights to a master node, or to other nodes. A. master node may receive the parameters and update the model, e.g. by for each link averaging the weights received from nodes.
  • 3) Each node may receive a model and may update its parameters, e.g. update the weights for its links or edges.
  • 4) Repeat: the process may repeat starting with operation 1. The iteration may stop when for example a certain accuracy is achieved, after a certain number of runs or iterations, when training data is exhausted, or on other conditions.
  • In the second step when a node needs to communicate its parameters over the network to other nodes, problems may arise. For example, in the case of a data-parallel convolutional neural network training approach, each node executes the full machine learning model on a subset of examples, and thus the number of parameters a node needs to communicate is the same as the model size. For example, as discussed, in case of the AlexNet CNN, there may be for example 220 MB in parameters, and thus in the example case of 10 nodes, 220 MB*10=2.2 GB of parameters that are transferred over the network in both directions for each iteration. If in one case the time it takes to complete an iteration is approximately 30 ms, 2.2 GB must be transferred in 30 ms over the network to avoid any network bottlenecks. This requires a 2.2 GB/0.030=73 GB/sec network link, which is greater than most reasonably priced interconnections (much more expensive links may be available, but this requires expense and a specialized data link). This may prevent the system from scaling.
  • In one embodiment, nodes may compress parameters such as the weights generated at each node during the distributed training, before they are sent over the network. In one embodiment:
  • Nodes may sort or order the weights of the links or edges. Sorting may be in any suitable order, e.g. low to high, or high to low. Sorting may be based on for example the value of the weights themselves.
  • Nodes may compress their sorted weights by using ZIP or a lossless compression algorithm.
  • Sorting and compressing parameters may work well since there may be many similar numbers in the sorted sequence among the parameters for each node which reduces the overall entropy and allows ZIP type compression algorithms to compress well. However, the nodes that receive the sorted-and-compressed data should know the sort-order in order to be able to access the data appropriately. Typically, without the addition of sorting, the sending and receiving nodes have a common understanding of the order of the weights being sent. For example, each edge, link or weight in the NN may be assigned an address, index or a place in an ordered list. For example, both the sending and receiving nodes understand the first weight is for edge X of the network. After sorting, a new understanding a sort order—may be sent. Sort-order or arrangement order information may be for example a table, one column being a weight, edge or link number or index, and the other column being the order in the compressed data, for each weight, edge or link. Sort order or arrangement order information may be for example a list or vector (e.g. an ordered list of numbers), where for each ordered entry X the number indicates the place, position or order, in the compressed list, of the parameter numbered or indexed X. Other forms for a sort order may be used.
  • Typically, sort-order information does not compress well, and sorting itself may be computationally expensive. Transmitting the sort order may be expensive in terms of network bandwidth, and thus transmitting sort information with each compressed list may he expensive and may eliminate the advantages of compression. Thus, in one embodiment of the invention, a sort-order is not generated and sent for each iteration, but rather only periodically or once every K′th iteration (K being an integer greater than one), so that the same sort--order is used across multiple iterations, and the cost of communicating and/or calculating the sort-order is amortized over them. K may be fixed as part of the system design, or may change periodically, over time or from iteration to iteration based on for example certain conditions.
  • In some embodiments, as the NN learns and changes its weights, yet many if not most of the weights do not change by a large percentage from iteration to iteration of training. Thus the actual order, from high to low or low to high, of the weights, changes from iteration to iteration but not by much. Typically, gradients which are applied to edge or link weights to change the weights are small. For example, a gradient may be +/−0.0002.
  • Thus in one embodiment, a process may include:
  • For iteration X:
      • a. Each node computing weights or another relevant parameter sorts the weights or other parameters, recording or saving the sort order or index order of the sorted weights, for example in a vector. The node may save or store the sort order locally for future use, as discussed further.
      • b. Each node compresses the sorted weights for example using ZIP or another suitable compression technology, typically lossless, to produce compressed sorted weights.
      • c. Each node transmits or sends the compressed or Zipped weights to, for example a master node.
      • d. Each node transmits or sends its sort order to, for example a master node. The master node decompresses the parameters, and reorders or resorts the parameters to their original order, according to the last sort order or indexing received.
  • For iteration X+1 through X+K−1 (not a “create sort” iteration):
      • a. Each node computing weights or another relevant parameter places, sorts or orders the weights according to the sort order or indexing order of iteration N, or the sort order last created for that node (each node computing parameters typically has a different sort order). No new sort order is created; thus the sorting is a rearrangement according to a prior sort order, e.g. the last or most recent sort order, as opposed to a sorting based on the value of the weights themselves. Typically, at this point, the parameter list is not fully sorted by value, but it is “almost-sorted”, according to the previously computed sort order, so that ZIP or another suitable compression algorithm can benefit from it,
      • b. Each node compresses the parameters ordered by its prior sort order (the “sorted parameters”).
      • c. Each node transmits the compressed parameters. The receiving node decompresses the parameters, and reorders the parameters to their original order, according to the last or most recent sort order or indexing received for the node that sent the parameters.
  • Iteration X+K is the same as iteration X.
  • For iteration X+1 through X+K−1 the process may be the same: arrange according to the same sort order as the previous iteration, compress, transmit. For iteration X+K, where K is a pre-set interval, such as 20 (causing a new sort to be created once every 20 iterations) the process may be the same as iteration X: sort, compress, transmit. Thus, the sort order may be created and transmitted only every K (e.g. 10, 20 or another suitable integer) iterations, so the cost of sending it will be amortized across K iterations. K can be variable rather than fixed. This works best as long as the sort order does not change much across iterations, which is typically the case for distributed machine learning where the parameters change slowly.
  • In some embodiments, nodes may transmit parameters after computing parameters for each layer, and thus parameters may be sent multiple times for each backward pass. Further, a master node may update model parameters after receiving parameters for a certain layer, and transmit the parameters back to the nodes after this: for example a node may compute backpropagation of parameters of the next lower layer while it is updating the parameters of layers above this layer whose backpropagation has already ended. Thus in some embodiments a sequence of backpropagation may include nodes updating parameters for layer X; nodes transmitting parameters for layer X; nodes updating parameters for layer X+1, (higher and more towards output than layer X) while master computes model for layer X (concurrently, or simultaneously); master sending model to nodes for layer X; nodes sending parameters for layer X+1; etc. Other orders may be used. Further, in some embodiments nodes may complete a full backpropagation of all layers before sending parameters.
  • FIG. 3 is a flowchart of a method for exchanging or transmitting parameters such as weights according to embodiments of the present invention, while conducting training on a NN. While in one embodiment the operations of FIG. 3 are carried out using systems as shown in FIGS. 1 and 2, in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 3 may be used with or combined with the embodiment shown in FIG. 5.
  • In operation 300 a number of nodes, e.g. computing nodes, or processors executing or simulating a neural network, may receive training sets or data from one or more master nodes. For example a master node may send one image each to a number of nodes. The nodes may be for example processors representing a NN using data, the NN including for example artificial neurons connected by edges or links. Thus the NN may be “virtual” and no actual physical neurons, links, etc. may exist, existing rather as data used by the nodes.
  • In operation 310, each node may execute a forward pass on the training data received, to produce an Output.
  • In operation 320, each node may execute a backward or backpropagation pass, comparing its output for a NN to the expected output for the training data used, and calculating parameters such as weights for links or edges, or other data. In some embodiments, during each iteration, all layers in the NN, or at least two layers, may have parameters generated. In some embodiments, after each computation of a layer's parameters, the sorting/reordering, compressing and transmitting operations may occur for that layer. In some embodiments, during each iteration the nodes or processors during the backward or backpropagation pass calculate or generate gradients for links and calculate or generate weights for the links based on or using the gradients. For example, gradients may be factors that adjust the values of the weights.
  • In operation 330, if the iteration is a “create sort” iteration or a periodically occurring “ordering” iteration, e.g. every K′th iteration (typically including the first iteration), where K is an integer, each node may sort or arrange parameters created in operation 320, for example according to the values of the parameters, to create sorted parameters, e.g. sorted weights. A sort order, order of arranging, ordering, or index may be created and possibly saved or stored, based on the sorting process. Each node may have a different locally-created sort order. For example, while sorting the parameters, the new position of each parameter (when compared to the position in the parameter before sorting) may be saved as a sort order. Typically, parameters exchanged in a NN system have some inherent order understood by all entities in the system, and the sort process changes that order. Sorting or arranging may be for example low to high, high to low, etc. according to the numerical value of the parameter. The period between when sorting is performed according to the values of the parameters, and a sort order is created, may vary from iteration or cycle to iteration or cycle, and thus K may vary.
  • In operation 340, if the iteration is not a periodically occurring “create sort” iteration or “ordering” iteration but rather an “in-between” iteration, no new sort order is created, and sorting or arranging is performed on the parameters created in operation 320 based on the last sort order or order of arranging created by this node or processor (each node may store a different “last” sort order). Thus the sorting performed in operation 340 may be a rearrangement or re-ordering of parameters according to a prior sort order (e.g. the last Kth iteration, or the last time operation 330 was performed), and the “sorted parameters” are not sorted according some ranking of their own values, but rather are arranged according to a prior sort order.
  • In operation 350, the parameters sorted or rearranged in operations 330 and 340 may be compressed by a node, to produce compressed sorted parameters, e.g. compressed sorted weights, typically using lossless compression, although other compression methods may be used. For example, the parameters may be Zipped. As noted, the “compressed sorted parameters” may be not sorted according to their own order; rather they may be sorted or arranged according to a prior sort order. Typically, with an embodiment that sorts parameters before compressing, data size savings is most when parameters are weights, which typically have a similar order across iterations, as opposed to gradients, which often do not have a similar order across iterations. However, sorting and compressing may be performed with parameters other than weights, such as gradients, losses, etc.
  • In operation 360, each node may transmit or send its compressed parameters to a master node, or one or more other processors or nodes. If the iteration is a “create sort” iteration, e.g. every K'th iteration, the sort order, ordering, or index created in operation 330 may also be transmitted, for example with the compressed parameters.
  • In operation 370, a master node or processor may receive the parameters and create an updated model of the NN. In order to do so, the master may decompress the parameters, and place the parameters in the order according to the last sort order received. The parameters are typically re-ordered or re-sorted according to the last sort order received for the node that sent the parameters: thus the master node may maintain or store a different “last” sort order for each node sending it parameters. The master node reordering decompressed parameters to their original, proper order may be performed for data received from each node using a separately received sort order, as typically the sort order or indexing from each node is different.
  • In operation 380, the master node may send an updated model to the nodes performing the simulation, and the nodes may update their internal representation of the NN. The updated model may be sent using parameters sorted and compressed according to the relevant sort order.
  • In operation 390, the process may repeat or iterate, moving back to operation 300. The iteration may stop when for example a certain accuracy is achieved, after a certain number of runs or iteration, or on other conditions. Other or different operations may be used.
  • In some embodiments, a node receiving data e.g. a master node—may use operations similar to operations 300-390 to transmit data to nodes, or non-master (e.g. “slave”) nodes may use such operations to transmit data. For example, a master node may use the sort order received from node A to transmit model parameters back to node A, by sorting the parameters according to the last sort order received from node A, then compressing the data. Node A then decompresses the received model data and uses the last sort order it created to sort the data back to its original order. Alternately, a master node may create its own sort order periodically. In some embodiments, parameters may be transmitted using a sort and compress method (e.g. to or from a master) after each layer has data computed, but such data may also be data transmitted after a complete backward pass. In some embodiments, data transmitted using a sort and compress method may be from a node executing a forward/backward pass to another node executing a forward/backward pass. In some embodiments, data transmitted using a sort and compress method may include parameters other than weights: for example data may include gradient or loss data.
  • In some embodiments a node, typically when performing calculations relevant to an output layer (typically an FC layer), instead of using only the loss or error produced at that node to calculate weights or gradients for that layer, in addition use losses from other nodes and transmit or communicate their losses or loss values to other nodes. One or more nodes receiving the losses may receive all losses from all nodes simulating a forward pass, and then compute, in series for losses from each different node sending losses, a gradient and/or weight for each link or edge to the output layer. This may be in place of a master node receiving and averaging parameters for that particular layer. In one embodiment once the gradients are computed the gradients, or the final node weights after applying the gradients, may be averaged. The nodes receiving loss data may be a master node, or may be all nodes conducting a forward pass, in which case all such nodes perform the same calculations using losses. Since in certain NNs the number of links to neurons in an FC output layer is orders of magnitude greater than the number of loss values for the output layer, this may reduce the amount of data to be communicated (which may allow for a less expensive communications link), in exchange, in some embodiments, for the modest cost of multiple nodes using the global loss values to compute weights or gradients for the model. Further, typically computation for an FC layer, possibly involving a matrix multiplication, is less burdensome than for other layers such as a convolution layer, which may asymptotically involve as many as the square of the number of operations of the matrix multiply. Thus, in some embodiments, while a master node may compute new weights for the model for most layers by accepting weight values computed by nodes and for if example averaging them, for an FC layer multiple nodes (or a master node) may compute the new weights (the weights after applying the gradients for the model) from the losses by performing the backpropagation computation step for the layer. This may lower the amount of data that is transmitted. This may be especially useful for a system using a small number of nodes, e.g. a pod of 16 or 32 nodes (other numbers of nodes may be used).
  • In some embodiments, the layer or subset of layers on which backpropagation is performed using non-local losses has associated with the layer a large fraction of the total weights in the NN but a much smaller fraction of the weight compute burden in the NN, even when computing using non-local losses. Since compression may be considered a translation of data movement burden (e.g. network burden) to data compute burden, this may be considered analogous to compression, in that there is a reduction in data movement burden (less weights are moved) and an increase in computation burden (each node redundantly performs substantially similar loss-to-weight calculations. However, given the architecture of some systems, this may result in faster processing. A measure of the amount of parameter transmission or network burden may be the number of bytes sent, or the number of parameters sent. A measure of the amount of compute or processing burden may be the number of computer operations (e.g., machine operations) needed to compute gradients and weights during backpropagation. A layer may have a different amount or burden of computation than other layers, and a layer's transmission of parameters such as gradients or weights may have a different amount or burden for this transmission than other layers. In some embodiments, the “compute” ratio of the compute burden of the layer or layers on which backpropagation is performed using non-local losses to the compute burden of the other layers in the NN on which backpropagation is performed using local losses may be smaller than the “transmission” ratio of the data transmission burden of the layer or layers on which backpropagation is performed using non-local losses to the transmission burden of the other layers in the NN on which backpropagation is performed using local losses. Since the number or amount of weights for a layer is analogous to or a measure of its transmission burden, in some embodiments the ratio of compute burden of layer(s) on which backpropagation is performed with non-local losses to the compute burden for the other layers in the NN may be less than the ratio of the number of weights for the layer(s) on which backpropagation is performed to the number of weights for the other layers in the NN. In some embodiments the layer(s) on which backpropagation is performed using non-local losses have more weights than another layer, or than all the other layers in the NN (e.g. cumulatively). In some embodiments the layer(s) on which backpropagation is performed using non-local losses has associated with the layer(s) a larger amount of weight values and/or a smaller amount of weight compute burden than all other layers in the NN cumulatively—e.g. than all the values and burdens for the other layers combined.
  • FIG. 4 depicts a prior art process for training using a multi-node system, using two nodes 402 and 404 and one master node 400; other numbers of nodes may be used and a master may be part of a node performing NN simulation. Nodes 400, 402 and 404 may be connected by for example network 406 and may simulate a NN 410 including layers 411, 412, 413, 414, 415, 416 and 417. Master node 400 may store datasets 420, e.g., training data, and model data and parameters 422. Embodiments of the present invention may improve on the system of FIG. 4. Referring to FIG. 4, in some processes for distributed (multi-node) training of machine learning algorithms operations such as the following may be used:
  • 1) A master may send (operation 430) parameters or a model and input data to the nodes. Each node may execute (operation 432) a forward-backward pass that generates update gradients and weights.
  • 2) Nodes execute a weight synchronization algorithm, which may involve a parameter update. This may involve nodes sending parameters to one or more master nodes (operation 440). In some embodiments, a loss 460 may be generated, and convolution layers may generate parameters e.g. parameters 462, and an FC layer may generate parameters 464. One or more master nodes may accept parameters to update the model (operation 442), e.g. by averaging weights, and send the model back to the nodes; or this may involve each node receiving all other nodes' parameters, so that each node can update its parameters based on averaging weights from all other nodes' executions just as the master would have done. As a result, in the standard data-parallel approach, the full model may be transmitted by the nodes to the master over the network.
  • 3) Each node updates its parameters.
  • 4) Iteration repeats at operation (1)
  • In operation 2, the weight synchronization, nodes may need to communicate parameters such as weights or gradients over the network to other nodes. For example, in the data-parallel learning approach, each node executes the full machine learning model on a subset of examples, and thus the number of parameters a node needs to communicate is the same as the model size, which is a large amount of data to communicate. In the case, for example, of the AlexNet CNN, there may be for example 220 MBytes of parameters, so for 10 nodes, 220 MBytes*10=2.2 GBytes of parameters must be transferred over the network in both directions for each iteration. The time it takes to complete an iteration can he for example approximately 30 ms, so 2.2 GBytes must be transferred in 30 ms over the network to avoid any network bottlenecks. This requires a 2.2 GB/0.030=73 GB/sec network link, faster than the capabilities of most reasonably priced network links. This may prevent the system from scaling.
  • In prior art systems, a synchronization procedure (operation 2 above) in distributed data-parallel training of neural networks included transmitting all of the parameters such as weights or gradients of the backward pass to a master node, or to the other nodes. For example, in the case of a six layer CNN with the layers INPUT, CONV_1, POOL_1, CONV_2, POOL_2, FC and SOFTMAX, after a node finishes the forward-backward pass for its set of input examples, there are new parameters generated for CONV_1, CONV_2 and the FC layers. At this point, a node may send all of these parameters to one or more other nodes (or a master node).
  • In one embodiment, for the FC final or output layers, of a NN such as CNN, the need to transmit the FC parameters to other nodes is avoided. Typically, compute and gradient memory requirements of the different neural network layers are not balanced or the same. For example, the (1) the amount of compute needed to execute (e.g. compute weights for, during training) the FC layer is low compared to other layers such as a convolution layer, and (2) the parameter memory requirement of the FC layer is relatively high (e.g. parameters for each FC node having links from each prior layer node must be stored), while for convolution layers memory requirements may be low (since typically convolution layer neurons are less connected to the layer inputting to the convolution layer compared to an FC layer). For example, in one example of the AlexNet CNN, the FC layer compute burden may be only 4% of the total CNN compute burden, while the parameter memory burden is 93% of the parameters for the NN. While embodiments are described as applied to an PC layer of a CNN, other types of layers can be used, and other types of NNs can be used. For example, embodiments may be applied to the training of any CNN that has a final layer in which the ratio of compute to data size is very small, that is, there is little computation a lot of parameter data to be transferred.
  • In one embodiment, operations such as the following may be performed, typically for each node simulating forward/backward passes. While in the following, an FC layer is given as an example of a layer where losses may be transmitted instead of other parameters, other embodiments may be used with layers other than FC layers, such as FC “style” layers that have large number of weights and low compute costs.
  • FIG. 5 is a flowchart of a method of exchanging weights among nodes or processors according to embodiments of the present invention. While in one embodiment the operations of FIG. 5 are carried out using systems as shown in FIGS. 1 and 2, in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 5 may be used with or combined with the embodiment shown in FIG. 3. For example some or all of data such as parameters, weights, gradients, and/or loss data may, in an embodiment of FIG. 5, be transmitted using an embodiment of FIG. 3. Typically, embodiments of FIG. 5 achieve the most savings in data transmission when nodes are CPU-based. CPU systems may for example have advantages over GPU systems in memory size, which may be important as some embodiments of FIG. 5 require the storage of multiple sets of losses and gradients. However, embodiments of FIG. 5 may be used with systems where nodes are GPU-based. In one embodiment, for each processor or node i which is not a master:
  • In operation 500, the node or processor may receive training data and execute a forward-pass on the NN, which may generate a set of loss values, e.g. loss(es)_i. These may be termed, for each node, local losses: losses local to that node.
  • In operation 510, the processor or node may send or transmit the loss i to other nodes executing a forward pass (e.g. non-master nodes). In other embodiments, such losses may be sent to a master node, which may perform the calculations discussed herein.
  • In operation 520, backpropagation or a backward pass may occur at the node or processor. The node may execute a full backward pass for all layers using its own loss only (“local” losses), not including the other losses received, resulting in the new weights and gradients for all layers including the FC. Typically, during the full backpropagation pass using local losses, gradients for the layers which will have losses for other nodes applied (e.g. in operation 550) are not applied to modify layer weights. Rather, these gradients are stored or saved, to be used later in operation 550 to modify the weights: this is because modification of a model using should typically losses should typically be performed on the model which generated the losses, as opposed to a modified model. In sonic embodiments, prior to performing backpropagation using loss values received from a set of other processors or nodes (e.g. operation 550, the node may perform backpropagation using the loss values produced by the processor or node.
  • In operation 530, the node may receive the loss(es) of each other node. At this point, the node has multiple sets of losses (one for each node in the system, including its own loss(es)). As with other operations, this operation may be performed in an order different than implied; for example nodes sending and receiving losses may be done somewhat concurrently, and this may be performed while nodes are performing other processing, such as backpropagation.
  • In operation 540 the node may transmit or send parameters such as gradients or modified weights generated in operation 520, apart from or excluding those for the FC layer (or a layer to be used with an embodiment of the present invention), to other nodes or to a master node, substantially as in a standard data-parallel approach. While in some embodiments, the backward pass results for layers such as a convolution layer still are transmitted, the number of parameter values for such layers may be small relative to those for an FC layer, and thus large savings in network traffic may result from not sending the FC layer parameters and sending only the losses.
  • Operations such as sending and receiving data may be performed at any suitable time, and do not have to be performed in the order implied in FIG. 5. The order of operations in flowcharts (FIG. 5 and other flowcharts) in this application may be altered in certain embodiments if suitable. For example, transmitting losses may be performed after, or concurrently with, transmitting parameters; other suitable modifications may be implemented.
  • In operation 550 the node may perform backpropagation training on a limited subset of layers, e.g. at least one layer of the NN such as an FC layer possibly including layers from the FC to the output, using loss values received from a set of other nodes or processors, e.g. non-local losses. Application of gradients to weights on such layers may also be performed using saved gradients from operation 520. In some embodiments the layer(s) on which backpropagation is performed using losses from other nodes has associated with the layer a larger amount of weight values and a smaller amount of weight compute burden than another layer in the NN, e.g. when compared to a convolution layer. Note that losses have already been computed for this layer (and all layers) using the local losses in operation 520.
  • For example, for the losses of each other nodes (“non-local”), apart from the local losses of the node, the node may execute a backward pass for higher layers down to or until and including the FC layer, but typically not beyond (e.g., below, towards the input) the FC layer, one after the other, not continuing with the backpropagation for layers below (towards the input) and beyond the FC. This backpropagation may occur individually and separately for each set of non-local losses received from another node, as typically the losses cannot be combined or averaged across different nodes. For example, for each other processor for which loss values are received, the receiving processor may perform a separate backpropagation operation on the layers down to and including the FC layer. Thus in one embodiment, in operation 550, one backward pass is done down to and including the FC layer but not beyond for the loss of each node other than the local node, the gradients—but not weight changes—resulting from the backward pass accumulating or being stored. Typically, a model should be modified using losses generated for that model, and thus gradients should be applied to the un-modified model, rather than having gradients generated from a model modified from the model that generated the losses. Thus the gradients generated for the relevant layer in operation 520 using local losses, and the gradients generated in operation 550 based on non-local losses, are accumulated, and then applied to the NN model stored by the node by for example being applied to the relevant weights, or averaged then applied to the weights. Weights for all other layers may be updated based on weights received from the master or from other nodes. In some embodiments, the parallelization of the backpropagation of losses of all other parts of the model—e.g. backpropagation in one pass across nodes, then combining weights—except for the typically inexpensive FC layer and its typically inexpensive preceding layers, among threads, may allow for loss calculation time to be reduced. At this point this node has the full PC backward pass result (or in the case that the FC layer is not the final top-most layer, the full result for every layer from the output to and including the FC layer): each node has same weights for the FC layer, as if a master node had averaged the FC layer weights and sent the weights to the nodes. Such a technique may improve NN learning communication in that in some example NNs, the actual weights of the FC layer which may be 90% of the NN weights, are never transmitted. Rather, only the loss and the 10% of the weights (in one example) for the other layers are transmitted.
  • In operation 560 the node may receive parameters from a master node or other nodes, as a model update (e.g. their calculated average) or from other nodes (e.g. as individual weights to be averaged by the node) and may apply them to the NN being processed by the node, to update the model stored by the node. For example, the node may receive individual parameters such as weights for all other layers, apart from FC, or apart from the layers from the FC to the output inclusive. At this point the node may have new or updated weights: for all layers lower than the FC (towards the input), obtained from the master, and for layers above the FC (towards the output) and including FC from a locally performed loss based calculation. Thus improvement may be achieved in some embodiments in that weight updates for layers from the output through and including the FC layer, the node may execute a backward pass for all loss values, separately, and for layers between the FC layer and input layer, weight updates are calculated by a master averaging locally computed weight values. This may decrease communications burden while only slightly increasing processing burden.
  • The process may iterate again with operation 500. Typically, a NN used with embodiments of FIG. 5 includes, at least one fully connected layer, at least one input layer, and at least one convolution layer, and possibly other layers; see, e.g. the example NN structures of FIGS. 1 and 6. However, other structures of NNs may be used with a process such as in FIG. 5.
  • In some embodiments, non-master or slave nodes may send losses to one central node, such as a master node, which may execute backpropagation for a selected subset of layers (e.g. layers from FC to output inclusive) for each loss set, integrate the results into the model or update the model (e.g. by applying each resulting gradient set to the model), and send the updated model back to other nodes. This may be performed in conjunction with the master node receiving parameters regarding other layers such as weights or gradients and updating the model based on those other parameters: the NN model updated by the master using both loss data and parameters such as weights or gradients may be sent to the non-master nodes conducting training. Whether the master or a number of slaves perform sequential backpropagation for certain layers using loss data from multiple nodes, the backpropagation for those layers is typically independent for each loss set. E.g. a loss set from node A may be applied to the model used to generate the losses to generate gradients, a loss set from node B may be applied to the model used to generate the losses, etc., and the multiple sets of gradients may be then applied to the weights from the model used to generate the losses, for the relevant layer.
  • In some embodiments, there is no accuracy loss in using embodiments of FIG. 5, since the algorithm is semantically the same to the prior art data-parallel forward-backward pass algorithm. In some embodiments, operations such as: a node sending a loss or set of loss values; and the same node executing a backward pass (e.g. operation 530, a backward pass based on “local” losses for the processing node only) or a portion of a backward pass, may be done in parallel, or concurrently. Different cores within the same processor may be dedicated to different tasks. Improvements may result from tasks being done in parallel such as for example transmitting or receiving data, sorting, compression, portions of a backward or forward pass, etc.
  • Communications improvements may result when nodes communicate their losses to other nodes, so that each node can have all of the losses and compute the total or aggregated FC gradients locally. In some embodiments, nodes computing a forward pass (typically slave nodes as opposed to master nodes) may send their loss values to each other and each may compute FC gradients (e.g. gradients to be used to change weights inputting to neurons in an FC layer) and apply them to alter the FC weights, individually, which may allow for FC layer weights or gradients to not be transmitted; rather only weights of gradients for other layers are transmitted. In prior art systems, weights or gradients for the FC layer are transmitted, which takes up a lot of network bandwidth. This may result in significant improvement to NN learning technology, as in some example systems, 90% of the weights of the NN may be for the FC layer. In another example, 93% of the weights are in the FC layer in one example of the AlexNet CNN. A dramatic reduction in overall communications may result. In some embodiments, each node computes the total/aggregated FC gradient result which adds computation time to the node, but this is more than made up for with communications savings. For example, if there are N nodes in the distributed system, then the compute time added is FC_layer_compute time*N: savings are maximized when the FC_layer_compute time is small (relative to other layers) and the number of nodes in the system is small. However, savings may result from systems without such characteristics.
  • Such a system of transmitting losses instead of other parameters such as weights or gradients may be combined with the embodiments for improved compression using sorting, as discussed herein, which itself may result in a 3× reduction in communications. The two techniques in combination may in some examples result in a 30× reduction of network traffic, in a lossless way.
  • FIG. 6 is a diagram depicting a method of exchanging weights among processors according to embodiments of the present invention. While in one embodiment the operations of FIG. 6 are carried out using systems as shown in FIGS. 1 and 2, in other embodiments other systems and equipment can be used. Further, embodiments of the example operations of FIG. 6 may be used with or combined with the embodiments shown in FIGS. 3 and/or 5.
  • FIG. 6 shows an embodiment with two nodes 610 and 620 executing simulations, including models of a NN 612 and 622, one master node 600 including model information such as parameters 602 and data or training 604, and a network 650 connecting nodes 600, 610 and 620. Other numbers of nodes may be used. An iteration may include Phase 1, the execution, and Phase 2, the parameter update. In Phase 1, a master sends parameters and input data to the nodes, the nodes perform forward pass, and then each node 610 and 620 transmits its loss value(w) (e.g. the forward pass result) to the other node of 610 and 620. In Phase 2, each node has one loss data set from each node (itself and other nodes), in this example two losses. Each node may use these losses to compute the final result for FC gradients locally by itself. Then, each node may continue to execute the rest of the backward pass in a way similar to the standard data-parallel approach: for example each node may send convolution weight gradients to master 600, master 600 may sum convolution weight gradients of nodes 610 and 620 performing forward and backward passes, and may send the final result (e.g. a model) to nodes 610 and 620. In some embodiments, an improvement may result from FC gradients not being transmitted over network 640 at any point in time, which has the potential to provide an order of magnitude reduction in network traffic in many CNNs (without any loss in accuracy). NNs other than CNNs may be used, and while embodiments discuss treating an FC layer differently, embodiments of the present invention may perform local calculations for layers other than an FC layer.
  • In some prior art systems, most of the computation necessary to train or perform inference in neural networks is performed by specialized, massively parallel hardware devices, such as GPUs. Such devices may have thousands of relatively weak processing cores, specialized to perform “regular,” predictable computation, which follows exactly the same control flow pattern, such as massive matrix multiplications.
  • Embodiments of the present invention may improve prior NN training and inference by for example allowing for less expensive, more common or commodity equipment to he used. For example, an Ethernet or less expensive data link may be used, and CPU based machines may be used instead of GPU based machines. While GPUs may be used with some embodiments of the present invention, typically, CPUs are not as powerful as CPUs at performing certain algorithms such as compression, which involves some sequential tasks: typically, GPUs are better at massively parallel task than CPUs, and CPUs may out perform GPUs at sequential tasks. Thus GPUs may not be as powerful at performing compression as discussed herein which may enable the use of less expensive network connections. Further, CPUs may be better than CPUs at interleaving, pipelining and complex parallel tasks which may be performed according to some embodiments of the present invention. GPUs may lack the large memory size CPU machines have, which may lower the ability of CPU machines to buffer a large amount of data. In some embodiments, a node may receive, and buffer or store, a large amount of input training data to process, and may process such data in sequence. In some embodiments, a node may multitask or interleave tasks, for example, at the same time, performing a forward pass for one layer of input data (e.g., an input image), sorting and/or compressing the parameter data for another layer.
  • While embodiments have been described in the context of NN learning, data processing in other contexts may make use of an embodiment of sort-and-compress method as described herein. Embodiments are applicable to any system in which the relative order of the elements to be compressed does not change much from one iteration to the next. Thus embodiments may be applied to systems other than machine learning. For example, an embodiment may be used to transmit pixel data for images. A sort-and-compress or sort-and-ZIP algorithm may be applicable to any set of numbers that are generated during iterations.
  • Embodiments of the present invention may be applicable to any set of numbers generated during iterations of distributed or other training, such as floating point parameters or gradients, or integer parameters or gradients that may be a result of quantization, 8 bit representations, etc.
  • Embodiments of the invention may be applicable to NNs computed with any sort of nodes, e.g. CPUs, GPUs, or other types of processors. However, embodiments may particularly useful with CPU based nodes, as sorting and compressing (e.g. sequential compression) may be easier to implement efficiently, or may execute faster, on a CPU.
  • In some embodiments, it is possible to use quantization, a known compression technique for gradients. For example, a process may first quantize floating point parameters to integers, and then perform a sort-and-compress process as described herein.
  • One skilled in the art will realize the invention may he embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the invention described herein. Scope of the invention is thus indicated by the appended claims, rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
  • In the foregoing detailed description, numerous specific details are set forth in order to provide an understanding of the invention. However, it will be understood by those skilled in the art that the invention can be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention. Some features or elements described with respect to one embodiment can be combined with features or elements described with respect to other embodiments.
  • Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing,” “computing,” “calculating,” “determining,” “establishing”, “analyzing”, “checking”, or the like, can refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that can store instructions to perform operations and/or processes.
  • Although embodiments of the invention are not limited in this regard, the terms “plurality” and “a plurality” as used herein can include, for example, “multiple” or “two or more”. The terms “plurality” or “a plurality” can be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. The term set when used herein can include one or more items. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

Claims (20)

What is claimed is:
1. A method of conducting training on a neural network (NN), the NN comprising neurons arranged into layers, the method comprising:
at each of a plurality of processors, executing a forward pass on the NN to produce a set of loss values for the processor;
at each of the plurality of the processors, transmitting the set of loss values to a set of processors of the plurality of processors; and
at each of the plurality of the processors, performing backpropagation on at least one layer of the NN using loss values received from a set of processors of the plurality of processors, wherein the ratio of compute burden of the at least one layer on which backpropagation is performed to the compute burden for the other layers in the NN is less than the ratio of the number of weights for the at least one layer on which backpropagation is performed to the number of weights for the other layers in the NN.
2. The method of claim 1, wherein the at least one layer on which backpropagation is performed has associated with the layer a larger amount of weight values than another layer in the NN.
3. The method of claim 1, wherein the at least one layer on which backpropagation is performed is a fully connected layer
4. The method of claim 1, wherein the NN comprises at least one fully connected layer, at least one input layer, and at least one convolution layer.
5. The method of claim 1, comprising, at each of the plurality processors, prior to performing backpropagation using loss values received from a set of processors, performing backpropagation using the loss values produced by the processor.
6. The method of claim 1, wherein performing backpropagation on at least one layer of the NN using loss values received from a set of processors of the plurality of processors comprises, for each processor for which loss values are received, performing a separate backpropagation operation.
7. The method of claim 1, wherein performing backpropagation on at least one layer of the NN using loss values received from a set of processors of the plurality of processors comprises performing backpropagation down to, including, but not beyond a fully connected layer.
8. The method of claim 1, comprising at each of a plurality of the processors receiving parameters from a master node and updating a NN model stored by the processor.
9. A method of conducting training on a model of a neural network (NN), the NN arranged into layers, the method comprising, at each or a plurality of nodes:
receiving training data to produce a set of local losses;
sending the losses to other nodes of the plurality of nodes;
for at least a first layer, performing backpropagation using local losses and losses received from other nodes of the plurality of nodes; and
for at least a second layer, performing backpropagation using local losses and not using losses received from other nodes of the plurality of nodes, wherein the ratio of compute burden of the at least one layer on which backpropagation is performed to the compute burden for the other layers in the NN is less than the ratio of the number of weights for the at least one layer on which backpropagation is performed to the number of weights for the other layers in the NN.
10. The method of claim 9, wherein the first layer is a fully connected layer
11. The method of claim 9, wherein the NN comprises at least one fully connected layer, at least one input layer, and at least one convolution layer.
12. The method of claim 9, wherein performing backpropagation using losses received from other nodes comprises, for each node for which losses are received, performing a separate backpropagation operation.
13. The method of claim 9, comprising at each of a plurality of nodes receiving parameters from a master node and updating the NN model.
14. A 5y stem conducting training on a neural network (NN), the NN comprising neurons arranged into layers, the system comprising:
a plurality of nodes, each node comprising a memory and a processor configured to:
execute a forward pass on the NN to produce a set of loss values for the node;
transmit the set of loss values to a set of nodes of the plurality of nodes; and
perform backpropagation on at least one layer of the NN using loss values received from a set of nodes of the plurality of nodes.
15. The system of claim 14, wherein the at least one layer on which backpropagation is performed has associated with the layer a larger amount of weight values and a smaller amount of weight compute burden than all other layers in the NN cumulatively.
16. The system of claim 14, wherein the at least one layer on which backpropagation is performed is a fully connected layer
17. The system of claim 14, wherein the NN comprises at least one fully connected layer, at least one input layer, and at least one convolution layer.
18. The system of claim 14, wherein at each node the processors is configured to, prior to performing backpropagation using loss values received from a set of nodes, perform backpropagation using the loss values produced by the processor.
19. The system of claim 1.4, wherein performing backpropagation on at least one layer of the NN using loss values received from a set of nodes comprises, for each node for which loss values are received, performing a separate backpropagation operation.
20. A method of conducting training of a neural network (NN), the NN arranged into layers and represented as a NN model, the method comprising:
at each of a plurality of non-master nodes:
receiving training data to produce a set of losses,
sending the losses to a master node;
at the master node:
for a subset of he NN layers, performing backpropagation using the losses;
updating the NN model; and
transmitting the NN model to the non-master nodes.
US16/192,924 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms Abandoned US20190156214A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/192,924 US20190156214A1 (en) 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762588324P 2017-11-18 2017-11-18
US201762588970P 2017-11-21 2017-11-21
US16/192,924 US20190156214A1 (en) 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms

Publications (1)

Publication Number Publication Date
US20190156214A1 true US20190156214A1 (en) 2019-05-23

Family

ID=66533132

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/192,924 Abandoned US20190156214A1 (en) 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms
US16/193,051 Active 2042-03-19 US11715287B2 (en) 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms

Family Applications After (1)

Application Number Title Priority Date Filing Date
US16/193,051 Active 2042-03-19 US11715287B2 (en) 2017-11-18 2018-11-16 Systems and methods for exchange of data in distributed training of machine learning algorithms

Country Status (1)

Country Link
US (2) US20190156214A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781456A (en) * 2019-09-27 2020-02-11 上海麦克风文化传媒有限公司 Sorting weight updating method
US10832133B2 (en) 2018-05-31 2020-11-10 Neuralmagic Inc. System and method of executing neural networks
US10902318B2 (en) 2017-11-06 2021-01-26 Neuralmagic Inc. Methods and systems for improved transforms in convolutional neural networks
US10963787B2 (en) 2018-05-31 2021-03-30 Neuralmagic Inc. Systems and methods for generation of sparse code for convolutional neural networks
WO2021082681A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Method and device for multi-party joint training of graph neural network
US11195095B2 (en) 2019-08-08 2021-12-07 Neuralmagic Inc. System and method of accelerating execution of a neural network
US11216732B2 (en) 2018-05-31 2022-01-04 Neuralmagic Inc. Systems and methods for generation of sparse code for convolutional neural networks
US11295206B2 (en) 2020-02-07 2022-04-05 Google Llc Interleaving memory requests to accelerate memory accesses
WO2022133725A1 (en) * 2020-12-22 2022-06-30 Orange Improved distributed training of graph-embedding neural networks
US11449363B2 (en) 2018-05-31 2022-09-20 Neuralmagic Inc. Systems and methods for improved neural network execution
US11544559B2 (en) 2019-01-08 2023-01-03 Neuralmagic Inc. System and method for executing convolution in a neural network
US11556757B1 (en) 2020-12-10 2023-01-17 Neuralmagic Ltd. System and method of executing deep tensor columns in neural networks
US11636343B2 (en) 2018-10-01 2023-04-25 Neuralmagic Inc. Systems and methods for neural network pruning with accuracy preservation
US11715287B2 (en) 2017-11-18 2023-08-01 Neuralmagic Inc. Systems and methods for exchange of data in distributed training of machine learning algorithms
US11960982B1 (en) 2021-10-21 2024-04-16 Neuralmagic, Inc. System and method of determining and executing deep tensor columns in neural networks

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10672663B2 (en) 2016-10-07 2020-06-02 Xcelsis Corporation 3D chip sharing power circuit
KR102647767B1 (en) 2016-10-07 2024-03-13 엑셀시스 코포레이션 Direct-bonded native interconnects and active base die
US10672745B2 (en) 2016-10-07 2020-06-02 Xcelsis Corporation 3D processor
US10580757B2 (en) 2016-10-07 2020-03-03 Xcelsis Corporation Face-to-face mounted IC dies with orthogonal top interconnect layers
US11176450B2 (en) * 2017-08-03 2021-11-16 Xcelsis Corporation Three dimensional circuit implementing machine trained network
US10580735B2 (en) 2016-10-07 2020-03-03 Xcelsis Corporation Stacked IC structure with system level wiring on multiple sides of the IC die
US11037330B2 (en) * 2017-04-08 2021-06-15 Intel Corporation Low rank matrix compression
US11630994B2 (en) * 2018-02-17 2023-04-18 Advanced Micro Devices, Inc. Optimized asynchronous training of neural networks using a distributed parameter server with eager updates
US11500842B2 (en) * 2018-12-14 2022-11-15 Sap Se Adaptive compression optimization for effective pruning
US20210065038A1 (en) * 2019-08-26 2021-03-04 Visa International Service Association Method, System, and Computer Program Product for Maintaining Model State
EP4035096A4 (en) * 2019-09-23 2023-07-19 Presagen Pty Ltd Decentralised artificial intelligence (ai)/machine learning training system
CN111091180B (en) * 2019-12-09 2023-03-10 腾讯科技(深圳)有限公司 Model training method and related device
US20210295168A1 (en) * 2020-03-23 2021-09-23 Amazon Technologies, Inc. Gradient compression for distributed training
CN111539519A (en) * 2020-04-30 2020-08-14 成都成信高科信息技术有限公司 Convolutional neural network training engine method and system for mass data
CN112200301B (en) * 2020-09-18 2024-04-09 星宸科技股份有限公司 Convolution computing device and method
CN117201198B (en) * 2023-11-07 2024-01-26 北京数盾信息科技有限公司 Distributed high-speed encryption computing method

Family Cites Families (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3334807B2 (en) 1991-07-25 2002-10-15 株式会社日立製作所 Pattern classification method and apparatus using neural network
US8131659B2 (en) 2008-09-25 2012-03-06 Microsoft Corporation Field-programmable gate array based accelerator system
US8583896B2 (en) 2009-11-13 2013-11-12 Nec Laboratories America, Inc. Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain
US8700552B2 (en) 2011-11-28 2014-04-15 Microsoft Corporation Exploiting sparseness in training deep neural networks
US9275334B2 (en) * 2012-04-06 2016-03-01 Applied Materials, Inc. Increasing signal to noise ratio for creation of generalized and robust prediction models
US9811775B2 (en) 2012-12-24 2017-11-07 Google Inc. Parallelizing neural networks during training
US9613001B2 (en) 2013-12-20 2017-04-04 Intel Corporation Processing device for performing convolution operations
US10540587B2 (en) * 2014-04-11 2020-01-21 Google Llc Parallelizing the training of convolutional neural networks
EP3146463B1 (en) 2014-05-23 2020-05-13 Ventana Medical Systems, Inc. Systems and methods for detection of biological structures and/or patterns in images
US10223333B2 (en) 2014-08-29 2019-03-05 Nvidia Corporation Performing multi-convolution operations in a parallel processing system
US9760538B2 (en) 2014-12-22 2017-09-12 Palo Alto Research Center Incorporated Computer-implemented system and method for efficient sparse matrix representation and processing
US10996959B2 (en) 2015-01-08 2021-05-04 Technion Research And Development Foundation Ltd. Hybrid processor
US20160239706A1 (en) 2015-02-13 2016-08-18 Qualcomm Incorporated Convolution matrix multiply with callback for deep tiling for deep convolutional neural networks
WO2016154440A1 (en) 2015-03-24 2016-09-29 Hrl Laboratories, Llc Sparse inference modules for deep learning
US9633306B2 (en) 2015-05-07 2017-04-25 Siemens Healthcare Gmbh Method and system for approximating deep neural networks for anatomical object detection
US9747546B2 (en) 2015-05-21 2017-08-29 Google Inc. Neural network processor
US10083395B2 (en) 2015-05-21 2018-09-25 Google Llc Batch processing in a neural network processor
US11423311B2 (en) 2015-06-04 2022-08-23 Samsung Electronics Co., Ltd. Automatic tuning of artificial neural networks
US20160379109A1 (en) 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
US9972063B2 (en) 2015-07-30 2018-05-15 International Business Machines Corporation Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units
US20190073582A1 (en) 2015-09-23 2019-03-07 Yi Yang Apparatus and method for local quantization for convolutional neural networks (cnns)
US9904874B2 (en) 2015-11-05 2018-02-27 Microsoft Technology Licensing, Llc Hardware-efficient deep convolutional neural networks
US9558156B1 (en) 2015-11-24 2017-01-31 International Business Machines Corporation Sparse matrix multiplication using a single field programmable gate array module
US20170193361A1 (en) 2015-12-31 2017-07-06 Microsoft Technology Licensing, Llc Neural network training performance optimization framework
US11170294B2 (en) 2016-01-07 2021-11-09 Intel Corporation Hardware accelerated machine learning
US11055063B2 (en) 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US20170372202A1 (en) 2016-06-15 2017-12-28 Nvidia Corporation Tensor processing using low precision format
US10891538B2 (en) 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
US20190303743A1 (en) 2016-08-13 2019-10-03 Intel Corporation Apparatuses, methods, and systems for neural networks
US11887001B2 (en) 2016-09-26 2024-01-30 Intel Corporation Method and apparatus for reducing the parameter density of a deep neural network (DNN)
CN110073359B (en) 2016-10-04 2023-04-04 奇跃公司 Efficient data placement for convolutional neural networks
US10360163B2 (en) 2016-10-27 2019-07-23 Google Llc Exploiting input data sparsity in neural network compute units
US10157045B2 (en) 2016-11-17 2018-12-18 The Mathworks, Inc. Systems and methods for automatically generating code for deep learning systems
KR102224510B1 (en) 2016-12-09 2021-03-05 베이징 호라이즌 인포메이션 테크놀로지 컴퍼니 리미티드 Systems and methods for data management
JP6864224B2 (en) 2017-01-27 2021-04-28 富士通株式会社 Processor, information processing device and how the processor operates
US11086967B2 (en) 2017-03-01 2021-08-10 Texas Instruments Incorporated Implementing fundamental computational primitives using a matrix multiplication accelerator (MMA)
US10726514B2 (en) 2017-04-28 2020-07-28 Intel Corporation Compute optimizations for low precision machine learning operations
US10776699B2 (en) 2017-05-05 2020-09-15 Intel Corporation Optimized compute hardware for machine learning operations
US20180336468A1 (en) 2017-05-16 2018-11-22 Nec Laboratories America, Inc. Pruning filters for efficient convolutional neural networks for image recognition in surveillance applications
WO2018214913A1 (en) 2017-05-23 2018-11-29 上海寒武纪信息科技有限公司 Processing method and accelerating device
CN107832839B (en) 2017-10-31 2020-02-14 南京地平线机器人技术有限公司 Method and apparatus for performing operations in convolutional neural networks
WO2019090325A1 (en) 2017-11-06 2019-05-09 Neuralmagic, Inc. Methods and systems for improved transforms in convolutional neural networks
WO2019099899A1 (en) 2017-11-17 2019-05-23 Facebook, Inc. Analyzing spatially-sparse data based on submanifold sparse convolutional neural networks
US20190156214A1 (en) 2017-11-18 2019-05-23 Neuralmagic Inc. Systems and methods for exchange of data in distributed training of machine learning algorithms
KR102551277B1 (en) 2017-12-13 2023-07-04 한국전자통신연구원 System and method for merge-join
US10572568B2 (en) 2018-03-28 2020-02-25 Intel Corporation Accelerator for sparse-dense matrix multiplication
WO2019222150A1 (en) 2018-05-15 2019-11-21 Lightmatter, Inc. Algorithms for training neural networks with photonic hardware accelerators
US11449363B2 (en) 2018-05-31 2022-09-20 Neuralmagic Inc. Systems and methods for improved neural network execution
US10832133B2 (en) 2018-05-31 2020-11-10 Neuralmagic Inc. System and method of executing neural networks
US10963787B2 (en) 2018-05-31 2021-03-30 Neuralmagic Inc. Systems and methods for generation of sparse code for convolutional neural networks
TW202013265A (en) 2018-06-04 2020-04-01 美商萊特美特股份有限公司 Methods for computing convolutions using programmable nanophotonics
US10599429B2 (en) 2018-06-08 2020-03-24 Intel Corporation Variable format, variable sparsity matrix multiplication instruction
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US20210201124A1 (en) 2018-08-27 2021-07-01 Neuralmagic Inc. Systems and methods for neural network convolutional layer matrix multiplication using cache memory
EP3847590A4 (en) 2018-09-07 2022-04-20 Intel Corporation Convolution over sparse and quantization neural networks
CN109460817B (en) 2018-09-11 2021-08-03 华中科技大学 Convolutional neural network on-chip learning system based on nonvolatile memory
US10719323B2 (en) 2018-09-27 2020-07-21 Intel Corporation Systems and methods for performing matrix compress and decompress instructions
US11636343B2 (en) 2018-10-01 2023-04-25 Neuralmagic Inc. Systems and methods for neural network pruning with accuracy preservation
US10713012B2 (en) 2018-10-15 2020-07-14 Intel Corporation Method and apparatus for efficient binary and ternary support in fused multiply-add (FMA) circuits
US11676003B2 (en) 2018-12-18 2023-06-13 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats
US11544559B2 (en) 2019-01-08 2023-01-03 Neuralmagic Inc. System and method for executing convolution in a neural network

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10902318B2 (en) 2017-11-06 2021-01-26 Neuralmagic Inc. Methods and systems for improved transforms in convolutional neural networks
US11715287B2 (en) 2017-11-18 2023-08-01 Neuralmagic Inc. Systems and methods for exchange of data in distributed training of machine learning algorithms
US11449363B2 (en) 2018-05-31 2022-09-20 Neuralmagic Inc. Systems and methods for improved neural network execution
US10832133B2 (en) 2018-05-31 2020-11-10 Neuralmagic Inc. System and method of executing neural networks
US10915816B2 (en) 2018-05-31 2021-02-09 Neuralmagic Inc. System and method of executing neural networks
US10963787B2 (en) 2018-05-31 2021-03-30 Neuralmagic Inc. Systems and methods for generation of sparse code for convolutional neural networks
US11216732B2 (en) 2018-05-31 2022-01-04 Neuralmagic Inc. Systems and methods for generation of sparse code for convolutional neural networks
US11636343B2 (en) 2018-10-01 2023-04-25 Neuralmagic Inc. Systems and methods for neural network pruning with accuracy preservation
US11544559B2 (en) 2019-01-08 2023-01-03 Neuralmagic Inc. System and method for executing convolution in a neural network
US11195095B2 (en) 2019-08-08 2021-12-07 Neuralmagic Inc. System and method of accelerating execution of a neural network
US11797855B2 (en) 2019-08-08 2023-10-24 Neuralmagic, Inc. System and method of accelerating execution of a neural network
CN110781456A (en) * 2019-09-27 2020-02-11 上海麦克风文化传媒有限公司 Sorting weight updating method
WO2021082681A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Method and device for multi-party joint training of graph neural network
US11295206B2 (en) 2020-02-07 2022-04-05 Google Llc Interleaving memory requests to accelerate memory accesses
US11928580B2 (en) 2020-02-07 2024-03-12 Google Llc Interleaving memory requests to accelerate memory accesses
US11556757B1 (en) 2020-12-10 2023-01-17 Neuralmagic Ltd. System and method of executing deep tensor columns in neural networks
WO2022133725A1 (en) * 2020-12-22 2022-06-30 Orange Improved distributed training of graph-embedding neural networks
US11960982B1 (en) 2021-10-21 2024-04-16 Neuralmagic, Inc. System and method of determining and executing deep tensor columns in neural networks

Also Published As

Publication number Publication date
US20190156215A1 (en) 2019-05-23
US11715287B2 (en) 2023-08-01

Similar Documents

Publication Publication Date Title
US11715287B2 (en) Systems and methods for exchange of data in distributed training of machine learning algorithms
US10902318B2 (en) Methods and systems for improved transforms in convolutional neural networks
US20230140474A1 (en) Object recognition with reduced neural network weight precision
US10832133B2 (en) System and method of executing neural networks
JP6574503B2 (en) Machine learning method and apparatus
US10482380B2 (en) Conditional parallel processing in fully-connected neural networks
CN106062786B (en) Computing system for training neural networks
EP3924892A1 (en) Adjusting activation compression for neural network training
CN110309847B (en) Model compression method and device
JP6869676B2 (en) Information processing equipment, information processing methods and programs
Mundy et al. An efficient SpiNNaker implementation of the neural engineering framework
US11657284B2 (en) Neural network model apparatus and compressing method of neural network model
WO2022068663A1 (en) Memory allocation method, related device, and computer readable storage medium
CN114127740A (en) Data parallelism in distributed training of artificial intelligence models
WO2022206717A1 (en) Model training method and apparatus
CN114127741A (en) Dynamic multi-tier execution for artificial intelligence modeling
Parthasarathy et al. DEFER: distributed edge inference for deep neural networks
CN114219076A (en) Quantum neural network training method and device, electronic device and medium
CN115860100A (en) Neural network model training method and device and computing equipment
Bharadwaj et al. Distributed-memory sparse kernels for machine learning
US11853391B1 (en) Distributed model training
Moe et al. Implementing spatio-temporal graph convolutional networks on graphcore ipus
CN115544307A (en) Directed graph data feature extraction and expression method and system based on incidence matrix
JP7060719B2 (en) Methods, equipment, and related products for processing data
CN111382848A (en) Computing device and related product

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

AS Assignment

Owner name: NEURALMAGIC INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MATVEEV, ALEXANDER;SHAVIT, NIR;REEL/FRAME:048195/0237

Effective date: 20190114

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION