CN111275173B

CN111275173B - Neural network training method, device and equipment thereof

Info

Publication number: CN111275173B
Application number: CN202010089702.7A
Authority: CN
Inventors: 朱亦博; 江逸敏; 蓝昶; 郭传雄
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2023-08-04
Anticipated expiration: 2040-02-12
Also published as: CN111275173A

Abstract

The present disclosure provides a neural network training method, apparatus, device, and non-transitory computer-readable storage medium, the method comprising: generating, by the first processing unit, a tensor related to the neural network; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and performing, by a first processing unit, a parameter update of the neural network based on the global tensor and the second processing unit.

Description

Neural network training method, device and equipment thereof

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to neural network training methods, apparatus, devices, and non-transitory computer-readable storage media.

Background

Training deep neural networks (Deep Neural Network, DNN) is important for running many modern services. Numerous DNN models may be trained for applications, including Computer Vision (CV), natural language processing (Natural Language Processing, NLP), and the like. The computation required to train the latest models has increased tremendously over the past few years. For example, training BERT (one of the most popular NLP models) requires training on four cloud TPUs for 16 days. Longer training times are required if training is performed on a single GPU machine. In order to complete training in a reasonable time, many GPUs must be used and trained in a distributed fashion.

The most common distributed training technique today is data parallelism, where each processing unit (e.g., GPU) retains a local copy of the entire model, trains using different data, and shares information (knowledges), typically in the form of gradients, synchronously or asynchronously with each other. Currently, there are two main architectures for DNN training: all-reduce and PS (Parameter server), wherein all-reduce is trained using a graphics processor (Graphics Processing Unit, GPU) and PS is trained using both GPU and a central processor (central processing unit, CPU). In recent years all-reduce has become more popular because it outperforms all existing PS implementations at lower hardware costs. However, all-reduce has a not wide range of applications, as it does not support asynchronous training. Furthermore, even with all-reduce, far nonlinear scaling of training performance is still often observed due to communication overhead. PS, while supporting asynchronous training, has the disadvantages of low performance, non-support across frameworks, and high hardware cost.

Disclosure of Invention

The present disclosure provides a neural network training method, apparatus, device, and non-transitory computer-readable storage medium.

According to an aspect of the present disclosure, there is provided a neural network training method, including: generating, by a first processing unit, a tensor related to the neural network; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and performing, by a first processing unit, a parameter update of the neural network based on the global tensor, wherein the first processing unit is different from the second processing unit.

According to another aspect of the present disclosure, there is provided an apparatus for neural network training, the apparatus comprising: a tensor generation module that generates a tensor associated with the neural network; the tensor summation module is used for performing tensor summation on the tensors generated by the tensor generation module to obtain a global tensor sum; and a parameter updating module that performs parameter updating of the neural network based on the global tensor, wherein the tensor generating module and the parameter updating module are on a first processing unit and the tensor summing module is on a second processing unit different from the first processing unit.

According to yet another aspect of the present disclosure, there is provided an apparatus for neural network training, the apparatus comprising a processor and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to implement a neural network training method according to the present disclosure, wherein the processor comprises a first processing unit and a second processing unit different from the first processing unit.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable medium having stored thereon computer instructions which, when executed by a computer, perform a neural network training method according to the present disclosure.

As will be described in detail below, the present disclosure proposes a new neural network training architecture, and methods, apparatus, devices, and non-transitory computer-readable media thereof. Compared with the existing neural network training architecture all-reduce and PS, the method has the advantages of optimal performance, support of cross-frame and asynchronous training at the same time and relatively low hardware cost.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and are intended to provide further explanation of the technology claimed and are not intended to limit the technical concepts of the present disclosure.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. Like reference numerals refer to like elements throughout the drawings. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIGS. 1A and 1B illustrate component arrangements of existing DNN training architectures all-reduce and PS;

FIG. 2 shows end-to-end (end-to-end) training speed for parameter updates using unused equipment on the PS;

FIG. 3 illustrates a neural network training architecture, according to some embodiments of the present disclosure;

FIG. 4 illustrates a flow chart of a neural network training method, according to some embodiments of the present disclosure;

FIG. 5 illustrates a tree topology of a neural network training architecture, according to some embodiments of the present disclosure;

FIG. 6A illustrates a flowchart of performing tensor summation by directly replicating a second processing unit and communicating a global tensor summation to a first processing unit according to some embodiments of the present disclosure;

FIG. 6B illustrates a flowchart of performing tensor summation by a topology aware second processing unit and communicating a global tensor summation to a first processing unit according to some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of performing a neural network training method according to some embodiments of the present disclosure through a multi-stage pipeline, according to some embodiments of the present disclosure;

FIG. 8A illustrates a flowchart of a neural network training method performed by synchronous training according to some embodiments of the present disclosure;

FIG. 8B illustrates a workflow of performing a neural network training method according to some embodiments of the present disclosure through synchronous training, according to some embodiments of the present disclosure;

FIG. 9A illustrates a flowchart of a neural network training method performed by asynchronous training according to some embodiments of the present disclosure;

FIG. 9B illustrates a workflow of performing a neural network training method according to some embodiments of the present disclosure through asynchronous training, according to some embodiments of the present disclosure;

FIGS. 10A-10D illustrate results of performance evaluation of a neural network training method on a public cloud with a 20Gbps TCP/IP network and NVLink-supporting VMs, according to an embodiment of the present disclosure;

11A and 11B illustrate the benefits of neural network training methods using topology aware summation according to some embodiments of the present disclosure;

FIGS. 12A-12D illustrate results of performance evaluations of neural network training methods on RDMA clusters (clusters) with different network bandwidths, according to some embodiments of the present disclosure;

13A and 13B illustrate convergence performance of a neural network training method according to some embodiments of the present disclosure performed by asynchronous training, according to some embodiments of the present disclosure;

FIG. 14 is a schematic diagram of an apparatus for neural network training, according to some embodiments of the present disclosure;

FIG. 15 is a schematic diagram of an apparatus for neural network training, according to some embodiments of the present disclosure;

FIG. 16 is another schematic diagram of an apparatus for neural network training, according to some embodiments of the present disclosure; and

fig. 17 is a schematic diagram of a non-transitory computer-readable storage medium for neural network training, according to some embodiments of the disclosure.

Detailed Description

In order for those skilled in the art to better understand the new neural network training architecture and neural network training methods thereof according to some embodiments of the present disclosure, the present disclosure will briefly introduce DNN training and two existing distributed training architectures, all-reduce and PS, and their limitations.

1. Distributed DNN training

DNN training: DNN models are typically composed of multiple layers with many parameters. Performing DNN training on a single processing unit (e.g., GPU) typically includes three steps: (1) Forward propagation, which introduces a batch of training data into the model, propagates through the dataflow graph, and calculates a loss function to evaluate the accuracy of the model; (2) Counter-propagating, calculating the corresponding gradient of each parameter by using the loss function; (3) Parameter update, which uses gradients to update parameters through some optimizer (e.g., SGD (Stochastic Gradient Descent, random gradient descent), adam, etc.). Training DNN can iteratively refine the model parameters through the three steps described above until the loss function is minimal (i.e., the model reaches convergence).

Distributed DNN training with data parallelism: because DNN training is very time consuming, the need for extended training continues to grow. One typical implementation is called data parallelism, which divides an entire data set into multiple distributed computing devices (called "accelerators"), which may be single GPUs, or single or multiple machines with multiple GPUs), each having complete model parameters. Since the data set for each accelerator is different, the gradient generated by the back propagation will also be different. Thus, data parallelism requires that all accelerators be synchronized during each training iteration. There are two main ways to perform synchronization, namely the PS architecture and the all-reduce architecture. The distributed training with data parallelism mainly comprises five steps: (1) forward propagation, (2) backward propagation, (3) network transmission, (4) tensor summation across distributed accelerators, and model updating via an optimizer such as SGD.

Synchronous and asynchronous training: distributed training may be performed in a synchronous or asynchronous manner. After each iteration, the synchronization training forces a global barrier to be set, so all accelerators need to wait for each other until all gradients have been added to the global model. In some real-world environments, the training speed of the accelerator may be different. The overall performance of the synchronization training is constrained by the slowest accelerator (i.e., straggler). In this case, developers tend to use asynchronous training to alleviate performance bottlenecks. By eliminating global obstructions between accelerators, asynchronous training may allow some accelerators to run faster than others, avoiding the effects of straggler and potentially speeding up training.

2. Existing architecture and limitations thereof

2.1all-reduce

All-reduce architecture from the HPC community sums the gradients of each accelerator in a collective fashion. The component arrangement of the All-reduce architecture is shown in fig. 1A, where (1) forward propagation, (2) backward propagation, (4) tensor summation, and (5) parameter update in the five steps of distributed DNN training are All performed by the GPU. In the distributed DNN training of all-reduce architecture, the accelerator updates its own parameters locally after the reduce gradient. Ring (ring) is the most popular all-reduce algorithm in DNN training. Assume that there are n accelerators. In order to use the loop algorithm to transfer all-reduce 1MB of data on these accelerators, each accelerator needs to transmit 2 (n-1)/n times 1MB in total. 2 (n-1)/n has proven to be optimal in planar topologies. In addition to the loop algorithm, other all-reduce algorithms may be used for the hierarchical topology (hierarchical topology). However, at the top level of the hierarchical topology, the traffic is still 2 (m-1)/m, where m is the number of nodes on the top level.

All-reduce does not support asynchronous training. By design, all-reduce primitives require all accelerator involvement and inherently have global obstacles. In contrast, the PS architecture does not have this problem, as the server maintains an up-to-date model, and the accelerator only needs to be synchronized with the server, not with other accelerators.

Furthermore, the synchronization nature of all-reduce may lead to chaotic problems and higher coordination overhead. Most DNN frameworks worsen the situation even more by the fact that the order of computation is not mandatory for efficiency reasons. Thus, different accelerators will generate gradients in different orders, and each all-reduce must wait for each other before starting. The HPC community has demonstrated the optimality of all-reduce algorithms.

2.2 parameter Server

The Parameter Server (PS) architecture is an abstraction that includes two roles: an accelerator and a server. The accelerator performs the computation and then pushes (push) the gradient to the server. The server sums the gradients from the different accelerators and updates the parameters. Finally, the accelerator obtains the updated parameters from the server and then proceeds to the next iteration. The component arrangement of the PS architecture is shown in fig. 1B, where (1) forward propagation, and (2) backward propagation, among the five steps of distributed DNN training, are performed by the accelerator (i.e., GPU), while (4) tensor summation, and parameter updating are performed by the server (i.e., CPU).

Parameter updates on CPU-based servers can result in poor performance. Although the parameter update is computationally lighter than forward and backward propagation, it still runs much slower on the CPU than on the GPU. This can be demonstrated with an experiment. Only one accelerator and one server are used to train VGG16 DNN. The accelerator and server are connected over 100Gbps ethernet and each has a Tesla V100 GPU. The accelerator always uses the GPU for computation, while the server uses 1) the entire CPU (32 cores with hyper threads and Intel MKL) on the host; or 2) a GPU performs parameter updating. Three types of optimizers were tested: SGD, momentum, and RMSProp. The results are shown in FIG. 2. As can be seen from FIG. 2, even when MKL is enabled, the parameter update on the CPU is significantly slower than the parameter update on only one GPU, resulting in a significant performance degradation. As optimizers become more complex (from simple SGD to complex RMSProp), performance differences become larger.

In the remainder of this document, reference is made solely to a CPU-based parameter server unless otherwise indicated. GPU-based PS is expensive because it requires an additional GPU. Therefore, GPU-based servers are less popular in practice.

In addition, the implementation of a parameter server cannot span the framework. The main reason is that the optimizers that the servers run are complex and must be scheduled by the framework engine. For the same reason, in customizing an optimizer, a developer must program with a non-reusable, framework-specific grammar.

2.3 general awareness about all-reduce and PS

Regarding all-reduce and PS, one common notion is that all-reduce and PS are the best architectures for distributed DNN training. HPC communities have been optimizing all-reduce for many years and are therefore very mature in design and implementation. In asynchronous training, all-reduce cannot be employed, and the developer can simply fall back to PS, which is the only choice today.

3. Neural network training architecture and neural network training method thereof

As described above, with respect to distributed DNN training, the current common concept is that all-reduce and PS are the best architectures for distributed DNN training. Based on this, related research is mainly focused on how to optimize all-reduce and PS architecture.

Is such concept correct? Having jumped the general knowledge of distributed DNN training, and having studied the entire problem and design space of distributed neural network training, the inventors of the present disclosure have surprisingly found that all-reduce and PS are not optimal in terms of system performance and versatility.

After further studying the DNN clusters, the inventors of the present disclosure noted that the main computation of the distributed DNN training was done by the GPU, while the CPU was relatively blank. Based on this observation, the inventors of the present disclosure considered to utilize a standby CPU to improve training performance. The problem is now that it is possible to place which computations on the CPU without the CPU becoming a performance bottleneck. After extensive analysis, computation, and simulation, the inventors of the present disclosure found that the CPU is suitable only for tensor summation.

The present invention has been made in view of the above. A neural network training architecture according to an embodiment of the present disclosure is depicted in fig. 3. Compared to all-reduce shown in FIG. 1A, it can greatly improve the efficiency of PCIe and network bandwidth of the GPU accelerator, and also support asynchronous training. It has higher performance, lower hardware cost, and supports cross-frame compared to the PS shown in fig. 1B.

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

First, referring to fig. 3, fig. 3 illustrates a new neural network training architecture (hereinafter, may be referred to as SS (summing Server)) according to some embodiments of the present disclosure. In the neural network training architecture shown in fig. 3, according to some embodiments of the present disclosure, forward propagation, backward propagation, and parameter updating are performed by a first processing unit (e.g., GPU), while tensor summation is performed by a second processing unit (e.g., CPU), where the first processing unit is different from the second processing unit. In this way, the neural network training architecture according to some embodiments of the present disclosure avoids performance bottlenecks that exist in PS, while additional CPU resources may be utilized compared to all-reduce. Furthermore, neural network training architectures according to some embodiments of the present disclosure may also employ scheduling algorithms to accelerate network communications.

Table 1 shows a comparison of a neural network training architecture and all-reduce and PS according to an embodiment of the present disclosure.

	All-reduce	PS	Summation Server
				Performance of	Good (good)	Difference of difference	Optimum (2 x all-reduce)
Cross-frame	Is that	Whether or not	Is that
				Supporting asynchronization	Whether or not	Is that	Is that
Hardware cost	Lowest minimum	High height	Low (< 3% extra)

TABLE 1

As can be seen from table 1, the new neural network training architecture according to embodiments of the present disclosure has optimal performance and supports both asynchronous training and cross-frameworks, as compared to all-reduce and PS.

In the foregoing, a new neural network training architecture according to some embodiments of the present disclosure is described with reference to fig. 3. Hereinafter, the present disclosure will describe a workflow of the neural network training architecture shown in fig. 3 with reference to fig. 4 to 9B.

Fig. 4 illustrates a flow chart of a neural network training method, according to some embodiments of the present disclosure. Referring to fig. 4, a neural network training method according to some embodiments of the present disclosure includes steps S400, S402, and S404.

At step S400, a first processing unit (e.g., GPU) generates a tensor related to a neural network. In some embodiments, the first processing unit generates a tensor associated with the neural network after forward propagation and backward propagation. In some embodiments (e.g., in synchronous training), the tensor may be a gradient of each parameter of the neural network calculated using the loss function. In other embodiments (e.g., in asynchronous training), the tensor may be the difference between the parameters of the neural network after the update and before it. After that, the method proceeds to step S402. At step S402, a second processing unit (e.g., CPU) sums the tensors generated by the first processing unit to obtain a global tensor sum. After deriving the global tensor sum, the global tensor sum may be communicated to the first processing unit. In some embodiments, the global tensor sum may be communicated to the first processing unit by way of direct replication. However, other approaches are also applicable (e.g., topology-aware based approaches as will be described in detail below in connection with fig. 6B). After the global tensor sum is communicated to the first processing unit, the method proceeds to step S404. At step S404, the first processing unit performs parameter updating of the neural network based on the global tensor calculated by the second processing unit.

The neural network training method shown in fig. 4, according to some embodiments of the present disclosure, is an example workflow diagram of the new neural network training architecture shown in fig. 3, according to embodiments of the present disclosure. The new neural network training architecture and its neural network training method described in connection with fig. 3 and 4, results in a true cross-frame implementation, since the server CPU only leaves a simple raw summation operation. In addition, the new neural network training architecture and its neural network training method also have versatility in that it supports both synchronous and asynchronous training, which will be described in detail later in connection with fig. 8A to 9B. As a result, the new neural network training architecture according to embodiments of the present disclosure is the most versatile (e.g., supporting asynchronous training that all-reduce cannot support, and also supporting cross-frameworks that PS does not support) and highest performing distributed DNN training architecture. Compared with all-reduce, the SS moves tensor summation operation on the GPU to the standby CPU, so that the operand of the GPU is lightened, and the whole training process is accelerated; compared with PS, SS only performs tensor summation on CPU, thus reducing the operation amount of CPU and the number of CPU, and improving training speed and saving hardware cost.

The new neural network training architecture and neural network training method thereof described in connection with fig. 3 and 4 according to embodiments of the present disclosure uses two processing resources, namely a first processing unit and a second processing unit (e.g., GPU and CPU), which are also referred to as utilizing heterogeneous resources.

With heterogeneous resources, hierarchical communication and summation can be applied without having to perform global all-reduce. If properly designed, traffic on certain (PCIe or network) links may be significantly reduced. According to some embodiments of the present disclosure, a new neural network training architecture according to embodiments of the present disclosure that utilizes heterogeneous resources may be modeled as a tree topology as shown in fig. 5.

The tree topology shown in fig. 5 can be regarded as a symmetrical three-level tree topology comprising a root node (C _global Global summing server), intermediate node (C ₀ To C _M-1 ) And leaf node (N) ₀ 、N ₁ …). It should be noted that the tree topology shown in fig. 5 is only one example. The new neural network training architecture according to the embodiments of the present disclosure may establish a corresponding tree topology according to practical situations, the number of stages may be less than or more than three, and the structure may be symmetrical or asymmetrical, which is not limited by the present disclosure.

In the above, a tree topology of a new neural network training architecture according to an embodiment of the present disclosure is described with reference to fig. 5. In the following, an example method of tensor summation based on a tree topology of a neural network training architecture and communication of global tensor sums to a first processing unit will be described in connection with fig. 6A-6B.

Fig. 6A illustrates a flowchart of performing tensor summation by directly replicating a second processing unit and communicating a global tensor summation to a first processing unit, according to some embodiments of the present disclosure. The method shown in fig. 6A may be referred to hereinafter as a direct replication strategy.

Referring to fig. 6A, performing tensor summation by direct replication strategy and communicating the global tensor summation to the first processing unit includes steps s402_2a, s402_4a and s600_2a.

At step s402—2a, the tensor generated by the first processing unit is copied to the second processing unit. Thereafter, the method proceeds to step s402_4a. At step s402—4a, the second processing unit performs tensor summation on the tensors generated by the first processing unit. The resulting global tensor sum is then directly copied back to the first processing unit (step s600_2a).

In the above, performing tensor summation by direct replication strategy and communicating global tensor summation to the first processing unit is described in connection with fig. 6A. Next, the present disclosure will briefly analyze network traffic for each switch (PCIe) in the tree topology when using a direct replication policy.

For simplicity, this disclosure takes the two-level tree shown in the lower left portion of fig. 5 as an example for analysis. Assume that the two-level tree shown in the lower left part of fig. 5 includes: with 2 PCIe switches (S ₀ And S is _p-1 Server (C) of p=2) ₀ ) (first level, which may also be referred to as the upper level link of the two-level tree in the tree topology), each switch has 4 first processing units, i.e. GPUs (N ₀ 、…、N _n-1 N=4) connection (second level, which may also be referred to as the lower link of the two-level tree in the tree topology). Now, it is desirable to sum one tensor of 1MB in a total of 8 first processing units. In the direct copy policy shown in fig. 6A, since all the first processing units (N ₀ 、…、N _n-1 ) Copying the tensor to a second processing unit (C ₀ ) In which the second processing unit is caused to perform tensor summation and the result is copied back, so that each first processing unit switch is linked (S ₀ And S is _p-1 ) The traffic on the switch is 1MB, and the switchRoot link (S) _root ) The traffic on is bi-directional 4MB.

From the above analysis, it can be seen that if the bandwidth of the first processing unit switch link is very limited and the bandwidth of the switch root link is very large, the performance of this solution may be better than ring-based all-reduce, since ring-based all-reduce will bi-directionally transmit 1.75MB on each link.

Fig. 6B illustrates a flow diagram for performing tensor summation by a topology aware second processing unit and communicating a global tensor summation to a first processing unit according to some embodiments of the present disclosure. The method shown in fig. 6B may be referred to hereinafter as a topology aware policy.

Referring to fig. 6B, performing tensor summation by topology aware policy and communicating the global tensor summation to the first processing unit includes steps s402_2b, s402_4b, s402_6b, s600_2b and s600_4b.

At step s402—2b, tensors generated by the first processing unit under the same switch are summed to obtain aggregate data. In some embodiments, the tensors generated by the first processing units under the same switch may be summed by performing a reduce-scatter (reduce-scatter) on all of the first processing units under the same switch to obtain aggregate data (this step may be referred to as reduce-scatter hereinafter). Note that other methods are also applicable. For example, in other embodiments, the tensors generated by the first processing units may be summed by copying and summing the tensors generated by all the first processing units under the same switch to one first processing unit to obtain the aggregate data. Thereafter, the method proceeds to step s402_4b. At step s402—4b, the aggregate data is copied to a second processing unit (this step may be referred to hereinafter as a first processing unit-second processing unit copy, and in the case where the first processing unit is a GPU, the second processing unit is a CPU, may be referred to as a GPU-CPU copy). After copying the aggregate data to the second processing unit, the method proceeds to step s402_6b. At step s402—6b, the second processing unit performs tensor summation on the aggregate data, resulting in a global tensor sum. In some embodiments, the second processing unit may sum the tensors of all the aggregate data by performing reduction (reduction) on the aggregate data to obtain a global tensor sum (this step may be referred to hereinafter as a second processing unit-reduction, which in the case of a second processing unit being a CPU may be referred to as a CPU-reduction). Note that other methods besides the second processing unit-reduction are also applicable. After the global tensor sum is obtained, the method proceeds to step s600_2b, where the first processing unit copies its partition data from the second processing unit back to itself, where the partition data of the first processing unit refers to: after dividing the global tensor sum equally into n parts, the part of data corresponding to the first processing unit, where n is the number of first processing units under the same switch (this step may be referred to as second processing unit-first processing unit copy hereinafter, in the case where the first processing unit is a GPU and the second processing unit is a CPU, it may be referred to as CPU-GPU copy). In some embodiments, the portion of data corresponding to the first processing unit refers to the portion of data corresponding to the process number of the first processing unit. Other correspondence is also applicable, and the present disclosure is not limited thereto, as long as such correspondence enables each first processing unit under the same switch to obtain a piece of mutually different data. After the CPU-GPU copy, the method proceeds to step S600_4b. At step s600_4b, each second processing unit performs a global-gather (all-gather) operation together with the second processing units under the same switch, so that each first processing unit obtains information of a global tensor sum.

In the above, the topology aware policy is described in connection with fig. 6B. Similar to fig. 6A, the present disclosure will hereinafter take the two-level tree shown in the lower left portion of fig. 5 as an example to analyze network traffic linked by individual switches in the tree topology when the topology aware policy is applied. The configuration and requirements of the two-level tree shown in the lower left part of fig. 5 and the size of the tensor are consistent with the above analysis regarding the direct replication strategy.

At step s402_2b, when summing the sums under the same switch by reduction-dispersionWith the tensor generated by the first processing unit, the processing unit is configured to perform the processing only in the PCIe switch (e.g., S in fig. 5 ₀ ) Internally generated is traffic of (n-1)/n MB, where n is the number of first processing units of the same switch, in this example n=4. After the reduce-scatter, each first processing unit should hold 1/n MB of aggregate data. At step s402—4b, each first processing unit (e.g., N in fig. 5 ₀ 、…、N _n Where n=4) copies 1/n MB of its aggregate data to the second processing unit (C in fig. 5 ₀ ) This would result in 1/n traffic along the way. Since there are a total of p switches (in this example, p=2), the full data size on the second processing unit should be p MB. At step s402—6b, the CPU performs a reduction on all p MB data in order to obtain a tensor global sum across all n×p first processing units. This reduction does not cause any traffic. At step s600_2b, each first processing unit copies its 1/n MB partition data back from the first processing unit to itself. This causes an opposite traffic from the first processing unit to 1/n MB per leaving node. At step s600_4b, each first processing unit performs a global-gather operation with the first processing units under the same PCIe switch. This creates (n-1)/n MB of reverse traffic inside the switch.

From the above analysis, it follows that in case of topology awareness application, the traffic on each first processing unit switch link will be 1.75MB, while the traffic on each switch root link will be 1MB bi-directional. If the switch root link is a bottleneck, the performance of this solution will be significantly better than ring-based all-reduce.

Note that the traffic analysis on switch links in the tree topology performed above with respect to fig. 6A and 6B only analyzes training in one physical machine, i.e., the lower left portion of fig. 5. It can be extended to distributed training on multiple machines. In addition, there are many other solutions besides the direct replication strategy shown in fig. 6A and the topology aware strategy shown in fig. 6B, for example, one solution may be to apply direct replication to a first set of tensors generated by a first processing unit and topology aware to a second set of tensors generated by the first processing unit. Finally, the best solution may depend on the specific topology of the heterogeneous resource. The inventors of the present disclosure have found, after research, analysis and simulation, that in practice, the topology aware strategy described in connection with fig. 6B can generally be well-suited and (near) optimal under today's hardware architecture.

In order for those skilled in the art to better understand the present disclosure, the present disclosure will clarify some possible problems before continuing the description below.

1. Why is the topology aware summation different from all-reduce have a different topology?

In this regard, the main difference is that the topology aware summation uses a second processing unit resource that is different from the tensor generation (e.g., GPU generates tensors, CPU performs tensor summation), while all-reduce uses the same resource (e.g., GPU) for tensor generation and tensor summation. Utilizing heterogeneous resources allows topology aware summation to further reduce traffic below the theoretical limit of all-reduce. Considering N1-GPU servers connected in full-mesh (full-mesh), there must be links carrying more than 2 x (N-1)/N times the tensor size of traffic, no matter what topology is used by all-reduce. However, the new neural network training architecture according to embodiments of the present disclosure may reduce the traffic on each link to 1.

2. In some embodiments, why are "reduce-scatter" and "global-gather" used instead of "reduce (reduce) and broadcast"? Reducing and broadcasting creates local hotspots on the root node. The reduce-scatter and global-gather may balance traffic on all nodes and links.

In the above, performing tensor summation by topology aware policy and communicating global tensor summation to the first processing unit is described in connection with fig. 6B. As can be seen from the description above in connection with fig. 6, there are a number of steps in topology aware summation. To expedite multiple steps in a topology aware policy, the inventors of the present disclosure introduced a multi-stage pipeline (multi-stage pipeline) into the topology aware policy. Fig. 7 shows a complete multi-stage pipeline.

Typically, on the working node, each first processing unit is used by an independent process. One of which may be selected as a root and the other as a non-root. For example, on a computer with 8 second processing units (e.g., GPUs), there is one root process and seven non-root processes. In the topology aware policy described above in connection with fig. 6B, five main phases are described. In the following, the disclosure will explain several other stages.

Internal "ready" (ready), "execute" (do), and "copy) signals. Within each worker node, processes must coordinate with each other to perform local reduce-scatter and broadcast operations. For example, in the first pipeline stage, a non-root first processing unit detects that its generated key-K (i.e., tensor) has been generated by the first processing unit and is ready to be reduced-scattered, and sends a "ready" signal (e.g., unix socket) to the root using the IPC method. The root receives these signals and eventually finds the ready keys K on all nodes. The root then broadcasts an "execute" signal to each non-root process indicating that their pace on key K is synchronized and can proceed to the next stage. The "copy" signal also takes a similar step.

Push (push) and pull (pull). Only the root process is responsible for network operations. Thus, the non-root process will skip these phases. This does not mean that non-root nodes simply wait idle, they can handle other tensors while waiting for a "copy" signal from the root process.

Each stage in the pipeline may be implemented as a separate thread with a corresponding tensor queue. The thread always polls its queue for outstanding tensors. Each thread performs an operation in the pipeline. After completing the tensor processing, the thread passes the tensor to the next queue of the downstream thread unless it is the last stage. Thus, all stages are pipelined. Note that it is not mandatory that each stage in the pipeline be implemented as a separate thread with a corresponding tensor queue, because the processing of some stages in the pipeline (e.g., "send key ready signal to root") is very fast, and a corresponding queue may not be needed for such stages. Accordingly, a corresponding tensor queue may be maintained at least one stage of the pipeline. For example, a corresponding tensor queue is maintained at one of the steps s402_2b, s402_4b, s402_6b, s600_2b, and s600_4b described above in connection with fig. 6B.

In addition, the inventors of the present disclosure have noted that different sized tensors may be generated at different times when training the neural network. In order to make the pipeline more fluid, tensors may be split when processing tensors of different sizes so that the tensors to be processed are the same or about the same size. The partition summing operation does not affect the correctness. In addition, the tensor queues are actually priority queues, so tensors can be ordered based on the same scheduling algorithm recently proposed. These optimizations can greatly improve the end-to-end performance of the pipeline.

Note that the stages in fig. 7 are flexible and can be adjusted according to training requirements. For example, for a stand-alone training task, the push-pull phase would be automatically deleted.

With respect to synchronous training and asynchronous training, as described above with reference to table 1, the new neural network training architecture according to embodiments of the present disclosure supports both synchronous and asynchronous algorithms. Hereinafter, the present disclosure will describe a synchronous training algorithm of a new neural network training architecture according to an embodiment of the present disclosure in conjunction with fig. 8A and 8B, and an asynchronous training algorithm in conjunction with fig. 9A and 9B.

Fig. 8A illustrates a flowchart of a neural network training method performed by synchronous training according to some embodiments of the present disclosure. Referring to fig. 8A, a neural network training method according to some embodiments of the present disclosure performed through synchronous training according to some embodiments of the present disclosure includes step S800, step S802, step S804, and step S806. At step S800, the first processing unit communicates the generated tensor to the second processing unit. After that, the second processing unit performs tensor summation on the received tensors (step S802). Furthermore, the second processing unit imposes a global barrier in that the first processing unit can only pull the summed global tensor sum after all the first processing units have pushed the tensors and the second processing unit has summed all the tensors. In other words, the first processing unit pulls the global tensor sum from the second processing unit only after the second processing unit has completed summing the tensors of all the tensors generated by the first processing unit (step S804). After the first processing unit pulls the global tensor sum from the second processing unit, the first processing unit may update based on the global tensor sum execution parameters and begin the next iteration until convergence. In some embodiments, the tensors in the synchronization training may be gradients generated after forward and backward propagation.

Fig. 8B presents an iterative flow of performing a neural network training method according to some embodiments of the present disclosure by synchronous training, where g represents a gradient, according to some embodiments of the present disclosure. The detailed algorithm is shown in algorithm 1.

Fig. 9A illustrates a flowchart of a neural network training method performed by asynchronous training according to some embodiments of the present disclosure.

Referring to fig. 9A, a neural network training method according to some embodiments of the present disclosure performed through synchronous training according to some embodiments of the present disclosure includes step S900, step S902, step S904, step S906, and step S908. After the forward and backward propagation, the first processing unit first performs parameter update, and calculates a difference (Δw) of the parameter after the update and the parameter before the update (step S900), and then communicates Δw to the first processing unit (step S902). The second processing unit maintains the latest parameter value w _t (step S904). Specifically, in some embodiments, each time the second processing unit receives a push request containing Δw, the second processing unit is configured to determine the value of Δw _t Δw was added thereto. In asynchronous training, the second processing unit does not force setting of global obstacle, so the first processing unit can pull parameter values from the second processing unit at any time (step S906) without requiring After the second processing unit has completed summing the tensors of the differences generated by all the first processing units. After the first processing unit pulls the parameter value from the second processing unit, the first processing unit may perform parameter update based on the pulled parameter value (step S908).

Fig. 9B presents an iterative flow of performing a neural network training method according to some embodiments of the present disclosure through asynchronous training, where w represents a parameter, according to some embodiments of the present disclosure. The detailed algorithm is shown in algorithm 2.

The asynchronous algorithm described in connection with fig. 9A and 9B does not simply apply a "send gradient, receive gradient" similar to the synchronous algorithm, because simply applying a "send gradient, receive gradient" in the asynchronous algorithm may result in significant drift of the parameters (between the individual first cells) after multiple iterations. In contrast, following the "transmit delta parameter, receive parameter" mode described with reference to fig. 9A and 9B, such an offset can be avoided.

For synchronous and asynchronous, the second processing unit in the neural network training architecture according to some embodiments of the present disclosure performs only summation. This allows the new neural network training architecture according to embodiments of the present disclosure to span the framework, as the tensor summation can be independent of the framework.

In the foregoing, the present disclosure describes a new neural network training architecture and neural network training method thereof, including basic flow, direct replication measurement, topology aware policy, and synchronous and asynchronous training, in accordance with embodiments of the present disclosure, in conjunction with fig. 3-9B.

Note that while specific examples are included in the disclosure above, they are to be considered illustrative only and not for the purpose of limitation. It will be apparent to those skilled in the art, after having understood the present disclosure, that various changes in form and details may be made therein without departing from the spirit and scope of the claims and their equivalents.

In the foregoing, a new neural network training architecture and neural network training method thereof according to embodiments of the present disclosure are described in connection with fig. 3-9B, using a first processing unit (e.g., GPU) for tensor generation and parameter updating, and a second processing unit (e.g., CPU) for tensor summation. In addition, synchronous and asynchronous training algorithms for tensor summation by direct replication strategy, tensor summation by topology aware strategy, neural network training by pipelining, and the like are also disclosed. Compared with the existing neural network training architecture all-reduce and PS, the novel neural network training architecture and the neural network training method thereof according to the embodiment of the disclosure have the advantages of improving the neural network training speed, relatively lower hardware cost, supporting cross-frame and simultaneously supporting synchronous training and asynchronous training and the like because tensor solving (compared with all-reduce) is performed on the CPU and only tensor summation operation (compared with PS) is performed on the CPU. Instead of simply applying a "send gradient, receive gradient" similar to the synchronous algorithm in asynchronous training, the use of a "send delta parameter, receive parameter" pattern can avoid simply applying a "send gradient, receive gradient" that can cause significant parameter drift after multiple iterations. In addition, applying topology aware policies can often bring good effect under today's hardware architecture and (near) optimal; introducing pipelines into a new neural network training architecture and neural network training method thereof according to embodiments of the present disclosure may further accelerate the neural network training process. Evaluation shows that neural network training methods according to some embodiments of the present disclosure can achieve speeds up to 149% faster than all-reduce; a 67% speed increase and a 18.75 times cost saving can be achieved compared to PS with the most advanced optimization functions. Hereinafter, the present disclosure will analyze and describe specific implementations of a new neural network training architecture and its neural network training method according to embodiments of the present disclosure. Note that the following analysis and implementations are merely to better understand the present disclosure by those skilled in the art and are not limiting. For example, in the analysis and implementation below, the first processing unit is exemplified by a GPU and the second processing unit is exemplified by a CPU, but this should not be construed as limiting.

4. Optimality analysis

First, the present disclosure will analyze why the topology aware policy based summation is better than the ring based all-reduce and (near) optimal.

In the optimality analysis below, the present disclosure models the system architecture as a tree-based hierarchical graph g= (V, E). Fig. 5 depicts the topology of G. N is denoted as the set of leaf nodes (GPU), S is denoted as the set of intermediate nodes (switches), and C is denoted as the set of CPU nodes. V=n.u.s.u.c. Each edge E (v) _x ，v _y ) Representing the secondary vertex v _x To v _y Will t (v) _x ，v _y ) Represented as from v _x Send to v _y Is a traffic volume of (a) in the network. Further, p is defined as the number of switches (p.gtoreq.2), and n is defined as the leaf node to which each switch is connected.

In the analysis of the present disclosure, it is assumed that G has the following features:

each intermediate node represents a switch that can forward data but cannot perform summation;

each CPU node may sum;

each edge in E is duplex and the bandwidths in both directions are equal. Will b (v) _x ，v _y ) Denoted as e (v) _x ，v _y ) B (v) _x ，v _y )＝b(v _y ，v _x )；

Let G be symmetrical. The bandwidths on the same level of the tree are equal. For example, for any j, k ε [0, p-1]，x，y∈[jn，(j+1)n-1]，b(S _j ，S _root )＝b(S _k ，S _root ) And b (N) _x ，S _j )＝b(N _y ，S _j ) This is true.

First, the lower left part in fig. 5 is seen. It is a two-level tree. From N ₀ To N _pn-1 Is required at C ₀ And summing up its data. As mentioned before, there are many combinations of solutions and the design space is very large. First, the problem is analyzed intuitively by considering two simplified cases:

1) Suppose b (N) _i ，S _j ) Ratio b (S) _j ，S _root ) Is much larger, so that the edge e (S _j ，S _root ) Is a bottleneck. In this case, it is necessary to minimize e (S _j ，S _root ) Traffic on the same time relaxes e (N _i ，S _j ) Traffic requirements on the network. From the above disclosure, it is clear that the topology aware policy described in connection with fig. 6B has satisfied this requirement. When the tensor of 1MB needs to be summed, e (S _j ，S _root ) The traffic on is only 1MB. This value cannot be less than 1MB, otherwise at C ₀ The data collected above is incomplete.

2) Suppose b (S) _j ，S _root ) Ratio b (N) _i ，S _j ) Much larger. Due to e (N) _i ，S _j ) Is now a bottleneck and it is therefore desirable to minimize traffic on this link. It goes without saying that when a sum of tensors of 1MB is required, e (N _i ，S _j ) The minimum traffic on should be 1MB. The associated solution is that each GPU copies its entire data to C ₀ I.e. a direct copy strategy.

Can be e (S _j ，S _root ) Called upper layer link, will b (N _i ，S _j ) Referred to as the lower layer link. From the above analysis, the following quotients can be derived.

The lemma 1. For two-level trees in hierarchical topology, if the upper link bandwidth is sufficiently greater than the lower link bandwidth, the direct replication strategy is the best solution; the topology aware policy is the best solution if the lower link bandwidth is sufficiently larger than the upper link bandwidth.

In other words, for a two-level tree in a hierarchical topology, when the bandwidth of the lower link is smaller than the bandwidth of the upper link, tensors generated by the first processing unit may be summed by the second processing unit by direct replication; and tensors generated by the first processing unit may be summed by the second processing unit through topology awareness when the bandwidth of the lower link is greater than the bandwidth of the upper link.

Note that, in the above, e (S _j ，S _root ) Called upper layer link, will b (N _i ，S _j ) The term lower link is merely for the portion shown at the bottom left of fig. 5 and is not limiting. For example, if each accelerator (e.g., the lower left portion of FIG. 5 is one accelerator) is taken as an exit node and its internal architecture is ignored, the entire topology shown in FIG. 5 is effectively a two-level tree, in which case e (C _m ,S _global ) Called lower layer link, e (S _global ,C _global ) Referred to as an upper link, where k=0, …, M-1.

Then, consider b again (S _j ，S _root ) And b (N) _i ，S _j ) Is a hypothesis of (2). For these two bandwidth items, it may be that in practice one is not sufficiently larger than the other. Therefore, neither topology aware policies nor direct replication policies are optimal. Instead, the best solution should be a combination of these two strategies. Intuitively, it is to apply a direct replication to one part (x) of the data (i.e. the tensor generated by the first unit) and to apply a topology aware policy to another part (y, x+y=1) of the data. At a particular x and y, job (job) completion time J may be minimized. Traffic for both links is calculated separately. At e (S) _j ，S _root ) The traffic consists of n direct copies plus topology aware policy traffic. At e (N) _i ，S _j ) The traffic is made up of one direct copy and the complete traffic of the topology aware policy.

Since J is composed ofIt is determined that the best J is therefore highly correlated with the two bandwidth terms. Example bandwidth is used on PCIe fabric to simplify analysis. On the inventors' own cluster of the present disclosure (section 7.2), b (N) was measured _i ，S _j ) =13 GB/S and b (S _j ，S _root ) =10 GB/s. Hereinafter, let m=1gb, n=4. Combining formulas (1), (2) and x+y=1, trying to find x e [0,1 ]]So that->Obtaining the optimal solution as x by solving it ^* =9/93 and J (x ^* )＝0.129s。

The topology aware policy is almost optimal for the local GPU. The performance difference between the best solution and the topology aware policy is now calculated. In practice, when x=0, the solution is a topology aware policy, so the job completion time is J (0) =0.134 s. As can be seen from the above disclosure, the optimal time is 0.129s. Thus, the topology aware policy according to embodiments of the present disclosure is very close to the optimal solution, with performance only 4% difference.

For a local GPU, the topology aware policy is superior to ring-based all-reduce. The working completion time of the ring-reduce can be calculated asSimilarly, for topology aware policy, +.>Wherein->It can be demonstrated that when k>1, for any n, p.gtoreq.2, J _ta ＜J _ar This is true. On the test platform of the inventors of the present disclosure, k=1.3, so topology aware policy is always better than loop-reduce. For example, using the above values and letting p=2, J can be obtained _ar =0.175 s, and J _ta And 23.4% smaller.

The new neural network training architecture according to embodiments of the present disclosure is the best solution for distributed communications. According to lemma 1, it is available that the direct replication strategy is best suited for distributed training and better than all-reduce. In fig. 5, if each accelerator (i.e., the portion shown at the bottom left of fig. 5) is taken as an off-node and its internal architecture is ignored, the topology shown in fig. 5 is effectively a two-level tree in which the bandwidth of the upper-level link is large enough (one can use many summing servers at the same time). Quantitatively, each accelerator sends 1GB to the summing server, while at all-reduce the traffic is 2 x (N-1)/N1 GB, approaching 2 for large N. This illustrates that the performance of the new neural network training architecture according to embodiments of the present disclosure is superior to all-reduce.

5. Training algorithm analysis

It was analyzed above why the summation based on topology aware policy is better than the loop-based all-reduce and (near) optimal. Hereinafter, the present disclosure will analyze the synchronous and asynchronous algorithms of the new neural network training architecture according to embodiments of the present disclosure. Their convergence behaviour will be discussed and compared with the original synchronous algorithm and the original asynchronous algorithm of the PS (asynchronous parallel, ASP).

And (5) synchronous training analysis. The present disclosure begins with a synchronization algorithm. In the case where the server enforces a global barrier, all accelerators will always be in the same training step. The new neural network training architecture and all-reduce according to embodiments of the present disclosure both follow a "send gradient, receive gradient" pattern, with the final received gradient being the sum of the gradients in all accelerators. They then update the parameters locally. Thus, the algorithm is equivalent to all-reduce and has the same convergence behavior as all-reduce.

Asynchronous training analysis. In asynchronous training, there is no synchronization requirement between accelerators. Each accelerator communicates with the server at its own cadence. Under asynchronous settings, simply applying a "send gradient, receive gradient" similar to the synchronous algorithm may result in significant drift of the parameters (between accelerators) after multiple iterations. In contrast, following the "transmit delta parameter, receive parameter" mode described with reference to fig. 9A, such an offset can be avoided. The present disclosure will analyze the asynchronous algorithm of the new neural network training architecture according to the embodiments of the present disclosure and prove equivalent to ASP.

Theorem 2. The asynchronous algorithm of the new neural network training architecture according to embodiments of the present disclosure is equivalent to ASP.

And (5) proving. Consider a server connected to n accelerators. The proven high-level concept of the present disclosure is to show that given the same communication order (push and pull order), the algorithm of the present disclosure generates the same state as the ASP (i.e., the parameters are the same for the server and n accelerators). F is used as a general representation of the optimizer. Optimizing to thereby enable expressed as w+.w+f (g) _i，t ) Wherein g _i，t Representing the gradient of the accelerator i. Will w _ps And w _ss Denoted PS and parameters in the new neural network training architecture according to embodiments of the present disclosure, respectively. And w is to _i，t Expressed as an iteration t (t.epsilon.1, T]) Parameters for each accelerator i at. For all accelerators and servers, the parameter is initialized to w ₀ . After T iterations, updated parameters can be obtained:

/>

next, using mathematical induction to demonstrate Δw for any i and t _i，t ＝f(g _i，t ) This is true.

Basic case t=1: given an initial parameter w ₀ From w ₀ Gradient g is obtained _i，1 . In the parameter server, accelerator i will g _i，1 Push to server and update to w _ps，1 ＝w ₀ +f(g _i，1 ). In SS, accelerator i will f (g _i，1 ) Push to server and update to w _ss，1 ＝w ₀ +f(g _i，1 ). Thus for t=1. Δw _i，t ＝f(g _i，t ) This is true. Under both architectures, the parameters on accelerator i are the same after receiving a pull response from the server.

The induction steps are as follows: if for t=k (k≡1), the lemma 2 holds, then from the same w _k Calculating gradient g _i，k+1 . Similar to the basic case, w is obtained _ps ，k+1＝w _k +f(g _i K+1) and w _ss ，k+1＝w _k +f(g _i K+1). Thus for t=k+1, Δw _i，t ＝f(g _i，t ) This is true. According to the principle of generalization, for allΔw _i，t ＝f(g _i，t ) This is true.

Returning to equations (3) and (4). Due to Deltaw _i，t ＝f(g _i，t ) This holds true for any i and t, thus yielding w _ps ＝w _ss . This completes the demonstration because after any T batches, the algorithm of the present disclosure is equal to the ASP parameters.

Theorem 2 shows that the algorithm of the present disclosure has the same convergence as ASP. This is also confirmed by the experimental evaluation below (section 7.3).

6. Implementation of the embodiments

Section 4 and section 5 above perform optimality analysis and training algorithm analysis on the new neural network training architecture according to embodiments of the present disclosure. In this chapter, the present disclosure will introduce implementation of a new neural network training architecture and neural network training method thereof according to embodiments of the present disclosure.

The core of the new neural network training architecture (i.e., SS) system according to embodiments of the present disclosure is implemented in c++, while the framework plug-ins contain c++ and Python. In general, the SS consists of approximately 6500 lines Python code and 8600 lines c++ code. SS currently supports Keras, tensorFlow, MXNet and pyrerch.

Adapt to different frameworks: to support different frameworks, framework plugins are designed for user code. These plug-ins expose APIs to user code and are an intermediate layer between the framework and the Core. The plug-in converts the framework-specific data structure into a unified abstraction called Task and queues the Task to Core. The Core then retrieves the task from the queue and processes it using the underlying communication library.

Intra-node communication: the data path messages are transmitted using NCCL and the control path messages are transmitted using UNIX sockets (e.g., coordinated). When the gradient is copied from the GPU to the CPU memory (e.g., topology aware summation (i.e., topology aware policy above)), the shared memory is used for different processes to access the global memory. If topology aware summation is enabled, the system sums data from different NUMA nodes on the CPU. OpenMP is used for parallel CPU summation.

Inter-node communication: ps-lite was used as the network library, which itself supports TCP. RDMA support (also open-source) is also added as an extension. For TCP, the number of TCP sockets is increased to increase the bandwidth as much as possible. For RDMA, the data messages are transferred using Send-Recv transfer meta information and RDMA-Write. In-place tensor push/pull is also implemented to remove memory copies on the RDMA datapath.

And the summation server: the summing function is based on the MXNet engine. But the summation is transparent to the user code and can therefore be generic to support different frameworks. For SS, the key value store communication mode is also inherited. The tensor that needs to be transferred will be converted to a key with a unique ID. Each key is divided into a plurality of segments of (approximately) equal size and randomly allocated to different servers in a load-balanced manner.

The method comprises the following steps: the interface of SS is chosen as horovad-alike, as the latter is already a widely used training architecture. In this way, a switch from hornovod to SS can be made without difficulty. For example, for a horovad-MNIST training sample, only the importhorovad.

7. Evaluation of

In chapter 6, implementation of a new neural network training architecture and neural network training method thereof in accordance with embodiments of the present disclosure is presented. In this chapter, the present disclosure will evaluate SSs on different DNN models, training architectures, and hardware settings. The inventors of the present disclosure tested popular CNN and NLP models, including res net50 (MXNet implementation), VGG16 (MXNet implementation), transformer (pyrerch implementation), and BERT (TensorFlow implementation). The batch sizes for each GPU of the res net50, VGG16, transducer and BERT were chosen to be 64, 128, 20, respectively, unless explicitly stated.

There are two test platforms: public clouds with NVLink enabled Virtual Machines (VMs) and 20Gbps TCP/IP networks, and private clusters with non-NVLink machines and 100Gbps RDMA (RoCEv 2) networks. For both settings, the GPU is NVIDIA Tesla V100 with CUDA-10.0 and cuDNN-7.0.

The horovad (with NCCL) implementation was chosen as all-reduce implementation because it has the latest distributed training performance. NCCL version 2.4.2, which will automatically select either ring-based or tree-based all-reduce depending on the hardware topology. For PS, byteScheduler is enabled, which can improve performance by up to 196%.

In evaluating the SS and PS, the number of servers is set to be the same as the accelerator so that the total network bandwidth of the servers matches the bandwidth of the accelerator. The server is installed with Intel MKL to improve CPU performance. All experiments were performed using synchronous training except section 7.3.

7.1 Performance on public cloud

The machines in this section are VMs on the cloud. This is a typical network-bound (DNN) training environment. Each machine has 8 Tesla V100 GB GPU with internal NVLink and 20Gbps TCP/IP network.

CNN model. First, two popular CNN models are run: resNet50 (computationally intensive) and VGG16 (communications intensive). To avoid potential effects of data IO loading and noise, synthetic training data (synthetic training data) is used. MXNet is used as training engine, in which case gradient compression is not used. The metric is the number of images trained on all GPUs per second. The SS with Horovad (HVD) was compared to MXNet PS (ps+bsc) with ByteScheduler, which performed better than the native MXNet PS.

Fig. 10A shows the results on the res net 50. Although there is no significant acceleration on a single node (8 GPUs), the performance of SS is 44% and 12% higher than hornovod and ByteScheduler, respectively, for the 64 GPUs case. The performance advantage of the SS is more pronounced on VGG16 (FIG. 10B, VGG16 is communications intensive. For example, in the case of 64 GPUs, the performance of the SS is better than Horovod 100%, which is consistent with previous analysis (the performance of the SS may be 2 times all-reduce). SS also improves the performance of PS+ByteScheduler by 55% because the SS places parameter updates on the CPU. Furthermore, the SS occupies very few CPU cores (section 7.4), while PS occupies all 32 CPU cores on the server.

NLP model. Then, SS was evaluated on two widely used NLP models: transformer and BERT. Both models are communication intensive, especially BERT. In this experiment, a Horovod all-reduce based implementation of ByteSchduler (HVD+BSC) was used, as PyTorch has no implementation of PS. Unlike previous experiments conducted on the CNN model, real data is used instead of synthetic data to better demonstrate the effects of SS under real world settings.

For the transducer (PyTorch implementation), a WMT16 Multimodal Translation (De-En) dataset was used. The speed measurement is the number of samples trained per second. As shown in FIG. 10C, the performance of the SS is better than all-reduction 31-87% and better than ByteScheduler 36-67%.

BERT (TensorFlow implementation) has been trained using internal datasets. This particular implementation has a special operation that can be tailored by global-norm, which is not compatible with byte stream. Therefore, no comparison is made with the ByteScheduler. XLA and FP16 compression are used to accelerate computation and communication, respectively. Results in fig. 10D, it can be found that SS improves performance by at least 55% (for 16 GPUs) and 95% (for 64 GPUs) compared to all-reduce.

7.2 Performance on private RDMA clusters

Next, the SS is evaluated over a cluster with a 100gbps RoCEv2 network. However, these machines do not have NVLink, but are based on PCIe switch fabric: there are 4 GPUs under one PCIe switch, two PCIe switches per machine. Here, the communication bottleneck is actually an internal PCIe link. The topology aware policy of the present disclosure may still improve performance.

Topology aware summation on a single machine. Starting with an 8-GPU machine and using the (communications intensive) VGG16 for evaluation. The number of GPUs is changed from 1, 4 to 8. One or four GPUs are under the same PCIe switch and have no hierarchical topology. When the number of GPUs is 8, the topology is hierarchical. The following four groups were compared: (1) All-on-CPU, wherein All gradients are copied to CPU memory, and weights are updated on CPU; (2) performing all-reduce on all GPUs; (3) SS-unaware: the SS of topology aware summation is disabled. (4) SS-aware: SS enabling topology aware summation. All results are in fig. 11A and 11B. Two different batch sizes are selected to alter the communication overhead. Take the smaller batch size (32) of fig. 11A as an example.

First, all-on-CPUs do not scale well on more GPUs, especially for the case of 8-GPUs, because they suffer from communication hot spots on the memory links of PCIe switches. Then, all-reduce and SS are discussed collectively. In the case of 1 or 4 GPUs, their performance is similar. However, in the case of an 8-GPU, topology aware summation is enabled and the performance of the SS may be 30% higher than all-reduce. The topology-aware free version of SS shows the same performance as all-reduce, indicating that the performance improvement is due to topology-aware summation.

Topology aware summation over multiple machines. Here, the SS on the distributed setting is evaluated. A total of 32 GPUs were used with four machines, and then the four CV and NLP models described above were again tested for evaluation. The available bandwidth range of the NIC for each machine is changed from 5Gbps to 100Gbps to show the performance of the SS as bottlenecks gradually transition from the external network to the internal PCIe link. The results are shown in fig. 12A to 12D.

In general, the lower the network bandwidth, the greater the advantage of the SS. With a 100Gbps RDMA network, the performance gain of all models drops to the same level as the single machine case described above—the gain comes only from topology aware summation. Using a 5Gbps network, the SS showed significant speed improvement over all models. The inflection point of each model is different-ResNet 50 is 10Gbps network, VGG is 40Gbps, transducer is 25Gbps, BERT is 40Gbps. In all cases tested, the SS performed 10-149% better than the Horovod all-reduce and 3-26% better than the PS+ByteScheduler. Note that ps+bytescheduler consumes 128 CPU cores, much more than SS, and its implementation is not cross-frame.

7.3 asynchronous training

Next, the present disclosure will demonstrate that the asynchronous algorithm according to the present disclosure has the same convergence and faster training speed than the original asynchronous training algorithm in PS (ASP). ResNet50 was trained using 32 GPUs with ILSVRC-2012ImageNet dataset containing about 1.2M images for training and 50K images for verification. During training, the first 5 accuracies and time costs for each Epoch were collected. All accelerators are completely asynchronous without any outdated limit control (staleness limit control). The present disclosure compares asynchronous algorithms according to embodiments of the present disclosure with an ASP.

Convergence and training steps. As can be seen from fig. 13A, the two loss curves almost overlap, which indicates that the algorithm according to the embodiment of the present disclosure and the convergence speed of the ASP are the same. This confirms the analysis of chapter 5 of the present disclosure.

Convergence and actual time (wall clock time). As shown in fig. 13B, the convergence rate of SS is 26% faster than the ASP implementation using MXNet. This is because the system according to embodiments of the present disclosure has higher training performance, and thus each Epoch can be completed faster.

7.4 additional CPU cost

The SS requires additional CPU to sum compared to all-reduce. For example, on a public cloud, a user must rent an additional CPU-only VM to run a server instance. The present disclosure will evaluate how much of these additional CPU resources will be spent.

The number of CPU cores required is determined by the total network bandwidth of the accelerator, the faster the accelerator sends out tensors, the faster the server sums them. The number of accelerators and the network bandwidth of each accelerator are modified. The inventors of the present disclosure found the minimum number of CPU cores that have no impact on the end-to-end training speed for different total accelerator bandwidths.

The results are shown in Table 2. Intel (R) Xeon (R) Platinum CPU@2.50GHz was used. The number of CPUs is used not only for summation but also for all other overhead including the network stack. As expected, the number of CPUs required grows linearly with the total bandwidth, since the summation is completely parallel. The CPUs need not be on the same server. For example, in the case of an operating bandwidth > 100Gbps, there must be more than one 100Gbps server.

Total bandwidth (Gbps)	100	150	200	250	300
						Summation Server	8	12	16	20	24
PS-SGD	40	60	80	100	120
						PS-RMSProp	120	180	240	300	360
PS-Adam	150	225	300	375	450

TABLE 2

The cost can be estimated using an AWS public price list. Consider VM c5 n.xlage, which has 4 CPU cores and a 25Gbps network. Its CPU/network ratio meets the requirements of table 2. The total memory consumption of the operation Server is the same as the DNN model size, and is usually at most several GB. 10.5GB RAM on c5n.xlage is sufficient. Moreover, the SS will share the model among multiple server instances and further reduce memory consumption. Each c5n.xlage charges $ 0.216 per hour.

Typical 8-GPU instances, p3.16xlarge has a 25Gbps network, charging $24.48 per hour. This means that if the user rents N p3.16xlarges (N.gtoreq.2), then only N c5n.xlarges are needed as summing servers, which costs less than 1% of GPU VM ≡! Alternatively, the user may rent the highest GPU VM, pn.24 xlage. It has 8 GPUs, a 100Gbps network, charging $ 31.2 per hour. For a 100Gbps network bandwidth per pn.24 x large, the user will need 4x 5n. The four c5n.xlarges cost $ 0.864 per hour, which is less than 3% of the cost of the pn.24xlarge VM.

Table 2 also shows the CPU requirements of the PS for comparison. A more complex (advanced) optimizer would require more CPU cores. For example, an optimizer Adam for BERT would require 18.75 times more CPUs than SS. This means 56% of the cost of the pn.24xlage accelerator.

If the user selects a GPU-based PS, for example, 30 p3.2xlarges (the cheapest GPU VM with 10Gbps network) are used to match the accelerator of three p3dn.24xlarges (300 Gbps total), it will be more expensive-about 98% of the accelerator cost.

This large cost difference between SS and PS is due to the fact that the former server only sums up, while the latter server also updates parameters and becomes a bottleneck.

So far, a new neural network training architecture (SS) and a neural network training method thereof according to an embodiment of the present disclosure have been described in conjunction with fig. 3 to 9B, and analyzed in conjunction with fig. 10A to 13B, including optimality analysis, training algorithm analysis, and the like. As can be seen from the above disclosure, the new neural network training architecture (SS) and the neural network training method thereof according to the embodiments of the present disclosure have the best performance, can support both synchronous training and asynchronous training, and can span a frame, as compared with all-reduce and PS.

Hereinafter, the present disclosure will describe an apparatus, device, and non-transitory computer-readable storage medium for neural network training according to embodiments of the present disclosure in connection with fig. 14 to 17.

Fig. 14 is a schematic diagram of an apparatus 1400 for neural network training, according to some embodiments of the present disclosure. As shown in fig. 14, an apparatus 1400 for neural network training according to some embodiments of the present disclosure may include a tensor generation module 1410, a tensor summation module 1420, and a parameter update module 1430, wherein the tensor generation module 1410 and the parameter update module 1430 are on a first processing unit and the tensor summation module 1420 is on a second processing unit different from the first processing unit. In some embodiments, the first processing unit is a GPU and the second processing unit is a CPU, note that this is not limiting. The tensor generation module 1410 is used to generate a tensor related to the neural network, which in some example embodiments is a gradient tensor. The tensor summation module 1420 sums the tensors generated by the tensor generation module 1410 to obtain a global tensor sum. The parameter update module 1430 performs parameter updates of the neural network based on the global tensor. Alternatively or additionally, the tensor generation module 1410, tensor summation module 1420, and parameter update module 1430 shown in fig. 14 may also perform the neural network training method described above in connection with fig. 4-9B according to embodiments of the disclosure.

Fig. 15 is a schematic diagram of an apparatus 1500 for neural network training, according to some embodiments of the present disclosure. As shown in fig. 15, an apparatus 1500 for neural network training according to embodiments of the present disclosure may include a processor 1510 and a memory 1520, the processor 1510 including a first processing unit 1510-1 and a second processing unit 1510-2 different from the first processing unit, the first processing unit 1510-1 being a GPU and the second processing unit 1510-2 being a CPU in some embodiments; stored on the memory 1520 are computer instructions that, when loaded and executed by the processor 1510, cause the processor 1510 to perform the method for neural network training according to the embodiments of the disclosure described above in connection with fig. 4-9B.

Fig. 16 is another schematic diagram of an apparatus 1600 for neural network training, according to some embodiments of the present disclosure. Fig. 16 illustrates a schematic of a structure of an apparatus 1600 suitable for use in implementing training of neural networks according to embodiments of the present disclosure. The electronic device 1600 may be a cloud platform or server, etc. It should be noted that the electronic device for neural network training shown in fig. 16 is merely an example, which does not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 16, the electronic device 1600 may include a processing device 1610, where the processing device 1610 includes a first processing unit (e.g., GPU) and a second processing unit (e.g., CPU), which may perform various suitable actions and processes according to programs stored in a Read Only Memory (ROM) 1620 or programs loaded from a storage device 1680 into a Random Access Memory (RAM) 1630. In RAM1630, various programs and data required for operation of device 1600 are also stored. Processing device 1610, ROM 1620, and RAM1630 are connected to each other via bus 1640. An input/output (I/O) interface 1650 is also connected to the bus 1640.

In general, the following devices may be connected to the I/O interface 1650: input devices 1660 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1670 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 1680 including, for example, magnetic tape, hard disk, etc.; and a communication device 1690. The communication means 1690 may allow the electronic device 1600 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 16 shows an electronic device 1600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 1690, or from storage device 1680, or from ROM 1620. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 1610.

Fig. 17 is a schematic diagram of a non-transitory computer-readable storage medium 1700 for neural network training, according to some embodiments of the disclosure. As shown in fig. 17, a non-transitory computer-readable storage medium 1700 for neural network training in accordance with an embodiment of the present disclosure has stored thereon computer program instructions 1710, which when loaded and executed by a processor, cause the processor to perform the method for neural network training described above in connection with fig. 4-9B.

In the above, the new neural network training architecture and the neural network training method, apparatus, device, and non-transitory computer-readable storage medium thereof according to the embodiments of the present disclosure are described in conjunction with fig. 3 to 9B and 14 to 17, and the new neural network training architecture and the neural network training method thereof according to the embodiments of the present disclosure are analyzed in conjunction with fig. 10A to 13B. As can be seen from the above disclosure, the new neural network training architecture (SS) and the neural network training method thereof according to the embodiments of the present disclosure have the best performance, can support both synchronous training and asynchronous training, and can span a frame, as compared with all-reduce and PS.

It should be noted that the computer readable medium described above in the present disclosure may be a computer readable signal medium or a non-transitory computer readable storage medium or any combination of the above. The non-transitory computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the non-transitory computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a non-transitory computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a non-transitory computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two internet protocol addresses; sending a node evaluation request comprising the at least two internet protocol addresses to node evaluation equipment, wherein the node evaluation equipment selects an internet protocol address from the at least two internet protocol addresses and returns the internet protocol address; receiving an Internet protocol address returned by the node evaluation equipment; wherein the acquired internet protocol address indicates an edge node in the content distribution network.

Alternatively, the computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: receiving a node evaluation request comprising at least two internet protocol addresses; selecting an internet protocol address from the at least two internet protocol addresses; returning the selected internet protocol address; wherein the received internet protocol address indicates an edge node in the content distribution network.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, at least the following methods, apparatuses, devices, and non-transitory computer storage media for neural network training are provided.

A neural network training method according to one or more embodiments of the present disclosure, the method comprising: generating, by a first processing unit, a tensor related to the neural network; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and performing, by a first processing unit, a parameter update of the neural network based on the global tensor, wherein the first processing unit is different from the second processing unit.

The neural network training method according to one or more embodiments of the present disclosure further includes, after tensor summing the tensors generated by the first processing unit by the second processing unit, communicating the global tensor sum to the first processing unit.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summation is performed on tensors generated by the first processing unit by the second processing unit, to obtain a global tensor sum, including: and (3) performing tensor summation on the tensors generated by the first processing unit by the second processing unit through direct copying to obtain a global tensor summation.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summation is performed on tensors generated by the first processing unit by the second processing unit, to obtain a global tensor sum, including: and carrying out tensor summation on the tensors generated by the first processing unit by the second processing unit through topology perception to obtain a global tensor summation.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summation is performed on tensors generated by the first processing unit by the second processing unit, to obtain a global tensor sum, including: direct replication is applied to a first set of tensors generated by a first processing unit, and topology awareness is applied to a second set of tensors generated by the first processing unit.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summation is performed on tensors generated by the first processing unit by the second processing unit, to obtain a global tensor sum, including: for the two-level tree in the hierarchical topology, when the bandwidth of the lower-layer link is smaller than that of the upper-layer link, tensor summation is carried out on tensors generated by the first processing unit through direct copying; and tensor summation is performed on the tensor generated by the first processing unit by the second processing unit through topology awareness when the bandwidth of the lower link is greater than the bandwidth of the upper link.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summing, by the second processing unit, the tensors generated by the first processing unit, by direct replication, results in a global tensor sum, comprising: copying the tensor generated by the first processing unit to the second processing unit; the second processing unit performs tensor summation on the tensors generated by the first processing unit to obtain a global tensor summation; and wherein communicating the global tensor sum to the first processing unit comprises: the global tensor sum is copied back to the first processing unit.

A neural network training method according to one or more embodiments of the present disclosure, wherein tensor summation is performed on tensors generated by the first processing unit by the second processing unit through topology awareness, to obtain a global tensor sum, including: step 1: tensor summation is carried out on tensors generated by a first processing unit under the same switch to obtain aggregation data; step 2: copying the aggregate data to a second processing unit; and step 3: the second processing unit performs tensor summation on the aggregation data to obtain a global tensor sum; and wherein communicating the global tensor sum to the first processing unit comprises: step 4: the first processing unit copies its partition data from the second processing unit back to itself; step 5: each second processing unit performs a global-collection operation with the second processing units under the same switch.

A neural network training method according to one or more embodiments of the present disclosure, wherein the steps 1, 2, 3, 4 and 5 are performed by a pipeline, and wherein a respective tensor queue is maintained at least one of the steps 1, 2, 3, 4 and 5.

A neural network training method according to one or more embodiments of the present disclosure, wherein during execution of the steps 1, 2, 3, 4 and 5 through a pipeline, tensors generated by the first processing unit are tensor-partitioned such that tensors to be processed are the same in size.

A neural network training method according to one or more embodiments of the present disclosure, wherein a tensor related to the neural network is generated by a first processing unit; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and updating, by the first processing unit, based on the global tensor and the parameters of the execution neural network, includes: the first processing unit communicates the tensor it generated to the second processing unit; the second processing unit performs a tensor summation on the received tensors; after the second processing unit completes the tensor summation of the tensors generated by all the first processing units, the first processing unit pulls the global tensor summation from the second processing unit; and a second processing unit based on the global tensor and an execution parameter update; wherein the tensor is a gradient.

A neural network training method according to one or more embodiments of the present disclosure, wherein a tensor related to the neural network is generated by a first processing unit; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and updating, by the first processing unit, based on the global tensor and the parameters of the execution neural network, includes: the first processing unit locally executes parameter updating and calculates the difference between the parameter after updating and the parameter before updating; the first processing unit communicating the calculated difference to the second processing unit; the second processing unit maintaining parameter values based on the received differences; the first processing unit pulls out the parameter value from the second processing unit; and the second processing unit performs parameter updating based on the parameter values, wherein the operation of the first processing unit pulling out the parameter values from the second processing unit does not have to be performed after the second processing unit has completed summing up tensors of all the differences generated by the first processing unit.

A neural network training method according to one or more embodiments of the present disclosure, wherein the first processing unit is a Graphics Processor (GPU) and the second processing unit is a Central Processing Unit (CPU).

An apparatus for neural network training according to one or more embodiments of the present disclosure, the apparatus comprising: a tensor generation module that generates a tensor associated with the neural network; the tensor summation module is used for performing tensor summation on the tensors generated by the tensor generation module to obtain a global tensor sum; and a parameter updating module that performs parameter updating of the neural network based on the global tensor, wherein the tensor generating module and the parameter updating module are on a first processing unit and the tensor summing module is on a second processing unit different from the first processing unit.

An apparatus for neural network training according to one or more embodiments of the present disclosure includes a processor and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to implement a neural network training method according to embodiments of the present disclosure, wherein the processor includes a first processing unit and a second processing unit different from the first processing unit.

A non-transitory computer-readable storage medium according to one or more embodiments of the present disclosure has stored thereon computer instructions that, when executed by a computer, perform a neural network training method according to embodiments of the present disclosure.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A neural network training method, comprising:

generating, by a first processing unit, a tensor related to the neural network;

the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and

based on the global tensor and performing a parameter update of the neural network by a first processing unit,

the second processing unit performs tensor summation on the tensors generated by the first processing unit, so as to obtain a global tensor sum, which includes: the tensor generated by the first processing unit is summed by the second processing unit, through topology awareness, to obtain a global tensor sum,

the second processing unit sums the tensors generated by the first processing unit through topology perception, so as to obtain a global tensor sum, and the method comprises the following steps:

step 1: tensor summation is carried out on tensors generated by a first processing unit under the same switch to obtain aggregation data;

Step 2: copying the aggregate data to a second processing unit; and

step 3: the second processing unit performs tensor summation on the aggregation data to obtain a global tensor sum;

wherein the first processing unit and the second processing unit are hardware processing units and the first processing unit is different from the second processing unit.

2. The method of claim 1, further comprising, after the tensor summation by the second processing unit of the tensors generated by the first processing unit, communicating the global tensor summation to the first processing unit.

3. The method of claim 2, wherein summing the tensors generated by the first processing unit by the second processing unit to obtain a global tensor sum comprises: and (3) performing tensor summation on the tensors generated by the first processing unit by the second processing unit through direct copying to obtain a global tensor summation.

4. The method of claim 3, wherein summing the tensors generated by the first processing unit by the second processing unit to obtain a global tensor sum comprises: direct replication is applied to a first set of tensors generated by a first processing unit, and topology awareness is applied to a second set of tensors generated by the first processing unit.

5. The method of claim 3, wherein summing the tensors generated by the first processing unit by the second processing unit to obtain a global tensor sum comprises: for a two-level tree in a hierarchical topology,

when the bandwidth of the lower link is smaller than that of the upper link, tensor summation is carried out on tensors generated by the first processing unit through direct copying; and is also provided with

When the bandwidth of the lower link is larger than the bandwidth of the upper link, tensors generated by the first processing unit are summed by the second processing unit through topology awareness.

6. The method of any of claims 3 to 5, wherein tensor summing, by the second processing unit, the tensors generated by the first processing unit by direct replication, resulting in a global tensor sum comprises:

copying the tensor generated by the first processing unit to the second processing unit;

the second processing unit performs tensor summation on the tensors generated by the first processing unit to obtain a global tensor summation;

and wherein communicating the global tensor sum to the first processing unit comprises: the global tensor sum is copied back to the first processing unit.

7. The method of any of claims 2 to 5, wherein communicating the global tensor sum to a first processing unit comprises:

Step 4: the first processing unit copies its partition data from the second processing unit back to itself;

step 5: each second processing unit performs a global-collection operation with the second processing units under the same switch.

8. The method of claim 7, wherein the steps 1, 2, 3, 4 and 5 are performed by pipelining,

and wherein a respective tensor queue is maintained at least one of said steps 1, 2, 3, 4 and 5.

9. The method of claim 8, wherein during execution of the steps 1, 2, 3, 4 and 5 through a pipeline, tensors generated by the first processing unit are tensor partitioned such that the tensors to be processed are the same in size.

10. The method of any one of claims 1 to 5, wherein a tensor related to the neural network is generated by a first processing unit; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and updating, by the first processing unit, based on the global tensor and the parameters of the execution neural network, includes:

the first processing unit communicates the tensor it generated to the second processing unit;

The second processing unit performs a tensor summation on the received tensors;

after the second processing unit completes the tensor summation of the tensors generated by all the first processing units, the first processing unit pulls the global tensor summation from the second processing unit; and

the first processing unit updates based on the global tensor and an execution parameter;

wherein the tensor is a gradient.

11. The method of any one of claims 1 to 5, wherein a tensor related to the neural network is generated by a first processing unit; the second processing unit sums the tensors generated by the first processing unit to obtain a global tensor sum; and updating, by the first processing unit, based on the global tensor and the parameters of the execution neural network, includes:

the first processing unit locally executes parameter updating and calculates the difference between the parameter after updating and the parameter before updating;

the first processing unit communicating the calculated difference to the second processing unit;

the second processing unit maintaining parameter values based on the received differences;

the first processing unit pulls out the parameter value from the second processing unit; and

the first processing unit performs a parameter update based on the parameter values,

Wherein the operation of the first processing unit pulling out the parameter values from the second processing unit does not have to be performed after the second processing unit has completed summing up the tensors of the differences generated by all the first processing units.

12. The method of any of claims 1 to 5, wherein the first processing unit is a Graphics Processor (GPU) and the second processing unit is a Central Processing Unit (CPU).

13. An apparatus for neural network training, the apparatus comprising:

a tensor generation module that generates a tensor associated with the neural network;

the tensor summation module is used for performing tensor summation on the tensors generated by the tensor generation module to obtain a global tensor sum; and

a parameter updating module that performs parameter updating of the neural network based on the global tensor,

the tensor summation module performs tensor summation on the tensors generated by the tensor generation module to obtain a global tensor sum, and the method comprises the following steps: tensor summation is carried out on tensors generated by the tensor generation module through topology perception by the tensor summation module to obtain a global tensor summation,

the tensor summation module performs tensor summation on the tensors generated by the tensor generation module through topology perception, so as to obtain a global tensor sum, and the method comprises the following steps:

Step 1: tensor summation is carried out on tensors generated by a tensor generation module under the same switch to obtain aggregation data;

step 2: copying the aggregate data to a tensor summing module; and

step 3: the tensor summation module performs tensor summation on the aggregated data to obtain a global tensor sum; wherein the tensor generation module and the parameter update module are on a first processing unit and the tensor summation module is on a second processing unit different from the first processing unit, and wherein the first processing unit and the second processing unit are hardware processing units.

14. An apparatus for neural network training, the apparatus comprising a processor and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of claims 1 to 12, wherein the processor comprises a first processing unit and a second processing unit different from the first processing unit.

15. A non-transitory computer readable storage medium having stored thereon computer instructions which, when executed by a computer, perform the method of any of claims 1 to 12.