CN114580664A

CN114580664A - Training analysis method and device, storage medium and electronic equipment

Info

Publication number: CN114580664A
Application number: CN202210203925.0A
Authority: CN
Inventors: 彭杨华; 胡汉鹏; 朱亦博; 林海滨; 江宸宇; 钟钰琛; 吴川; 郭传雄
Original assignee: ByteDance HK Co Ltd
Current assignee: Douyin Group HK Ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-03

Abstract

The disclosure relates to a training analysis method, a training analysis device, a storage medium and an electronic device, which are used for realizing fine-grained tracking of communication operators, aligning tracking timestamps from different computing units and improving analysis accuracy of distributed training. The method comprises the following steps: determining a global dependency graph of the deep neural network; determining the time offset of a plurality of processors relative to a reference processor by taking a minimized time alignment function as a target and taking the operation dependency among all nodes in the global dependency graph before and after time alignment as a constraint condition; and according to the global dependency graph and the time offset, simulating a distributed training process of the deep neural network in an off-line mode, and determining a distributed training analysis result of the deep neural network according to an off-line simulation result.

Description

Training analysis method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a training analysis method, an apparatus, a storage medium, and an electronic device.

Background

Distributed training for large datasets has been widely used in machine learning models, such as Deep Neural Networks (DNNs), to support various intelligent technology driven applications. Compared with single-node training, distributed training of the same machine learning model using multiple processors can reduce training time cost.

In the related art, the analysis of distributed training is mainly to estimate the communication time of the synchronization tensor between the processors according to the bandwidth and the tensor size. The tensor is data generated in the distributed training process, and may be a gradient generated in the distributed training process, for example. The coarse-grained estimation method considers the synchronization of one tensor as a black box, and does not distinguish communication queuing time and actual communication transmission time, so that specific training details of distributed training cannot be accurately analyzed.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a training analysis method for analyzing a distributed training process of a deep neural network, the deep neural network operating on a plurality of devices, the method comprising:

determining a global dependency graph of the deep neural network, wherein the global dependency graph is used for characterizing the dependency relationship between the calculation nodes in the deep neural network running on the processor of each device and the communication relationship between a plurality of communication nodes used for synchronizing tensors generated by the calculation nodes, and each two processors in the global dependency graph comprise a plurality of communication nodes;

determining time offsets of a plurality of processors relative to a reference processor by taking a minimized time alignment function as a target and taking operation dependency among nodes in the global dependency graph before and after time alignment as a constraint condition, wherein the time alignment function is used for representing time difference of tensors received by the processors from the same processor and time offset difference among processors in the same device, and the reference processor is any one of the processors in the plurality of devices;

and according to the global dependency graph and the time offset, off-line simulating a distributed training process of the deep neural network, and determining a distributed training analysis result of the deep neural network according to an off-line simulation result.

In a second aspect, the present disclosure provides a training analysis apparatus for analyzing a distributed training process of a deep neural network, the deep neural network operating on a plurality of devices, the apparatus comprising:

a first determining module, configured to determine a global dependency graph of the deep neural network, where the global dependency graph is used to characterize dependencies between computation nodes in the deep neural network running on a processor of each device and communication relationships between communication nodes used to synchronize tensors generated by the computation nodes among a plurality of processors, and each two processors in the global dependency graph include a plurality of the communication nodes therebetween;

a second determining module, configured to determine time offsets of the multiple processors relative to a reference processor, with a constraint condition that operation dependencies between nodes in the global dependency graph do not change before and after time alignment, as an objective of minimizing a time alignment function, where the time alignment function is used to characterize time differences of tensors received by the processors from the same processor and time offset differences between processors in the same device, and the reference processor is any one of the multiple devices;

and the training simulation module is used for simulating the distributed training process of the deep neural network in an off-line manner according to the global dependency graph and the time offset, and determining the distributed training analysis result of the deep neural network according to the off-line simulation result.

In a third aspect, the present disclosure provides a non-transitory computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.

By the technical scheme, the distributed training process of the deep neural network can be simulated and analyzed in an off-line mode according to the global dependency graph and the time offset. Compared with the mode that tensor synchronization between two processors is regarded as a black box in the related technology, the method equivalently splits one tensor synchronization process into a plurality of communication sub-processes, thereby distinguishing communication queuing time and actual communication transmission time, obtaining more accurate distributed training time and improving the offline analysis accuracy of distributed training. On the other hand, the time offset is obtained by minimizing a time alignment function, and the time alignment function represents that the processors receive time difference of tensors from the same processor and time offset difference between the processors in the same device, so that time stamps of offline simulation training on different processors can be aligned offline, more accurate communication transmission time is provided for synchronization of each tensor, and analysis accuracy of distributed training is further improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a training analysis method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a global dependency graph in a training analysis method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a critical path in a training analysis method according to an exemplary embodiment of the present disclosure;

FIG. 4 is a process diagram illustrating a training analysis method according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a training analysis device according to an exemplary embodiment of the present disclosure;

fig. 6 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is further noted that references to "a", "an", and "the" modifications in the present disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

All actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Distributed training for large datasets has been widely used in machine learning models, such as Deep Neural Networks (DNNs), to support various intelligent technology driven applications. Compared with single-node training, the distributed training of the same machine learning model by using a plurality of devices can reduce the training time cost.

In the related art, an analyzer is mainly used for analyzing deep neural network operation (i.e., a training process of a deep neural network) to collect a track (mainly, execution time of an operation) of each operation in the operation, and analyze a dependency relationship between the operations to obtain a dependency graph corresponding to the operation, and then a simulator simulates operation of the operation by using a result collected by the analyzer to obtain performance evaluation of the operation.

The inventor researches and discovers that in a distributed training scene, the related technology can collect the track of each operation in the operation through a hardware analysis tool, a built-in analyzer of a machine learning framework and an analyzer in a communication library, and analyze the dependency relationship among the operations to obtain a dependency graph corresponding to the operation. The hardware analysis tool may collect start or end time of a processor core, GPU (Graphics Processing Unit) memory usage, and other hardware information. Moreover, the hardware analysis tool can also collect relevant analysis information during the operation of the job. However, such hardware analysis tools rely on hardware and do not capture dependencies between operators in deep neural networks, which makes parsing kernel-level tracking challenging.

For built-in parsers of machine learning frameworks, for example, TensorFlow, PyTorch, and MXNet all provide built-in parsers. These analyzers collect trace information for operators in deep neural networks, including the time and memory consumption to execute each operator. The TensorFlow and MXNet analyzers also collect coarse-grained communication trace information for their distributed API (Application Programming Interface), including the start time and end time of the communication operator. However, these tools do not exclude communication queuing times in the communication library, and thus do not allow for a true communication transit time.

For analyzers in a communication library, two communication schemes (i.e., parameter synchronization schemes) are widely adopted for distributed training in the related art: (1) AllReduce, that is, processors (also called worker, usually a block of GPU) participating in training in distributed training are connected in a tree or ring topology, and synchronously aggregate gradients by using a corresponding communication mode, and then update training parameters; (2) the Parameter Server (PS) architecture, i.e., the processor participating in training pushes the local gradient generated by itself to the PS, and then extracts the aggregate gradient from the PS. Among other things, the AllReduce communication scheme may treat the entire synchronization task of a tensor (tensor) as a single operation and collect the start time and duration of such synchronization task on a single processor. While the PS communication scheme allows tensor division, its analyzer can collect the time taken to pull the tensor from the PS and the time taken to upload the tensor to the PS. However, neither communication scheme supports the tracking of computational operators.

Alternatively, the related art may also monitor the gradient noise metric by inserting a monitoring operation into the dependency graph and aggregate the gradient noise metric between processors using a collective communication layer. However, this approach also does not track the execution time of operators and the dependencies among operators.

Further, the inventors have also studied to find that the related art parser does not support timestamp alignment between the processor and the parameter server. Even with clock synchronization tools such as NTP (Network Time Protocol) or other more accurate tools, there is millisecond or sub-millisecond clock drift between the processor and the parameter server. Also, the analysis tool in the related art can capture only the start time of the RECV operation (i.e., tensor reception operation), but cannot acquire the exact time when the reception data starts. If the timestamps are not aligned exactly, errors in the communication timestamps may accumulate, thereby increasing the error in the end-to-end performance estimation.

In summary, in a distributed training scenario, an analyzer in the related art has a problem that operator tracking and timestamp alignment cannot be supported. In addition, the analyzer in the related art estimates the communication time of the synchronization tensor between the processors mainly based on the bandwidth and the tensor size. The coarse-grained estimation method considers the synchronization of one tensor as a black box, and does not distinguish communication queuing time and actual communication transmission time, so that the simulator cannot accurately simulate the running time of distributed training operation, and further cannot perform accurate performance evaluation on the distributed training operation.

In view of this, the present disclosure provides a training analysis method, which provides a general distributed training analysis manner, distinguishes communication queuing time and actual communication transmission time, realizes fine-grained communication operator tracking, aligns tracking timestamps from different processors, and provides more accurate global time for distributed deep neural network training, so that a simulator can more accurately simulate running time of distributed training operation, and further perform more accurate performance evaluation on the distributed training operation.

FIG. 1 is a flow chart illustrating a training analysis method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the training analysis method may be used for a distributed training process for analyzing a deep neural network, the deep neural network operating on a plurality of devices, including the steps of:

step 101, determining a global dependency graph of a deep neural network. The global dependency graph is used for representing the dependency relationship between the calculation nodes in the deep neural network running on the processor of each device and the communication relationship between the communication nodes used for synchronizing tensors generated by the calculation nodes among the plurality of processors, and each two processors in the global dependency graph comprise a plurality of communication nodes.

And step 102, with the minimized time alignment function as a target, and with the operation dependency among the nodes in the global dependency graph before and after time alignment unchanged as a constraint condition, determining the time offset of the multiple processors relative to the reference processor. Wherein the time alignment function is used for characterizing the time difference of tensors received by the processors from the same processor and the time offset difference between the processors in the same device, and the reference processor is any processor included in the plurality of devices.

And 103, simulating the distributed training process of the deep neural network in an off-line mode according to the global dependency graph and the time offset, and determining a distributed training analysis result of the deep neural network according to the off-line simulation result.

It should be understood at first that, an implementation scenario of the training analysis method in the embodiment of the present disclosure may be that after performing online training on a deep neural network running on a plurality of processors, performance data (such as time overhead of each computing node and communication node) of the online training is obtained. And then, performing off-line simulation based on the performance data, thereby obtaining a distributed training analysis result of the deep neural network and realizing off-line optimization of the deep neural network. The deep neural network can be a deep neural network capable of realizing different application functions such as text recognition, voice recognition, image recognition and the like. It should be understood that if the deep neural network is a deep neural network for implementing an image recognition function, the calculation operation and the communication operation referred to in the present disclosure are both operations on image data. Of course, if the deep neural network is a deep neural network for implementing other functions, the computing operation and the communication operation mentioned in the present disclosure are operations for corresponding types of data.

Further, it should be understood that the processor in the embodiments of the present disclosure may be a processor having a hardware structure, such as a GPU. Accordingly, the processor of a device may be all GPUs that the device includes. Alternatively, the processor in the embodiments of the present disclosure may be a process for performing a corresponding computing function. Accordingly, the processor of the device may be a process running a deep neural network, which is not limited by the embodiments of the present disclosure.

In a possible manner, step 101 may be: determining a local dependency graph of the deep neural network running on a processor of each device, wherein the local dependency graph is used for representing the dependency relationship between the computing nodes in the deep neural network, and the local dependency graph comprises a first virtual operation identification used for indicating the tensor synchronization start and the tensor synchronization end. Then, a communication topology of the plurality of processors is determined, the communication topology being used for characterizing communication relations between communication nodes for synchronizing tensors generated by the compute nodes among the plurality of processors, and the communication topology including a second virtual operation identifier for indicating a start of tensor synchronization and an end of tensor synchronization. And finally, establishing association between the local dependency graph and the communication topology through the first virtual operation identifier and the second virtual operation identifier with the same identification information to obtain a global dependency graph.

That is to say, the establishment of the global dependency graph can be divided into two parts, namely a local dependency graph and a communication topology, as shown in fig. 2, the local dependency graph corresponds to a computation part in distributed training, and each circle in the local dependency graph in fig. 2 represents a computation node and corresponds to a layer of network structure in a deep neural network. The local dependency graph may be established according to a network structure of the deep neural network, or may be established by referring to a manner of generating the dependency graph in a machine learning framework such as TensorFlow, MxNet, and the like, which is not described herein again. In the embodiment of the disclosure, in order to associate the local dependency graph with the communication topology, the local dependency graph may include a first virtual operation identifier for indicating the start of tensor synchronization and the end of tensor synchronization. For example, referring to fig. 2, a first virtual operation identification a1 for indicating the start of tensor synchronization and a first virtual operation identification B1 for indicating the end of tensor synchronization in the local dependency graph. It should be understood that fig. 2 represents virtual communication operations as dashed triangles.

With continued reference to fig. 2, the communication topology is used to describe how tensors are transmitted between processors of the device, the nodes in fig. 2 are used to represent the processors of the device (i.e., computation units participating in training in distributed training, such as GPUs or processing processes, also referred to as workers), the dashed triangles in fig. 2 represent virtual communication operations, the solid triangles in fig. 2 represent communication nodes, and one communication node corresponds to a communication operation (a sending operation or a receiving operation).

In the embodiment of the present disclosure, the communication topology mainly includes two types of communication operations: 1) the producer sends a certain amount to other processors; 2) the consumer receives a certain amount from the other processor. Each transfer of the tensor between any two processors can be tagged with a unique transaction ID and connect the producer to the corresponding consumer who possesses the same transaction ID. Wherein one triangle in the communication topology in fig. 2 represents a communication operation (either a sending operation or a receiving operation).

In the PS architecture, each tensor upload operation and tensor pull operation is considered as a pair of transmit and receive operations between the processor and the parameter server. Correspondingly, the transaction ID may be generated by an IP address of a sender or a receiver, a tensor name, and an operation type identifier, where the operation type identifier is used to characterize an operation corresponding to the transaction ID as a tensor uploading operation or a tensor pulling operation. Whereas in the AllReduce architecture, the transfer of the tensor to the next processor (i.e., worker) on the AllReduce ring can be represented by a pair of transmit and receive operations. Accordingly, the transaction ID may be generated by the block ID of the tensor partition and the step ID in the AllReduce ring. The block ID of the tensor partition is used for identifying the tensor corresponding to each operation, and the step ID is used for identifying which two processors in the AllReduce ring are targeted by each operation. The relevant content of the AllReduce ring can refer to the related art, and is not described herein.

In the disclosed embodiment, in order to associate the local dependency graph with the communication topology, the communication topology of each tensor may include a pair of input or output virtual operations marked with tensor names, indicating the start or end of tensor transmission, i.e., a second virtual operation identification for indicating the start of tensor synchronization and the end of tensor synchronization. Such as a second virtual operation indicating the start of tensor synchronization a2 and a second virtual operation indicating the end of tensor synchronization referring to fig. 2, is used in the communication topology to identify B2.

And then, establishing association between the local dependency graph and the communication topology through the first virtual operation identifier and the second virtual operation identifier which have the same identification information, so as to obtain a global dependency graph. For example, following the above example, the first virtual operation identifier a1 and the second virtual operation identifier a2 have the same identifier information "a", so that the first virtual operation identifier a1 and the second virtual operation identifier a2 have a connection relationship therebetween. Meanwhile, the first virtual operation identifier B1 and the second virtual operation identifier B2 have the same identifier information "B", and thus the first virtual operation identifier B1 and the second virtual operation identifier B2 have a connection relationship therebetween. Therefore, the local dependency graph and the communication topology can be associated to obtain the global dependency graph.

By the method, the structure of the global dependency graph is decoupled into the local dependency graph and the communication topology, the local dependency graph and the communication topology are associated through the virtual operation identifiers with the same identification information, and under the condition that processors participating in distributed training are unchanged, when the local dependency graph is different due to different machine learning frames, the communication topology does not need to be modified, various machine learning frames and communication libraries can be supported, and the communication modification cost is reduced.

In addition, referring to fig. 2, the global dependency graph includes a plurality of communication nodes between every two processors, and compared with a mode in which tensor synchronization between the processors is regarded as a black box in the related art, the method is equivalent to splitting a tensor synchronization process into a plurality of communication sub-processes, so that communication queuing time and actual communication transmission time can be distinguished, more accurate distributed training time is obtained, and accuracy of offline analysis of distributed training is improved.

After the global dependency graph is obtained, the time offset of the multiple processors relative to the reference processor can be determined through a time alignment function, the start time of the RECV operation (namely tensor receiving operation) is corrected in an off-line mode, and the more accurate communication duration of each tensor is obtained.

For example, let W and P represent the set of processors (i.e., worker) and PS in distributed training, respectively. It should be understood that for the Allreduce architecture, P is null.

And

respectively representing the start and end timestamps of an operation op on a node i e P ∈ P £ measured in the initial trajectory,

and

respectively, indicating adjusted start and end timestamps. Wherein, the node i is a processor or a PS. Assume that node 0 is a reference node for other nodes, i.e., all other nodes are aligned to the time axis of node 0. Accordingly, the time offset θ of the node i can be represented by the difference in the measured time between the node i and the node 0_iI.e. by

Based on this, a time alignment function can be established to align the time stamps between the nodes in the following manner.

First, operations (i.e., RECV operations) of the same node receiving the same tensor from the same sender under different training iterations should have similar execution times, and these receiving operations may be referred to as belonging to the same receiving operation series. For a pair of a tensor transmission operation and a tensor reception operation from a node i to a node j, the start time of the tensor transmission operation may be counted

Adjusting a start time of a tensor receive operation

Thus, the time of the receive tensor in the time alignment function can be determined as follows: determining an end time of the first processor performing the tensor receiving operation, and determining a latest time among a start time of the first processor performing the tensor receiving operation and a start time of the second processor performing the tensor transmitting operation, wherein the end time, the start time of the first processor performing the tensor receiving operation, and the start time of the second processor performing the tensor transmitting operation are all represented by a time offset of the corresponding processor relative to the reference processor, and the first processor and the second processor are different. Then, a time difference between the end time of the first processor performing the tensor reception operation and the latest time is determined as a time at which the first processor receives the tensor from the second processor.

For example, the time to receive the tensor, i.e., the execution time of the tensor receiving operation, can be selected from

Become into

It should be understood that, in practical applications, there is a case where the tensor receiving operation is started before the tensor transmitting operation, and in this case, if the execution time of the tensor receiving operation is determined at the start time of the tensor receiving operation, the determined execution time is caused to be longer than the actual execution time. Therefore, according to the method, the time determination error is reduced, and the accuracy of subsequent calculation is improved.

In the embodiment of the present disclosure, the first objective is to minimize the variance of the execution time of the tensor reception operation in the same reception operation series by adjusting the time offset θ by the time axis:

wherein,

representing the set of operations corresponding to all the series of received operations, f_recvA series of receiving operations is represented, j represents a node that performs a tensor receiving operation, and i represents a node that performs a tensor sending operation.

Second, since processors on the same device share the same physical clock, processors on the same device should have the same time offset. Let M be the set of all devices, g_mA set of nodes corresponding to processors on device m. The time offset difference between processors on the same device is calculated according to the following formula:

thus, the time offset of the processors comprised by the plurality of devices with respect to the reference processor, i.e. the time offset θ of the time alignment between the distributed nodes, can be determined by solving the following optimization problem:

s.t.θ₀＝0

wherein, a₁And a₂Is a measure of the weight of two objects, and a₁,a₂≧ 0, E is the set of inter-operation dependencies. Constraint conditions

For ensuring that the operation dependency between nodes in the global dependency graph is not changed before and after time alignment, i.e. assuming operation o on node j in the global dependency graph₂Dependent on operation o on node i in the global dependency graph₁The time after the adjustment of the former

Not earlier than after the adjustment of the latter

The optimization problem can be solved by using an optimization tool in the related art, such as CVXPY, and the like, which is not limited by the embodiment of the present disclosure.

After the time offset is obtained, the distributed training process of the deep neural network can be simulated and analyzed in an off-line mode according to the time offset and the global dependency graph.

In a possible manner, according to the global dependency graph and the time offset, the distributed training process of the offline simulation deep neural network may be: the method comprises the steps of firstly enabling each processor to correspond to an operation queue, enabling the sending operation of each tensor in a global dependency graph to correspond to the operation queue, enabling the receiving operation of each tensor to correspond to the operation queue, enabling the operation queue to be used for storing the operation to be executed, enabling the operation queue to have a queue time represented by a time offset, and enabling the queue time to be the end time of the last operation executed by the processor or the tensor communication process corresponding to the operation queue. And then, determining a target operation queue with the earliest queue time, executing the head operation stored in the target operation queue, updating the queue time of the target operation queue based on the execution ending time of the head operation, returning to the step of determining the target operation queue with the earliest queue time until all operations in the complete local dependency graph are executed, and determining the latest queue time as the iteration time of the global dependency graph in the queue times corresponding to the plurality of operation queues. Accordingly, from the results of the offline simulation, determining the results of the distributed training analysis on the deep neural network may be: and determining a distributed training analysis result of the deep neural network according to the iteration time.

In particular implementations of the present disclosure, execution of the global dependency graph may be simulated based on a Kahn topological sort algorithm. It should be appreciated that the simulation process in the related art uses a global ready queue to store all currently executable operations. For distributed training, each processor (i.e., worker), PS and communication process can be regarded as a virtual device, and a ready queue and a queue time are maintained for each virtual device in the embodiments of the present disclosure. The communication process is a tensor communication process corresponding to a tensor sending operation or a tensor receiving operation, and the queue time is the end time of the last operation executed on the virtual device. An operation is queued to the corresponding virtual device once it is ready (i.e., all dependent operations are completed). Then, the virtual device with the smallest queue time may be continuously selected and an operation execution is taken from the head of its device queue (i.e., the head of the queue), and the corresponding queue time is updated with the execution time of the operation. After all operations in the global dependency graph are finished running, the maximum queue time can be used as the iteration time of the global dependency graph to evaluate the performance of the global dependency graph.

It should be understood that a global dependency graph may have a plurality of possible topological rankings, but before simulation, the embodiments of the present disclosure have performed a plurality of iterative trainings on the deep neural network (for example, 10 iterative trainings on average), that is, trajectory information in the global dependency graph is obtained by a plurality of training iterations, and the embodiments of the present disclosure use a first-in first-out (FIFO) queue, so that the most likely topological rankings can be generated quickly, and the efficiency of offline analysis is improved.

In practical applications, after simulating the running of the distributed training job by using the simulator to obtain the performance evaluation of the job, the job can be optimized, for example, changing the operation in the dependency graph, such as replacing several operations by one operation, or changing the time information of part of the operations.

The inventor carefully researches and discovers that for the optimization, the related technology mainly comprises a communication optimization mode and a calculation optimization mode. The communication optimization mode comprises the following steps: 1) the communication operations corresponding to the small tensors are fused, and the communication overhead in gradient synchronization is reduced; 2) dividing a large tensor into a plurality of tensors and then communicating, so as to execute tensor uploading operation and tensor pulling operation in parallel in a PS framework; 3) gradient size reduction by various compression methods; 4) tensor transmission scheduling algorithms are defined to better perform computation operations and communication operations in parallel. The calculation optimization mode comprises the following steps: 1) a plurality of calculation operators are fused into a whole operator, so that the operator scheduling overhead is reduced; 2) by using 16-bit floating point numbers instead of 32-bit floating point numbers as the number format for most computing operations, memory consumption is reduced and computation is faster.

For the communication optimization method, in practical applications, the granularity of tensor fusion can be set by setting environment variables, that is, the tensors are fused into a new tensor of a specified size for transmission, or all tensors in the period of time are fused at a specific time interval. Although the related art provides an automatic searching mode based on Bayesian search, the search is an online search process, needs time of an actual training process, and cannot be automatically searched offline.

In addition, the inventor also researches and discovers that conflicts exist between the effects of different optimization modes. For example, DL compilers employ operator fusion techniques to reduce GPU memory access, but their default approach may fuse all back propagation operators, which postpones communication operations by the tensors corresponding to these direction propagation operators. Although the computation time may become shorter after the fusion operator, the end-to-end training time may increase instead due to less overlap between computation and communication. In addition, some memory optimization techniques also interact with computation and communication, but recalculation of intermediate results sacrifices training speed to reduce memory footprint and may delay communication of the tensor.

Moreover, a very large combination strategy space exists between different optimization modes. Specifically, in the actual optimization process, the optimizer needs to search for an optimal optimization technique for each operator and how to implement the combination of the optimization techniques, however, the number of operators included in the global dependency graph corresponding to the deep neural network distributed training is huge, and therefore, it is time-consuming and labor-consuming to directly and manually find the optimal optimization strategy.

Therefore, the optimization search framework for the distributed training task is provided, a user-defined optimization technology can be added, and an optimal optimization scheme can be automatically searched.

In a possible manner, a global execution graph may be generated according to the global dependency graph, and the critical path may be determined according to the global execution graph, where the global execution graph is used to indicate an execution order of operations between nodes having a dependency relationship in the global dependency graph. Then, with the goal of minimizing the operation execution time of the nodes on the critical path as an objective, determining an optimization strategy for the global dependency graph, and executing the optimization strategy for the global dependency graph to obtain an optimized dependency graph, wherein the nodes on the critical path include communication group nodes and computation nodes in the deep neural network, and the communication group nodes include a plurality of communication nodes. Accordingly, according to the global dependency graph and the time offset, the distributed training process for the offline simulation deep neural network may be: and simulating a distributed training process of the deep neural network off line according to the optimization dependency graph and the time offset.

It should be appreciated that, in the disclosed embodiment, given a global dependency graph G of a distributed DNN training job and a set of optimization strategies S, the goal is to identify bottlenecks in the global dependency graph through offline simulation and generate a subset S of the optimization strategies that minimizes the training time per iteration (referred to as iteration time for short)^*∈S：

Wherein G' ═ f (G, S)^′) Is the application of S^′The modified global dependency graph after the optimization strategy.

In the embodiment of the present disclosure, the critical path of the global execution graph is mainly checked and optimized continuously and iteratively, so that a corresponding optimization strategy is determined for each node on the critical path. WhereinThe critical path C may comprise a series of calculation and communication operations: c ═ p₀,p₁,…,p_i,q_i,q_i+1,…,q_|C|-1]Wherein p is₀,p₁,…,p_iIs a calculation operation, q_i,q_i+1,…,q_|C|-1Is a communication operation, | C | is the total number of nodes on the critical path, i.e., the length of the critical path. To facilitate analysis of the relationship between communications and computations, fine-grained communications operations (e.g., a set of corresponding tensor send operations and tensor receive operations) for the transmission tensor n in the global execution graph may be abstracted as a set of communications operations q_n. FIG. 3 depicts a critical path (illustrated by the node path included in dashed lines) and each pair of operations p_nAnd q is_nThe correspondence between n and C, 0, 1. Since the generation of the tensor depends on the computation operation, the critical path may start with a sequence of computation operations and end with a sequence of communication operations. On the one hand, each computation operation p on the critical path_n(n 1.., i), possibly corresponding to a communication operation group q_nBut the set of communication operations is not located on the critical path because the stage calculation operations are a bottleneck. On the other hand, each communication operation q on the critical path_n(n ═ i., | C | -1) corresponds to a computation operation p that is not on the critical path_nSince the communication is a bottleneck in this phase.

In each iteration, the slave p can be checked₀To q_|c|-1For each operation on the critical path, a corresponding optimization strategy is determined. If using d_nTo indicate when p is detected_n/q_nThe optimization strategy in the embodiment of the present disclosure may include the following cases: 1) d_nDenotes fusing two computational operations p ═ opfs_n-1And p_n；2)d_nDenotes the fusion calculation operation p ═ tsfs_n-1And p_nGenerating a communication operation q corresponding to a tensor_n-1And q is_n；3)d_nOPfs _ tsfs, representing a simultaneous fusion computation operation p_n-1And p_nAnd corresponding communication operation q_n-1And q is_n；4)d_nNull, noneThere is a fusion. It should be understood that d₀Null. When tensor partition is enabled, an optimal partition number k can be determined for the tensor on the critical path_n. Different from tensor fusion and operator fusion, tensor partitioning does not delay the start time of tensor communication, and only changes the duration of tensor communication, so that each d can be firstly detected_nFinding the optimal k for the possible values_nThen pick the optimal d_nThe value is obtained. Referring to Table 1, let T_nTo start execution from the global dependency graph to p_n/q_nAll perform the time required for the end, i.e.

The goal of the optimization is to find the best decision D ═ D₀,d₁,...,d_|C|-1]And K ═ K₀,k₁,...,k_|C|-1]To minimize the execution time of each node on the critical path, namely:

TABLE 1

The following describes a specific determination method of the above-mentioned optimization strategy.

In a possible approach, determining an optimization strategy for the global dependency graph may be: and for each computing node in the key path, when the operation execution time of a previous communication group node of the communication group node corresponding to the computing node is less than or equal to a first fusion time difference, determining an operator fusion strategy as an optimization strategy for the global dependency graph, wherein the operator fusion strategy is used for fusing the computing operation of the computing node and the previous computing node, and the first fusion time difference is the time difference between the total computing time of the computing node and the previous computing node and the fusion computing time of the computing node and the previous computing node.

Wherein the computing nodes are used for executing computing operation, and the communication group nodes are used forPerforming a communication operation, the computing node's computing time being the computing operation p performed by the computing node_nIs performed at a time of execution, i.e

Operation execution time of a communication group node preceding the communication group node, which is communication operation q_n-1Is performed at a time of execution, i.e

The fusion calculation time is obtained by collecting a track of fusion operation offline after the calculation operations of the calculation node and the previous calculation node of the calculation node are fused, and this is not limited in the embodiments of the present disclosure.

In the disclosed embodiment, if

Then T achieved by operator fusion_nLess than no fusion, i.e. T_n(d_n＝opfs)<T_n(d_nNull), fuse p_n-1And p_nBetter than not fused. On the contrary, if

Then T achieved by operator fusion_nNot less than or equal to fusion, i.e. T_n(d_n＝opfs)≥T_n(d_nNull), fuse p_n-1And p_nResulting in even worse performance. Wherein, opfs _ time (p)_n-1,p_n) To fuse p_n-1And p_nThe execution time of the post-computation operation (i.e., the fused computation time).

It should be appreciated that the inventors have also investigated that when considering operator fusion and tensor fusion simultaneously, their corresponding communication operations (if any) do not sacrifice overlap between the computational communications if fusing the two computational operations better than not.

Therefore, in a possible manner, after the operator fusion strategy is determined as the optimization strategy for the global dependency graph, the tensor fusion strategy is also determined as the optimization strategy for the global dependency graph, and the tensor fusion strategy is used for fusing the communication operation of the communication group node corresponding to the communication group node and the previous computing node. Accordingly, the optimization strategies include an operator fusion strategy and a tensor fusion strategy.

In other possible ways, determining the optimization strategy for the global dependency graph may also be: and determining a first optimal partition number of a tensor corresponding to the converged communication group node and a second optimal partition number of a tensor corresponding to the communication group node under the condition of the communication operation of the converged communication group node and the previous communication group node aiming at each communication group node in the critical path. When the operation ending time of a previous communication group node of the communication group nodes is larger than a second fusion time difference, determining a tensor fusion strategy as an optimization strategy for the global dependency graph, wherein the tensor fusion strategy is used for fusing the communication operation of the communication group nodes and the previous communication group nodes, the second fusion time difference is a time difference between the time obtained by adding the first tensor synchronization time to the calculation ending time of the calculation node corresponding to the communication group nodes and the second tensor synchronization time, the first tensor synchronization time is the tensor synchronization time after partitioning the tensor corresponding to the fusion communication operation according to a first optimal partition number, and the second tensor synchronization time is the tensor synchronization time after partitioning the tensor corresponding to the communication group nodes according to a second optimal partition number.

For example, the optimal number of partitions of the corresponding tensor can be obtained through grid search, and the grid search mode may refer to the related art, which is not described herein again.

Illustratively, the communication group node includes a plurality of communication nodes for performing a plurality of communication operations. The communication group node preceding the communication group node is a communication group node preceding the communication group node on the critical path. The fusion communication operation is an operation of fusing a first communication operation corresponding to a communication group node and a second communication operation corresponding to a previous communication group node of the communication group node, and the tensor corresponding to the fusion communication operation is a tensor obtained by fusing a tensor corresponding to the first communication operation and a tensor corresponding to the second communication operation. Previous communication of a communication group nodeEnd time of operation of group node, communication operation q_n-1End time of, i.e.

Computing operation p executed by computing node corresponding to communication group node according to computing end time of computing node_nEnd time of, i.e.

Therefore, in the disclosed embodiment, if

Then merge the communication operation group q corresponding to tensor_n-1And q is_nObtained T_nLess than without fusion, i.e.

Fusion q_n-1And q is_nBetter than not fused. On the contrary, if

Then merge the communication operation group q corresponding to tensor_n-1And q is_nObtained T_nNot less than or equal to fusion, i.e.

Fusion q_n-1And q is_nWorse than no fusion. Wherein,

to operate group q in converged communication_nCorresponding tensor and previous communication operation group q_n-1In the case of the corresponding tensor, the optimal number of partitions of the fused tensor, i.e. the first optimal number of partitions,

in order to perform tensor synchronization time after partitioning on the fused tensor according to the first optimal partition number, namely the first tensor synchronization time,

operating group q for communication_nThe optimal number of partitions, i.e. the second optimal number of partitions,

to operate group q on communication_nAnd the corresponding tensor performs tensor synchronization time after partitioning according to the second optimal partition number, namely second tensor synchronization time.

Likewise, the inventors have also investigated that when operator fusion and tensor fusion are considered simultaneously, their corresponding computing operations fuse without sacrificing overlap between computing communications if fusing the two communications operations is better than not fusing.

Therefore, in a possible manner, after the tensor fusion strategy is determined as the optimization strategy for the global dependency graph, the operator fusion strategy is determined as the optimization strategy for the global dependency graph, and the operator fusion strategy is used for fusing the calculation nodes corresponding to the nodes of the communication group and the calculation nodes corresponding to the nodes of the previous communication group. Accordingly, the optimization strategies include an operator fusion strategy and a tensor fusion strategy.

In other possible manners, after the tensor fusion policy is determined as the optimization policy for the global dependency graph, the operator fusion policy and the tensor partitioning policy are determined as the optimization policy for the global dependency graph, the operator fusion policy is used for fusing the calculation nodes corresponding to the communication group nodes and the calculation nodes corresponding to the previous communication group nodes, and the tensor partitioning policy is used for partitioning and synchronizing the corresponding tensors according to the first optimal partition number and the second optimal partition number. Accordingly, the optimization strategies include an operator fusion strategy, a tensor fusion strategy, and a tensor partitioning strategy.

Therefore, the global dependency graph can be optimized by combining optimization strategies of operator fusion, tensor fusion and tensor partitioning, and the optimization effect of the global dependency graph is further improved.

It should be appreciated that exhaustive exploration of the large-scale optimization strategy space is very time consuming. Therefore, the embodiment of the present disclosure also provides a way to accelerate the automatic search optimization strategy.

In a possible manner, the coarsened view may be determined first by: determining a target computing node which does not correspond to a communication node in the global dependency graph, and fusing the target computing node and computing operation corresponding to a subsequent computing node of the target computing node, and/or determining a target communication node which is related to the same computing node in the global dependency graph, and fusing the communication operation corresponding to the target communication node. Accordingly, according to the global dependency graph and the time offset, the distributed training process of the offline simulation deep neural network may be: and according to the coarsened view and the time offset, simulating a distributed training process of the deep neural network in an off-line mode.

It should be understood that if a certain computing node does not have a corresponding node, then the computing node and the following computing node must satisfy the condition:

therefore, the computing operation of the computing node and the subsequent computing node can be directly fused without searching the optimization strategy, so that the searching time is reduced, and the optimization efficiency is improved. Similarly, for a plurality of communication group nodes associated with the same computing node, the plurality of communication group nodes and the computing node satisfy the condition:

therefore, the fusion of communication operations can be directly carried out without searching optimization strategies, so that the searching time is reduced, and the optimization efficiency is improved.

That is, in the embodiment of the present disclosure, a calculation operation that does not generate a tensor (i.e., a calculation operation of a target calculation node that does not correspond to a communication node) but a calculation operation that generates a tensor next to the target calculation node (i.e., a calculation operation corresponding to a calculation node subsequent to the target calculation node) may be put into one group, and communication operations connected to the same calculation operation may be put into one group. Then, the calculation operation or the communication operation in a group can be directly merged to obtain a coarsened view. Finally, a distributed training process of the deep neural network can be simulated in an off-line mode based on the coarsened view and the time offset, and an optimization strategy is searched based on the coarsened view, so that the searching time is shortened, and the optimization efficiency is improved.

In other possible ways, the first sheet synchronization time in the above may be determined by: the method includes the steps of determining a first communication operation related to a tensor corresponding to the synchronous converged communication operation, generating a first communication subgraph based on the first communication operation, the first optimal number of partitions and the tensor corresponding to the converged communication operation, and determining a first tensor synchronization time based on the first communication subgraph. The second tensor synchronization time above may be determined as follows: and determining a second communication operation related to the tensor corresponding to the synchronous communication group node, generating a second communication subgraph based on the second communication operation, the second optimal partition number and the tensor corresponding to the communication group node, and determining second tensor synchronization time based on the second communication subgraph.

It should be appreciated that in order to avoid frequently simulating the entire global dependency graph during policy search, in particular, t is estimated_sync(s, k), the disclosed embodiments provide a way to partially simulate. For example, given a current global dependency graph, all communication operations S associated with a current tensor are identified_pAnd generating a communication sub-graph comprising S_pAnd edges between these operations, and then obtaining tensor synchronization time by simulating the execution of the communication subgraph.

In other possible manners, the optimization strategy is executed on the global dependency graph to obtain an optimized dependency graph, which may be: and synchronously executing an optimization strategy on the local dependency graph corresponding to each processor in the global dependency graph to obtain the optimization dependency graph corresponding to each processor. Accordingly, according to the optimization dependency graph and the time offset, the distributed training process of the offline simulation deep neural network may be: and simulating the distributed training process of the deep neural network off line according to the optimized dependency graph and the time offset corresponding to each processor.

It should be understood that, for distributed training in the data parallel mode, multiple processors train the same deep neural network, so in the training analysis stage, each analysis is performed on one processor, and after the analysis on the one processor is completed, a similar analysis is performed on the next processor. Since the training is directed to the same deep neural network, the corresponding, local dependency graphs of different processors in the global dependency graph are the same. If the search of the optimization strategy is repeated for each analysis, unnecessary search time is inevitably increased. Therefore, the embodiment of the present disclosure provides a synchronous execution manner to accelerate the search of the optimization policy, that is, after the optimization policy for the global dependency graph is determined, the optimization policy for the compute node in the optimization policy is synchronously executed for the local dependency graph corresponding to each processor in the global dependency graph, so as to obtain the optimization dependency graph corresponding to each processor. Finally, an off-line simulation training process can be executed according to the optimization dependency graph and the time offset corresponding to each processor, and off-line simulation and optimization of distributed training are completed.

The training analysis method provided by the present disclosure is explained below by another exemplary embodiment.

Referring to fig. 4, after performing an online distributed training job through a corresponding machine learning framework and a corresponding communication framework, an analyzer collects a trajectory (mainly, an execution time of an operation) of each operation in the job, analyzes a dependency relationship between the operations to obtain a global dependency graph corresponding to the job, and performs time alignment on the global dependency graph to obtain a global time axis view. The machine learning framework includes TensorFlow, MxNet, etc., and the communication framework includes BytePS, Horowod, etc., and the embodiments of the present disclosure are not limited thereto. And simulating the operation of the operation by using a simulator through the result collected by the analyzer to obtain the performance evaluation of the operation, and finally optimizing the operation according to the performance evaluation result.

As shown in FIG. 4, the optimizer may include a scheduler and a policy registry module for receiving registered various optimization policies. The recalculation strategy and the gradient accumulation strategy shown in fig. 4 may refer to the related art, and are not described herein. The scheduler comprises an evaluation module and a candidate module, wherein the evaluation module is used for judging the iteration time and the memory use condition of the global dependency graph, and the candidate module is used for determining the operation corresponding to the critical path in the global dependency graph. Referring to fig. 4, the candidate module sends the operation information corresponding to the critical path to the policy registration module through step (i), and the policy registration module may return the registered policy optimization scheme to the scheduler through step (ii) in response to the received operation information. The scheduler determines a target strategy optimization scheme from the registered strategy optimization schemes through the evaluation result of the evaluation module, calls the target strategy optimization scheme from the strategy registration module through the step three, and executes the target optimization scheme on the global dependency graph to obtain an optimized dependency graph. And then, performing off-line simulation again on the basis of the optimization dependency graph by using a trigger simulator in the step IV, returning the result of the off-line simulation in the step V, re-determining the optimization strategy, and so on, wherein the global dependency graph can be updated and the critical path can be updated in each analysis process, and then repeatedly searching on the new critical path until the iteration time of the global dependency graph reaches a preset time threshold or meets a preset analysis stop condition.

It should be appreciated that the optimizer analyzes the bottlenecks in the global dependency graph and optimizes them in an iterative manner. Firstly, an optimizer simulates the operation of an initial global dependency graph through a simulator and evaluates the iteration time and the memory use condition of the global dependency graph. And if the memory usage exceeds a memory threshold, calling a memory optimization technology to reduce memory occupation. The memory optimization technique can refer to the related art, and is not described herein.

The optimizer then proceeds with the optimization of throughput to minimize the time span of critical paths in the global execution graph. Given the critical path C ═ p₀,p₁,…,p_i,q_i,q_i+1,…,q_|C|-1]The optimizer first checks the calculation operations in orderMaking p_n(n 1.. i.) since the performance of this part of the critical path is limited by the computation, the optimizer first evaluates whether to apply the operator fusion policy to the computation operation p according to the above-mentioned judgment condition_n-1And p_n. If yes, calling operator fusion strategy fusion p_n-1And p_nAnd calling a tensor fusion strategy to fuse the communication operation q corresponding to the calculation operation_n-1And q is_n. Then, the optimal number of partitions is calculated and applied to the tensor q_n(if tensor fusion is performed, q_nRepresenting the tensor after fusion).

Next, the optimizer examines the communication operations q on the critical path in turn_n(n ═ i., | C | -1). In this segment, the performance bottleneck is communication. The optimizer first calculates the optimal number of partitions, k, for the fused tensor and the unfused tensor, respectively^*Namely, the first optimal partition number and the second optimal partition number are determined. And evaluating whether to apply the tensor fusion strategy to the communication operation q according to the judgment condition_n-1And q is_n. If yes, invoking tensor fusion strategy fusion communication operation q_n-1And q is_nAnd simultaneously calling an operator fusion strategy to fuse the corresponding calculation operation p_n-1And p_n. Finally, the corresponding optimal partition number k is applied to the tensor corresponding to the fusion communication operation^*. After the optimization is applied, the global dependency graph is updated. The optimizer updates the critical path using the simulator and then repeats the search on the new critical path until the iteration time of the global dependency graph reaches a preset time threshold or a preset analysis stop condition is met.

By the method, a global dependency graph can be constructed for the distributed machine learning training task, specific operations in the communication process are analyzed, track information of the specific operations is collected, and meanwhile timestamps of tracks on different processors are aligned in an off-line mode based on the dependency relationship among the communication operations and the distribution of the processors. And moreover, a search framework can be provided for optimization of the distributed machine learning training task, a user-defined optimization technology can be added in a shared mode, and an optimal optimization scheme can be automatically searched. Moreover, for operator fusion, tensor fusion, and tensor partitioning, the optimal operator fusion, tensor fusion, and partitioning strategies for the distributed machine learning training task may be searched based on a heuristic search algorithm including corresponding judgment conditions. In addition, the strategy searching time can be shortened and the strategy optimization time can be improved through technologies such as view coarsening, partial simulation and symmetry (namely parallel execution).

Through testing, in a TensorFlow and MXNet training framework, PS and AllReduce are used as communication libraries, and RDMA (Remote Direct Memory Access) and TCP (Transmission Control Protocol) are used as a distributed training scene of a Transmission Protocol, according to the training analysis method provided by the disclosure, as tensor communication processes are analyzed and time stamps of different nodes are aligned, compared with a related technology mode, the distributed training of a deep neural network can be accurately simulated, and the average error is less than 5% (10 times better than Daydream). In addition, the optimizer in the training analysis method provided by the disclosure can effectively explore a good optimization strategy, and compared with a related technology mode, the training throughput can be improved by 3.48 times, and especially the large-scale distributed training task is improved more remarkably. In addition, compared with the mode of the related art, the training analysis method provided by the disclosure can reduce the strategy search time by 2 orders of magnitude, namely can shorten the strategy search time and improve the strategy optimization time.

Based on the same concept, the present disclosure also provides a training analysis apparatus, which may become part or all of an electronic device through software, hardware or a combination of both. Referring to fig. 5, the apparatus is used for analyzing a distributed training process of a deep neural network, the deep neural network operates in a plurality of devices, and the training analysis apparatus 500 includes:

a first determining module 501, configured to determine a global dependency graph of the deep neural network, where the global dependency graph is used to characterize dependencies between computation nodes in the deep neural network running on a processor of each device and communication relationships between communication nodes used to synchronize tensors generated by the computation nodes among a plurality of processors, and each two processors in the global dependency graph include a plurality of the communication nodes therebetween;

a second determining module 502, configured to determine time offsets of the plurality of processors relative to a reference processor, with a constraint condition that operation dependencies between nodes in the global dependency graph before and after time alignment do not change, as an objective of minimizing a time alignment function, where the time alignment function is used to characterize time differences of tensors received by the processors from the same processor and time offset differences between processors in the same device, and the reference processor is any one of the plurality of devices;

a training simulation module 503, configured to simulate, offline, a distributed training process of the deep neural network according to the global dependency graph and the time offset, and determine, according to a result of the offline simulation, a distributed training analysis result of the deep neural network.

Optionally, the first determining module 501 is configured to:

determining a local dependency graph of a deep neural network running on a processor of each device, wherein the local dependency graph is used for characterizing dependency relationships among computing nodes in the deep neural network, and the local dependency graph comprises a first virtual operation identifier used for indicating tensor synchronization start and tensor synchronization end;

determining a communication topology of the plurality of processors, the communication topology being used for characterizing communication relationships between a plurality of the processors for synchronizing tensors generated by the compute nodes, and the communication topology comprising a second virtual operation identification for indicating a start of tensor synchronization and an end of tensor synchronization;

and establishing association between the local dependency graph and the communication topology through the first virtual operation identifier and the second virtual operation identifier with the same identifier information to obtain a global dependency graph.

Optionally, in the time alignment function, the time of receiving the tensor is determined by:

a third determining module, configured to determine an end time of a first processor performing a tensor receiving operation, and determine a latest time from a start time of the first processor performing the tensor receiving operation and a start time of a second processor performing a tensor transmitting operation, where the end time, the start time of the first processor performing the tensor receiving operation, and the start time of the second processor performing the tensor transmitting operation are all represented by a time offset of the corresponding processor from the reference processor, and the first processor and the second processor are different;

a fourth determining module for determining a time difference between an end time of the first processor performing the tensor reception operation and the latest time as a time when the first processor receives a tensor from the second processor.

Optionally, the apparatus 500 further comprises:

a fifth determining module, configured to generate a global execution graph according to the global dependency graph, and determine a critical path according to the global execution graph, where the global execution graph is used to indicate an operation execution order among nodes having a dependency relationship in the global dependency graph;

the optimization module is used for determining an optimization strategy for the global dependency graph by taking the minimization of the operation execution time of the nodes on the critical path as a target, and executing the optimization strategy on the global dependency graph to obtain an optimized dependency graph, wherein the nodes on the critical path comprise communication group nodes and computing nodes in the deep neural network, and the communication group nodes comprise a plurality of communication nodes;

the training simulation module 503 is configured to:

and simulating the distributed training process of the deep neural network off line according to the optimization dependency graph and the time offset.

Optionally, the optimization module is configured to:

for each computing node in the critical path, when the operation execution time of a previous communication group node of the communication group node corresponding to the computing node is less than or equal to a first fusion time difference, determining an operator fusion strategy as an optimization strategy for the global dependency graph, where the operator fusion strategy is used for fusing the computing operations of the computing node and the previous computing node, and the first fusion time difference is a time difference between the total computing time of the computing node and the previous computing node and the fusion computing time for the computing node and the previous computing node.

Optionally, the apparatus 500 further comprises:

a sixth determining module, configured to determine a tensor fusion policy as an optimization policy for the global dependency graph after determining the operator fusion policy as the optimization policy for the global dependency graph, where the tensor fusion policy is used to fuse communication operations of the communication group node and a communication group node corresponding to the previous computing node;

accordingly, the optimization strategy includes the operator fusion strategy and the tensor fusion strategy.

Optionally, the optimization module is configured to:

determining, for each communication group node in the critical path, a first optimal partition number of a tensor corresponding to a fusion communication operation under the condition of fusing the communication group node and a communication operation of the previous communication group node, and determining a second optimal partition number of a tensor corresponding to the communication group node;

when the operation end time of a previous communication group node of the communication group nodes is greater than a second fusion time difference, determining a tensor fusion strategy as an optimization strategy for the global dependency graph, wherein the tensor fusion strategy is used for fusing the communication operation of the communication group nodes and the previous communication group node, the second fusion time difference is a time difference between a time obtained by adding a first tensor synchronization time to the calculation end time of a calculation node corresponding to the communication group node and a second tensor synchronization time, the first tensor synchronization time is a tensor synchronization time obtained by partitioning a tensor corresponding to the fusion communication operation according to the first optimal partition number, and the second tensor synchronization time is a tensor synchronization time obtained by partitioning a tensor corresponding to the communication group node according to the second optimal partition number.

Optionally, the apparatus 500 further comprises:

a seventh determining module, configured to determine an operator fusion policy as the optimization policy for the global dependency graph after determining the tensor fusion policy as the optimization policy for the global dependency graph, where the operator fusion policy is used to fuse the computing nodes corresponding to the communication group nodes and the computing nodes corresponding to the previous communication group nodes;

Optionally, the apparatus 500 further comprises:

an eighth determining module, configured to determine, after determining the tensor fusion policy as an optimization policy for the global dependency graph, an operator fusion policy and a tensor partitioning policy as optimization policies for the global dependency graph, where the operator fusion policy is used to fuse calculation operations of a calculation node corresponding to the communication group node and a calculation node corresponding to a previous communication group node, and the tensor partitioning policy is used to partition and synchronize corresponding tensors according to the first optimal number of partitions and the second optimal number of partitions;

accordingly, the optimization strategies include the operator fusion strategy, the tensor fusion strategy, and the tensor partitioning strategy.

Optionally, the first quantum synchronization time is determined by a first sub-graph module, the first sub-graph module being configured to: determining a first communication operation related to synchronizing tensors corresponding to the converged communication operation, generating a first communication subgraph based on the first communication operation, the first optimal number of partitions and the tensors corresponding to the converged communication operation, and determining a first quantum synchronization time based on the first communication subgraph;

the second tensor synchronization time is determined by a second sub-graph module to: determining a second communication operation related to synchronizing the tensor corresponding to the communication group node, generating a second communication subgraph, the second optimal partition number and the tensor corresponding to the communication group node based on the second communication operation, and determining second tensor synchronization time based on the second communication subgraph.

Optionally, the optimization module is configured to:

synchronously executing the optimization strategy on the local dependency graph corresponding to each processor in the global dependency graph to obtain the optimization dependency graph corresponding to each processor;

the off-line simulation of the distributed training process of the deep neural network according to the optimization dependency graph and the time offset comprises the following steps:

and according to the optimization dependency graph and the time offset corresponding to each processor, simulating a distributed training process of the deep neural network in an off-line mode.

Optionally, the apparatus 500 further comprises:

the coarsening module is used for determining a coarsened view in the following way: determining a target computing node which does not correspond to a communication node in the global dependency graph, and fusing the target computing node and computing operation corresponding to a subsequent computing node of the target computing node, and/or determining a target communication node which is related to the same computing node in the global dependency graph, and fusing the communication operation corresponding to the target communication node;

the training simulation module 503 is configured to:

and according to the coarsening view and the time offset, simulating a distributed training process of the deep neural network in an off-line mode.

Optionally, the training simulation module 503 is configured to:

each processor corresponds to an operation queue, the sending operation of each tensor in the global dependency graph corresponds to an operation queue, the receiving operation of each tensor corresponds to an operation queue, the operation queue is used for storing the operation to be executed, the operation queue has queue time represented by the time offset, and the queue time is the ending time of the last operation executed by the processor or the tensor communication process corresponding to the operation queue;

determining a target operation queue with the earliest queue time, executing the head operation stored in the target operation queue, updating the queue time of the target operation queue based on the execution ending time of the head operation, returning to the step of determining the target operation queue with the earliest queue time until all operations in the global dependency graph are executed, and determining the latest queue time as the iteration time of the global dependency graph in the queue time corresponding to a plurality of operation queues;

and determining a distributed training analysis result of the deep neural network according to the iteration time.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Based on the same concept, the present disclosure also provides a non-transitory computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of any of the above-described training analysis methods.

Based on the same concept, the present disclosure also provides an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of any of the above-mentioned training analysis methods.

Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or installed from the storage means 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the communication may be performed using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining a global dependency graph of the deep neural network, wherein the global dependency graph is used for characterizing the dependency relationship between the calculation nodes in the deep neural network running on the processor of each device and the communication relationship between a plurality of communication nodes used for synchronizing tensors generated by the calculation nodes, and each two processors in the global dependency graph comprise a plurality of communication nodes; determining time offsets of a plurality of processors relative to a reference processor by taking a minimized time alignment function as a target and taking operation dependency among nodes in the global dependency graph before and after time alignment as a constraint condition, wherein the time alignment function is used for representing time difference of tensors received by the processors from the same processor and time offset difference among processors in the same device, and the reference processor is any one of the processors in the plurality of devices; and according to the global dependency graph and the time offset, off-line simulating a distributed training process of the deep neural network, and determining a distributed training analysis result of the deep neural network according to an off-line simulation result.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and including conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A training analysis method for analyzing a distributed training process of a deep neural network, the deep neural network operating on a plurality of devices, the method comprising:

determining a global dependency graph of the deep neural network, wherein the global dependency graph is used for characterizing the dependency relationship between the computing nodes in the deep neural network running on the processor of each device and the communication relationship between a plurality of communication nodes used for synchronizing tensors generated by the computing nodes, and each two processors in the global dependency graph comprise a plurality of communication nodes;

2. The method of claim 1, wherein determining the global dependency graph for the deep neural network comprises:

determining a communication topology for a plurality of the processors, the communication topology being used to characterize communication relationships between a plurality of communication nodes between the processors for synchronizing tensors generated by the compute nodes, and the communication topology including a second virtual operation identification for indicating a start of tensor synchronization and an end of tensor synchronization;

3. The method of claim 1, wherein the time of the receive tensor in the time alignment function is determined by:

determining an end time of a first processor performing a tensor reception operation, and determining a latest time among a start time of the first processor performing the tensor reception operation and a start time of a second processor performing a tensor transmission operation, wherein the end time, the start time of the first processor performing the tensor reception operation, and the start time of the second processor performing the tensor transmission operation are each represented by a time offset of the corresponding processor from the reference processor, and the first processor and the second processor are different;

determining a time difference between an end time of the first processor performing the tensor reception operation and the latest time as a time at which the first processor receives a tensor from the second processor.

4. The method according to any one of claims 1-3, further comprising:

generating a global execution graph according to the global dependency graph, and determining a critical path according to the global execution graph, wherein the global execution graph is used for indicating an operation execution sequence among nodes with dependency relationship in the global dependency graph;

determining an optimization strategy for the global dependency graph with the goal of minimizing the operation execution time of the nodes on the critical path as a target, and executing the optimization strategy for the global dependency graph to obtain an optimized dependency graph, wherein the nodes on the critical path comprise communication group nodes and computing nodes in the deep neural network, and the communication group nodes comprise a plurality of communication nodes;

the off-line simulation of the distributed training process of the deep neural network according to the global dependency graph and the time offset comprises:

5. The method of claim 4, wherein determining the optimization strategy for the global dependency graph comprises:

6. The method of claim 5, further comprising:

after determining the operator fusion strategy as the optimization strategy for the global dependency graph, determining a tensor fusion strategy as the optimization strategy for the global dependency graph, wherein the tensor fusion strategy is used for fusing communication operations of the communication group nodes and the communication group nodes corresponding to the previous computing node;

7. The method of claim 4, wherein determining the optimization strategy for the global dependency graph comprises:

8. The method of claim 7, further comprising:

after determining the tensor fusion strategy as the optimization strategy of the global dependency graph, determining an operator fusion strategy as the optimization strategy of the global dependency graph, wherein the operator fusion strategy is used for fusing the computing nodes corresponding to the communication group nodes and the computing nodes corresponding to the previous communication group nodes;

9. The method of claim 7, further comprising:

after the tensor fusion strategy is determined as the optimization strategy of the global dependency graph, determining an operator fusion strategy and a tensor partitioning strategy as the optimization strategy of the global dependency graph, wherein the operator fusion strategy is used for fusing the computing nodes corresponding to the nodes of the communication group and the computing nodes corresponding to the nodes of the previous communication group, and the tensor partitioning strategy is used for partitioning and synchronizing the corresponding tensors according to the first optimal partitioning number and the second optimal partitioning number;

10. The method of claim 7, wherein the first scalar synchronization time is determined by: determining a first communication operation related to synchronizing tensors corresponding to the converged communication operation, generating a first communication subgraph based on the first communication operation, the first optimal number of partitions and the tensors corresponding to the converged communication operation, and determining a first quantum synchronization time based on the first communication subgraph;

the second tensor synchronization time is determined by: determining a second communication operation related to synchronizing tensors corresponding to the communication group nodes, generating a second communication subgraph based on the second communication operation, the second optimal partition number and the tensors corresponding to the communication group nodes, and determining second tensor synchronization time based on the second communication subgraph.

11. The method of claim 4, wherein the performing the optimization strategy on the global dependency graph to obtain an optimized dependency graph comprises:

and simulating the distributed training process of the deep neural network off line according to the optimization dependency graph corresponding to each processor and the time offset.

12. The method according to any one of claims 1-3, further comprising:

determining a coarsened view by: determining a target computing node which does not correspond to a communication node in the global dependency graph, and fusing the target computing node and computing operation corresponding to a subsequent computing node of the target computing node, and/or determining a target communication node which is related to the same computing node in the global dependency graph, and fusing the communication operation corresponding to the target communication node;

13. The method according to any one of claims 1-3, wherein the off-line simulation of the distributed training process of the deep neural network according to the global dependency graph and the time offset comprises:

determining a distributed training analysis result of the deep neural network according to the result of the off-line simulation, wherein the determining comprises the following steps:

14. A training analysis apparatus for analyzing a distributed training process of a deep neural network, the deep neural network operating on a plurality of devices, the apparatus comprising:

15. A non-transitory computer readable medium, on which a computer program is stored, which, when being executed by a processing device, carries out the steps of the method according to any one of claims 1 to 13.

16. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 13.