CN112070223A

CN112070223A - Model parallel method based on Tensorflow framework

Info

Publication number: CN112070223A
Application number: CN202010825175.1A
Authority: CN
Inventors: 田文洪; 谢远伦; 杨锦涛; 许凌霄
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-12-11

Abstract

The invention discloses a model parallel method based on a Tensorflow framework, which is characterized in that a parallel optimization algorithm is added in the model parallel of the Tensorflow, and the original random model division is replaced by the execution of the model division by an innovative greedy algorithm; the strategy of the model parallel optimization algorithm is to find out a key path in a calculation graph, then adopt a greedy algorithm which takes minimum completion time as a target for equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed. The key path is placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent, and the purpose of reducing the task completion time is achieved. In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, and simultaneously, the importance degree of nodes on the long path is also considered, the optimization algorithm estimates the critical path according to the calculation complexity sequence of the nodes.

Description

Model parallel method based on Tensorflow framework

Technical Field

The invention relates to the field of computers, in particular to a model parallel method based on a Tensorflow framework.

Background

Since the TensorFlow open source publishing framework, academic and industrial research related to deep learning has been developed unprecedentedly, and with the fact that relevant models are higher and more complex, the number of layers obtained by a layered structure is more and more, a neural network model is larger and larger, the memory limit of a single device is gradually exceeded, and the requirement for reducing the training time of the models is increased day by day. However, the TensorFlow is highly restrictive to a single compute node, and especially, as the size of a data set increases, the limitation of the TensorFlow is more prominent, and it is an effective method to improve the training efficiency of the deep learning model and solve the limitation bottleneck of a single device to the memory in a distributed parallel manner. Therefore, there is a need for a distributed parallel algorithm that addresses the problem of neural network models being too large to exceed the memory limit of a single device, while reducing the model training time.

Disclosure of Invention

In order to solve the above problem, an embodiment of the present invention provides a model parallel method based on a tensrflow framework.

The embodiment of the invention provides a model parallel method based on a Tensorflow framework, which comprises the following steps:

and the parallel optimization algorithm is realized by replacing the original random model division mode with the innovative greedy algorithm execution model division.

The model parallel optimization algorithm has scalability. It enables users to run larger models and make computations complete faster by adding more devices.

The model parallel optimization algorithm is not simply to randomly divide equipment according to experience, and the strategy is to find out a key path in a calculation map, then execute a greedy algorithm which takes minimum completion time as a target on equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed.

The specific implementation process of the model partitioning algorithm on the TensorFlow frame platform is as follows:

inputting: a set D of n devices with an execution speed description s, a set V of m computation nodes with a computation complexity description c, an edge set E of the computation graph.

And (3) outputting: a matrix O of dimensions m x n, representing the final solution of the algorithm. Element O in the matrix_i，jMeans that the jth node is finally placed on the ith block device for execution, wherein i is more than or equal to 1 and less than or equal to n, and 1 is less than or equal to nj is less than or equal to m. Element O_i，jThe value of (1) can only be 1 or 0, wherein 1 represents placement and 0 represents no placement.

1. And calculating the source-based ranking level of each node according to the given set V of the calculation nodes and the edge set E of the calculation graph.

2. And calculating the sorting level of each node based on the direction according to the given set V of the calculation nodes and the edge set E of the calculation graph.

3. And summing the calculated source-based ranking grade and the destination-based ranking grade of each node to obtain a final ranking grade.

4. The final ranking level of each node is ranked.

5. A set D of n devices is ordered.

6. Declaring a matrix O of m x n dimensions, each element representing on which device a node is placed to execute

7. The initialization matrix O is filled with 0 elements.

8. And judging whether each device can store and execute the ordered computing nodes, wherein 1 represents placement and 0 represents no placement.

9. And when all the nodes are judged to be finished, finishing the algorithm execution.

The model parallel optimization algorithm estimates the critical path in the order of computational complexity of the nodes.

The calculation complexity ranking optimization algorithm comprises a source-based ranking algorithm and a destination-based ranking algorithm.

The specific implementation process of the node on the TensorFlow frame platform based on the source ranking grade algorithm is as follows:

inputting: and calculating the adjacency matrix M representation of the map directed graph.

And (3) outputting: the node is based on the source's rank array source _ ranks.

1. By traversing the adjacency matrix M, a point with an in-degree of 0 is found.

2. And establishing a linked list taking the link list as a head node.

3. Each element in the adjacency matrix is determined to be 1.

4. And sequencing the final sequencing grade of each node, and adding a new node into the linked list.

5. Iterating until all paths in the calculation map are established into a chain table

5. The declaration source _ ranks node is based on the sorted ranked list of sources.

6. And traversing the linked list from the head node to the current node to obtain the source-based sorting level of each node and adding the sorting level into the newly-built list.

7. The return node is based on the source's ranking level source _ ranks.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a computational graph of the need for model partitioning in the practice of the present invention;

FIG. 2 is a source-based ranking and a destination-based ranking in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating model partitioning according to node importance ranking in an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of network transmission delay, equipment heterogeneity, computational complexity of computational nodes and mutual dependence degree between the computational nodes in a deep learning network model training process, the model parallel optimization algorithm based on the TensorFlow frame does not simply divide equipment randomly according to experience, and a strategy of the model parallel optimization algorithm is to find a key path in a computational graph, then execute a greedy algorithm aiming at the minimum completion time on the equipment executing the path, and place the nodes on the key path on the equipment with the highest execution speed. The critical path refers to the longest (or the highest computational complexity) path from any starting point to the end point. The key path is found because the nodes on the key path have high computation complexity and strong execution dependency between the nodes, and the nodes are placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent under the condition of not exceeding the equipment memory, and the aim of reducing the task completion time is fulfilled.

In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, but also consider the importance degree of nodes on the long path, the optimization algorithm estimates the critical path by sequencing the calculation complexity of the nodes. For any node in the directed graph of the computation graph, there are both source and destination attributes. Aiming at the two attributes, a sorting algorithm is respectively designed:

a. source-based ranking algorithm: for any node v_iIt is represented based on the rank of the source: from any starting node v_jTo the start, to node v_iAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. The corresponding formula is defined as follows:

b. a sort algorithm based on direction: for any node v_iExpressed based on the ranking level of the heading, is: slave node v_iStart to any termination node v_jAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. Phase (C)The corresponding formula is defined as follows:

with the two definitions, any node v in the computational graph can be given_iRank of (2): arbitrary compute node v_iThe rank of (c) is the sum of the source-based rank and the destination-based rank of the node. The corresponding formula is expressed as follows:

Rank(v_i)＝SourceRank(v_i)+EndRank(v_i)#(1-3)

as also previously mentioned, node v_iThe ranking level of (2) represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, and the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes is. Now obtain any node v_iAfter the ranking level is reached, all the nodes in the set V can be ranked according to the ranking level, then the node with the highest ranking level is placed on the computing equipment with the highest execution speed according to the ranking result, then the node with the second highest ranking level is placed, and iteration is continued until all the computing nodes are placed, so that the greedy algorithm is completed. At this point, the process of optimizing the algorithm is designed.

For clarity of description, the following describes the execution flow of the algorithm in the embodiment of the present invention:

a computational graph, such as that shown in fig. 1, is now required to be model partitioned. Each circle in the graph represents a TensorFlow operation, i.e., a compute node in the TensorFlow computation graph. The numbers inside the circles represent the computational complexity of the node.

First, as shown in FIG. 2, a source-based ranking level and a destination-based ranking level are computed for all nodes in the graph according to the foregoing formula definitions.

Finally, as shown in fig. 3, the total ranking levels of all nodes are calculated, and the ranking of the importance of the nodes is performed according to the total ranking levels. After the completion of the sorting process,the model can be divided into two devices Dev according to the sorting result₀And Dev₁(execution speed: Dev)₀＞Dev₁) The above iteration is performed.

To this end, the problem of how to place m nodes in a computation graph on n devices in a distributed system to achieve the minimum completion time with a certain network bandwidth is solved by modeling the computation graph with a critical path, and the time complexity of the algorithm is controlled to be O (n × log n).

According to the algorithm design in the previous section, the corresponding problem solving flow can now be described as: the algorithm calculates the source-based ranking grade and the destination-based ranking grade of each node according to two input conditions of a set V of m computing nodes with a computing complexity description c and an edge set E of a computing map, adds the source-based ranking grade and the destination-based ranking grade to obtain a final ranking grade of the node, and finally places the node on equipment in a greedy manner. For a certain compute node v_iOf (c) computational complexity_iThe computational complexity is an input condition for the algorithm to run.

From the above description of the embodiments, it is clear for a person skilled in the art that the embodiments can be implemented by means of a software platform and a hardware platform, and based on such understanding, the technical solutions described above may be essentially or partially implemented in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A model parallel method based on a Tensorflow framework is characterized in that an original random model division mode is replaced by an innovative greedy algorithm execution model division.

2. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is scalable, which enables users to run larger models and make the computation complete faster by adding more devices.

3. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is not simply an empirical random partitioning of the equipment, and the strategy is to find the critical path in the computation graph, then to execute a greedy algorithm with minimum completion time as the target for the equipment executing the path, and to place the nodes on the critical path on the equipment with the highest execution speed.

4. The Tensorflow framework-based model parallel method according to claim 1 or 2, wherein the model parallel optimization algorithm estimates the critical path in a computational complexity ordering of nodes; the problem that m nodes in a computation graph are placed on n devices in a distributed system under the condition that the network bandwidth is constant to achieve the minimum completion time is solved by means of modeling of a critical path, and the time complexity of an algorithm is controlled to be O (n x logn).

5. The calculation complexity sorting optimization algorithm comprises a sorting algorithm based on a source and a sorting algorithm based on a destination, and is realized on a TensorFlow framework; arbitrary compute node v_iThe ranking level of (1) is the sum of the node's source-based ranking level and destination-based ranking level; ordering of nodes, etcThe level represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes on the node is; and after any node sequencing level is obtained, sequencing all nodes in the set V according to the sequencing level, placing the node with the highest sequencing level on the computing equipment with the highest execution speed according to the result of the level sequencing, then placing the node with the second highest sequencing level, and continuously iterating until all the computing nodes are placed completely to finish the greedy algorithm.