CN112070223A - Model parallel method based on Tensorflow framework - Google Patents
Model parallel method based on Tensorflow framework Download PDFInfo
- Publication number
- CN112070223A CN112070223A CN202010825175.1A CN202010825175A CN112070223A CN 112070223 A CN112070223 A CN 112070223A CN 202010825175 A CN202010825175 A CN 202010825175A CN 112070223 A CN112070223 A CN 112070223A
- Authority
- CN
- China
- Prior art keywords
- nodes
- node
- algorithm
- level
- path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000005457 optimization Methods 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 238000012163 sequencing technique Methods 0.000 claims description 9
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 239000011159 matrix material Substances 0.000 description 7
- 238000012549 training Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a model parallel method based on a Tensorflow framework, which is characterized in that a parallel optimization algorithm is added in the model parallel of the Tensorflow, and the original random model division is replaced by the execution of the model division by an innovative greedy algorithm; the strategy of the model parallel optimization algorithm is to find out a key path in a calculation graph, then adopt a greedy algorithm which takes minimum completion time as a target for equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed. The key path is placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent, and the purpose of reducing the task completion time is achieved. In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, and simultaneously, the importance degree of nodes on the long path is also considered, the optimization algorithm estimates the critical path according to the calculation complexity sequence of the nodes.
Description
Technical Field
The invention relates to the field of computers, in particular to a model parallel method based on a Tensorflow framework.
Background
Since the TensorFlow open source publishing framework, academic and industrial research related to deep learning has been developed unprecedentedly, and with the fact that relevant models are higher and more complex, the number of layers obtained by a layered structure is more and more, a neural network model is larger and larger, the memory limit of a single device is gradually exceeded, and the requirement for reducing the training time of the models is increased day by day. However, the TensorFlow is highly restrictive to a single compute node, and especially, as the size of a data set increases, the limitation of the TensorFlow is more prominent, and it is an effective method to improve the training efficiency of the deep learning model and solve the limitation bottleneck of a single device to the memory in a distributed parallel manner. Therefore, there is a need for a distributed parallel algorithm that addresses the problem of neural network models being too large to exceed the memory limit of a single device, while reducing the model training time.
Disclosure of Invention
In order to solve the above problem, an embodiment of the present invention provides a model parallel method based on a tensrflow framework.
The embodiment of the invention provides a model parallel method based on a Tensorflow framework, which comprises the following steps:
and the parallel optimization algorithm is realized by replacing the original random model division mode with the innovative greedy algorithm execution model division.
The model parallel optimization algorithm has scalability. It enables users to run larger models and make computations complete faster by adding more devices.
The model parallel optimization algorithm is not simply to randomly divide equipment according to experience, and the strategy is to find out a key path in a calculation map, then execute a greedy algorithm which takes minimum completion time as a target on equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed.
The specific implementation process of the model partitioning algorithm on the TensorFlow frame platform is as follows:
inputting: a set D of n devices with an execution speed description s, a set V of m computation nodes with a computation complexity description c, an edge set E of the computation graph.
And (3) outputting: a matrix O of dimensions m x n, representing the final solution of the algorithm. Element O in the matrixi,jMeans that the jth node is finally placed on the ith block device for execution, wherein i is more than or equal to 1 and less than or equal to n, and 1 is less than or equal to nj is less than or equal to m. Element Oi,jThe value of (1) can only be 1 or 0, wherein 1 represents placement and 0 represents no placement.
1. And calculating the source-based ranking level of each node according to the given set V of the calculation nodes and the edge set E of the calculation graph.
2. And calculating the sorting level of each node based on the direction according to the given set V of the calculation nodes and the edge set E of the calculation graph.
3. And summing the calculated source-based ranking grade and the destination-based ranking grade of each node to obtain a final ranking grade.
4. The final ranking level of each node is ranked.
5. A set D of n devices is ordered.
6. Declaring a matrix O of m x n dimensions, each element representing on which device a node is placed to execute
7. The initialization matrix O is filled with 0 elements.
8. And judging whether each device can store and execute the ordered computing nodes, wherein 1 represents placement and 0 represents no placement.
9. And when all the nodes are judged to be finished, finishing the algorithm execution.
The model parallel optimization algorithm estimates the critical path in the order of computational complexity of the nodes.
The calculation complexity ranking optimization algorithm comprises a source-based ranking algorithm and a destination-based ranking algorithm.
The specific implementation process of the node on the TensorFlow frame platform based on the source ranking grade algorithm is as follows:
inputting: and calculating the adjacency matrix M representation of the map directed graph.
And (3) outputting: the node is based on the source's rank array source _ ranks.
1. By traversing the adjacency matrix M, a point with an in-degree of 0 is found.
2. And establishing a linked list taking the link list as a head node.
3. Each element in the adjacency matrix is determined to be 1.
4. And sequencing the final sequencing grade of each node, and adding a new node into the linked list.
5. Iterating until all paths in the calculation map are established into a chain table
5. The declaration source _ ranks node is based on the sorted ranked list of sources.
6. And traversing the linked list from the head node to the current node to obtain the source-based sorting level of each node and adding the sorting level into the newly-built list.
7. The return node is based on the source's ranking level source _ ranks.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a computational graph of the need for model partitioning in the practice of the present invention;
FIG. 2 is a source-based ranking and a destination-based ranking in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating model partitioning according to node importance ranking in an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In view of network transmission delay, equipment heterogeneity, computational complexity of computational nodes and mutual dependence degree between the computational nodes in a deep learning network model training process, the model parallel optimization algorithm based on the TensorFlow frame does not simply divide equipment randomly according to experience, and a strategy of the model parallel optimization algorithm is to find a key path in a computational graph, then execute a greedy algorithm aiming at the minimum completion time on the equipment executing the path, and place the nodes on the key path on the equipment with the highest execution speed. The critical path refers to the longest (or the highest computational complexity) path from any starting point to the end point. The key path is found because the nodes on the key path have high computation complexity and strong execution dependency between the nodes, and the nodes are placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent under the condition of not exceeding the equipment memory, and the aim of reducing the task completion time is fulfilled.
In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, but also consider the importance degree of nodes on the long path, the optimization algorithm estimates the critical path by sequencing the calculation complexity of the nodes. For any node in the directed graph of the computation graph, there are both source and destination attributes. Aiming at the two attributes, a sorting algorithm is respectively designed:
a. source-based ranking algorithm: for any node viIt is represented based on the rank of the source: from any starting node vjTo the start, to node viAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. The corresponding formula is defined as follows:
b. a sort algorithm based on direction: for any node viExpressed based on the ranking level of the heading, is: slave node viStart to any termination node vjAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. Phase (C)The corresponding formula is defined as follows:
with the two definitions, any node v in the computational graph can be giveniRank of (2): arbitrary compute node viThe rank of (c) is the sum of the source-based rank and the destination-based rank of the node. The corresponding formula is expressed as follows:
Rank(vi)=SourceRank(vi)+EndRank(vi)#(1-3)
as also previously mentioned, node viThe ranking level of (2) represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, and the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes is. Now obtain any node viAfter the ranking level is reached, all the nodes in the set V can be ranked according to the ranking level, then the node with the highest ranking level is placed on the computing equipment with the highest execution speed according to the ranking result, then the node with the second highest ranking level is placed, and iteration is continued until all the computing nodes are placed, so that the greedy algorithm is completed. At this point, the process of optimizing the algorithm is designed.
For clarity of description, the following describes the execution flow of the algorithm in the embodiment of the present invention:
a computational graph, such as that shown in fig. 1, is now required to be model partitioned. Each circle in the graph represents a TensorFlow operation, i.e., a compute node in the TensorFlow computation graph. The numbers inside the circles represent the computational complexity of the node.
First, as shown in FIG. 2, a source-based ranking level and a destination-based ranking level are computed for all nodes in the graph according to the foregoing formula definitions.
Finally, as shown in fig. 3, the total ranking levels of all nodes are calculated, and the ranking of the importance of the nodes is performed according to the total ranking levels. After the completion of the sorting process,the model can be divided into two devices Dev according to the sorting result0And Dev1(execution speed: Dev)0>Dev1) The above iteration is performed.
To this end, the problem of how to place m nodes in a computation graph on n devices in a distributed system to achieve the minimum completion time with a certain network bandwidth is solved by modeling the computation graph with a critical path, and the time complexity of the algorithm is controlled to be O (n × log n).
According to the algorithm design in the previous section, the corresponding problem solving flow can now be described as: the algorithm calculates the source-based ranking grade and the destination-based ranking grade of each node according to two input conditions of a set V of m computing nodes with a computing complexity description c and an edge set E of a computing map, adds the source-based ranking grade and the destination-based ranking grade to obtain a final ranking grade of the node, and finally places the node on equipment in a greedy manner. For a certain compute node viOf (c) computational complexityiThe computational complexity is an input condition for the algorithm to run.
From the above description of the embodiments, it is clear for a person skilled in the art that the embodiments can be implemented by means of a software platform and a hardware platform, and based on such understanding, the technical solutions described above may be essentially or partially implemented in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (5)
1. A model parallel method based on a Tensorflow framework is characterized in that an original random model division mode is replaced by an innovative greedy algorithm execution model division.
2. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is scalable, which enables users to run larger models and make the computation complete faster by adding more devices.
3. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is not simply an empirical random partitioning of the equipment, and the strategy is to find the critical path in the computation graph, then to execute a greedy algorithm with minimum completion time as the target for the equipment executing the path, and to place the nodes on the critical path on the equipment with the highest execution speed.
4. The Tensorflow framework-based model parallel method according to claim 1 or 2, wherein the model parallel optimization algorithm estimates the critical path in a computational complexity ordering of nodes; the problem that m nodes in a computation graph are placed on n devices in a distributed system under the condition that the network bandwidth is constant to achieve the minimum completion time is solved by means of modeling of a critical path, and the time complexity of an algorithm is controlled to be O (n x logn).
5. The calculation complexity sorting optimization algorithm comprises a sorting algorithm based on a source and a sorting algorithm based on a destination, and is realized on a TensorFlow framework; arbitrary compute node viThe ranking level of (1) is the sum of the node's source-based ranking level and destination-based ranking level; ordering of nodes, etcThe level represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes on the node is; and after any node sequencing level is obtained, sequencing all nodes in the set V according to the sequencing level, placing the node with the highest sequencing level on the computing equipment with the highest execution speed according to the result of the level sequencing, then placing the node with the second highest sequencing level, and continuously iterating until all the computing nodes are placed completely to finish the greedy algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010825175.1A CN112070223A (en) | 2020-08-17 | 2020-08-17 | Model parallel method based on Tensorflow framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010825175.1A CN112070223A (en) | 2020-08-17 | 2020-08-17 | Model parallel method based on Tensorflow framework |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112070223A true CN112070223A (en) | 2020-12-11 |
Family
ID=73662178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010825175.1A Pending CN112070223A (en) | 2020-08-17 | 2020-08-17 | Model parallel method based on Tensorflow framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112070223A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115809699A (en) * | 2023-02-03 | 2023-03-17 | 之江实验室 | Method and device for estimating minimum memory occupation amount required by neural network model inference |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819664A (en) * | 2012-07-18 | 2012-12-12 | 中国人民解放军国防科学技术大学 | Influence maximization parallel accelerating method based on graphic processing unit |
CN109909657A (en) * | 2019-04-02 | 2019-06-21 | 北京无线电测量研究所 | A kind of automatic welding paths planning method of antenna array |
-
2020
- 2020-08-17 CN CN202010825175.1A patent/CN112070223A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819664A (en) * | 2012-07-18 | 2012-12-12 | 中国人民解放军国防科学技术大学 | Influence maximization parallel accelerating method based on graphic processing unit |
CN109909657A (en) * | 2019-04-02 | 2019-06-21 | 北京无线电测量研究所 | A kind of automatic welding paths planning method of antenna array |
Non-Patent Citations (1)
Title |
---|
何马均: "深度学习框架TensorFlow 的高效分布式并行算法研究" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115809699A (en) * | 2023-02-03 | 2023-03-17 | 之江实验室 | Method and device for estimating minimum memory occupation amount required by neural network model inference |
CN115809699B (en) * | 2023-02-03 | 2023-06-23 | 之江实验室 | Method and device for estimating minimum memory occupation amount required by neural network model reasoning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA3085897C (en) | Evolutionary architectures for evolution of deep neural networks | |
Hasegawa et al. | A novel chaotic search for quadratic assignment problems | |
CN112651509B (en) | Method and device for determining quantum circuit | |
Huang et al. | An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks | |
CN109690576A (en) | The training machine learning model in multiple machine learning tasks | |
CN108021983A (en) | Neural framework search | |
Kampolis et al. | A multilevel approach to single-and multiobjective aerodynamic optimization | |
CN105550746A (en) | Training method and training device of machine learning model | |
Zomaya et al. | A framework for reinforcement-based scheduling in parallel processor systems | |
EP4006788A1 (en) | Quantum circuit determining method and apparatus, device, and storage medium | |
CN111966495B (en) | Data processing method and device | |
Ma et al. | A comprehensive improved salp swarm algorithm on redundant container deployment problem | |
De Lima et al. | Efficient ridesharing dispatch using multi-agent reinforcement learning | |
Sun et al. | A teaching-learning-based optimization with feedback for LR fuzzy flexible assembly job shop scheduling problem with batch splitting | |
CN112070223A (en) | Model parallel method based on Tensorflow framework | |
CN116644804B (en) | Distributed training system, neural network model training method, device and medium | |
CN113163004A (en) | Industrial Internet edge task unloading decision method, device and storage medium | |
CN114386309B (en) | Agent optimization problem scale unification method in cloud computing environment | |
WO2022166125A1 (en) | Recommendation system with adaptive weighted baysian personalized ranking loss | |
CN113986816A (en) | Reconfigurable computing chip | |
CN110175287B (en) | Flink-based matrix decomposition implicit feedback recommendation method and system | |
Haralampiev | Neural network approaches for a facility location problem | |
Skraba et al. | Application of self-gonfiguring genetic algorithm for human resource management | |
Vuchener et al. | Dynamic load-balancing with variable number of processors based on graph repartitioning | |
CN115001978B (en) | Cloud tenant virtual network intelligent mapping method based on reinforcement learning model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201211 |
|
RJ01 | Rejection of invention patent application after publication |