CN112070223A - Model parallel method based on Tensorflow framework - Google Patents

Model parallel method based on Tensorflow framework Download PDF

Info

Publication number
CN112070223A
CN112070223A CN202010825175.1A CN202010825175A CN112070223A CN 112070223 A CN112070223 A CN 112070223A CN 202010825175 A CN202010825175 A CN 202010825175A CN 112070223 A CN112070223 A CN 112070223A
Authority
CN
China
Prior art keywords
nodes
node
algorithm
level
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010825175.1A
Other languages
Chinese (zh)
Inventor
田文洪
谢远伦
杨锦涛
许凌霄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010825175.1A priority Critical patent/CN112070223A/en
Publication of CN112070223A publication Critical patent/CN112070223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a model parallel method based on a Tensorflow framework, which is characterized in that a parallel optimization algorithm is added in the model parallel of the Tensorflow, and the original random model division is replaced by the execution of the model division by an innovative greedy algorithm; the strategy of the model parallel optimization algorithm is to find out a key path in a calculation graph, then adopt a greedy algorithm which takes minimum completion time as a target for equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed. The key path is placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent, and the purpose of reducing the task completion time is achieved. In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, and simultaneously, the importance degree of nodes on the long path is also considered, the optimization algorithm estimates the critical path according to the calculation complexity sequence of the nodes.

Description

Model parallel method based on Tensorflow framework
Technical Field
The invention relates to the field of computers, in particular to a model parallel method based on a Tensorflow framework.
Background
Since the TensorFlow open source publishing framework, academic and industrial research related to deep learning has been developed unprecedentedly, and with the fact that relevant models are higher and more complex, the number of layers obtained by a layered structure is more and more, a neural network model is larger and larger, the memory limit of a single device is gradually exceeded, and the requirement for reducing the training time of the models is increased day by day. However, the TensorFlow is highly restrictive to a single compute node, and especially, as the size of a data set increases, the limitation of the TensorFlow is more prominent, and it is an effective method to improve the training efficiency of the deep learning model and solve the limitation bottleneck of a single device to the memory in a distributed parallel manner. Therefore, there is a need for a distributed parallel algorithm that addresses the problem of neural network models being too large to exceed the memory limit of a single device, while reducing the model training time.
Disclosure of Invention
In order to solve the above problem, an embodiment of the present invention provides a model parallel method based on a tensrflow framework.
The embodiment of the invention provides a model parallel method based on a Tensorflow framework, which comprises the following steps:
and the parallel optimization algorithm is realized by replacing the original random model division mode with the innovative greedy algorithm execution model division.
The model parallel optimization algorithm has scalability. It enables users to run larger models and make computations complete faster by adding more devices.
The model parallel optimization algorithm is not simply to randomly divide equipment according to experience, and the strategy is to find out a key path in a calculation map, then execute a greedy algorithm which takes minimum completion time as a target on equipment executing the path, and place nodes on the key path on the equipment with the highest execution speed.
The specific implementation process of the model partitioning algorithm on the TensorFlow frame platform is as follows:
inputting: a set D of n devices with an execution speed description s, a set V of m computation nodes with a computation complexity description c, an edge set E of the computation graph.
And (3) outputting: a matrix O of dimensions m x n, representing the final solution of the algorithm. Element O in the matrixi,jMeans that the jth node is finally placed on the ith block device for execution, wherein i is more than or equal to 1 and less than or equal to n, and 1 is less than or equal to nj is less than or equal to m. Element Oi,jThe value of (1) can only be 1 or 0, wherein 1 represents placement and 0 represents no placement.
1. And calculating the source-based ranking level of each node according to the given set V of the calculation nodes and the edge set E of the calculation graph.
2. And calculating the sorting level of each node based on the direction according to the given set V of the calculation nodes and the edge set E of the calculation graph.
3. And summing the calculated source-based ranking grade and the destination-based ranking grade of each node to obtain a final ranking grade.
4. The final ranking level of each node is ranked.
5. A set D of n devices is ordered.
6. Declaring a matrix O of m x n dimensions, each element representing on which device a node is placed to execute
7. The initialization matrix O is filled with 0 elements.
8. And judging whether each device can store and execute the ordered computing nodes, wherein 1 represents placement and 0 represents no placement.
9. And when all the nodes are judged to be finished, finishing the algorithm execution.
The model parallel optimization algorithm estimates the critical path in the order of computational complexity of the nodes.
The calculation complexity ranking optimization algorithm comprises a source-based ranking algorithm and a destination-based ranking algorithm.
The specific implementation process of the node on the TensorFlow frame platform based on the source ranking grade algorithm is as follows:
inputting: and calculating the adjacency matrix M representation of the map directed graph.
And (3) outputting: the node is based on the source's rank array source _ ranks.
1. By traversing the adjacency matrix M, a point with an in-degree of 0 is found.
2. And establishing a linked list taking the link list as a head node.
3. Each element in the adjacency matrix is determined to be 1.
4. And sequencing the final sequencing grade of each node, and adding a new node into the linked list.
5. Iterating until all paths in the calculation map are established into a chain table
5. The declaration source _ ranks node is based on the sorted ranked list of sources.
6. And traversing the linked list from the head node to the current node to obtain the source-based sorting level of each node and adding the sorting level into the newly-built list.
7. The return node is based on the source's ranking level source _ ranks.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a computational graph of the need for model partitioning in the practice of the present invention;
FIG. 2 is a source-based ranking and a destination-based ranking in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating model partitioning according to node importance ranking in an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In view of network transmission delay, equipment heterogeneity, computational complexity of computational nodes and mutual dependence degree between the computational nodes in a deep learning network model training process, the model parallel optimization algorithm based on the TensorFlow frame does not simply divide equipment randomly according to experience, and a strategy of the model parallel optimization algorithm is to find a key path in a computational graph, then execute a greedy algorithm aiming at the minimum completion time on the equipment executing the path, and place the nodes on the key path on the equipment with the highest execution speed. The critical path refers to the longest (or the highest computational complexity) path from any starting point to the end point. The key path is found because the nodes on the key path have high computation complexity and strong execution dependency between the nodes, and the nodes are placed on the same equipment, so that the network transmission delay can be reduced to the maximum extent under the condition of not exceeding the equipment memory, and the aim of reducing the task completion time is fulfilled.
In order to avoid a complex algorithm for calculating the critical path and solve the problem that the memory of a single device cannot store the whole critical path, but also consider the importance degree of nodes on the long path, the optimization algorithm estimates the critical path by sequencing the calculation complexity of the nodes. For any node in the directed graph of the computation graph, there are both source and destination attributes. Aiming at the two attributes, a sorting algorithm is respectively designed:
a. source-based ranking algorithm: for any node viIt is represented based on the rank of the source: from any starting node vjTo the start, to node viAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. The corresponding formula is defined as follows:
Figure RE-GDA0002710065610000031
b. a sort algorithm based on direction: for any node viExpressed based on the ranking level of the heading, is: slave node viStart to any termination node vjAnd calculating the maximum value of the sum of the complexity of all the nodes on the finished path. Phase (C)The corresponding formula is defined as follows:
Figure RE-GDA0002710065610000032
with the two definitions, any node v in the computational graph can be giveniRank of (2): arbitrary compute node viThe rank of (c) is the sum of the source-based rank and the destination-based rank of the node. The corresponding formula is expressed as follows:
Rank(vi)=SourceRank(vi)+EndRank(vi)#(1-3)
as also previously mentioned, node viThe ranking level of (2) represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, and the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes is. Now obtain any node viAfter the ranking level is reached, all the nodes in the set V can be ranked according to the ranking level, then the node with the highest ranking level is placed on the computing equipment with the highest execution speed according to the ranking result, then the node with the second highest ranking level is placed, and iteration is continued until all the computing nodes are placed, so that the greedy algorithm is completed. At this point, the process of optimizing the algorithm is designed.
For clarity of description, the following describes the execution flow of the algorithm in the embodiment of the present invention:
a computational graph, such as that shown in fig. 1, is now required to be model partitioned. Each circle in the graph represents a TensorFlow operation, i.e., a compute node in the TensorFlow computation graph. The numbers inside the circles represent the computational complexity of the node.
First, as shown in FIG. 2, a source-based ranking level and a destination-based ranking level are computed for all nodes in the graph according to the foregoing formula definitions.
Finally, as shown in fig. 3, the total ranking levels of all nodes are calculated, and the ranking of the importance of the nodes is performed according to the total ranking levels. After the completion of the sorting process,the model can be divided into two devices Dev according to the sorting result0And Dev1(execution speed: Dev)0>Dev1) The above iteration is performed.
To this end, the problem of how to place m nodes in a computation graph on n devices in a distributed system to achieve the minimum completion time with a certain network bandwidth is solved by modeling the computation graph with a critical path, and the time complexity of the algorithm is controlled to be O (n × log n).
According to the algorithm design in the previous section, the corresponding problem solving flow can now be described as: the algorithm calculates the source-based ranking grade and the destination-based ranking grade of each node according to two input conditions of a set V of m computing nodes with a computing complexity description c and an edge set E of a computing map, adds the source-based ranking grade and the destination-based ranking grade to obtain a final ranking grade of the node, and finally places the node on equipment in a greedy manner. For a certain compute node viOf (c) computational complexityiThe computational complexity is an input condition for the algorithm to run.
From the above description of the embodiments, it is clear for a person skilled in the art that the embodiments can be implemented by means of a software platform and a hardware platform, and based on such understanding, the technical solutions described above may be essentially or partially implemented in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (5)

1. A model parallel method based on a Tensorflow framework is characterized in that an original random model division mode is replaced by an innovative greedy algorithm execution model division.
2. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is scalable, which enables users to run larger models and make the computation complete faster by adding more devices.
3. The Tensorflow framework-based model parallel method as claimed in claim 1, wherein the model parallel optimization algorithm is not simply an empirical random partitioning of the equipment, and the strategy is to find the critical path in the computation graph, then to execute a greedy algorithm with minimum completion time as the target for the equipment executing the path, and to place the nodes on the critical path on the equipment with the highest execution speed.
4. The Tensorflow framework-based model parallel method according to claim 1 or 2, wherein the model parallel optimization algorithm estimates the critical path in a computational complexity ordering of nodes; the problem that m nodes in a computation graph are placed on n devices in a distributed system under the condition that the network bandwidth is constant to achieve the minimum completion time is solved by means of modeling of a critical path, and the time complexity of an algorithm is controlled to be O (n x logn).
5. The calculation complexity sorting optimization algorithm comprises a sorting algorithm based on a source and a sorting algorithm based on a destination, and is realized on a TensorFlow framework; arbitrary compute node viThe ranking level of (1) is the sum of the node's source-based ranking level and destination-based ranking level; ordering of nodes, etcThe level represents the importance degree of the node in the calculation graph and the dependency degree between the nodes, the higher the ranking level is, the more important the node is, and the stronger the dependency degree of other nodes on the node is; and after any node sequencing level is obtained, sequencing all nodes in the set V according to the sequencing level, placing the node with the highest sequencing level on the computing equipment with the highest execution speed according to the result of the level sequencing, then placing the node with the second highest sequencing level, and continuously iterating until all the computing nodes are placed completely to finish the greedy algorithm.
CN202010825175.1A 2020-08-17 2020-08-17 Model parallel method based on Tensorflow framework Pending CN112070223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010825175.1A CN112070223A (en) 2020-08-17 2020-08-17 Model parallel method based on Tensorflow framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010825175.1A CN112070223A (en) 2020-08-17 2020-08-17 Model parallel method based on Tensorflow framework

Publications (1)

Publication Number Publication Date
CN112070223A true CN112070223A (en) 2020-12-11

Family

ID=73662178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010825175.1A Pending CN112070223A (en) 2020-08-17 2020-08-17 Model parallel method based on Tensorflow framework

Country Status (1)

Country Link
CN (1) CN112070223A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809699A (en) * 2023-02-03 2023-03-17 之江实验室 Method and device for estimating minimum memory occupation amount required by neural network model inference

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819664A (en) * 2012-07-18 2012-12-12 中国人民解放军国防科学技术大学 Influence maximization parallel accelerating method based on graphic processing unit
CN109909657A (en) * 2019-04-02 2019-06-21 北京无线电测量研究所 A kind of automatic welding paths planning method of antenna array

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819664A (en) * 2012-07-18 2012-12-12 中国人民解放军国防科学技术大学 Influence maximization parallel accelerating method based on graphic processing unit
CN109909657A (en) * 2019-04-02 2019-06-21 北京无线电测量研究所 A kind of automatic welding paths planning method of antenna array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何马均: "深度学习框架TensorFlow 的高效分布式并行算法研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809699A (en) * 2023-02-03 2023-03-17 之江实验室 Method and device for estimating minimum memory occupation amount required by neural network model inference
CN115809699B (en) * 2023-02-03 2023-06-23 之江实验室 Method and device for estimating minimum memory occupation amount required by neural network model reasoning

Similar Documents

Publication Publication Date Title
CA3085897C (en) Evolutionary architectures for evolution of deep neural networks
Hasegawa et al. A novel chaotic search for quadratic assignment problems
CN112651509B (en) Method and device for determining quantum circuit
Huang et al. An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks
CN109690576A (en) The training machine learning model in multiple machine learning tasks
CN108021983A (en) Neural framework search
Kampolis et al. A multilevel approach to single-and multiobjective aerodynamic optimization
CN105550746A (en) Training method and training device of machine learning model
Zomaya et al. A framework for reinforcement-based scheduling in parallel processor systems
EP4006788A1 (en) Quantum circuit determining method and apparatus, device, and storage medium
CN111966495B (en) Data processing method and device
Ma et al. A comprehensive improved salp swarm algorithm on redundant container deployment problem
De Lima et al. Efficient ridesharing dispatch using multi-agent reinforcement learning
Sun et al. A teaching-learning-based optimization with feedback for LR fuzzy flexible assembly job shop scheduling problem with batch splitting
CN112070223A (en) Model parallel method based on Tensorflow framework
CN116644804B (en) Distributed training system, neural network model training method, device and medium
CN113163004A (en) Industrial Internet edge task unloading decision method, device and storage medium
CN114386309B (en) Agent optimization problem scale unification method in cloud computing environment
WO2022166125A1 (en) Recommendation system with adaptive weighted baysian personalized ranking loss
CN113986816A (en) Reconfigurable computing chip
CN110175287B (en) Flink-based matrix decomposition implicit feedback recommendation method and system
Haralampiev Neural network approaches for a facility location problem
Skraba et al. Application of self-gonfiguring genetic algorithm for human resource management
Vuchener et al. Dynamic load-balancing with variable number of processors based on graph repartitioning
CN115001978B (en) Cloud tenant virtual network intelligent mapping method based on reinforcement learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201211

RJ01 Rejection of invention patent application after publication