CN115016938A

CN115016938A - Calculation graph automatic partitioning method based on reinforcement learning

Info

Publication number: CN115016938A
Application number: CN202210650630.8A
Authority: CN
Inventors: 崔毅东; 林孟群; 雷友珣; 陈莉萍
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-06

Abstract

The invention discloses a calculation graph automatic partitioning method based on reinforcement learning, which comprises the following steps: step 1, performing topological sorting on a calculation graph to convert the calculation graph into a linear table; step 2, modeling a computational graph to be divided and a many-core processor as a Markov decision process in reinforcement learning, extracting a sub-graph division condition and a core resource distribution condition of the current computational graph as states in the reinforcement learning, adjusting layer number distribution between two adjacent cores as actions in the reinforcement learning, and using the running time and storage condition of the computational graph on the many-core processor as rewards in the reinforcement learning; and 3, solving the Markov decision process by using a REINFORCE algorithm, and dividing an algorithm model by a training graph.

Description

Calculation graph automatic partitioning method based on reinforcement learning

Technical Field

The invention belongs to the field of resource allocation and reinforcement learning, and relates to a method for solving the problem of deep learning calculation graph partitioning by using a reinforcement learning algorithm.

Background

In recent years, deep learning has enjoyed dramatic success in the fields of image analysis, natural language processing, speech recognition, video classification, and the like. However, deep learning relies on powerful computing power, and optimizing the system framework of deep learning to reduce computing power requirements plays an important role in deep learning applications. In the face of the explosive computing requirement of a deep learning model, an AI chip aiming at the field of AI is widely applied.

The AI chip generally employs a many-core architecture. The AI chip is dedicated to handle a large number of computational tasks in AI applications, while other non-computational tasks are still handled by the CPU. The AI chip integrates multiple cores and functionally can purposely accelerate certain algorithms or tasks. The definition of AI chips in the market comes more from the functional aspect, regardless of its architecture. In recent years, AI chip products related to deep learning have appeared in succession, and solutions thereof have been introduced from science and technology capitals such as google, intel, and invida to entrepreneur companies such as cambrian and horizon. With the development and maturity of chip design technology, the AI chip architecture realizes rapid iteration with the development of AI technology.

The deep learning compiler often splits a model realized by different frameworks into a plurality of subtasks to be deployed on a many-core chip, and a pipeline structure is used for processing a computing task so as to achieve optimal performance. In a pipeline structure, a large computation task can be divided into a plurality of parallel subtask sets, wherein each subtask set is processed by a plurality of cores in a many-core chip in parallel, and the mapping from the subtask set to a core resource set is completed by a Run-time system (Run-time system). When a certain deep learning model is operated as a calculation task, dividing the subtask into a plurality of subgraphs, namely dividing the calculation graph of the deep learning model. In the computational graph, the computational process is simulated by using the nodes, and the advantage is that the operation can be divided, and even the operation can be carried out in a multi-GPU mode.

It is important in a pipeline architecture how to distribute computing tasks to individual processors in an optimized manner, a problem commonly referred to as load balancing. Load balancing is a fundamental problem in parallel computing, which maximizes parallel application performance by minimizing idle time of processors and communication time between processors.

The scheduling of the pipeline structure can reduce the idle time of the processor, improve the program execution performance and increase the utilization rate of hardware resources. However, limitations of system resources such as processor core performance, on-chip storage, communication bandwidth, etc., can impact software pipelining performance.

Reinforcement learning has achieved certain achievements in solving resource scheduling. Mirhoseini et al, 2017, proposed the use of reinforcement learning to optimize computational graph node assignment in a distributed system for the TensorFlow model. The paper uses a sequence-to-sequence (Seq2Seq) model. The model consists of an encoder and a decoder, nodes of the computational graph are input into the model according to topological sequencing, and equipment distributed to each node serves as output. Since then, many reinforcement learning based resource scheduling schemes have been proposed. Addanki et al use a reinforcement learning algorithm to achieve the scheduling of neural networks over distributed resources. This study iterates through the resource allocation scheme rather than obtaining the node allocation scheme of the computational graph at once. The subsequent research also commonly utilizes reinforcement learning to solve the task scheduling problem in the distributed system, and the differences are mostly reflected on a specific deep learning model. However, the above methods for solving the resource scheduling problem by reinforcement learning allocate resources according to the resource layout, and are not suitable for the situation that the core resource layout of the processor changes. The Luo provides an algorithm based on deep reinforcement learning and multi-level graph division ideas aiming at the resource scheduling problem of a distributed stream processing system, and resources are distributed according to the number of the resources. And reducing the complexity of the graph by graph coarsening, processing the large-scale data flow graph into a small-scale data flow graph, training by using reinforcement learning, and mapping the result into the large-scale data flow graph. However, after the data flow graph is subjected to coarsening processing, the space for searching the optimal solution by reinforcement learning is limited, so that the reinforcement learning effect is limited to a certain extent.

The invention mainly researches how to effectively divide the computation graph corresponding to the model and realize load balance when training the deep learning model on the many-core chip architecture. Therefore, an algorithm for automatically dividing the deep learning calculation graph and allocating the number of the core resources to each sub-graph is designed, so that the running time of the deep learning model in many core chips is minimum, and the method is a method for allocating the resources according to the number of the resources.

Disclosure of Invention

The invention aims to provide a computation graph automatic partitioning method based on reinforcement learning, which can automatically partition a deep learning computation graph corresponding to a deep learning model into sub-graphs and allocate core resources to each sub-graph according to the number of resources so as to achieve the aim of shortening the running time of the deep learning model.

In order to achieve the above object, the method for automatically partitioning a computational graph based on reinforcement learning according to the present invention comprises the following steps:

step 1, performing topological sorting on a calculation graph to convert the calculation graph into a linear table;

step 2, modeling a computational graph to be divided and a many-core processor as a Markov decision process in reinforcement learning, extracting a sub-graph division condition and a core resource distribution condition of the current computational graph as states in the reinforcement learning, adjusting layer number distribution between two adjacent cores as actions in the reinforcement learning, and using the running time and storage condition of the computational graph on the many-core processor as rewards in the reinforcement learning;

and 3, solving the Markov decision process by using a REINFORCE algorithm, and dividing an algorithm model by a training graph.

The specific process of the step 1 is as follows:

and 11, carrying out topological sequencing on the deep learning calculation graph, and converting the graph into a linear table structure. The element arrangement sequence in the linear table is consistent with the operation sequence of the nodes, and the data elements in the linear table correspond to the layers in the deep learning model;

and step 12, seeking to represent the type and the hyper-parameter of the operation of the data of each layer and the data quantity of each edge in the graph, thereby calculating the total number of nodes in the computation graph, the operation quantity of each operation, the required storage quantity and the routing quantity.

The step 2 specifically comprises the following steps:

step 21, extracting the subgraph division condition and the core resource allocation condition of the current computation graph as the states in reinforcement learning: the state is composed of two parts, the first part is a node division and resource allocation state of the computational graph, and the second part is an operand state of each subgraph;

step 22, the adjustment of the layer number distribution between two adjacent cores is taken as the action in reinforcement learning, and there are four types of actions: merging all layers of two adjacent cores to a next core for processing, merging all layers of two adjacent cores to a previous core for processing, handing the last layer processed by the previous core (namely the last layer in the linear table) to the next core for processing, and handing the first layer processed by the next core (namely the first layer in the linear table) to the previous core for processing;

and step 23, taking the running time and the storage condition of the computation graph on the many-core chip as rewards in reinforcement learning: the reward value is set to reward ═ a × max (T) + b, where T ═ T ₁ ,t ₂ ,…,t _k "is the running time of the computation graph G on the many-core processor M, S ═ S { [ S ] } ₁ ,s ₂ ,…,s _k Indicates the data storage on the many-core processor M. If max(s) exceeds the limit, b is assigned a penalty value. The goal of the training is to make the prize value as large as possible.

The step 3 specifically includes the following steps:

step 31, initializing the whole graph partitioning environment, importing the deep learning calculation graph converted into the linear table structure, counting the total number of nodes, initializing the core resource allocation condition, initializing optional actions according to the total number of the core resources, changing variables recording the number of cores in the subgraph, variables recording the states of all the core resources, variables recording the number of partitioned nodes, variables recording reward values and the like into initial states, and initializing the probability of all the actions;

step 32, selecting an action a according to the action probability, and changing the current environment state s after executing the action a;

step 33, calculating the calculation amount, the storage amount and the routing amount of each sub-graph according to the current state, respectively comparing the three values in each sub-graph to obtain three values with the worst condition, and performing weighting operation to calculate the reward value r;

step 34, judging whether the reward value meets the preset requirement: if the probability of the action is up, updating the action probability and ending the current round, if the probability of the action is not up, continuing to select the action, and repeating the steps 32-34;

step 35, inputting all s, a and r of the turn into a neural network for training, and updating the probability of selecting an action;

and step 36, finishing the training process after the set number of rounds is reached, and storing the neural network model.

The above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same, and any modifications made on the basis of the technical solutions according to the technical ideas presented by the present invention are within the scope of the present invention.

Drawings

FIG. 1 is a flow chart of a computational graph automatic partitioning method based on reinforcement learning;

FIG. 2 is a schematic diagram of a many-core chip oriented computation graph automatic partitioning method structure based on deep reinforcement learning;

fig. 3 is a diagram of a neural network architecture.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings:

the invention provides a computation graph automatic partitioning method based on reinforcement learning, which comprises three parts, namely a Markov decision process for performing topological sequencing on a computation graph and modeling a computation graph partitioning problem into reinforcement learning, and a neural network model is trained by utilizing a REINFORCE algorithm-based many-core processor-oriented computation graph automatic partitioning algorithm. The specific implementation method comprises the following steps:

defining the many-core processor-oriented computation graph partitioning problem as follows: for a computation graph, G ═ (O, E), O ═ op for a computation graph ₁ ,op ₂ ,···op _m Is the set of operators on the graph, and E is the set of E edges. Dividing the calculation graph G into sub-graph sets G' ═ G ₁ ,g ₂ ,···,g _k Where k ≦ m, i.e. each sub-graph g consists of one or more operators. The computational graph G is deployed on a many-core processor M and runs, and n core resources are integrated on the M. Assigning a core number to each subgraph, C ═ C ₁ ,c ₂ ,…,c _k Is a collection of sets of kernel resources, denoted as subgraph g _i Is assigned c _i A core resource, wherein c ₁ +c ₂ +···+c _k N, responsible for sub-graph g _i C of _i Each core resource constitutes a group of core resources. Each partitioning scheme P ═ G ', C denotes that the computation graph G is partitioned into subgraphs in a G' manner, and core resources are allocated to each subgraph in the core number allocation scheme of C. Under the partitioning scheme P, s (P) { s ═ s ₁ ,s ₂ ,…,s _k Denotes the data storage situation on the many-core processor M, where s _i Represents the storage amount of a single core in the ith core resource group, T (P) ═ t ₁ ,t ₂ ,…,t _k "is the running time situation of the computation graph G on the many-core processor M, where t _i Representation scheme g _i The run time of (c). The training goal is to find an allocation scheme P ═ G', C such that max (t) is the shortest, and max(s) does not exceed the upper memory limit of the many-core processor M.

According to the analysis, the invention relates to a calculation graph automatic partitioning method based on reinforcement learning, which comprises the following steps:

step A, performing topological sorting on a calculation graph to convert the calculation graph into a linear table;

step B, modeling a computational graph to be divided and a many-core processor as a Markov decision process in reinforcement learning;

and step C, solving the Markov decision process by using a REINFORCE algorithm, and dividing an algorithm model by a training picture.

Further, the detailed description of step A is provided in the summary of the invention.

The step B specifically comprises the following steps:

and step B1, modeling the state as a subgraph division condition and a core resource allocation condition of the current computation graph. Because the main factor influencing load balance is the operand of the subgraph, the state is composed of two parts, the first part is the node division and resource allocation state of the computational graph, and the second part is the operand state of each subgraph.

The node division and resource allocation state of the computation graph is recorded by a list shared _ core, and the list shared _ core represents the condition that each core resource on the many-core processor processes the computation graph. The processor includes n core resources ₁ ,core ₂ ,···,core _n Denotes the list of associated _ core as [ layers } ₁ ,layers ₂ ,layers ₃ ,layers ₄ ,…,layers _n ]Wherein

And 0 or more layers _i Less than or equal to m, and making layers _i As core _i The number of layers treated. If layers _i All are not 0, the whole calculation graph is divided into n sub-graphs, and the number of layers is layers ₁ ,layers ₂ ,layers ₃ ,layers ₄ ，…，layers _n . When layers appear _i ，layers _i+1 ，…，layers _i+p Are all 0 and layers _i+p+1 Not equal to 0, namely when p +1 0 s continuously appear in the arraged _ core (p is more than or equal to 0 and less than or equal to m-1), a core resource group is formed from the ith core resource to the (i + p) th core resource, and the layers are processed _i+p+1 Each layer constitutes a sub-graph. It is emphasized that, in actual operation, core _i ，core _i+1 ，…，core _i+p Rather than dealing with 0 layers, the state settings here are merely modeling to facilitate reinforcement learning.

The operand status of each sub-graph is recorded by a list of allocated _ macs, which represents the layers in the allocated _ core list _i The calculated amount of each layer. The allocated _ mac corresponds to the allocated _ core setting, denoted as [ mac s ₁ ,macs ₂ ,macs ₃ ,macs ₄ …macs _n ]Wherein macs _i Are layers _i Total amount of operations of individual layers. Regarding macs _i Definition of 0 and layers _i Similarly.

And step B2, under the current dividing state, adjusting the layer number distribution between two adjacent cores to be modeled as one action.

There are four types of actions for adjusting the number of layers for two adjacent cores. To adjacent core _i ,core _i+1 Four types of actions that can be taken are: (a) two adjacent cores are connected _i ,core _i+1 Layer on to core on the next core _i+1 And (6) carrying out the above treatment. I.e. core _i Incorporated into core _i+1 In the core resource group, the new core resource group needs to bear the layer _i The task of (2). (b) Two adjacent cores are connected _i ,core _i+1 Layer on the previous core _i And (6) performing the above treatment. I.e. core _i Core resource group processing layers _i ,layers _i+1 All layers of (1), and core _i+1 Adding core _i+2 The set of core resources of (1). (c) The previous core resource is core _i Treated layers _i The last layer in the layer (i.e. the last layer in the linear table) is handed over to the next core resource core _i+1 And (6) processing. (d) The latter core resource is core _i+1 Treated layers _i+1 The first layer in the hierarchy (i.e., the first layer in the linear table) is handed over to the previous core resource core _i And (6) processing.

The action space is the set of all selectable actions. Actions are taken on two adjacent cores, and for n cores, there are n-1 groups of adjacent cores, and each type of action has n-1 choices. There are four types of motion, so the motion space size is 4 x (n-1). Table 1 shows the definitions of these four types of actions, the first two types of actions being merge actions and the last two types of actions being split actions. To facilitate code writing, the action number starts at 0.

TABLE 1 Definitions of four classes of actions

When an action is performed, it may happen that the action is invalid for the environmental state at that time, and the environmental state is left unchanged. For example, if two cores are originally in the same core resource group, and the adjustment of the number of layers is not effective at this time, the action will not change the state.

Step B3 models the run time and storage of the computation graph on the many-core chip as rewards. The goal here is to find an allocation scheme P ═ G', C such that max (t) is the shortest, and max(s) does not exceed the upper memory limit of the many-core processor M. The reward value is set to reward-a max (t) + b, and b is given a penalty value if max(s) exceeds the limit. The goal of the training is to make the prize value as large as possible.

Further, the detailed steps of the training diagram partitioning algorithm model in the step C are as follows:

step C1, initialize the whole graph partitioning environment. Importing a deep learning calculation graph converted into a linear table structure, counting the total number of nodes, initializing the distribution condition of core resources, initializing optional actions according to the total number of the core resources, changing variables recording the number of cores in the sub-graph, variables recording the states of each core resource, variables recording the number of divided nodes, variables recording an incentive value and the like into initial states, initializing the probability of all the actions, recording the total number of the cores as n, and setting m to be n-1;

step C2, selecting action a according to the action probability;

step C3, selecting corresponding action according to the value of a, and executing action a;

step C3, calculating the calculation amount, the storage amount and the routing amount of each subgraph according to the current state s, respectively comparing the three values in each subgraph to obtain the three values with the worst condition, and performing weighting operation to calculate the reward value r;

step C4, judging whether r meets the preset requirement: if the probability of the action is up, updating the action probability and ending the current round, if the probability of the action is not up, continuing to select the action, and repeating the steps 32-34;

step C5, inputting all s, a, r of the turn into the neural network for training, and updating the probability of selecting action (FIG. 3 shows the structure of the neural network);

and step C6, finishing the training process after the set number of rounds is reached, and storing the neural network model.

Claims

1. A computational graph automatic partitioning method based on reinforcement learning is characterized by comprising the following steps:

step 2, modeling a computation graph to be divided and a many-core processor as a Markov decision process in reinforcement learning, extracting a sub-graph division condition and a core resource distribution condition of the current computation graph as states in the reinforcement learning, adjusting layer number distribution between two adjacent cores as actions in the reinforcement learning, and using the running time and the storage condition of the computation graph on the many-core processor as rewards in the reinforcement learning;

2. The method for automatically partitioning a computational graph based on reinforcement learning according to claim 1, wherein the specific process of the step 1 is as follows:

3. The method for automatically partitioning a computational graph based on reinforcement learning according to claim 1, wherein the specific process of the step 2 is:

step 22, taking the adjustment of the layer number distribution between two adjacent cores as the actions in the reinforcement learning, wherein there are four types of actions: merging all layers of two adjacent cores to a next core for processing, merging all layers of two adjacent cores to a previous core for processing, handing the last layer processed by the previous core (namely the last layer in the linear table) to the next core for processing, and handing the first layer processed by the next core (namely the first layer in the linear table) to the previous core for processing;

and step 23, taking the running time and the storage condition of the computation graph on the many-core chip as rewards in reinforcement learning: the reward value is set to reward ═ a × max (T) + b, where T ═ T ₁ ,t ₂ ,…,t _k The running time of the computation graph G on the many-core processor M is shown, S ═ S ₁ ,s ₂ ,…,s _k Denotes the data storage situation on the many-core processor M. If max(s) exceeds the limit, b is assigned a penalty value. The goal of the training is to make the prize value as large as possible.

4. The method for automatically partitioning a computational graph based on reinforcement learning according to claim 1, wherein the specific process of the step 3 is:

and step 36, after the set number of rounds is reached, ending the training process, and storing the neural network model.