CN115421897A

CN115421897A - Core particle-oriented deep neural network pipeline parallel scheduling method and device

Info

Publication number: CN115421897A
Application number: CN202211381782.9A
Authority: CN
Inventors: 潘秋红; 许慧卿; 毛旷; 汤昭荣; 杨弢; 杨佳宁; 叶茂伟; 王颖
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2022-12-02
Anticipated expiration: 2042-11-07
Also published as: CN115421897B

Abstract

The invention discloses a core particle-oriented deep neural network pipeline parallel scheduling method and a core particle-oriented deep neural network pipeline parallel scheduling device, wherein the method comprises the following steps: acquiring a deep neural network and a core particle topological structure; constructing a deep neural network calculation graph according to the deep neural network and reducing the deep neural network calculation graph; dividing the pipeline group according to the reduced deep neural network computation graph to obtain a pipeline group graph; dividing a pipeline parallel area according to the pipeline group diagram and the core particle topological structure; determining a deep neural network pipeline parallel scheduling strategy according to the divided pipeline parallel region and the core particle topological structure; and deploying the deep neural network on the core particles according to the pipeline parallel scheduling strategy of the deep neural network, and executing pipeline parallel reasoning of the deep neural network.

Description

Core particle-oriented deep neural network pipeline parallel scheduling method and device

Technical Field

The invention belongs to the technical field of deep learning, parallel computing and core grain architecture intersection, and particularly relates to a core grain-oriented deep neural network pipeline parallel scheduling method and device.

Background

Deep learning has become a key to solving complex real-world problems, the use of deep neural networks has increased rapidly, and researchers and practitioners apply these models to a wide range of applications, including many areas of image recognition, target detection, language translation, audio synthesis, and autopilot. As deep neural networks are more widely developed and used, and the scale of deep neural networks is increasing to improve efficiency, today the most advanced deep neural networks have tens to hundreds of layers, requiring billions of operations and hundreds of megabytes to store activation data and weights. The trend of such networks becoming larger and deeper, the accompanying computational and storage requirements pose great challenges to the computational and storage performance of neural network accelerators, and can lead to the collapse of the commonly used parallelization methods.

In recent years, with the development of core technology, neural network accelerators are being upgraded from chips to cores. The core particles integrally package the small chip units into a module chip meeting specific functions, and compared with the original common chip, the small chip has higher flexibility and performance and lower cost, and is very favorable for the development of a special field architecture of a neural network accelerator.

Therefore, for a more complex structure with high integration level, such as a core grain, in order to fully utilize core grain resources and adapt to ultra-large-scale deep neural network inference training, parallel scheduling of a deep neural network on the core grain is required. An efficient parallel approach today is pipeline parallel, partitioning the deep neural network, assigning different computational map regions to different sub-core-grain modules on the core grain for operation. Pipeline parallelism can effectively solve the problem of insufficient resource utilization, but at the same time, challenges are presented to the effective allocation of resources. Therefore, how to effectively partition the deep neural network and how to map different areas of the deep neural network onto the core grains is important to make full use of core grain resources and make deep neural network inference as less as possible. However, the existing pipeline parallel method is usually directed to a general-purpose graphics processor cluster and is not suitable for core hardware.

Disclosure of Invention

The embodiment of the application aims to provide a core particle-oriented deep neural network pipeline parallel scheduling method and device, so as to solve the technical problem that the prior art is difficult to efficiently perform pipeline parallel scheduling on a deep neural network on a core particle, reasonably divide the deep neural network, fully consider the topological structure of the core particle when a deep neural network region is mapped onto the core particle, fully utilize core particle resources, reduce data transmission among deep neural network regions, and improve the speed of the deep neural network.

According to a first aspect of the embodiments of the present application, a core-grain-oriented deep neural network pipeline parallel scheduling method is provided, including:

step (1): acquiring a deep neural network and a core particle topological structure;

step (2): constructing a deep neural network calculation graph according to the deep neural network and reducing the deep neural network calculation graph;

and (3): dividing the pipeline group according to the reduced deep neural network computation graph to obtain a pipeline group graph;

and (4): dividing a pipeline parallel area according to the pipeline group diagram and the core particle topological structure;

and (5): determining a deep neural network pipeline parallel scheduling strategy according to the divided pipeline parallel region and the core particle topological structure;

and (6): and according to the deep neural network pipeline parallel scheduling strategy, deploying the deep neural network on the core particles, and executing the deep neural network pipeline parallel reasoning.

Further, the step (2) includes:

(2.1) performing control flow analysis on the deep neural network to create a deep neural network computation graph G, wherein the deep neural network computation graph G consists of a node set V and a directed edge set E, each node V in G represents an operator, each directed edge E represents a data dependency relationship between nodes, each node V in G comprises 3 attributes [ in, out and calculated amount ], wherein in represents the number of operators output by a current operator in the computation graph G, out represents the number of operators output by the current operator in the computation graph G, the calculated amount represents the required number of calculations of the current operator, and each edge E in G comprises 3 attributes [ start point, end point and weight ], wherein the weight represents the data amount of an element output by the current operator;

(2.2) traversing the nodes in the calculation graph G, and aiming at each node v with the in-degree and the out-degree of 1 _i For its successor node v _j Making a judgment if v _j If the in-degree and out-degree of v are both 1, then v is _i And v _j Are combined into a node w _i Wherein w is _i Is calculated by v _i And v _j The sum of the calculated amounts of (a);

(2.3) recursively traversing the calculation graph G until no node which can be merged exists, and obtaining a reduced neural network calculation graph G'.

Further, the step (3) includes:

(3.1) dividing a pipeline set P, dividing operators which are at the same depth and have no data dependency relationship in the reduced neural network computation graph G' into the same pipeline set P, wherein the computation amount of P is the cumulative sum of the computation amounts of all nodes in the pipeline set;

(3.2) adding the starting point and the end point of the edge in the G' into the attribute of the pipeline group;

(3.3) reducing the edges in G' to obtain a pipeline group graph S, wherein the S consists of a pipeline group set P and a directed edge set ES, each node P in the S represents a pipeline group, each pipeline group comprises a node set VS, and each directed edge ES represents the data dependency among the pipeline groups and the data quantity required to be transmitted.

Further, the step (3.3) includes:

(3.3.1) for the starting point is the same node, the end point is a plurality of edges of different nodes in the same flow line group, only the traversed first edge is reserved;

(3.3.2) recursively executing (3.3.1) until there are no edges from the same node to different nodes in the same pipeline group in G';

(3.3.3) combining the edges among the pipeline groups, wherein the starting point and the end point of the combined edge are the names of the pipeline groups respectively, and the weight is the accumulated sum of all the edges before combination.

Further, the step (4) includes:

(4.1) acquiring a pipeline group diagram S and the number M of sub-core particles in the core particle topological structure;

(4.2) calculating the expected cost t of S on M sub-core particles according to the pipeline group diagram S and the number M of the sub-core particles ₀ ；

(4.3) taking the edge es with the minimum weight in S _i ;

(4.4) by es _i Divide S into two pipeline set diagrams S ₁ And S ₂ And according to S ₁ And S ₂ The calculated amount of (A) divides the M sub-core particles into M ₁ And M ₂ Two parts if es is present _i If the calculated amount is less than 1/M of the total calculated amount of S in the two divided subgraphs, the step (4.3) is returned, and the es in the S is taken according to the sequence from small to large _i Then the edge with the smallest weight;

(4.5) calculating S by the same calculation method as that in the step (4.2) ₁ At M ₁ Expected overhead t on individual core particles ₁ And S ₂ At M ₂ Expected overhead t on individual core particles ₂ ；

(4.6) calculation from S ₁ To S ₂ Expected data transmission overhead t ₃ ；

(4.7) comparison of t ₀ And t ₁ +t ₂ +t ₃ The size of (d);

(4.8) if t ₀ > t ₁ +t ₂ +t ₃ And the number of the currently divided areas is less than the total number of the sub-core particles in the core particles, the assembly line group diagram S is formed ₁ + number of core particles M ₁ And pipeline set diagram S ₂ And the number M of the core particles ₂ Recursively invoking steps (4.1) - (4.7), otherwise executing step (4.9);

(4.9) if t ₀ <= t ₁ +t ₂ +t ₃ Or the number of the currently divided areas is not less than the total number of the sub-core particles in the core particles, obtaining a division result of the pipeline parallel areas, wherein the division result comprises the pipeline parallel area division of the deep neural network, the data dependency relationship among different areas and the number of the sub-core particles required to be occupied by each area.

Further, the step (4.2) comprises:

(4.2.1) obtaining a data volume d of S, wherein the data volume d is the weighted sum of each edge in S;

(4.4.2) acquiring the number N of embedded neural network processors contained in one sub-core particle and the size w × w of a pulse array on each embedded neural network processor;

(4.2.3) calculating the data transmission overhead C =of S according to the number M of the sub-core particles, the data quantity d and the number N of the embedded neural network processors

；

(4.2.4) calculating the calculation cost d of a single operator v in S _v : dividing the number of available sub-core particles according to the calculated amount of each operator in each pipeline group, and respectively calculating the calculation cost of each operator v in S: the operator v distributes the calculated quantity m x k x n on each embedded neural network processor; if m>w or n>w, then obtaining the calculation expense d of the single operator v by adding the proportion of the calculation quantity m x k x n distributed to each embedded neural network processor and the pulse array size w x w and the one-dimensional pulse beat number w-1 which is 2 times _v (ii) a If m<w and n<w, the calculation of the single operator v is obtained by adding the calculated amount of each dimension and subtracting 1Overhead d _v ；

(4.2.5) calculating data calculation overhead D of S: for each pipeline group p in S, the maximum value of the calculation overhead of each operator is the calculation overhead of the pipeline group, and the sum of the calculation overheads of all the pipeline groups in S is the data calculation overhead D of S;

(4.2.6) the data transmission cost C of S is added to the data computation cost D of S to obtain the expected cost of computing pipeline set map S on the set of sub-core grains.

Further, the step (5) is:

and deploying the pipeline parallel areas on the core grain topological structure according to the data dependency relationship among the pipeline parallel areas, and folding the current row to the next row to continue distribution according to the reverse direction when the residual core grain number of the current row cannot meet the distribution requirement.

According to a second aspect of the embodiments of the present application, there is provided a core-grain-oriented deep neural network pipeline parallel scheduling apparatus, including:

the acquisition module is used for acquiring a deep neural network and a core particle topological structure;

the construction module is used for constructing a deep neural network calculation graph and reducing the deep neural network calculation graph according to the deep neural network;

the first dividing module is used for dividing the pipeline group according to the reduced deep neural network calculation diagram to obtain a pipeline group diagram;

the second division module is used for dividing the pipeline parallel area according to the pipeline group diagram and the core grain topological structure;

the determining module is used for determining a deep neural network pipeline parallel scheduling strategy according to the divided pipeline parallel region and the core grain topological structure;

and the deployment module is used for deploying the deep neural network on the core particles according to the deep neural network pipeline parallel scheduling strategy and executing the deep neural network pipeline parallel reasoning.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

(1) The core particle-oriented deep neural network pipeline parallel scheduling method can automatically deploy the deep neural network to core particle hardware, reasonably divide the deep neural network through technologies such as pipeline parallel and the like, fully consider the topological structure of the core particles when mapping the deep neural network region to the core particles, fully utilize core particle resources, reduce data transmission among deep neural network regions and improve the speed of the deep neural network.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram illustrating a core-grain-oriented deep neural network pipeline parallel scheduling method in accordance with an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating a core particle topology, according to an exemplary embodiment.

FIG. 3 is a schematic diagram of a deep neural network computational graph shown in accordance with an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating a process of generating a pipeline group diagram for a deep neural network computation graph according to an exemplary embodiment, where (a) is a reduced deep neural network computation graph, (b) is a result of performing pipeline group partitioning on (a), (c) is a result of performing reduction on edges in (b), and (d) is a result of performing merging on edges between pipeline groups in (c).

FIG. 5 is a schematic diagram illustrating a pipeline parallel section in accordance with an illustrative embodiment.

FIG. 6 is a schematic diagram illustrating the location of pipeline parallel regions on a core die placement in accordance with an exemplary embodiment.

Fig. 7 is a block diagram illustrating a core-grain-oriented deep neural network pipeline parallel scheduling apparatus according to an exemplary embodiment.

FIG. 8 is a schematic diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

Fig. 1 is a flowchart illustrating a core-kernel-oriented deep neural network pipeline parallel scheduling method according to an exemplary embodiment, where the method is applied to a core kernel and may include the following steps:

and (6): and deploying the deep neural network on the core particles according to the pipeline parallel scheduling strategy of the deep neural network, and executing pipeline parallel reasoning of the deep neural network.

According to the embodiment, the deep neural network can be automatically deployed on the core grain hardware, the deep neural network is reasonably divided through technologies such as pipeline parallel and the like, the topological structure of the core grains is fully considered when the deep neural network area is mapped to the core grains, core grain resources are fully utilized, data transmission among deep neural network areas is reduced, and the speed of the deep neural network is improved.

The core-grain-oriented deep neural network pipeline parallel scheduling method disclosed by the application is applicable to all neural networks based on a core grain topological structure, and the neural network shown in fig. 2 is taken as an example for explanation in the application. The following description of the neural network shown in fig. 2 is not intended to limit the scope of the present invention.

In the specific implementation of the step (1), a deep neural network and a core particle topological structure are obtained;

specifically, the obtained deep neural network is a file describing network structure information, and may be a model file generated by a deep learning framework or a network structure configuration file. The obtained core particle topological structure comprises the array size M of the small chip unit in the core particle ₁ *N ₁ NPU (Neural-network Processing Unit) array size M in small chip Unit ₂ *N ₂ 。

Specifically, a deep neural network is obtained first, and the following neural network configuration file is taken as an example, and the configuration file describes operators contained in the deep neural network, the type and size of each operator, and data dependency among the operators:

then, a core particle topology is obtained, as shown in fig. 2, where NOP and NOC are inter-chiplet-unit routing components and intra-chiplet-unit routing components, respectively, the obtained chiplet-unit array size is 2*2, and the NPU array size in the chiplet unit is 4*4, and if numbering is started from 0, the core particle coordinate (1,1,0,2) represents the NPU in the first row and the third column on the chiplet unit in the second row and the second column in the chiplet-unit array. This step is the basis for pipeline parallelism by mapping deep neural networks onto core grains.

In the specific implementation of the step (2), according to the deep neural network, constructing a deep neural network computation graph and reducing the deep neural network computation graph;

specifically, the step (2) includes:

(2.1) performing control flow analysis on the deep neural network to create a deep neural network computation graph G, wherein the deep neural network computation graph G consists of a node set V and a directed edge set E, each node V in G represents an operator, each directed edge E represents a data dependency relationship between nodes, each node V in G contains 3 attributes [ in degree, out degree and computation amount ], the in degree represents the number of operators output by the current operator in the computation graph G (namely the end point in G is the number of edges of V), the out degree represents the number of operators input by the current operator in the computation graph G (namely the start point in G is the number of edges of V), the computation amount represents the number of times of computation required by the current operator, and each edge E in G contains 3 attributes [ start point, end point and weight ], and the weight represents the data amount of an output element of the operator at the current start point;

specifically, for the deep neural network provided by the above neural network configuration file, the deep neural network computation graph G shown in fig. 3 is constructed. And a node set V in G = { V1, V2, …, V29}, each node corresponds to an operator in the deep neural network one by one, each node V records the type and the size of the operator, and 3 attributes [ in degree, out degree and calculated amount ] are stored. For example, node v2 represents Conv (64, 2 x 2, 256), which means that the operator is a convolution operator with an input dimension of 64, a convolution kernel size of 2*2, and an output dimension of 256. The predecessor of v2 has only v1, and the successor has [ v3, v4, v5, v6], so the in-degree of v2 is 1 and the out-degree is 4. For this deep neural network, the size of the initial input tensor determines the size of the operator output at each layer. Taking the input (32, 64, 64) as an example, the output tensor size of v2 is (256, 56, 56), and the calculation amount of v2 is (64 × 2+ 1) × 56 × 256 = 412647424 can be obtained according to the convolution operator calculation amount formula.

In the convolution operator calculation formula, H and W are the sizes of output tensors respectively, and K is _h And K _w Respectively, convolution kernel size, C _in And C _out Respectively the number of input and output channels.

For the edge in the deep neural network computation graph G, taking the edge labeled with attributes in fig. 3 as an example, the edge indicates that the output of the node v2 is the input of the node v3, and the output tensor size of v2 is (256, 56, 56), so the starting point of the edge is v2, the end point is v3, and the weight is 256 × 56 = 802816.

(2.2) ergometerFor each node v with an in-degree and an out-degree of 1, the nodes in graph G are calculated _i For its successor node v _j Making a judgment if v _j If the in-degree and out-degree of v are both 1, then v is _i And v _j Are combined into a node w _i Wherein w is _i Is calculated by v _i And v _j The sum of the calculated amounts of (a);

specifically, the traversal pattern required in this step can be combined with the recursive traversal in (2.3), such as the pre-order traversal and the post-order traversal. In this embodiment, a predecessor traversal is adopted, and taking a calculation graph G shown in fig. 3 as an example, the predecessor traversal is performed to a node v3 whose first in-degree and out-degree are both 1, and a subsequent node v7 is obtained, and it is found that the in-degree and out-degree of v7 are also both 1, so that v3 and v7 can be merged into a node, and the calculation amount of the node is v3 calculation amount + v7 calculation amount = 7310752.

Specifically, recursively traversing the computation graph G shown in fig. 3 can obtain a reduced computation graph G' as shown in (a) in fig. 4. The step is to reduce the calculation graph and combine operators with the in-degree and out-degree of 1, so that a group of operators with complete serial relation are divided and then are positioned in the same pipeline parallel area, namely the operators are mapped to the same chip grain. The output of each operator in a group of operators with complete serial relation is the input of the subsequent operator, and has no data dependency relation with other operators, so that the output of each operator is transmitted in the current sub-core particle region without being transmitted to the sub-core particles corresponding to the parallel regions of other pipelines, and the data transmission overhead is reduced.

In the specific implementation of the step (3), dividing the pipeline group according to the reduced deep neural network computation graph to obtain a pipeline group graph;

specifically, the step (3) may include:

specifically, taking the reduced computation graph G 'shown in (a) in fig. 4 as an example, G' may be divided into 7 pipeline groups. According to the depth traversal reduced computation graph G', the depth of 1 and 2 is only 1 node, so that the pipeline group p1 only contains v1, and the pipeline group p2 only contains v2. And for four nodes [ w1, w2, w3, w4] with the depth of 3 in G', the nodes are at the same depth and have no data dependency relationship, so that the calculation amount for dividing the nodes into the same flow line group p3, p3 is the summation of the calculation amounts of the four nodes: 14775712 × 4=59102848. Similarly, the running water line group p6 includes four nodes [ w5, w6, w7, w8], and the calculation amount is 54871680.

specifically, taking the reduced edge e1[ v1, v2, 207936] connecting v1- > v2 in the calculation graph G' shown in (b) in fig. 4 as an example, the attributes of the pipeline group to which v1 and v2 belong are added: e1[ p1 (v 1), p2 (v 2), 207936].

(3.3) reducing the edges in G' to obtain a pipeline group graph S, wherein the S consists of a pipeline group set P and a directed edge set ES, each node P in the S represents a pipeline group, each pipeline group comprises a node set VS, and each directed edge ES represents the data dependency among the pipeline groups and the data quantity to be transmitted.

Specifically, in the step (3.3), the reducing the edge in G' may include the following sub-steps:

(3.3.1) only retaining the traversed first edge for the edges of which the starting point is the same node and the end point is different nodes in the same flow line group;

specifically, for example, in the reduced computation graph G' shown in fig. 4 (b), four edges connecting v2 and four nodes in the pipeline group p3 are all the starting points of the four edges are the node v2 in the pipeline group p2, and the end points are different and all belong to the pipeline group p3, so that the four edges can be reduced, and only the traversed first edge e2[ p2 (v 2), p3 (w 1), 802816] is reserved.

specifically, recursively processing the reduced computation graph shown in fig. 4 (b) reduces the edges connecting v2 and four nodes in the pipeline group p3, retains only e2[ p2 (v 2), p3 (w 1), 802816], reduces the edges connecting v16 and four nodes in the pipeline group p6, retains only e2[ p5 (v 16), p6 (w 5), 746496]. As a result, as shown in fig. 4 (c), there are no edges from the same node to different nodes in the same pipeline group in fig. 4 (c).

(3.3.3) merging the edges among the assembly line groups, wherein the starting point and the end point of the merged edges are the names of the assembly line groups respectively, and the weight is the accumulated sum of all the edges before merging;

specifically, taking the calculation graph G' shown in fig. 4 (c) as an example, 4 edges exist between the pipeline groups p3 and p4, and they need to be merged, the start point of the merged edge is p3, the end point is p4, and the weight is the sum of the weights of the 4 edges before merging: 746496 4= 2985984. Similarly, the 4 edges between pipeline groups p6 and p7 are merged.

Through the implementation of step (3.3.1) to step (3.3.3), taking fig. 4 as an example, the reduced computation graph G' shown in (a) in fig. 4 is gradually updated to the pipeline group graph S shown in (d) in fig. 4 through the above-mentioned flow. S is composed of a pipeline group set P and a directed edge set E, where the pipeline group set P includes 7 pipeline groups, the directed edge set ES includes 6 edges, and each edge represents a data dependency relationship between the pipeline groups and a data amount to be transmitted, for example, an edge ES1[ P1, P2, 207936] in (d) in fig. 4, which indicates that a result of the pipeline group P1 needs to be transmitted to the pipeline group P2, and the transmitted data amount is 207936.

In the specific implementation of the step (4), dividing a pipeline parallel area according to the pipeline group diagram and the core particle topological structure;

specifically, the step (4) may include:

specifically, taking the assembly line diagram shown in fig. 4 (d) and the set of sub-core particles represented by the core particle topology shown in fig. 2 as an example, the number M =4 of sub-core particles in the assembly line diagram S shown in fig. 4 (d) and the core particle topology shown in fig. 2 is obtained.

In particular, said step (4.2) may comprise the following sub-steps:

specifically, the pipeline group graph S shown in (d) in fig. 4 includes 7 edges, and the weight of each edge is sequentially accumulated to obtain the data amount d =8258624 of S.

(4.2.2) acquiring the number N of embedded neural network processors contained in one sub-core particle and the size w × w of a pulse array on each embedded neural network processor;

specifically, each sub-core particle in the core particle topology shown in fig. 2 includes N =16 embedded neural network processors, and the size w of the pulse array on each embedded neural network processor is 16 × 16.

；

Specifically, substituting d =8258624, M =4, N =16 in the example into the above formula, the data transfer overhead C =812,958.3 of the pipeline group diagram S shown in (d) in fig. 4 is calculated.

(4.2.4) calculating the calculation cost d of a single operator v in S _v : dividing the number of available sub-core particles according to the calculated amount of each operator in each pipeline group, and respectively calculating the calculation cost of each operator v in S: the operator v distributes the calculated quantity m x k x n on each embedded neural network processor; if m>w or n>w, then by summing the calculated quantities m x k x n and distributed to each embedded neural network processorThe calculation cost d of the single operator v is obtained by the proportion of the pulse array size w x w and the one-dimensional pulse beat number w-1 which is 2 times _v (ii) a If m is<w and n<w, the calculation cost d of the single operator v is obtained by adding the total calculated amount (i.e. m, k, n) of each dimension and then subtracting 1 _v ；

Specifically, the pipeline group p3 in the pipeline group diagram S includes 4 operators [ w1, w2, w3, w4], and it is determined that each operator can be placed on 1 sub-core particle according to the calculated amount of the 4 operators. Then, the computation cost of each operator in S, for example, the operator v1, which is distributed to each embedded neural network processor is 57 × 4098, is computed one by judging that the condition of m > w or n > w is met, so that the computation cost of v1 is m × n × k/(w × w) +2 (w-1) = 52039.38. For a pipeline group containing only 1 operator, such as p1, the calculation overhead is the calculation overhead of the contained operator, and for a pipeline group with a plurality of operators, such as p3, the calculation overhead of the contained 4 operators is 3637.35, so the calculation overhead of the pipeline group p3 is 3637.35.

specifically, the calculation overhead of each pipeline group is sequentially superimposed to obtain the data calculation overhead D =108132.206 of S.

(4.2.6) adding the data transmission cost C of S to the data calculation cost D of S to obtain the expected cost of the calculation pipeline group diagram S on the sub-core particle set;

specifically, the pipeline group diagram shown in (d) in fig. 4 has an expected overhead of 921090.506 over the set of sub-core-grains represented by the core-grain topology shown in fig. 3.

(4.3) taking the edge es with the minimum weight in S _i ;

Specifically, the side es1[ p1, p2, 207936] with the smallest weight in (d) in fig. 4 is acquired.

(4.4) by es _i Divide S into two pipeline set diagrams S ₁ And S ₂ And according to S ₁ And S ₂ The calculated amount of (A) divides the M sub-core particles into M ₁ And M ₂ Two parts if es is present _i When the calculated amount of the two divided subgraphs is less than 1/M of the total calculated amount of S, the step (4.3) is returned, and the step e is taken according to the sequence from small to large in S _i Then the least weighted edge;

in particular, if es is currently present _i When the calculated amount of the two divided subgraphs is less than 1/M of the total calculated amount of S, the subgraph is placed on the core particle to make full use of the core particle resources, so that the step (4.3) needs to be returned. In this example, the values are given in es1[ p1, p2, 207936]Divide S into two pipeline group diagrams S ₁ And S ₂ ，S ₁ Including the group of pipelines p1, S ₂ Including the running water line group p2-p7. According to S ₁ And S ₂ The calculated amount of (2) is divided into 4 sub-core particles such that S ₁ And S ₂ 2 subcore particles were used each.

(4.5) calculating S by the same calculation method as that in the step (4.2) ₁ At M ₁ Expected overhead t on individual core particles ₁ And S ₂ At M ₂ Expected overhead t on individual core grain ₂ ；

Specifically, S is calculated by the calculation method shown in steps (4.2.1) to (4.2.6) ₁ Expected overhead t over 2 sub-core particles ₁ For 104048.76, S is calculated ₂ Expected overhead t over 2 sub-core particles ₂ Is 690648.85.

In particular, according to the formula

Calculate S ₁ To S ₂ Expected data transmission overhead t ₃ Is 14945.4.

(4.7) comparison of t ₀ And t ₁ +t ₂ +t ₃ The size of (d);

(4.8) if t ₀ > t ₁ +t ₂ +t ₃ And is currently dividedIf the number of regions is less than the total number of sub-core particles in the core particle, the pipeline group diagram S is formed ₁ + number of core particles M ₁ And pipeline set diagram S ₂ And the number M of the core particles ₂ Recursively invoking steps (4.1) - (4.7), otherwise executing step (4.9);

specifically, t in the above example ₀ =921090.506，t ₁ +t ₂ +t ₃ =809643.01, by comparison, satisfy t ₀ > t ₁ +t ₂ +t ₃ And the number of currently divided regions is smaller than the total number of sub-core particles in the core particle, so the side es1[ p1, p2, 207936 is selected]Dividing the pipeline group diagram S into two parts S ₁ And S ₂ ，S ₁ And S ₂ 2 sub-core particles were dispensed separately. At this time S ₁ Since no edge exists inside, the division cannot be continued. Iterative execution of steps 4.1-4.7 compute pipeline set map S ₂ And the number M of the core particles ₂ Division scheme of = 2.

Specifically, by the division, the pipeline group diagram S is finally divided into 3 pipeline parallel regions shown in fig. 5, where the pipeline parallel region stage1 occupies 2 sub-core particles, and the pipeline parallel regions stage2 and stage3 each occupy 1 sub-core particle.

In the specific implementation of the steps (4.1) - (4.9), pipeline parallel areas are divided according to the size of edges in the pipeline group diagram, reasonable suspension conditions are set, and balance is achieved between the number of the divided pipeline parallel areas and the maximum overhead of each pipeline parallel area, so that the method cannot divide too many pipeline parallel areas, and the overall running time is as small as possible after deep neural network pipelines are parallel; and determining the number of the occupied sub-core particles according to the calculated amount of the parallel area of each deep neural network pipeline, namely more accurately determining the size of the occupied core particle area according to the calculation requirement of the parallel area of each pipeline. Therefore, the core grain resources are better utilized, and the delay of the pipeline parallelism is balanced.

In the specific implementation of the step (5), a deep neural network pipeline parallel scheduling strategy is determined according to the divided pipeline parallel region and the core grain topological structure;

specifically, the step (5) may include:

In this embodiment, the placement of the deep neural network pipeline parallel region on the core grain is determined in the form shown in FIG. 6. In fig. 6, a core grain including 4*4 sub-core grains and a deep neural network divided into 7 pipeline parallel regions are shown, and the method adopts an "S" shape for placement in order to reduce the overhead of data transmission between the regions. First for the pipeline parallel region s1, it needs to distribute 2 sub-core particles, and thus is placed on the first row of seated 2 sub-core particles on the core particle. Then the next pipeline parallel region s2 is obtained, which needs to allocate 3 sub-core particles, and only 2 non-allocated sub-core particles remain in the first row of core particles, so that the 1 st sub-core particle from the right of the second row is allocated to s2 together, and so on, the allocation result shown in fig. 6 is obtained. Corresponding to the deep neural network pipeline parallel region of fig. 5 and the core particle topology of fig. 3 as examples of the above process, two core particles in the first row of the core particle would be assigned to the first pipeline parallel region of fig. 6, the first core particle from the right of the second row of the core particle would be assigned to the second pipeline parallel region, and the second core particle from the right would be assigned to the third pipeline parallel region. The placement position of the pipeline parallel area of the deep neural network on the core grain is determined through the S shape, so that the distance between the adjacent pipeline parallel areas on the core grain is as short as possible, and the data transmission overhead between the pipeline parallel areas is reduced.

In the specific implementation of the step (6), according to the deep neural network pipeline parallel scheduling strategy, deploying the deep neural network on the core particles, and executing deep neural network pipeline parallel reasoning;

specifically, the deep neural network and the pipeline parallel scheduling strategy thereof are transmitted to a core grain compiler (where the compiler is a well-known technology in the field, and the core grain compiler refers to a well-known compiler bound to the core grain), the compiler analyzes the deep neural network structure and the pipeline parallel scheduling strategy, automatically generates executable codes, maps different parts of the deep neural network to different sub-core grains of the core grain according to the pipeline parallel scheduling strategy, and performs scheduling during running.

Corresponding to the foregoing embodiments of the core-grain-oriented deep neural network pipeline parallel scheduling method, the present application also provides embodiments of a core-grain-oriented deep neural network pipeline parallel scheduling apparatus.

FIG. 7 is a block diagram illustrating a core-grain oriented deep neural network pipeline parallel scheduler in accordance with an exemplary embodiment. Referring to fig. 7, the apparatus may include:

the acquisition module 21 is used for acquiring a deep neural network and a core particle topological structure;

a constructing module 22, configured to construct a deep neural network computation graph according to the deep neural network and reduce the deep neural network computation graph;

the first dividing module 23 is configured to divide the pipeline group according to the reduced deep neural network computation graph to obtain a pipeline group graph;

a second dividing module 24, configured to divide a pipeline parallel region according to the pipeline group diagram and the core particle topology;

the determining module 25 is configured to determine a deep neural network pipeline parallel scheduling strategy according to the divided pipeline parallel region and the core particle topological structure;

and the deployment module 26 is configured to deploy the deep neural network to the core particles according to the deep neural network pipeline parallel scheduling policy, and execute deep neural network pipeline parallel inference.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a core-oriented deep neural network pipeline parallel scheduling method as described above. As shown in fig. 8, for a hardware structure diagram of any device with data processing capability in which the core-grain-oriented deep neural network pipeline parallel scheduling method according to the embodiment of the present invention is located, in addition to the processor, the memory, and the network interface shown in fig. 8, any device with data processing capability in which an embodiment of the apparatus is located may also include other hardware according to an actual function of the any device with data processing capability, which is not described again.

Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the core-grain-oriented deep neural network pipeline parallel scheduling method is implemented. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit of any data processing capable device and an external storage device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.

Claims

1. A core particle-oriented deep neural network pipeline parallel scheduling method is characterized by comprising the following steps:

2. The method of claim 1, wherein step (2) comprises:

3. The method of claim 1, wherein step (3) comprises:

4. The method of claim 3, wherein step (3.3) comprises:

5. The method of claim 1, wherein the step (4) comprises:

(4.3) taking the edge es with the minimum weight in S _i ;

(4.4) by es _i Divide S into two pipeline set diagrams S ₁ And S ₂ And according to S ₁ And S ₂ The calculated amount of (A) divides the M sub-core particles into M ₁ And M ₂ Two parts if es is present _i Divided two subgraphsWhen the calculated amount is less than 1/M of the total calculated amount of S, the step (4.3) is returned, and the es in the S is taken according to the sequence from small to large _i Then the edge with the smallest weight;

(4.7) comparison of t ₀ And t ₁ +t ₂ +t ₃ The size of (d);

(4.8) if t ₀ > t ₁ +t ₂ +t ₃ And the number of the currently divided areas is less than the total number of the sub-core particles in the core particles, the assembly line group diagram S is formed ₁ + the number of core particles M ₁ And pipeline set diagram S ₂ And the number M of the core particles ₂ Recursively invoking steps (4.1) - (4.7), otherwise executing step (4.9);

6. The method according to claim 1, wherein step (4.2) comprises:

(4.4.2) acquiring the number N of embedded neural network processors contained in one sub-core particle and the size w of a pulse array on each embedded neural network processor;

；

(4.2.4) calculating the calculation cost d of a single operator v in S _v : dividing the number of available sub-core particles according to the calculated amount of each operator in each pipeline group, and respectively calculating the calculation cost of each operator v in S: the operator v distributes the calculated quantity m x k x n on each embedded neural network processor; if m>w or n>w, the calculation cost d of the single operator v is obtained by adding the proportion of the calculation quantity m x k x n distributed to each embedded neural network processor to the pulse array size w x w and the one-dimensional pulse beat number w-1 which is 2 times of the calculation quantity m x k x n distributed to each embedded neural network processor _v (ii) a If m<w and n<w, the calculation cost d of the single operator v is obtained by adding the total calculation amount of each dimension and then subtracting 1 _v ；

7. The method of claim 1, wherein the step (5) is:

8. A core particle-oriented deep neural network pipeline parallel scheduling method is characterized by comprising the following steps:

the determining module is used for determining a deep neural network pipeline parallel scheduling strategy according to the divided pipeline parallel area and the core grain topological structure;

and the deployment module is used for deploying the deep neural network on the core particles according to the deep neural network pipeline parallel scheduling strategy and executing deep neural network pipeline parallel reasoning.

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method according to any one of claims 1-7.