CN111752691B

CN111752691B - Method, device, equipment and storage medium for sorting AI (advanced technology attachment) calculation graphs

Info

Publication number: CN111752691B
Application number: CN202010577847.1A
Authority: CN
Inventors: 邹伟; 熊超; 蔡权雄; 牛昕宇
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2023-11-28
Anticipated expiration: 2040-06-22
Also published as: CN111752691A; WO2021259041A1

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for sorting AI (advanced technology attachment) calculation graphs, wherein the method comprises the following steps: obtaining a computation graph based on a data flow architecture, wherein the computation graph comprises a plurality of computation nodes; performing topological sorting on the calculation graphs to obtain a first arrangement sequence of the calculation graphs; determining a second arrangement order of the plurality of branch computing nodes according to the branch ordering of the plurality of branches in the computation graph; and replacing the arrangement sequence of the branch calculation nodes in the first arrangement sequence with a second arrangement sequence corresponding to the branch calculation nodes to obtain the target arrangement sequence of the calculation graph. According to the embodiment of the invention, the topology ordering and the branch ordering are carried out on the computation graph, so that the ordering of the computation graph based on the data flow architecture is realized, the execution sequence of each computation node in the computation graph can be uniquely determined, and the chip performance based on the data flow architecture is improved.

Description

Method, device, equipment and storage medium for sorting AI (advanced technology attachment) calculation graphs

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for sorting AI (advanced technology attachment) calculation graphs.

Background

The deep learning model is essentially a computational graph, such as a convolutional neural network model, which is essentially a directed acyclic computational graph, containing a large number of computational nodes, each of which represents a computational operation, and input dependencies exist, i.e., the input nodes of the current computational node are all computed to execute the current computational node.

Currently, the development of artificial intelligent chips is mostly based on the instruction set architecture of the current CPU (Central Processing Unit ) and GPU (Graphics Processing Unit, graphics processor), and the CPU and GPU generally have an instruction control unit and a mature cache mechanism, so that a computational graph can be directly input into the artificial intelligent chip of the instruction set architecture for running.

In recent years, an artificial intelligent chip developed based on a data flow architecture is deeply paid attention to because of high utilization rate and low delay, but the data flow architecture and an instruction set architecture are not the same architecture, an instruction control unit and a cache mechanism of the instruction set architecture cannot be multiplexed into the data flow architecture, and the artificial intelligent chip developed based on the data flow architecture is not provided with the instruction control unit and only accepts the sequence of a calculation graph optimized in advance to execute, so that a processing method is needed to optimize the node ordering of the calculation graph in advance, so that the artificial intelligent chip developed based on the data flow architecture can normally calculate the calculation graph.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method, apparatus, device and storage medium for ordering AI computation graphs, so as to achieve ordering of computation graphs based on a data flow architecture, improve the utilization rate of chip caches based on the data flow architecture, and improve the chip performance.

In a first aspect, an embodiment of the present invention provides a method for ordering AI computation graphs, including:

obtaining a computation graph based on a data flow architecture, wherein the computation graph comprises a plurality of computation nodes;

performing topological sorting on the calculation graphs to obtain a first arrangement sequence of the calculation graphs;

determining a second arrangement order of the plurality of branch computing nodes according to the branch ordering of the plurality of branches in the computation graph;

and replacing the arrangement sequence of the branch calculation nodes in the first arrangement sequence with a second arrangement sequence corresponding to the branch calculation nodes to obtain the target arrangement sequence of the calculation graph.

Further, determining a second ranking order of the plurality of branch computing nodes according to a branch ranking of the plurality of branches in the computation graph includes:

determining a plurality of branch starting nodes and a plurality of branches corresponding to each branch starting node;

determining the weight of each branch according to the number of branch calculation nodes included in each branch;

According to the weight of each branch, sorting the branches corresponding to each branch starting node;

and determining a second arrangement sequence of a plurality of branch computing nodes corresponding to each branch starting node according to the branch ordering.

Further, determining a plurality of branch starting nodes and a plurality of branches corresponding to each branch starting node includes:

if the output number of the calculation nodes is greater than 1, the calculation nodes are used as branch starting nodes;

if the output number of the calculation nodes positioned behind the branch starting node is greater than 1 or equal to 0, the calculation nodes are used as branch ending nodes, wherein one branch starting node corresponds to at least two branch ending nodes;

and forming branches from the branch starting node to the branch ending node according to the direction of the computational graph.

Further, determining the weight of each branch according to the number of branch computing nodes included in each branch includes:

determining whether a composite computing node exists in the current branch, wherein the composite computing node is a computing node shared by the current branch and other branches except a branch starting node;

if not, the number of branch calculation nodes included in the current branch is taken as the weight of the current branch.

Further, after determining whether the composite computing node exists in the current branch, the method further includes:

if yes, determining whether the composite computing node is a branch starting node of other branches;

if yes, the number of branch calculation nodes corresponding to the branch starting node of the current branch is used as the weight of the current branch.

Further, after determining whether the composite computing node is a branch start node of another branch, the method further includes:

if not, the weight of the current branch is set to infinity;

and after the branch sequencing is completed by other branches where the composite computing node is located, taking the number of the branch computing nodes included in the current branch as the weight of the current branch.

Further, the method further comprises the following steps:

and allocating ping-pong caches to the computational graph according to the target arrangement sequence of the computational graph.

In a second aspect, an embodiment of the present invention provides an apparatus for sorting AI computation graphs, including:

a computation graph acquisition module, configured to acquire a computation graph based on a data flow architecture, where the computation graph includes a plurality of computation nodes;

the topological ordering module is used for carrying out topological ordering on the calculated graphs to obtain a first arrangement sequence of the calculated graphs;

A branch ordering module, configured to determine a second ordering order of a plurality of branch computing nodes according to branch ordering of a plurality of branches in the computation graph;

the target arrangement order determining module is used for replacing the arrangement order of the branch computing nodes in the first arrangement order with a second arrangement order corresponding to the branch computing nodes to obtain the target arrangement order of the computation graph.

In a third aspect, an embodiment of the present invention provides an apparatus, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for ordering AI computation graphs provided by any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method for sorting AI computation graphs provided by any embodiment of the present invention.

According to the embodiment of the invention, the topology ordering and the branch ordering are carried out on the computation graph, so that the ordering of the computation graph based on the data flow architecture is realized, the execution sequence of each computation node in the computation graph can be uniquely determined, and the chip performance based on the data flow architecture is improved.

Drawings

FIG. 1A is a flowchart illustrating a method for sorting AI calculation graphs according to an embodiment of the invention;

FIG. 1B is a schematic diagram of an exemplary calculation diagram according to a first embodiment of the present invention;

fig. 2A is a flowchart illustrating a method for sorting AI computation graphs according to a second embodiment of the present invention;

fig. 2B is a flowchart illustrating a method for determining branch weights according to a second embodiment of the present invention;

FIG. 2C is a schematic diagram of an exemplary calculation diagram according to a second embodiment of the present invention;

fig. 2D is a table tennis buffer allocation schematic diagram of a calculation chart according to a second embodiment of the present invention;

fig. 2E is a schematic diagram of a calculation sequence of each calculation node in a calculation graph after topology sequencing according to the second embodiment of the present invention;

FIG. 2F is a schematic diagram illustrating a calculation sequence of each calculation node in a calculation graph after branch ordering according to the second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a sorting device for AI computation graphs according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Furthermore, the terms "first," "second," and the like, may be used herein to describe various directions, acts, steps, or elements, etc., but these directions, acts, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. The terms "first," "second," and the like, are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, "plurality", "batch" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Example 1

Fig. 1A is a flow chart of an AI computation graph sorting method according to an embodiment of the present invention, which is applicable to node sorting of a deep learning model computation graph based on a data flow architecture. As shown in fig. 1A, the method for sorting AI computation graphs according to the first embodiment of the present invention includes:

s110, acquiring a computation graph based on a data flow architecture, wherein the computation graph comprises a plurality of computation nodes.

Specifically, the computation graph based on the data flow architecture refers to the computation graph of the deep learning model developed based on the data flow architecture. A computational graph is a computational flow of a data structure in a directed acyclic graph (Directed Acyclic Graph, DAG) that includes a plurality of computational nodes, each representing an arithmetic operation, such as addition and subtraction of multipliers, or a physical operation, such as shape transformation and slicing of multidimensional data.

S120, performing topological sorting on the calculation graphs to obtain a first arrangement sequence of the calculation graphs.

Specifically, a directed acyclic graph G is topologically ordered, that is, all vertices in the graph G are arranged in a linear sequence such that if an edge (u, v) E (G) occurs before v in the linear sequence for any pair of vertices u and v in the graph. The topology ordering is performed on the computation graph, namely all computation nodes in the computation graph are ordered according to the data flow direction in the computation graph, and the result of the topology ordering is called a first arrangement order.

In general, when there are multiple outputs of one compute node, the ordering result from the topological ordering is not unique. For example, as shown in the calculation diagram in fig. 1B, the first arrangement sequence obtained after topological ordering may be: a- > B- > C- > D- > E- > F- > G- > H, may also be: a- > B- > C- > D- > F- > E- > G- > H, may also be: a- > B- > C- > E- > G- > H- > D- > F.

S130, determining a second arrangement sequence of the plurality of branch computing nodes according to the branch sequences of the plurality of branches in the computing graph.

Specifically, when there are a plurality of output numbers of one calculation node, the calculation node may constitute one branch along each output direction, the calculation node is referred to as a branch start node, the calculation nodes in the branches are referred to as branch calculation nodes, and obviously, one branch start node may constitute at least two branches. By performing branch ordering on the plurality of branches of each branch starting node, the execution sequence of each branch can be uniquely determined, that is, the arrangement sequence of all branch computing nodes of the branches formed based on the branch starting node is uniquely determined, and the arrangement sequence of all branch computing nodes is called a second arrangement sequence.

Optionally, the branches may be sorted in ascending order according to the weight of each branch, and the weights of the branches may be generated according to a preset rule, for example, the number of branch computing nodes included in the branches is directly set as the weight of the branches, or the weight is set according to the node type of the computing node. Illustratively, as shown in the computational graph of FIG. 1B, the branches included in the graph have: b- > C, B- > D- > F and B- > E- > G- > H, marking branch B- > C as branch 1, branch B- > D- > F as branch 2, branch B- > E- > G- > H as branch 3, and taking the number of branch computing nodes included in the branch as the weight of the branch, then the weight of branch 1 is 2, the weight of branch 2 is 3, the weight of branch 3 is 4, and ascending order is performed according to the weight, so that the arrangement order of the branches is 123, that is, branch 1 is executed first, branch 2 is executed, branch 3 is executed finally, and then the second arrangement order of all branch computing nodes is obtained: b- > C- > D- > F- > E- > G- > H.

S140, replacing the arrangement sequence of the branch calculation nodes in the first arrangement sequence with a second arrangement sequence corresponding to the branch calculation nodes to obtain the target arrangement sequence of the calculation graph.

Specifically, in the topology ordering, the arrangement sequence of the computing nodes near the branch start node is generally not unique, that is, the arrangement sequence of the branch computing nodes in the first arrangement sequence is not unique, and the second arrangement sequence of the branch computing nodes determined by the branch ordering is unique, so that the arrangement sequence of the corresponding branch computing nodes in the first arrangement sequence is replaced by the second arrangement sequence of the branch computing nodes, and thus, a unique calculation graph arrangement sequence, that is, the target arrangement sequence of the calculation graph, can be obtained.

For example, as shown in the calculation diagram in fig. 1B, the first arrangement sequence obtained by topologically ordering the calculation diagram is: a- > B- > C- > D- > E- > F- > G- > H, the second arrangement order of the branch computation nodes is: b- > C > D- > F > E- > G- > H, replacing the second arrangement sequence of the branch computation nodes with the arrangement sequence of the corresponding branch computation nodes in the first arrangement sequence to obtain a target arrangement sequence of the computation graph, wherein the target arrangement sequence is as follows: a- > B- > C- > D- > F- > E- > G- > H.

According to the method for sorting the AI computational graph, which is provided by the embodiment of the invention, the topological sorting and the branch sorting are carried out on the computational graph, so that the sorting of the computational graph based on the data flow architecture is realized, the execution sequence of each computational node in the computational graph can be uniquely determined, and the chip performance based on the data flow architecture is improved.

Example two

Fig. 2A is a flow chart of a sorting method of AI computation graphs according to a second embodiment of the present invention, which is a further refinement of the above embodiment. As shown in fig. 2A, the method for sorting AI computation graphs provided in the second embodiment of the present invention includes:

s210, acquiring a computation graph based on a data flow architecture, wherein the computation graph comprises a plurality of computation nodes.

S220, performing topological sorting on the calculated graphs to obtain a first arrangement sequence of the calculated graphs.

S230, determining a plurality of branch starting nodes and a plurality of branches corresponding to each branch starting node.

Specifically, the computation graph of the deep learning model is a directed acyclic graph, and usually, the computation nodes are numerous, and the structure is complex, so that the number of branch starting nodes is also plural, each branch is based on its corresponding branch starting node, and to determine the branch, the branch starting node needs to be determined first.

Further, determining a plurality of branch start nodes and a plurality of branches corresponding to each branch start node includes steps S231 to S233 (not shown in the figure).

S231, if the output number of the calculation nodes is greater than 1, taking the calculation nodes as branch starting nodes.

Specifically, when there are a plurality of output numbers of one calculation node, that is, the output number of the calculation node is greater than 1, the calculation node is referred to as a branch start node. Illustratively, with reference to the computational graph shown in FIG. 1B, it may be determined that the computational node B is a branch initiation node.

S232, if the output number of the calculation nodes positioned behind the branch starting node is greater than 1 or equal to 0, the calculation nodes are used as branch ending nodes, wherein one branch starting node corresponds to at least two branch ending nodes.

Specifically, the branch termination node refers to the last calculation node of the branch, and if the output number of a calculation node located after the branch start node is greater than 1, it is indicated that the calculation node can be used as the branch start node of other branches, and then the current branch should reach the calculation node, that is, the calculation node should be used as a branch termination node corresponding to the current branch start node; when the output number of a computing node located after the branch start node is equal to 0, the computing node is the last computing node, and then the computing node also serves as a branch termination node corresponding to the current branch start node. The branch starting node is a computing node with the output number being greater than 1, that is, the output of the branch starting node is at least connected with 2 computing nodes, so that one branch starting node corresponds to at least two branch ending nodes. Illustratively, referring to the computational graph shown in FIG. 1B, it may be determined that the branch termination nodes corresponding to the branch initiation node B include a compute node C, a compute node F, and a compute node H.

S233, forming branches from the branch starting node to the branch ending node according to the direction of the calculation graph.

Specifically, a branch corresponds to a linear arrangement, and its start point and end point are determined respectively, and the branch is determined according to the data flow direction, i.e. the direction of the calculation map. Illustratively, referring to the computational graph shown in FIG. 1B, the branch consisting of branch initiation node B and branch termination node C is: b- > C, the branch consisting of the branch starting node B and the branch ending node F is: b- > D- > F, the branch consisting of the branch starting node B and the branch ending node H is: b- > E- > G- > H.

S240, determining the weight of each branch according to the number of branch calculation nodes included in each branch.

Specifically, the branch weight is determined according to the branch calculation node data included in the branch, and the number of the branch calculation nodes included in the branch is set as the weight of the branch. For example, referring to the calculation graph shown in fig. 1B, the number of branch calculation nodes of the branch B- > C is 2, the weight of the branch B- > C is 2, the number of branch calculation nodes of the branch B- > D- > F is 3, the weight of the branch B- > C is 3, the number of branch calculation nodes of the branch B- > E- > G- > H is 4, and the weight of the branch B- > C is 4.

Further, as shown in fig. 2B, the method for determining the branching weight includes:

s241, determining whether a composite computing node exists in the current branch, wherein the composite computing node is a computing node shared by the current branch and other branches except the branch starting node.

Specifically, because the structure of the computation graph is complex, there is a high probability that one computation node exists among a plurality of branches, and the computation node shared by the plurality of branches is called a composite computation node, and because one branch starting node corresponds to a plurality of branches, the branch starting node naturally exists among the plurality of branches, and therefore, the branch starting node is not used as the composite computation node. If the current branch has a composite computing node, it is indicated that the current branch has the same computing node as other branches, step S242 is executed, otherwise step S243 is executed.

And S242, if the current branch does not exist, taking the number of branch calculation nodes included in the current branch as the weight of the current branch.

Specifically, if no composite computing node exists in the current branch, the current branch is a single branch, that is, the current branch has no cross relation with other branches, and the number of the branch computing nodes included in the current branch can be used as the weight of the current branch. Illustratively, with reference to the computational graph shown in FIG. 1B, branches B- > C, B- > D- > F and B- > E- > G- > H are both single branches, then the corresponding weights may be set to: 2. 3 and 4.

S243, if so, determining whether the composite computing node is a branch starting node of other branches.

Specifically, if there are composite computing nodes in the current branch, there are two cases in which the composite computing node is a branch termination node of the current branch, but the output number of the branch termination nodes is greater than 1, and the branch termination node is a branch start node of other branches at the same time, at this time, step S244 is executed.

In another case, the composite computing node is a computing node except for the branch start node and the branch end node in the current branch, which indicates that the input number of the composite computing node is greater than 1 and the output number is 1, and the input of the composite computing node depends on at least 2 computing nodes, and the composite computing node is a computing node except for the branch start node and the branch end node in other branches, and step S245 is performed.

S244, if yes, taking the number of branch calculation nodes corresponding to the branch starting node of the current branch as the weight of the current branch.

Specifically, if the composite computing node is a branch starting node of other branches, it is indicated that the computing node to be computed after the current branch, that is, the output data of the branch ending node of the current branch, is not the final output data, and at this time, the weight of the current branch should be set to be the maximum in all branches corresponding to the branch starting node of the current branch, even if the current branch is executed last, so as to avoid data coverage on other branches. Optionally, the weight of the current branch is set to be the maximum, which may be that the number of branch calculation nodes corresponding to the branch starting node of the current branch is used as the weight of the current branch, or that the sum of the weights of all branches corresponding to the branch starting node of the current branch is set as the weight of the current branch. For example, referring to the calculation diagram shown in fig. 2C, there is a composite calculation node G in the current branch B- > E- > G, where the composite calculation node G is a branch starting node of other branches, and the branches corresponding to the branch starting node B of the current branch include: b- > C, B- > D- > F and B- > E- > G, and the weight corresponding to each branch is: 2. and 3, setting the sum of all the branch weights as the weight of the current branch, wherein the weight of the current branch is 2+3+3=8, and if the number of branch computing nodes corresponding to the branch starting node B of the current branch is set as the weight of the current branch and the number of branch computing nodes corresponding to the branch starting node B is 6, the weight of the current branch is 6. The weights of branches B- > C, B- > D- > F and B- > E- > G- > H corresponding to the final branch starting node B are respectively as follows: 2. 3 and 8, or the corresponding weights are respectively: 2. 3 and 6.

S245, if not, the weight of the current branch is set to infinity.

S246, after the branch ordering of other branches where the composite computing node is located is completed, the number of branch computing nodes included in the current branch is used as the weight of the current branch.

Specifically, when the composite computing node is not a branch starting node of other branches, that is, the composite computing node is a computing node except for the branch starting node and the branch ending node in the current branch, it is stated that in the running process of the current branch, output data of other branches is also relied as input, and according to the input dependency characteristic of the computing graph, the current branch can normally run only after all input computing nodes corresponding to the current branch are computed, so that at this time, the weight of the current branch is set to infinity, and the other branches where the composite computing node is located wait for completing branch sorting, so that all input computing nodes corresponding to the current branch calculate before the current branch, and then the weight of the current branch is recovered to be normal, that is, the number of the branch computing nodes included in the current branch is taken as the weight of the current branch.

S250, sorting the branches corresponding to each branch starting node according to the weight of each branch.

S260, determining a second arrangement sequence of a plurality of branch computing nodes corresponding to each branch starting node according to the branch ordering.

S270, replacing the arrangement sequence of the branch calculation nodes in the first arrangement sequence with a second arrangement sequence corresponding to the branch calculation nodes to obtain the target arrangement sequence of the calculation graph.

S280, allocating ping-pong caches to the calculation graphs according to the target arrangement sequence of the calculation graphs.

Specifically, in a chip based on data flow, a ping-pong caching mechanism is generally adopted in the data computing process, that is, in the computing process, data is cached in two on-chip caches alternately, and final output data is cached in an external cache, but at the same time, the external cache also backs up the computing data of each computing node. The data exchange between the on-chip caches by the compute nodes is faster, the compute nodes take longer to transfer the output to the external cache, but the on-chip cache is typically smaller in storage space and the external cache is larger in storage space. And allocating ping-pong caches to the calculation graphs according to the target arrangement sequence, so that data in the running process of the calculation graphs are stored in the on-chip cache as much as possible, thereby greatly reducing the calculation time of the calculation graphs and improving the chip performance.

For example, for the computation graph shown in fig. 1B, assuming that there are two on-chip caches B1 and B2 and one external low-speed cache D1, the computation graph of fig. 1B is allocated with ping-pong caches, so that a cache allocation graph shown in fig. 2D may be obtained, where the cache manner marked in the graph indicates the cache location of the data after the computation of each computation node is completed, the data after the computation of the computation node a is stored in the cache B1, the data after the computation of the computation node B is stored in the cache B2 according to the ping-pong cache mechanism, the data after the computation of the computation node D and the computation node E are stored in the cache B1, and since the computation node C, the computation node F and the computation node H have no output, the output data of the computation node C, the computation node F and the computation node H are all stored as final output data in the low-speed cache D1.

If the topology ordering is performed only on the computation graph shown in fig. 1B, the ordering manner that may be obtained is: the calculation sequence of each calculation node is represented by a graph, and the calculation sequence of each calculation node of the calculation graph after topological ordering is shown in fig. 2E, for example, the calculation sequence corresponding to the calculation node A is 1, which indicates that the 1 st execution of the calculation node A is performed. As can be seen from fig. 2E, after the computing node B completes the computation and stores the data in the cache B2, the computing node C obtains the data from the cache B2 to perform the computation, and stores the obtained data in the low-speed cache D1; then the computing node D acquires data from the cache B2 for computing and stores the data in the cache B1; the next calculation is calculation node E, which also obtains data from cache B2 to calculate and stores the data in cache B1. When the computing node E completes the computation, the data of the computing node E is stored in the cache B1, the data of the computing node D is covered, and for the computing node F which performs the computation next, the input data of the computing node D needs to be loaded from the cache D1, which greatly increases the time consumption of data transmission and causes a computing error if the backup data is loaded.

However, after topological sorting and branch sorting are performed on the calculated graphs by using the sorting method of the AI calculated graph provided by the embodiment of the invention, the target sorting order of the obtained calculated graphs is as follows: a- > B- > C- > D- > F- > E- > G- > H, the calculation order of each calculation node is shown in FIG. 2F. As can be seen from fig. 2F, the computing node D completes the computation to cache the data in the cache B1, then the computing node F obtains the data of the computing node D from the cache B1 to perform the computation, after the computing node F completes the computation, the computing node E obtains the data of the computing node B from the cache B2 to perform the computation and caches the data in the cache B1, and then the computing node G obtains the data of the computing node E from the cache B1 to perform the computation and caches the data in the cache B2.

According to the method for sorting the AI computational graph, which is provided by the embodiment of the invention, through topological sorting and branch sorting of the computational graph, the execution sequence of each computational node in the computational graph can be uniquely determined, the sorting of the computational graph based on a data flow architecture is realized, the computational process of the computational graph can be more suitable for a ping-pong cache mechanism in the data flow architecture, the utilization rate of a chip cache based on the data flow architecture is improved, the computational speed of the computational graph is faster, and the overall performance of the chip is improved.

Example III

Fig. 3 is a schematic structural diagram of an apparatus for sorting AI computation graphs according to a third embodiment of the present invention, where the present embodiment is applicable to node sorting of AI computation graphs based on a deep learning model of a data flow architecture, and the apparatus may implement the sorting method of AI computation graphs according to any embodiment of the present invention, and have a corresponding functional structure and beneficial effects of the implementation method.

As shown in fig. 3, the sorting device for AI computation graphs provided in the third embodiment of the present invention includes: a computational graph acquisition module 310, a topological sort module 320, a branch sort module 330, and a target sort order determination module 340, wherein:

the computation graph acquisition module 310 is configured to acquire a computation graph based on a data flow architecture, where the computation graph includes a plurality of computation nodes;

the topology ordering module 320 is configured to perform topology ordering on the computation graph to obtain a first order of the computation graph;

the branch ordering module 330 is configured to determine a second ordering order of the plurality of branch computing nodes according to branch ordering of the plurality of branches in the computation graph;

the target arrangement order determining module 340 is configured to replace the arrangement order of the branch computing nodes in the first arrangement order with a second arrangement order corresponding to the branch computing nodes, so as to obtain a target arrangement order of the computation graph.

Further, the branch ordering module 330 includes:

the branch determining unit is used for determining a plurality of branch starting nodes and a plurality of branches corresponding to each branch starting node;

a weight determining unit, configured to determine a weight of each branch according to the number of branch computing nodes included in each branch;

the branch ordering unit is used for ordering the branches corresponding to each branch starting node according to the weight of each branch;

and the second arrangement sequence unit is used for determining the second arrangement sequence of the plurality of branch calculation nodes corresponding to each branch starting node according to the branch ordering.

Further, the branch determining unit is specifically configured to:

Further, the weight determining unit is specifically configured to:

if the current branch does not exist, taking the number of branch computing nodes included in the current branch as the weight of the current branch;

if yes, taking the number of branch calculation nodes corresponding to the branch starting node of the current branch as the weight of the current branch;

if not, the weight of the current branch is set to infinity;

Further, the device further comprises: and the ping-pong buffer allocation module is used for allocating ping-pong buffers to the calculation graphs according to the target arrangement sequence of the calculation graphs.

According to the AI computation graph ordering device provided by the third embodiment of the invention, the computation graph ordering based on the data flow architecture is realized by performing topological ordering and branch ordering on the computation graph, the execution sequence of each computation node in the computation graph can be uniquely determined, and the chip performance based on the data flow architecture is improved.

Example IV

Fig. 4 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention. Fig. 4 illustrates a block diagram of an exemplary device 412 suitable for use in implementing embodiments of the invention. The device 412 shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the invention.

As shown in fig. 4, device 412 is in the form of a general purpose device. Components of device 412 may include, but are not limited to: one or more processors 416, a storage 428, and a bus 418 that connects the various system components (including the storage 428 and the processors 416).

Bus 418 represents one or more of several types of bus structures, including a memory device bus or memory device controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry standard architecture (Industry Subversive Alliance, ISA) bus, micro channel architecture (Micro Channel Architecture, MAC) bus, enhanced ISA bus, video electronics standards association (Video Electronics Standards Association, VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnect, PCI) bus.

Device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 412 and includes both volatile and nonvolatile media, removable and non-removable media.

The storage 428 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 430 and/or cache memory 432. Device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk such as a Read Only Memory (CD-ROM), digital versatile disk (Digital Video Disc-Read Only Memory, DVD-ROM), or other optical media, may be provided. In such cases, each drive may be coupled to bus 418 via one or more data medium interfaces. Storage 428 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 440 having a set (at least one) of program modules 442 may be stored, for example, in the storage 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 442 generally perform the functions and/or methodologies in the described embodiments of the invention.

The device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing terminal, display 424, etc.), one or more terminals that enable a user to interact with the device 412, and/or any terminals (e.g., network card, modem, etc.) that enable the device 412 to communicate with one or more other computing terminals. Such communication may occur through an input/output (I/O) interface 422. Also, device 412 may communicate with one or more networks such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN) and/or a public network such as the internet via network adapter 420. As shown in fig. 4, network adapter 420 communicates with other modules of device 412 over bus 418. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with device 412, including, but not limited to: microcode, end drives, redundant processors, external disk drive arrays, disk array (Redundant Arrays of Independent Disks, RAID) systems, tape drives, data backup storage systems, and the like.

The processor 416 executes various functional applications and data processing by running programs stored in the storage 428, such as implementing the AI computation graph ordering method provided by any embodiment of the invention, and may include:

Example five

The fifth embodiment of the present invention further provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements a method for sorting AI computation graphs as provided in any embodiment of the present invention, the method may include:

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method for ordering AI computation graphs, comprising:

replacing the arrangement sequence of the branch calculation nodes in the first arrangement sequence with a second arrangement sequence corresponding to the branch calculation nodes to obtain a target arrangement sequence of the calculation graph;

The determining a second arrangement order of the plurality of branch computing nodes according to the branch ordering of the plurality of branches in the computation graph includes:

2. The method of claim 1, wherein determining a plurality of branch initiation nodes and a plurality of branches corresponding to each branch initiation node comprises:

3. The method of claim 2, wherein determining the weight of each branch based on the number of branch computation nodes that each branch includes comprises:

4. The method of claim 3, wherein after determining whether a composite compute node is present in the current branch, further comprising:

5. The method of claim 4, wherein after determining whether the composite computing node is a branch start node for another branch, further comprising:

if not, the weight of the current branch is set to infinity;

6. The method of claim 1, wherein sorting the plurality of branches corresponding to each branch initiation node according to the weight of each branch further comprises:

7. An AI computation graph sorting apparatus, comprising:

the topology ordering module is used for performing topology ordering on the calculation graphs to obtain a first arrangement sequence of the calculation graphs;

the target arrangement order determining module is used for replacing the arrangement order of the branch computing nodes in the first arrangement order with a second arrangement order corresponding to the branch computing nodes to obtain a target arrangement order of the computation graph;

the branch ordering module comprises:

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of ordering AI computation graphs of any of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the AI computation graph ordering method of any one of claims 1-6.