WO2021259041A1 - Ai计算图的排序方法、装置、设备及存储介质 - Google Patents

Ai计算图的排序方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021259041A1
WO2021259041A1 PCT/CN2021/098307 CN2021098307W WO2021259041A1 WO 2021259041 A1 WO2021259041 A1 WO 2021259041A1 CN 2021098307 W CN2021098307 W CN 2021098307W WO 2021259041 A1 WO2021259041 A1 WO 2021259041A1
Authority
WO
WIPO (PCT)
Prior art keywords
branch
calculation
node
nodes
computing
Prior art date
Application number
PCT/CN2021/098307
Other languages
English (en)
French (fr)
Inventor
邹伟
熊超
蔡权雄
牛昕宇
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Publication of WO2021259041A1 publication Critical patent/WO2021259041A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of artificial intelligence technology, for example, it relates to a sorting method, device, device, and storage medium of artificial intelligence (AI) calculation graphs.
  • AI artificial intelligence
  • the deep learning model is essentially a calculation graph, such as a convolutional neural network model, which is essentially a directed acyclic calculation graph.
  • the calculation graph contains a large number of calculation nodes. Each calculation node represents a calculation operation, and there is input dependence. , That is, the input nodes of the current computing node are all calculated before the current computing node can be executed.
  • the development of artificial intelligence chips is mostly based on the instruction set architecture of the Central Processing Unit (CPU) and Graphics Processing Unit (GPU), and CPU and GPU usually have instruction control units and mature caching mechanisms. Therefore, the calculation graph can be directly input to the artificial intelligence chip of the instruction set architecture to run.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • An artificial intelligence chip based on the data flow architecture is highly valued because of its high utilization and low latency.
  • the data flow architecture and the instruction set architecture are not the same architecture, and the instruction control unit and cache mechanism of the instruction set architecture cannot be replicated. It is used in the data flow architecture, and the artificial intelligence chip developed based on the data flow architecture does not have an instruction control unit, and only executes according to the order of the pre-optimized calculation graph. Therefore, there is an urgent need for a processing method to deal with the nodes of the calculation graph in advance.
  • the sorting is optimized so that the artificial intelligence chip developed based on the data flow architecture can normally operate on the calculation graph.
  • the present application provides a sorting method, device, device, and storage medium of AI calculation graphs, so as to realize the sorting of calculation graphs based on the data flow architecture, improve the utilization rate of the chip cache based on the data flow architecture, and improve the performance of the chip.
  • the calculation graph includes multiple calculation nodes; perform topological sorting on the multiple calculation nodes to obtain the first arrangement order of the calculation graph;
  • the branch ordering of the multiple branches determines the second arrangement order of the multiple branch calculation nodes in the multiple calculation nodes; the arrangement order of the multiple branch calculation nodes in the first arrangement order is replaced with the multiple
  • the second arrangement order corresponding to each branch calculation node obtains the target arrangement order of the calculation graph.
  • a device for sorting AI calculation graphs including:
  • the calculation graph obtaining module is configured to obtain a calculation graph based on a data flow architecture, wherein the calculation graph includes a plurality of calculation nodes; the topological sorting module is set to perform topological sorting on the plurality of calculation nodes to obtain the calculation The first arrangement order of the graph; the branch ordering module is configured to determine the second arrangement order of the multiple branch calculation nodes of the multiple calculation nodes according to the branch ordering of the multiple branches in the calculation graph; the target arrangement order is determined Module, configured to replace the arrangement order of the plurality of branch calculation nodes in the first arrangement order with the second arrangement order corresponding to the plurality of branch calculation nodes to obtain the target arrangement order of the calculation graph .
  • a device including:
  • One or more processors a storage device configured to store one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors can realize the above The sorting method of AI calculation graph.
  • a computer-readable storage medium is also provided, on which a computer program is stored, and when the program is executed by a processor, the above-mentioned method for sorting the AI calculation graph is realized.
  • FIG. 1A is a schematic flowchart of a method for sorting AI calculation graphs according to Embodiment 1 of this application;
  • FIG. 1B is a schematic structural diagram of a calculation graph provided in Embodiment 1 of this application.
  • FIG. 2A is a schematic flowchart of a method for sorting AI calculation graphs according to Embodiment 2 of this application;
  • FIG. 2B is a schematic flowchart of a method for determining branch weights according to Embodiment 2 of this application;
  • FIG. 2C is a schematic structural diagram of a calculation graph provided in Embodiment 2 of this application.
  • 2D is a schematic diagram of ping-pong buffer allocation of a calculation diagram provided in Embodiment 2 of this application;
  • 2E is a schematic diagram of a calculation sequence of multiple computing nodes in a calculation graph after topological sorting according to the second embodiment of the application;
  • 2F is a schematic diagram of a calculation sequence of multiple computing nodes in a calculation graph after branch sorting according to the second embodiment of the application;
  • FIG. 3 is a schematic structural diagram of a sorting device for AI calculation graphs provided in Embodiment 3 of this application;
  • FIG. 4 is a schematic structural diagram of a device provided in Embodiment 4 of this application.
  • first”, “second”, etc. may be used herein to describe various directions, actions, steps or elements, etc., but these directions, actions, steps or elements are not limited by these terms. These terms are only used to distinguish a first direction, action, step or element from another direction, action, step or element.
  • the terms “first”, “second”, etc. cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features.
  • “multiple” and “batch” mean at least two, such as two, three, etc., unless otherwise defined.
  • FIG. 1A is a schematic flowchart of a method for sorting AI calculation graphs according to Embodiment 1 of this application.
  • This embodiment is applicable to the node sorting of calculation graphs of a deep learning model based on a data flow architecture.
  • the method can be used to sort the AI calculation graphs.
  • the device is implemented and can be implemented in hardware or software.
  • the method for sorting AI calculation graphs provided in Embodiment 1 of the present application includes:
  • the calculation graph based on the data flow architecture refers to the calculation graph of the deep learning model developed based on the data flow architecture.
  • a calculation graph is a calculation process with a Directed Acyclic Graph (DAG) as the data structure. It includes multiple calculation nodes. Each calculation node represents an arithmetic operation or a physical operation. The arithmetic operation is like addition and subtraction. Multipliers, etc., physical operations such as shape transformation and slicing operations of multi-dimensional data.
  • DAG Directed Acyclic Graph
  • To perform topological sorting on a directed acyclic graph G is to arrange all vertices in G into a linear sequence. For any pair of vertices u and v in the graph, if the edge (u, v) ⁇ E(G), then u Appears before v in a linear sequence. Then topological sorting of the calculation graph is to sort all the computing nodes in the calculation graph according to the data flow direction in the calculation graph, and the result of the topological sorting is called the first arrangement order.
  • the sorting result obtained by topological sorting is not unique.
  • the first arrangement order obtained after topological sorting can be: A->B->C->D->E->F->G->H , It can also be: A->B->C->D->F->E->G->H, or: A->B->C->E->G->H-> D->F.
  • the computing node can form a branch along each output direction.
  • This computing node is called the branch starting node, and the computing nodes in the branch are all called branch computing nodes.
  • a branch The starting node can form at least two branches.
  • multiple branches can be sorted in ascending order according to the weight of each branch, and the weight of the branch can be generated according to a preset rule, for example, directly set the number of branch calculation nodes included in the branch as the weight of the branch, or , Set the weight according to the node type of the computing node.
  • the branches included in the figure are: B->C, B->D->F and B->E->G->H, and branch B->C Mark it as branch 1, mark the branch B->D->F as branch 2, and mark the branch B->E->G->H as branch 3, and use the number of branch calculation nodes contained in the branch as the weight of the branch , Then the weight of branch 1 is 2, the weight of branch 2 is 3, and the weight of branch 3 is 4. According to the weight, the ascending order of the branches can be obtained. The order of the branches is 123, that is, branch 1 is executed first, and then branch 2. Finally, branch 3 is executed, and the second arrangement order of all branch computing nodes is obtained: B->C>D->F>E->G->H.
  • the sequence of computing nodes near the branch start node is usually not unique, that is, the sequence of branch computing nodes in the first sequence is not unique, and the second of the branch computing nodes determined by the branch sequence
  • the arrangement order is unique. Therefore, replace the arrangement order of the corresponding branch calculation nodes in the first arrangement order with the second arrangement order of the branch calculation nodes, so as to obtain a unique arrangement order of the calculation graph, that is, the target arrangement order of the calculation graph .
  • the first arrangement order obtained by topological sorting is: A->B->C->D->E->F->G->H
  • the second arrangement order of the computing nodes is: B->C>D->F>E->G->H
  • the target sequence of the calculation graph is: A->B->C>D->F->E->G->H.
  • the AI calculation graph sorting method provided in the first embodiment of the present application implements the sorting of the calculation graph based on the data flow architecture by topological sorting and branch sorting of the calculation graph, and can uniquely determine the execution order of multiple calculation nodes in the calculation graph. Improved chip performance based on data flow architecture.
  • FIG. 2A is a schematic flowchart of a method for sorting AI calculation graphs according to Embodiment 2 of this application. This embodiment is an illustration of the foregoing embodiment. As shown in FIG. 2A, the method for sorting the AI calculation graph provided in the second embodiment of the present application includes:
  • S220 Perform topological sorting on the multiple computing nodes to obtain a first arrangement order of the computing graph.
  • the calculation graph of the deep learning model is a directed acyclic graph. Usually there are a large number of calculation nodes and the structure is more complicated. Therefore, there are also multiple branch starting nodes. Each branch is based on its corresponding branch starting node. It must be determined Branch, you need to determine the starting node of the branch first.
  • Determining multiple branch starting nodes and multiple branches corresponding to each branch starting node includes steps S231 to S233 (not shown in the figure).
  • the computing node When there are multiple output numbers of a computing node, that is, the output number of the computing node is greater than 1, then the computing node is called the branch starting node.
  • the branch starting node Exemplarily, referring to the calculation graph shown in FIG. 1B, it can be determined that the calculation node B is the branch starting node.
  • the branch termination node refers to the last calculation node of the branch, so the branch termination node must be located after the branch start node.
  • the output number of a calculation node located after the branch start node is greater than 1, it means that the calculation node can be used as another
  • the current branch should end at the calculation node, that is, the calculation node should be a branch termination node corresponding to the current branch start node; when the output number of a calculation node located after the branch start node is equal to When it is 0, it means that the computing node is the last computing node, and the computing node is also used as a branch termination node corresponding to the current branch starting node.
  • the branch starting node is a computing node whose output number is greater than 1, that is, the output of the branch starting node is connected to at least two computing nodes. Therefore, one branch starting node corresponds to at least two branch ending nodes.
  • the branch termination node corresponding to the branch start node B includes a calculation node C, a calculation node F, and a calculation node H.
  • a branch is equivalent to a linear arrangement, and its starting point and ending point are determined respectively. Then, according to the data flow direction, that is, the direction of the calculation graph, the branch is determined.
  • the branch formed by the branch starting node B and the branch ending node C is: B->C
  • the branch formed by the branch starting node B and the branch ending node F It is: B->D->F
  • the branch composed of the branch start node B and the branch end node H is: B->E->G->H.
  • S240 Determine the weight of each branch according to the number of branch calculation nodes included in each branch.
  • Determining the branch weight based on the branch calculation node data included in the branch is to set the number of branch calculation nodes included in the branch as the branch weight.
  • the number of branch calculation nodes of branch B->C is 2, then the weight of branch B->C is 2, and the branch calculation node of branch B->D->F If the number of is 3, the weight of branch B->F is 3, the number of branch calculation nodes of branch B->E->G->H is 4, and the weight of branch B->H is 4.
  • the method for determining branch weights includes:
  • step S242 is executed, otherwise, step S243 is executed.
  • the current branch is a single branch, that is, the current branch does not have a cross relationship with other branches, and the number of branch computing nodes included in the current branch can be used as the weight of the current branch.
  • the branches B->C, B->D->F and B->E->G->H are all single branches, then the corresponding weights can be set as: 2, 3 and 4.
  • step S244 is executed at this time.
  • the composite computing node is the computing node in the current branch except the branch start node and the branch termination node, so it means that the number of inputs of the composite computing node is greater than 1, and the number of outputs is 1, which means that the composite
  • the input of the computing node depends on the output of at least two computing nodes, and the composite computing node is also a computing node in other branches except the branch start node and the branch end node.
  • step S245 is executed.
  • the compound computing node is the branch starting node of other branches, it means that there are computing nodes to be calculated after the current branch, that is, the output data of the branch termination node of the current branch is not the final output data, and then the branch starting node of the current branch Among all the corresponding branches, the weight of the current branch should be set to the largest, even if the current branch is executed last, avoid data overwriting on other branches.
  • set the weight of the current branch to the maximum which can be the number of branch calculation nodes corresponding to the branch start node of the current branch as the weight of the current branch, or the branch start node of the current branch
  • the sum of the weights of all branches of is set as the weight of the current branch. Exemplarily, referring to the calculation diagram shown in FIG.
  • the branches corresponding to B include: B->C, B->D->F, and B->E->G, and the weights corresponding to each branch are: 2, 3, and 3.
  • Set the sum of all branch weights Is the weight of the current branch, then the weight of the current branch is 2+3+3 8, if the number of branch calculation nodes corresponding to the branch starting node B of the current branch is set as the weight of the current branch, the branch starting node B
  • the number of corresponding branch calculation nodes is 6, so the weight of the current branch is 6.
  • the weights of the branches B->C, B->D->F and B->E->G corresponding to the final branch starting node B are: 2, 3, and 8, or the corresponding weights are: 2, 3 and 6.
  • the composite computing node is not the branch starting node of other branches, that is, the composite computing node is the computing node in the current branch except the branch starting node and the branch ending node, then it means that the current branch is still dependent on The output data of other branches is used as input.
  • the current branch needs to be able to run normally after all the corresponding input calculation nodes have been calculated. Therefore, the weight of the current branch is set to infinity at this time.
  • S250 Perform branch sorting on multiple branches corresponding to the start node of each branch according to the weight of each branch.
  • S260 Determine a second arrangement order of the multiple branch calculation nodes corresponding to each branch start node according to the branch order.
  • S280 Allocate a ping-pong buffer to the calculation graph according to the target arrangement sequence of the calculation graph.
  • the data calculation process usually uses a ping-pong cache mechanism, that is, during the calculation process, the data is alternately cached in two on-chip caches, and the final output data is cached in the external low-speed cache, but at the same time, the external low-speed cache is also The computing data of each computing node will be backed up. Data exchange between computing nodes in the on-chip cache is faster, and it takes longer for computing nodes to transmit output to the external low-speed cache, but the storage space of the on-chip cache is usually small, and the storage space of the external low-speed cache is larger.
  • the ping-pong buffer is allocated to the calculation graph according to the target arrangement order, so that the data during the calculation graph running process is stored in the on-chip cache as much as possible, which can greatly reduce the calculation time of the calculation graph and improve the performance of the chip.
  • the cache mode marked in the figure indicates the data cache location of each computing node after the calculation is completed.
  • the data calculated by the computing node A is stored in the cache B1.
  • the computing node B has completed the calculation
  • the data is stored in cache B2, and the data calculated by computing node D and computing node E must be stored in cache B1. Since computing node C, computing node F, and computing node H have no output, computing nodes C, The output data of the computing node F and the computing node H are both used as final output data and stored in the low-speed cache D1.
  • the possible sorting method is: A->B->C->D->E->F->G->H.
  • the calculation order is graphically represented, and the calculation order of multiple calculation nodes in the calculation graph after topological sorting is shown in Fig. 2E.
  • the calculation order corresponding to calculation node A is 1, which means that calculation node A executes first.
  • the input data is overwritten, then
  • the data of the backup computing node D needs to be loaded from the low-speed cache D1, which greatly increases the time consumption of data transmission, and if the backup data is loaded incorrectly, it will cause a calculation error.
  • the target arrangement order of the obtained calculation graphs is: A->B->C>D->F-> E->G->H
  • the calculation sequence of multiple computing nodes is shown in Figure 2F. It can be seen from Figure 2F that the computing node D completes the calculation and caches the data in the cache B1. Next, the computing node F obtains the data of the computing node D from the cache B1 for calculation.
  • the computing node E receives Cache B2 obtains the data of computing node B for calculation and caches the data in cache B1, and then compute node G obtains the data of computing node E from cache B1 for calculation and caches the data in cache B2, thereby It can be seen that in this sorting method, the data of the computing node can be cached to the high-speed cache to the maximum, which speeds up the calculation, avoids data coverage in the calculation process, and also avoids data transmission errors.
  • the method for sorting the AI calculation graph provided in the first embodiment of the present application can uniquely determine the execution order of multiple calculation nodes in the calculation graph by performing topological sorting and branch sorting on the calculation graph, and realizes the sorting of the calculation graph based on the data flow architecture.
  • the calculation process of the calculation graph can be more suitable for the ping-pong cache mechanism within the data flow architecture, which improves the utilization rate of the chip cache based on the data flow architecture, makes the calculation speed of the calculation graph faster, and improves the overall performance of the chip.
  • FIG. 3 is a schematic structural diagram of a sorting device for AI calculation graphs provided in the third embodiment of the application.
  • This embodiment is applicable to the node sorting of the calculation graphs of the deep learning model based on the data flow architecture.
  • the device can implement any implementation of this application.
  • the sorting method of the AI calculation graph provided in the example has the corresponding functional structure and effect of the implementation method. For the content not described in detail in this embodiment, please refer to the description of any method embodiment of this application.
  • the AI calculation graph sorting apparatus includes: a calculation graph acquisition module 310, a topology sorting module 320, a branch sorting module 330, and a target sorting order determining module 340.
  • the calculation graph obtaining module 310 is configured to obtain a calculation graph based on a data flow architecture, and the calculation graph includes a plurality of calculation nodes; the topological sorting module 320 is configured to perform topological sorting on the plurality of calculation nodes to obtain the calculation graph The first arrangement order; the branch ordering module 330 is set to determine the second arrangement order of the multiple branch computing nodes according to the branch ordering of the multiple branches in the calculation graph; the target arrangement order determining module 340 is set to set the first arrangement The arrangement order of the plurality of branch calculation nodes in the sequence is replaced with the second arrangement order corresponding to the plurality of branch calculation nodes to obtain the target arrangement order of the calculation graph.
  • the branch sorting module 330 includes:
  • the branch determining unit is configured to determine multiple branch starting nodes and multiple branches corresponding to each branch starting node; the weight determining unit is configured to determine the weight of each branch according to the number of branch calculation nodes included in each branch;
  • the branch sorting unit is set to sort the multiple branches corresponding to each branch starting node according to the weight of each branch; the second sorting sequence unit is set to determine the location of each branch starting node according to the branch sorting The second arrangement order of the corresponding multiple branch calculation nodes.
  • the branch determination unit is set to:
  • the computing node If the output number of a computing node is greater than 1, then the computing node is used as the branch starting node; if the output number of the computing node located after the branch starting node is greater than 1 or equal to 0, then the computing node is used as the branch A termination node, wherein one branch start node corresponds to at least two branch termination nodes; according to the direction of the calculation graph, a branch is formed from the branch start node to the branch termination node.
  • the weight determination unit is set to:
  • the composite computing node is a computing node shared by the current branch and other branches except the branch start node; if there is no composite computing node in the current branch, the current branch includes The number of branch computing nodes is used as the weight of the current branch; if there is a composite computing node in the current branch, it is determined whether the composite computing node is the branch starting node of other branches; if the composite computing node is the branch starting node of other branches, The number of branch calculation nodes corresponding to the branch start node of the current branch is taken as the weight of the current branch; if the composite calculation node is not the branch start node of other branches, the weight of the current branch is set to infinity; After the other branches where the calculation node is located have completed branch sorting, the number of branch calculation nodes included in the current branch is used as the weight of the current branch.
  • the device further includes: a ping-pong buffer allocation module configured to allocate a ping-pong buffer to the calculation graph according to the target arrangement sequence of the calculation graph.
  • the AI calculation graph sorting device provided in the third embodiment of the present application realizes the sorting of the calculation graph based on the data flow architecture by performing topological sorting and branch sorting on the calculation graph, and can uniquely determine the execution order of multiple calculation nodes in the calculation graph. Improved chip performance based on data flow architecture.
  • FIG. 4 is a schematic structural diagram of a device provided in Embodiment 4 of this application.
  • Figure 4 shows a block diagram of an exemplary device 412 suitable for implementing embodiments of the present application.
  • the device 412 shown in FIG. 4 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the device 412 is represented in the form of a general-purpose device.
  • the components of the device 412 may include, but are not limited to: one or more processors 416, a storage device 428, and a bus 418 connecting different system components (including the storage device 428 and the processor 416).
  • the bus 418 represents one or more of several types of bus structures, including a storage device bus or a storage device controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (Industry Subversive Alliance, ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
  • the device 412 includes a variety of computer system readable media. These media can be any available media that can be accessed by the device 412, including volatile and non-volatile media, removable and non-removable media.
  • the storage device 428 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (RAM) 430 and/or a cache memory 432.
  • the device 412 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
  • the storage system 434 may be configured to read and write a non-removable, non-volatile magnetic medium (not shown in FIG. 4, usually referred to as a "hard drive”).
  • a disk drive configured to read and write to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile disk such as a compact disc read-only memory (Compact Disc) can be provided.
  • a removable non-volatile disk such as a "floppy disk”
  • Compact Disc compact disc read-only memory
  • each drive may be connected to the bus 418 through one or more data media interfaces.
  • the storage device 428 may include at least one program product, and the program product has a set (for example, at least one) program modules, and these program modules are configured to perform the functions of the embodiments of the present application.
  • a program/utility tool 440 having a set of (at least one) program module 442 may be stored in, for example, the storage device 428.
  • Such program module 442 includes but is not limited to an operating system, one or more application programs, other program modules, and programs Data, each of these examples or a combination may include the realization of a network environment.
  • the program module 442 generally executes the functions and/or methods in the embodiments described in this application.
  • the device 412 may also communicate with one or more external devices 414 (for example, a keyboard, a pointing terminal, a display 424, etc.), and may also communicate with one or more terminals that enable a user to interact with the device 412, and/or communicate with
  • the device 412 can communicate with any terminal (such as a network card, modem, etc.) that communicates with one or more other computing terminals. This communication can be performed through an input/output (Input/Output, I/O) interface 422.
  • the device 412 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 420. As shown in FIG.
  • the network adapter 420 communicates with other modules of the device 412 through the bus 418.
  • other hardware and/or software modules can be used in conjunction with the device 412, including but not limited to: microcode, terminal drives, redundant processors, external disk drive arrays, and disk arrays (Redundant Arrays of Independent Disks, RAID) systems, tape drives, and data backup storage systems.
  • the processor 416 executes a variety of functional applications and data processing by running programs stored in the storage device 428, for example, to implement the AI calculation graph sorting method provided by any embodiment of the present application.
  • the method may include:
  • the branch order of the branches determines the second arrangement order of the plurality of branch calculation nodes; replacing the arrangement order of the plurality of branch calculation nodes in the first arrangement order with the second arrangement order corresponding to the plurality of branch calculation nodes , Get the target arrangement sequence of the calculation graph.
  • the fifth embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method for sorting AI calculation graphs as provided in any of the embodiments of the present application can be realized.
  • the branch order of the branches determines the second arrangement order of the plurality of branch calculation nodes; replacing the arrangement order of the plurality of branch calculation nodes in the first arrangement order with the second arrangement order corresponding to the plurality of branch calculation nodes , Get the target arrangement sequence of the calculation graph.
  • the computer storage medium of the embodiment of the present application may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Examples of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory, EPROM or flash memory), optical fiber, CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • suitable medium including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • the computer program code used to perform the operations of this application can be written in one or more programming languages or a combination thereof.
  • the programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or terminal.
  • the remote computer may be connected to the user's computer through any kind of network including LAN or WAN, or may be connected to an external computer (for example, using an Internet service provider to connect through the Internet).

Abstract

一种AI计算图的排序方法、装置、设备及存储介质,所述AI计算图的排序方法包括:获取基于数据流架构的计算图,所述计算图包括多个计算节点(S110);对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序(S120);根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点的第二排列顺序(S130);将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的所述第二排列顺序,得到所述计算图的目标排列顺序(S140)。

Description

AI计算图的排序方法、装置、设备及存储介质
本申请要求在2020年06月22日提交中国专利局、申请号为202010577847.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,例如涉及一种人工智能(Artificial Intelligence,AI)计算图的排序方法、装置、设备及存储介质。
背景技术
深度学习模型本质是一个计算图,如卷积神经网络模型,其本质上是一个有向无环计算图,计算图里面含有大量的计算节点,每个计算节点代表一个计算操作,并且存在输入依赖,即当前计算节点的输入节点都计算完,才能执行当前计算节点。
人工智能芯片的开发大多是基于中央处理单元(Central Processing Unit,CPU)和图形处理单元(Graphics Processing Unit,GPU)的指令集架构,而CPU和GPU中通常具有指令控制单元和成熟的缓存机制,因此计算图可以直接输入到指令集架构的人工智能芯片中运行。
一种基于数据流架构开发的人工智能芯片因其利用率高和延迟低而深受人们重视,但是数据流架构与指令集架构不是同一种架构,指令集架构的指令控制单元和缓存机制无法复用到数据流架构中,并且,基于数据流架构开发的人工智能芯片不设置指令控制单元,只根据预先优化好的计算图的顺序来执行,因此,急需一种处理方法提前对计算图的节点排序进行优化,以使基于数据流架构开发的人工智能芯片能够正常对计算图进行运算。
发明内容
本申请提供一种AI计算图的排序方法、装置、设备及存储介质,以实现基于数据流架构的计算图的排序,提高基于数据流架构的芯片缓存的使用率,提升芯片性能。
提供一种AI计算图的排序方法,包括:
获取基于数据流架构的计算图,其中,所述计算图包括多个计算节点;对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序;根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点 的第二排列顺序;将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的所述第二排列顺序,得到所述计算图的目标排列顺序。
还提供一种AI计算图的排序装置,包括:
计算图获取模块,设置为获取基于数据流架构的计算图,其中,所述计算图包括多个计算节点;拓扑排序排序模块,设置为对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序;分支排序模块,设置为根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点的第二排列顺序;目标排列顺序确定模块,设置为将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的所述第二排列顺序,得到所述计算图的目标排列顺序。
还提供一种设备,包括:
一个或多个处理器;存储装置,设置为存储一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述的AI计算图的排序方法。
还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的AI计算图的排序方法。
附图说明
图1A为本申请实施例一提供的一种AI计算图的排序方法的流程示意图;
图1B为本申请实施例一提供的一种计算图的结构示意图;
图2A为本申请实施例二提供的一种AI计算图的排序方法的流程示意图;
图2B为本申请实施例二提供的一种确定分支权重的方法的流程示意图;
图2C为本申请实施例二提供的一种计算图的结构示意图;
图2D为本申请实施例二提供的一种计算图的乒乓缓存分配示意图;
图2E为本申请实施例二提供的一种拓扑排序后计算图中多个计算节点的计算顺序的示意图;
图2F为本申请实施例二提供的一种分支排序后计算图中多个计算节点的计算顺序的示意图;
图3为本申请实施例三提供的一种AI计算图的排序装置的结构示意图;
图4为本申请实施例四提供的一种设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。
在讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将多个步骤描述成顺序的处理,但是其中的许多步骤可以被并行地、并发地或者同时实施。此外,多个步骤的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。
此外,术语“第一”、“第二”等可在本文中用于描述多种方向、动作、步骤或元件等,但这些方向、动作、步骤或元件不受这些术语限制。这些术语仅用于将第一个方向、动作、步骤或元件与另一个方向、动作、步骤或元件区分。术语“第一”、“第二”等而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”、“批量”的含义是至少两个,例如两个,三个等,除非另有限定。
实施例一
图1A为本申请实施例一提供的一种AI计算图的排序方法的流程示意图,本实施例可适用于基于数据流架构的深度学习模型计算图的节点排序,该方法可由AI计算图的排序装置实施,并可通过硬件或软件的方式实现。如图1A所示,本申请实施例一提供的AI计算图的排序方法包括:
S110、获取基于数据流架构的计算图,所述计算图包括多个计算节点。
基于数据流架构的计算图是指基于数据流架构开发的深度学习模型的计算图。计算图是一种以有向无环图(Directed Acyclic Graph,DAG)为数据结构的计算流程,其中包括多个计算节点,每一个计算节点表示一种算术操作或者物理操作,算术操作如加减乘数等,物理操作如多维数据的形状变换和切片操作等。
S120、对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序。
对一个有向无环图G进行拓扑排序,就是将G中所有顶点排成一个线性序列,对于图中任意一对顶点u和v,若边(u,v)∈E(G),则u在线性序列中出现在v之前。那么对计算图进行拓扑排序,就是根据计算图中的数据流向对计算图中的所有计算节点进行排序,将拓扑排序的结果称为第一排列顺序。
通常,当一个计算节点的输出有多个时,拓扑排序得到的排序结果并不是唯一的。示例性的,如图1B所示的计算图,对其进行拓扑排序后得到的第一排列顺序可以是:A->B->C->D->E->F->G->H,也可以是:A->B->C->D->F->E->G->H,还可以是:A->B->C->E->G->H->D->F。
S130、根据所述计算图中的多个分支的分支排序确定多个分支计算节点的第二排列顺序。
当一个计算节点的输出数量有多个时,这个计算节点沿每个输出方向都可以构成一个分支,这个计算节点称为分支起始节点,分支中的计算节点都称为分支计算节点,一个分支起始节点至少可以构成两个分支。对每个分支起始节点的多个分支进行分支排序,就可以唯一确定多个分支执行的先后顺序,也就唯一确定了基于该分支起始节点而形成的分支的所有分支计算节点的排列顺序,所有分支计算节点的排列顺序称为第二排列顺序。
可选的,可以根据每个分支的权重对多个分支进行分支升序排序,分支的权重可以根据预设规则生成,例如,直接将分支所包含的分支计算节点的数量设为分支的权重,或者,根据计算节点的节点类型设置权重。示例性的,如图1B所示的计算图,图中包括的分支有:B->C、B->D->F和B->E->G->H,将分支B->C记为分支1,将分支B->D->F记为分支2,将分支B->E->G->H记为分支3,以分支所包含的分支计算节点的数量作为分支的权重,那么分支1的权重为2,分支2的权重为3,分支3的权重为4,根据权重进行升序排序,可以得到分支的排列顺序为123,即,先执行分支1,再执行分支2,最后执行分支3,进而得到所有分支计算节点的第二排列顺序:B->C>D->F>E->G->H。
S140、将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的第二排列顺序,得到所述计算图的目标排列顺序。
由于在拓扑排序中,分支起始节点附近的计算节点的排列顺序通常不是唯一的,也就是第一排列顺序中分支计算节点的排列顺序不是惟一的,而分支排序确定的分支计算节点的第二排列顺序是唯一的,因此,用分支计算节点的第二排列顺序替换掉第一排列顺序中对应分支计算节点的排列顺序,从而可以得到一个唯一的计算图排列顺序,即计算图的目标排列顺序。
示例性的,如图1B所示的计算图,对其进行拓扑排序得到的第一排列顺序为:A->B->C->D->E->F->G->H,分支计算节点的第二排列顺序为:B->C>D->F>E->G->H,用分支计算节点的第二排列顺序替换掉第一排列顺序中对应分支计算节点的排列顺序,得到计算图的目标排列顺序为:A->B->C>D->F->E->G->H。
本申请实施例一提供的AI计算图的排序方法通过对计算图进行拓扑排序和分支排序,实现了基于数据流架构的计算图的排序,能够唯一确定计算图中多个计算节点的执行顺序,提升了基于数据流架构的芯片性能。
实施例二
图2A为本申请实施例二提供的一种AI计算图的排序方法的流程示意图,本实施例是对上述实施例的说明。如图2A所示,本申请实施例二提供的AI计算图的排序方法包括:
S210、获取基于数据流架构的计算图,所述计算图包括多个计算节点。
S220、对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序。
S230、确定多个分支起始节点以及每个分支起始节点所对应的多个分支。
深度学习模型的计算图是一个有向无环图,通常计算节点数量众多,结构比较复杂,因此分支起始节点也有多个,每一个分支都基于其对应的分支起始节点而来,要确定分支,则需要先确定分支起始节点。
确定多个分支起始节点以及每个分支起始节点所对应的多个分支包括步骤S231~S233(图中未示出)。
S231、若计算节点的输出数量大于1,则将所述计算节点作为分支起始节点。
当一个计算节点的输出数量有多个时,也即,计算节点的输出数量大于1,那么就将这个计算节点称为分支起始节点。示例性的,参考图1B所示的计算图,可以确定计算节点B为分支起始节点。
S232、若位于所述分支起始节点之后的计算节点的输出数量大于1或等于0,则将所述计算节点作为分支终止节点,其中,一个分支起始节点对应至少两个分支终止节点。
分支终止节点是指分支的最后一个计算节点,那么分支终止节点必然位于分支起始节点之后,当一个位于分支起始节点之后的计算节点的输出数量大于1时,则说明该计算节点可以作为其他分支的分支起始节点,那么当前分支应该到该计算节点为止,即该计算节点应该作为当前分支起始节点对应的一个分支终止节点;当一个位于分支起始节点之后的计算节点的输出数量等于0时,说明该计算节点已是最后一个计算节点,那么该计算节点也作为当前分支起始节点对应的一个分支终止节点。分支起始节点是输出数量大于1的计算节点,也就是分支起始节点的输出至少连接了2个计算节点,因此,一个分支起始节点 对应至少两个分支终止节点。示例性的,参考图1B所示的计算图,可以确定为分支起始节点B所对应的分支终止节点包括计算节点C、计算节点F和计算节点H。
S233、根据所述计算图的方向,从所述分支起始节点到所述分支终止节点形成分支。
一个分支就相当于一条线性排列,分别确定其起点和终点,那么根据数据流向,即计算图的方向,就确定了这个分支。示例性的,参考图1B所示的计算图,由分支起始节点B和分支终止节点C所组成的分支为:B->C,由分支起始节点B和分支终止节点F所组成的分支为:B->D->F,由分支起始节点B和分支终止节点H所组成的分支为:B->E->G->H。
S240、根据每个分支包括的分支计算节点数量确定每个分支的权重。
根据分支包括的分支计算节点数据确定分支权重,就是将分支所包含的分支计算节点的数量设为分支的权重。示例性的,参考图1B所示的计算图,分支B->C的分支计算节点的数量为2,则分支B->C的权重为2,分支B->D->F的分支计算节点的数量为3,则分支B->F的权重为3,分支B->E->G->H的分支计算节点的数量为4,则分支B->H的权重为4。
如图2B所示,确定分支权重的方法包括:
S241、确定当前分支中是否存在复合计算节点,所述复合计算节点为除分支起始节点外的当前分支与其他分支共有的计算节点。
由于计算图的结构复杂,很有可能一个计算节点存在于多个分支之中,将多个分支共有的计算节点,称为复合计算节点,由于一个分支起始节点对应多个分支,其自然存在于多个分支中,故分支起始节点不作为复合计算节点。若当前分支存在复合计算节点,说明当前分支与其他分支存在相同的计算节点,则执行步骤S242,否则执行步骤S243。
S242、若当前分支中不存在复合计算节点,则将当前分支所包括的分支计算节点的数量作为当前分支的权重。
若当前分支中不存在复合计算节点,则说明当前分支是单一分支,即当前分支不与其他分支存在交叉关系,则可将当前分支所包括的分支计算节点的数量作为当前分支的权重。示例性的,参考图1B所示的计算图,分支B->C、B->D->F和B->E->G->H均为单一分支,那么对应的权重可设置为:2、3和4。
S243、若当前分支中存在复合计算节点,则确定所述复合计算节点是否为其他分支的分支起始节点。
若当前分支中存在复合计算节点,那么复合计算节点有两种情况,一种情况是,复合计算节点为当前分支的分支终止节点,但是该分支终止节点的输出数量大于1,那么该分支终止节点同时是其他分支的分支起始节点,此时执行步骤S244。
另一种情况是,复合计算节点是当前分支当中除分支起始节点和分支终止节点之外的计算节点,那么说明,该复合计算节点的输入数量大于1,输出数量为1,也就该复合计算节点的输入至少依赖2个计算节点的输出,该复合计算节点也是其他分支当中除分支起始节点和分支终止节点之外的计算节点,此时执行步骤S245。
S244、若所述复合计算节点是其他分支的分支起始节点,则将当前分支的分支起始节点所对应的分支计算节点的数量作为当前分支的权重。
若复合计算节点为其他分支的分支起始节点,则说明当前分支之后还有待计算的计算节点,即当前分支的分支终止节点的输出数据并不是最终输出数据,这时当前分支的分支起始节点所对应的所有分支中,当前分支的权重应该设为最大,即使当前分支最后执行,避免对其他分支进行数据覆盖。可选的,将当前分支的权重设为最大,可以是将当前分支的分支起始节点所对应的分支计算节点的数量作为当前分支的权重,也可以是将当前分支的分支起始节点所对应的所有分支的权重之和,设为当前分支的权重。示例性的,参考图2C所示的计算图,当前分支B->E->G中存在复合计算节点G,该复合计算节点G为其他分支的分支起始节点,当前分支的分支起始节点B所对应的分支包括:B->C、B->D->F和B->E->G,且每个分支对应的权重为:2、3和3,将所有分支权重之和设为当前分支的权重,那么当前分支的权重为2+3+3=8,若将当前分支的分支起始节点B所对应的分支计算节点的数量设为当前分支的权重,分支起始节点B所对应的分支计算节点的数量为6个,那么当前分支的权重为6。最终分支起始节点B所对应的分支B->C、B->D->F和B->E->G的权重分别为:2、3和8,或对应的权重分别为:2、3和6。
S245、若所述复合计算节点不是其他分支的分支起始节点,则将当前分支的权重设为无穷大。
S246、当所述复合计算节点所在的其他分支完成分支排序后,将当前分支所包括的分支计算节点的数量作为当前分支的权重。
当复合计算节点不是其他分支的分支起始节点时,即复合计算节点是当前分支当中除分支起始节点和分支终止节点之外的计算节点,那么说明,当前分支在运行的过程中,还依赖其他分支的输出数据作为输入,根据计算图的输入依赖特性,当前分支需要在其对应的所有输入计算节点都计算完成之后,才能 够正常运行,因此,此时将当前分支的权重设为无穷大,等待复合计算节点所在的其他分支完成分支排序,使当前分支对应的所有输入计算节点在其之前进行计算,再将当前分支的权重恢复为正常情况,即将当前分支所包括的分支计算节点的数量作为当前分支的权重。
S250、根据每个分支的权重对每个分支起始节点所对应的多个分支进行分支排序。
S260、根据所述分支排序确定每个分支起始节点所对应的多个分支计算节点的第二排列顺序。
S270、将所述第一排列顺序中的分支计算节点的排列顺序替换为所述分支计算节点对应的第二排列顺序,得到所述计算图的目标排列顺序。
S280、根据所述计算图的目标排列顺序对所述计算图分配乒乓缓存。
在基于数据流的芯片中,数据计算过程通常采用乒乓缓存机制,即计算过程中,数据交替缓存在两个片上高速缓存中,最终的输出数据缓存在外部低速缓存中,但同时外部低速缓存也会对每个计算节点的计算数据进行备份。计算节点在片上高速缓存之间的数据交换更快,计算节点将输出传输到外部低速缓存所需的时间更长,但是片上高速缓存的存储空间通常较小,外部低速缓存的存储空间较大。根据目标排列顺序对计算图分配乒乓缓存,使计算图运行过程中的数据尽量存储在片上高速缓存中,从而可以大大降低计算图的计算时间,提升芯片性能。
示例性的,对于如图1B所示的计算图,假设存在两个片上高速缓存B1和B2,一个外部低速缓存D1,对图1B的计算图进行乒乓缓存的分配,可以得到如图2D所示的缓存分配图,图中标注的缓存方式均表示每个计算节点计算完成后数据的缓存位置,计算节点A计算完成的数据存放在高速缓存B1中,根据乒乓缓存的机制,计算节点B计算完成的数据存放在高速缓存B2中,计算节点D和计算节点E计算完成的数据都要存放在高速缓存B1中,由于计算节点C、计算节点F和计算节点H都没有输出,故计算节点C、计算节点F和计算节点H的输出数据都作为最终输出数据,存放在低速缓存D1中。
若仅对图1B所示的计算图进行拓扑排序,可能得到的排序方式为:A->B->C->D->E->F->G->H,将多个计算节点的计算顺序用图形表示,则拓扑排序后计算图中多个计算节点的计算顺序如图2E所示,如,计算节点A对应的计算顺序为1,表示计算节点A第1个执行。由图2E可知,当计算节点B完成计算并将数据存放在高速缓存B2中之后,计算节点C从高速缓存B2中获取数据进行计算,并将得到的数据存放在低速缓存D1中;然后计算节点D从高速缓 存B2中获取数据进行计算,并将数据存放在高速缓存B1中;接下来进行计算的是计算节点E,计算节点E也是从高速缓存B2中获取数据进行计算,并将数据存放在高速缓存B1中。那么当计算节点E完成计算之后,高速缓存B1中存放的是计算节点E的数据,计算节点D的数据被覆盖,对于接下来进行计算的计算节点F来说,其输入数据被覆盖,那么就需要从低速缓存D1中加载备份的计算节点D的数据,这样就大大增加了数据传输耗时,并且,若加载备份数据出错则会导致计算错误。
然而,使用本申请实施例提供的AI计算图的排序方法对计算图进行拓扑排序和分支排序后,得到的计算图的目标排列顺序为:A->B->C>D->F->E->G->H,多个计算节点的计算顺序的表示如图2F所示。由图2F可知,计算节点D完成计算将数据缓存至高速缓存B1中,接下来计算节点F从高速缓存B1中获取计算节点D的数据进行计算,计算节点F计算完成之后,计算节点E从高速缓存B2中获取计算节点B的数据进行计算并将数据缓存至高速缓存B1中,然后计算节点G从高速缓存B1中获取计算节点E的数据进行计算并将数据缓存至高速缓存B2中,由此可见,这种排序方式下,计算节点的数据能够最大化的缓存到高速缓存中,加快了计算速度,避免了计算过程中的数据覆盖,同时也避免了数据传输出错。
本申请实施例一提供的AI计算图的排序方法通过对计算图进行拓扑排序和分支排序,能够唯一确定计算图中多个计算节点的执行顺序,实现了基于数据流架构的计算图的排序,并且使计算图的计算过程能够更加适用于数据流架构内部的乒乓缓存机制,提高了基于数据流架构的芯片高速缓存的使用率,使得计算图的计算速度更快,提升了芯片整体性能。
实施例三
图3为本申请实施例三提供的一种AI计算图的排序装置的结构示意图,本实施例可适用于基于数据流架构的深度学习模型计算图的节点排序,该装置可以实现本申请任意实施例提供的AI计算图的排序方法,具备实现方法的相应功能结构和效果,本实施例中未详尽描述的内容可参考本申请任意方法实施例的描述。
如图3所示,本申请实施例三提供的AI计算图的排序装置包括:计算图获取模块310、拓扑排序排序模块320、分支排序模块330和目标排列顺序确定模块340。
计算图获取模块310设置为获取基于数据流架构的计算图,所述计算图包 括多个计算节点;拓扑排序排序模块320设置为对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序;分支排序模块330设置为根据所述计算图中的多个分支的分支排序确定多个分支计算节点的第二排列顺序;目标排列顺序确定模块340设置为将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的第二排列顺序,得到所述计算图的目标排列顺序。
分支排序模块330包括:
分支确定单元,设置为确定多个分支起始节点以及每个分支起始节点所对应的多个分支;权重确定单元,设置为根据每个分支包括的分支计算节点数量确定每个分支的权重;分支排序单元,设置为根据每个分支的权重对每个分支起始节点所对应的多个分支进行分支排序;第二排列顺序单元,设置为根据所述分支排序确定每个分支起始节点所对应的多个分支计算节点的第二排列顺序。
所述分支确定单元设置为:
若计算节点的输出数量大于1,则将所述计算节点作为分支起始节点;若位于所述分支起始节点之后的计算节点的输出数量大于1或等于0,则将所述计算节点作为分支终止节点,其中,一个分支起始节点对应至少两个分支终止节点;根据所述计算图的方向,从所述分支起始节点到所述分支终止节点形成分支。
所述权重确定单元设置为:
确定当前分支中是否存在复合计算节点,所述复合计算节点为除分支起始节点外的当前分支与其他分支共有的计算节点;若当前分支中不存在复合计算节点,则将当前分支所包括的分支计算节点的数量作为当前分支的权重;若当前分支中存在复合计算节点,则确定所述复合计算节点是否为其他分支的分支起始节点;若复合计算节点是其他分支的分支起始节点,则将当前分支的分支起始节点所对应的分支计算节点的数量作为当前分支的权重;若复合计算节点不是其他分支的分支起始节点,则将当前分支的权重设为无穷大;当所述复合计算节点所在的其他分支完成分支排序后,将当前分支所包括的分支计算节点的数量作为当前分支的权重。
所述装置还包括:乒乓缓存分配模块,设置为根据所述计算图的目标排列顺序对所述计算图分配乒乓缓存。
本申请实施例三提供的AI计算图的排序装置通过对计算图进行拓扑排序和分支排序,实现了基于数据流架构的计算图的排序,能够唯一确定计算图中多个计算节点的执行顺序,提升了基于数据流架构的芯片性能。
实施例四
图4为本申请实施例四提供的一种设备的结构示意图。图4示出了适于用来实现本申请实施方式的示例性设备412的框图。图4显示的设备412仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图4所示,设备412以通用设备的形式表现。设备412的组件可以包括但不限于:一个或者多个处理器416,存储装置428,连接不同系统组件(包括存储装置428和处理器416)的总线418。
总线418表示几类总线结构中的一种或多种,包括存储装置总线或者存储装置控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Subversive Alliance,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
设备412包括多种计算机系统可读介质。这些介质可以是任何能够被设备412访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储装置428可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)430和/或高速缓存存储器432。设备412可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统434可以设置为读写不可移动的、非易失性磁介质(图4未显示,通常称为“硬盘驱动器”)。尽管图4中未示出,可以提供设置为对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘,例如光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM),数字多功能盘只读存储器(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线418相连。存储装置428可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请实施例的功能。
具有一组(至少一个)程序模块442的程序/实用工具440,可以存储在例如存储装置428中,这样的程序模块442包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或一种组合中可能包括网络环境的实现。程序模块442通常执行本申请所描述的实施例中的功能和/或方法。
设备412也可以与一个或多个外部设备414(例如键盘、指向终端、显示器424等)通信,还可与一个或者多个使得用户能与该设备412交互的终端通信,和/或与使得该设备412能与一个或多个其它计算终端进行通信的任何终端(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口422进行。并且,设备412还可以通过网络适配器420与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图4所示,网络适配器420通过总线418与设备412的其它模块通信。尽管图中未示出,可以结合设备412使用其它硬件和/或软件模块,包括但不限于:微代码、终端驱动器、冗余处理器、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理器416通过运行存储在存储装置428中的程序,从而执行多种功能应用以及数据处理,例如实现本申请任意实施例所提供的AI计算图的排序方法,该方法可以包括:
获取基于数据流架构的计算图,所述计算图包括多个计算节点;对所述计算图进行拓扑排序,得到所述多个计算节点的第一排列顺序;根据所述计算图中的多个分支的分支排序确定多个分支计算节点的第二排列顺序;将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的第二排列顺序,得到所述计算图的目标排列顺序。
实施例五
本申请实施例五还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例所提供的AI计算图的排序方法,该方法可以包括:
获取基于数据流架构的计算图,所述计算图包括多个计算节点;对所述计算图进行拓扑排序,得到所述多个计算节点的第一排列顺序;根据所述计算图中的多个分支的分支排序确定多个分支计算节点的第二排列顺序;将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的第二排列顺序,得到所述计算图的目标排列顺序。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存 储介质的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或终端上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。

Claims (10)

  1. 一种人工智能AI计算图的排序方法,包括:
    获取基于数据流架构的计算图,其中,所述计算图包括多个计算节点;
    对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序;
    根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点的第二排列顺序;
    将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的所述第二排列顺序,得到所述计算图的目标排列顺序。
  2. 如权利要求1所述的方法,其中,所述根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点的第二排列顺序,包括:
    确定所述多个计算节点中的多个分支起始节点以及每个分支起始节点所对应的多个分支;
    根据每个分支包括的分支计算节点数量确定所述每个分支的权重;
    根据每个分支的权重对每个分支起始节点所对应的多个分支进行分支排序;
    根据所述分支排序确定每个分支起始节点所对应的多个分支计算节点的第二排列顺序。
  3. 如权利要求2所述的方法,其中,所述确定所述多个计算节点中多个分支起始节点以及每个分支起始节点所对应的多个分支,包括:
    在一计算节点的输出数量大于1的情况下,将所述一计算节点作为分支起始节点;
    在位于所述分支起始节点之后的计算节点的输出数量大于1或等于0的情况下,将所述分支起始节点之后的计算节点作为分支终止节点,其中,一个分支起始节点对应至少两个分支终止节点;
    根据所述计算图的方向,从所述分支起始节点到所述分支终止节点形成分支。
  4. 如权利要求3所述的方法,其中,所述根据每个分支包括的分支计算节点数量确定所述每个分支的权重,包括:
    确定当前分支中是否存在复合计算节点,其中,所述复合计算节点为除所述当前分支的分支起始节点外的所述当前分支与其他分支共有的计算节点;
    响应于所述当前分支中不存在所述复合计算节点,将所述当前分支所包括的分支计算节点的数量作为所述当前分支的权重。
  5. 如权利要求4所述的方法,在所述确定当前分支中是否存在复合计算节点之后,还包括:
    响应于所述当前分支中存在所述复合计算节点,确定所述复合计算节点是否为其他分支的分支起始节点;
    响应于所述复合计算节点是其他分支的分支起始节点,将所述当前分支的分支起始节点所对应的分支计算节点的数量作为所述当前分支的权重。
  6. 如权利要求5所述的方法,在所述确定所述复合计算节点是否为其他分支的分支起始节点之后,还包括:
    响应于所述复合计算节点不是其他分支的分支起始节点,将所述当前分支的权重设为无穷大;
    在所述复合计算节点所在的其他分支完成分支排序后,将所述当前分支所包括的分支计算节点的数量作为所述当前分支的权重。
  7. 如权利要求2所述的方法,在所述得到所述计算图的目标排列顺序之后,还包括:
    根据所述计算图的目标排列顺序对所述计算图分配乒乓缓存。
  8. 一种人工智能AI计算图的排序装置,包括:
    计算图获取模块,设置为获取基于数据流架构的计算图,其中,所述计算图包括多个计算节点;
    拓扑排序排序模块,设置为对所述多个计算节点进行拓扑排序,得到所述计算图的第一排列顺序;
    分支排序模块,设置为根据所述计算图中的多个分支的分支排序确定所述多个计算节点中的多个分支计算节点的第二排列顺序;
    目标排列顺序确定模块,设置为将所述第一排列顺序中的所述多个分支计算节点的排列顺序替换为所述多个分支计算节点对应的所述第二排列顺序,得到所述计算图的目标排列顺序。
  9. 一种设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-7中任一项所述的人工智能AI计算图的排序方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处 理器执行时实现如权利要求1-7中任一项所述的人工智能AI计算图的排序方法。
PCT/CN2021/098307 2020-06-22 2021-06-04 Ai计算图的排序方法、装置、设备及存储介质 WO2021259041A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010577847.1 2020-06-22
CN202010577847.1A CN111752691B (zh) 2020-06-22 2020-06-22 Ai计算图的排序方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021259041A1 true WO2021259041A1 (zh) 2021-12-30

Family

ID=72675663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098307 WO2021259041A1 (zh) 2020-06-22 2021-06-04 Ai计算图的排序方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111752691B (zh)
WO (1) WO2021259041A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111752691B (zh) * 2020-06-22 2023-11-28 深圳鲲云信息科技有限公司 Ai计算图的排序方法、装置、设备及存储介质
CN112308449B (zh) * 2020-11-12 2023-04-14 山东鲁软数字科技有限公司 配电自动化系统环网图中分支内设备排序的方法及系统
CN114021708B (zh) * 2021-09-30 2023-08-01 浪潮电子信息产业股份有限公司 一种数据处理方法、装置、系统、电子设备及存储介质
CN114911630B (zh) * 2022-07-14 2022-11-04 小米汽车科技有限公司 数据处理方法、装置、车辆、存储介质及芯片

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473265A (zh) * 2013-07-25 2013-12-25 深圳市华傲数据技术有限公司 一种流程图布局的分析方法及装置
US20140075408A1 (en) * 2012-09-10 2014-03-13 Sap Ag System and Method for Generating High Performance Calculators for Calculation Graphs
CN107679010A (zh) * 2017-09-20 2018-02-09 东南大学 一种面向可重构计算阵列的算子映射系统及方法
CN109656568A (zh) * 2018-12-28 2019-04-19 黑龙江省工业技术研究院 按需的可约程序控制流图图可达性索引方法
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN111752691A (zh) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 Ai计算图的排序方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140075408A1 (en) * 2012-09-10 2014-03-13 Sap Ag System and Method for Generating High Performance Calculators for Calculation Graphs
CN103473265A (zh) * 2013-07-25 2013-12-25 深圳市华傲数据技术有限公司 一种流程图布局的分析方法及装置
CN107679010A (zh) * 2017-09-20 2018-02-09 东南大学 一种面向可重构计算阵列的算子映射系统及方法
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN109656568A (zh) * 2018-12-28 2019-04-19 黑龙江省工业技术研究院 按需的可约程序控制流图图可达性索引方法
CN111752691A (zh) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 Ai计算图的排序方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111752691A (zh) 2020-10-09
CN111752691B (zh) 2023-11-28

Similar Documents

Publication Publication Date Title
WO2021259041A1 (zh) Ai计算图的排序方法、装置、设备及存储介质
WO2020001438A1 (zh) 片上代码断点调试方法、片上处理器及芯片断点调试系统
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
WO2021129645A1 (zh) 数据并行化处理方法、系统、设备和存储介质
JP2021530813A (ja) 専用低レイテンシリンクを使用した複数のハードウェアアクセラレータのための統合されたアドレス空間
US20150067695A1 (en) Information processing system and graph processing method
WO2019019926A1 (zh) 系统参数的优化方法、装置及设备、可读介质
US20230251979A1 (en) Data processing method and apparatus of ai chip and computer device
WO2024001024A1 (zh) 在区块链系统中执行交易的方法、区块链系统和节点
WO2021259098A1 (zh) 一种基于卷积神经网络的加速系统、方法及存储介质
CN113568860A (zh) 基于深度学习的拓扑映射方法、装置、介质及程序产品
US11729268B2 (en) Computer-implemented method, system, and storage medium for prefetching in a distributed graph architecture
CN111752887A (zh) 人工智能芯片和基于人工智能芯片的数据处理方法
WO2022012563A1 (zh) 神经网络数据处理方法、装置、设备及存储介质
WO2023071566A1 (zh) 数据处理方法、装置、计算机设备、计算机可读存储介质及计算机程序产品
WO2022247880A1 (zh) 一种对神经网络的算子进行融合的方法和相关产品
US11836188B2 (en) Devices and methods for accessing and retrieving data in a graph
WO2023050704A1 (zh) 一种ai集群中数据缓存方法、系统、设备及计算机介质
CN112766475B (zh) 处理部件及人工智能处理器
CN111913812B (zh) 一种数据处理方法、装置、设备及存储介质
CN114363988A (zh) 分簇方法、装置和电子设备
WO2020236360A1 (en) Remote operations application programming interface
CN108009099B (zh) 一种应用于K-Mean聚类算法中的加速方法及其装置
WO2018188416A1 (zh) 一种数据搜索的方法、装置和相关设备
WO2023115529A1 (zh) 芯片内的数据处理方法及芯片

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828045

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21828045

Country of ref document: EP

Kind code of ref document: A1