WO2024065869A1 - 一种用于图计算的指令执行方法及装置 - Google Patents

一种用于图计算的指令执行方法及装置 Download PDF

Info

Publication number
WO2024065869A1
WO2024065869A1 PCT/CN2022/124006 CN2022124006W WO2024065869A1 WO 2024065869 A1 WO2024065869 A1 WO 2024065869A1 CN 2022124006 W CN2022124006 W CN 2022124006W WO 2024065869 A1 WO2024065869 A1 WO 2024065869A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
instructions
node
graph
parallel
Prior art date
Application number
PCT/CN2022/124006
Other languages
English (en)
French (fr)
Inventor
王宏升
陈�光
曾令仿
潘爱民
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US18/071,978 priority Critical patent/US20240118897A1/en
Publication of WO2024065869A1 publication Critical patent/WO2024065869A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the technical field of computer systems based on a specific computing model, and in particular to an instruction execution method and device for graph computing.
  • the present invention constructs a topological order of parallel instructions by analyzing the dependencies between instructions during the execution of computational graphs, provides a method and device for scheduling parallel instructions to hardware resources as quickly as possible, and provides a compilation technology for an instruction execution method and device for graph computing.
  • the purpose of the present invention is to provide an instruction execution method and device for graph computing, which solves the problem of how to analyze the dependencies between instructions contained in nodes during the execution of the computational graph from a global perspective and derive the topological order of instructions that can be executed in parallel in the global computational graph based on the dependencies, so as to schedule the parallel instructions to hardware resources as quickly as possible.
  • a method for executing instructions for graph computing comprising the following steps:
  • Step S1 Send the operator of each node in the computation graph used for neural network calculation to the operator interpreter;
  • Step S2 The operator interpreter constructs runtime instructions
  • Step S3 define instruction dependencies
  • Step S4 construct an instruction dependency graph
  • Step S5 construct a topological order of parallel instructions
  • Step S6 Scheduling parallel instructions to hardware resources
  • Step S7 construct the shortest schedule of parallel instructions: the shortest time required for parallel instruction execution under the condition of hardware resource constraints;
  • Step S8 Release the executed instruction.
  • step S3 includes a strong write-read dependency relationship, a weak read-write dependency relationship, and a weak write-write dependency relationship.
  • the strong write-read dependency is: according to the instruction operation, a register is written first, and then the same register is read, and the instruction operation of reading the same register later depends on the instruction operation of writing the register first.
  • the read-write weak dependency is: according to the instruction operation, the register is read first and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of reading the register first.
  • the write-write weak dependency is: according to the instruction operation, a register is written first, and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of writing the register first.
  • step S4 are: traversing each node in turn according to the topological structure of the computational graph, and constructing the dependency edges of each node to form an instruction dependency graph by analyzing the dependency relationship between each node instruction and its successor node instruction.
  • step S5 are: traversing each computing node in turn according to the topological structure of the computational graph, and obtaining the parallel execution instructions of each step in the execution flow according to the instruction dependency graph to obtain the topological order of the parallel instructions.
  • step S6 are: scheduling each step of parallel execution instruction to the corresponding hardware resources according to the topological order of the instruction dependency graph.
  • the present invention also provides an instruction execution device for graph computing, comprising a memory and one or more processors, wherein the memory stores executable code, and when the one or more processors execute the executable code, they are used to implement an instruction execution method for graph computing described in any one of the above embodiments.
  • the present invention also provides a computer-readable storage medium having a program stored thereon, and when the program is executed by a processor, an instruction execution method for graph computing described in any one of the above embodiments is implemented.
  • the present invention analyzes the dependencies between instructions contained in nodes in the execution process of the computation graph from a global perspective, and derives the topological order of instructions that can be executed in parallel in the global computation graph based on the dependencies, providing a method and device for scheduling parallel instructions to hardware resources as quickly as possible.
  • the efficiency of graph computing instruction execution is improved by analyzing and designing parallel computing operations, and a compilation technology for a graph computing instruction execution method and device is provided.
  • the optimization model of the instruction execution method and device for graph computing is used to optimize the compilation efficiency of the computation graph, which promotes the development of the application of deep neural network models in the relationship graph.
  • FIG1 is a schematic flow chart of an instruction execution method for graph computing according to the present invention.
  • FIG2 is an architecture diagram of an instruction execution method for graph computing according to an embodiment
  • FIG3 is a calculation diagram for neural network calculation according to an embodiment
  • FIG4 is an instruction for constructing the runtime of the operator interpreter of the embodiment
  • FIG5 is a diagram showing the dependency relationship between instructions in the embodiment
  • FIG6 is an example of analyzing instruction dependencies
  • FIG7 is a first step of executing instructions in parallel in the embodiment
  • FIG8 is a second step of executing instructions in parallel in the embodiment.
  • FIG9 is a third step of executing instructions in parallel in the embodiment.
  • FIG10 is a fourth step of executing instructions in parallel in the embodiment.
  • FIG11 is a fifth step of the embodiment of the parallel execution instructions
  • FIG12 is a sixth step of executing instructions in parallel in the embodiment.
  • FIG13 is a seventh step of executing instructions in parallel in the embodiment.
  • FIG14 is a diagram of the eighth step of executing instructions in parallel in the embodiment.
  • FIG15 is an example of analyzing the parallel execution order of instructions
  • FIG16 is a diagram showing the shortest scheduling of parallel instructions according to an embodiment
  • FIG17 is a schematic diagram of the structure of an instruction execution device for graph computing according to the present invention.
  • a method for executing instructions for graph computing includes the following steps:
  • Step S1 Send the operator of each node in the computation graph used for neural network calculation to the operator interpreter;
  • Step S2 The operator interpreter constructs runtime instructions
  • Step S3 define instruction dependencies
  • the instruction dependency relationship includes a strong write-read dependency relationship, a weak read-write dependency relationship, and a weak write-write dependency relationship;
  • the strong write-read dependency is: according to the instruction operation, the register is written first, and then the same register is read, and the instruction operation of reading the same register later depends on the instruction operation of writing the register first;
  • the read-write weak dependency is: according to the instruction operation, the register is read first, and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of reading the register first;
  • the write-write weak dependency is: according to the instruction operation, a register is written first, and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of writing the register first.
  • Step S4 construct an instruction dependency graph
  • each node is traversed in turn, and by analyzing the dependency relationship between each node instruction and its successor node instruction, the dependency edges of each node are constructed to form an instruction dependency graph.
  • Step S5 construct a topological order of parallel instructions
  • Each computing node is traversed in turn according to the topological structure of the computing graph, and at the same time, the parallel execution instructions of each step in the execution flow are obtained according to the instruction dependency graph to obtain the topological order of the parallel instructions.
  • Step S6 Scheduling parallel instructions to hardware resources
  • each step of parallel execution instruction is scheduled to the corresponding hardware resource.
  • Step S7 Construct the shortest schedule of parallel instructions: the shortest time required for parallel instruction execution under the condition of hardware resource constraints.
  • Step S8 Release the executed instruction.
  • Embodiment Referring to FIG2 , an architectural diagram of an instruction execution method for graph computing is shown
  • a method for executing instructions for graph computing comprising the following steps:
  • step S1 sending the operator of each node in the computation graph used for neural network computation to the operator interpreter;
  • tf.matmul(x, y) represents the matrix multiplication operation of tensor x and tensor y;
  • tf.subtract(x, y) represents the matrix subtraction operation between tensor x and tensor y;
  • tf.add(x, y): represents the matrix addition operation of tensor x and tensor y;
  • step S2 the operator interpreter constructs runtime instructions
  • the instruction indicates a write register instruction, which means writing the value of the tensor variable x in the memory into the register r i ;
  • MULr i , r j , r k indicates the execution of a matrix multiplication operation, reading the tensor variables in registers r j and r k respectively, performing a matrix multiplication operation using the obtained tensor variables, and writing the calculated result to register r i ;
  • ADDri , rj , rk indicates the execution of a matrix addition operation, reading the tensor variables in registers rj and rk respectively, performing a matrix addition operation using the obtained tensor variables, and writing the calculated result into register r i ;
  • SURr i , r j , r k indicates executing a matrix subtraction operation, reading the tensor variables in register r j and register r k respectively, performing a matrix subtraction operation using the obtained tensor variables, and writing the calculated result into register r i .
  • step S3 defining instruction dependencies
  • the instruction indicates a write register instruction, which means writing the value of the tensor variable x in the memory into the register r i ;
  • the instruction indicates a read register instruction, which means reading the value in register ri and writing it to the tensor variable y in the memory;
  • Write 1 r 1 indicates the former write register r i operation
  • Read 1 r i indicates the former read register r i operation
  • Read 2 r i indicates the latter read register r i operation.
  • the instruction dependency relationship includes a strong write-read dependency relationship, a weak read-write dependency relationship, and a weak write-write dependency relationship;
  • the strong write-read dependency is: according to the instruction operation, the register is written first, and then the same register is read, and the instruction operation of reading the same register later depends on the instruction operation of writing the register first;
  • the read-write weak dependency is: according to the instruction operation, the register is read first, and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of reading the register first;
  • the write-write weak dependency is: according to the instruction operation, a register is written first, and then the same register is written, and the instruction operation of writing the same register later depends on the instruction operation of writing the register first.
  • Step S4 construct an instruction dependency graph
  • each node is traversed in turn, and by analyzing the dependency relationship between each node instruction and its successor node instruction, the dependency edge of each node is constructed to form an instruction dependency graph;
  • Analyzing the dependency relationship between each node instruction and its successor node instruction refers to analyzing the dependency relationship between each node instruction and its successor node instruction, and the dependency relationship includes a strong write-read dependency relationship, a weak read-write dependency relationship, and a weak write-write dependency relationship.
  • V i ⁇ V j Indicates that the V j node is strongly dependent on the V i node, that is, the V i node and the V j node have a write-read dependency relationship.
  • V i ⁇ V j Indicates that the V j node is weakly dependent on the V i node, that is, the V i node and the V j node have a read-write dependency relationship.
  • step 1 the parallel instructions that can be executed simultaneously in step 1 include the instructions at the Vi node.
  • Node V 1 includes a write register r 1
  • node V 3 includes a read register r 1 , so there is a strong write-read dependency relationship between instructions between nodes V 1 and V 3 .
  • Node V 2 includes a write register r 2
  • node V 3 includes a read register r 2 , so there is a strong write-read dependency relationship between instructions between nodes V 2 and V 3 .
  • Node V 3 1) Node V 3 includes a read register r 2 , and node V 4 includes a write register r 2 , so there is a weak read-write dependency between instructions between nodes V 3 and V 4. 2) Node V 3 includes a write register r 1 , and node V 7 includes a read register r 1 , so there is a strong write-read dependency between instructions between nodes V 3 and V 7 .
  • Node V 4 includes a write register r 2
  • node V 6 includes a read register r 2 , so there is a strong write-read dependency relationship between instructions between node V 4 and node V 6 .
  • Node V 5 includes the write register r 3
  • node V 6 includes the read register r 3 , so there is a strong write-read dependency relationship between instructions between node V 5 and node V 6 .
  • Node V 6 1) Node V 6 includes write register r 2 , and node V 7 includes read register r 2 , so there is a strong read-write dependency between instructions between node V 6 and node V 7. 2) Node V 6 includes read register r 3 , and node V 9 includes write register r 3 , so there is a weak read-write dependency between instructions between node V 6 and node V 9 .
  • Node V 7 includes a read register r 2
  • node V 8 includes a write register r 2 , so there is a weak read-write dependency relationship between instructions between nodes V 7 and V 8 .
  • Node V 8 includes a write register r 2
  • node V 10 includes a read register r 2 , so there is a strong write-read dependency relationship between instructions between node V 8 and node V 10 .
  • Node V 9 includes the write register r 3
  • node V 10 includes the read register r 3 , so there is a strong write-read dependency relationship between instructions between node V 9 and node V 10 .
  • Node V 10 includes a write register r 2
  • node V 11 includes a read register r 2 , so there is a strong write-read dependency relationship between instructions between node V 10 and node V 11 .
  • Step S5 construct a topological order of parallel instructions
  • the parallel execution of instructions at each step means that when the current instruction to be analyzed is executed at runtime, if the current instruction to be analyzed has no dependent predecessor node in the instruction dependency graph, then the current instructions that can be executed in parallel include the current instruction to be analyzed.
  • the first step of parallel execution of instructions is shown, such as the instructions covered by the gray rectangular shadow indicated by symbol 1 in the figure;
  • instructions can be executed in parallel: Since the instructions included in the nodes V 1 , V 2 and V 5 have no dependency relationship, the instructions included in the nodes V 1 , V 2 and V 5 can be executed in parallel in the first step.
  • the second step of parallel execution of instructions is shown, such as the instructions covered by the gray rectangular shadow indicated by symbol 2 in the figure.
  • the second step can execute instructions in parallel: Since node V 3 depends on the instructions contained in nodes V 1 and V 2 , the second step can execute the instructions contained in node V 3. Since node V 6 depends on node V 4 in addition to node V 5 , and node V 4 depends on node V 3 , there is an indirect dependency relationship between node V 6 and node V 3 , so the second step cannot execute the instructions contained in node V 6. The final analysis shows that the second step can execute the instructions contained in node V 3 in parallel.
  • FIG. 9 there is shown the third step of parallel execution of instructions, such as the instructions covered by the gray rectangular shadow indicated by symbol 3 in the figure.
  • the third step can execute instructions in parallel: Since the nodes that directly depend on node V 3 include node V 4 and node V 7. And node V 4 only depends on node V 3 , the third step can execute the instructions contained in node V 4. Since node V 7 depends on node V 6 in addition to node V 3 , and node V 6 depends on node V 4 , there is an indirect dependency relationship between node V 7 and node V 4 , so the third step cannot execute the instructions contained in node V 7. The final analysis shows that the third step can execute the instructions contained in node V 4 in parallel.
  • FIG. 10 there is shown the fourth step of parallel execution of instructions, such as the instructions covered by the gray rectangular shadow indicated by symbol 4 in the figure.
  • the fourth step can execute instructions in parallel: Since the nodes that directly depend on node V 4 only include node V 6. Although node V 6 depends on node V 5 in addition to node V 4, the first step has already executed the instructions contained in node V 5 , so when it comes to the fourth step, it can be regarded that node V 6 only depends on node V 4. Therefore, the fourth step can execute the instructions contained in node V 6. The final analysis shows that the fourth step can execute the instructions contained in node V 6 in parallel.
  • the fifth step of parallel execution of instructions is shown, such as the instructions covered by the gray rectangular shadow indicated by symbol 5 in the figure.
  • the fifth step can execute instructions in parallel: because the nodes that directly depend on node V 6 include node V 7 and node V 9 , and node V 9 only depends on node V 6.
  • the final analysis shows that the fifth step can execute the instructions contained in node V 7 and node V 9 in parallel.
  • FIG. 12 there is shown the sixth step of parallel execution of instructions, such as the instructions covered by the gray rectangular shadow indicated by symbol 6 in the figure.
  • instructions can be executed in parallel: since the nodes directly dependent on node V 7 include node V 8 , and the nodes directly dependent on node V 9 include node V 10 , but node V 10 depends on node V 8.
  • the final analysis shows that the sixth step can execute the instructions contained in node V 8 in parallel.
  • FIG. 13 there is shown the seventh step of parallel execution of instructions, such as the instructions covered by the gray rectangular shadow indicated by symbol 7 in the figure.
  • step 7 instructions can be executed in parallel: Since the nodes that directly depend on node V 8 include node V 10 , although node V 10 also depends on node V 9 , the instructions included in node V 9 have been executed in step 5. The final analysis shows that the instructions included in node V 10 can be executed in parallel in step 7.
  • FIG. 14 there is shown the eighth step of parallel execution of instructions, such as the instructions covered by the gray rectangular shadow indicated by the symbol 8 in the figure.
  • the eighth step is to execute instructions in parallel: Since the nodes that directly depend on node V10 only include node V11 , the final analysis shows that the eighth step is to execute instructions included in node V11 in parallel.
  • Step S6 Scheduling parallel instructions to hardware resources
  • each step of parallel execution instruction is scheduled to the corresponding hardware resource
  • Each step of parallel execution of instructions is scheduled to the corresponding hardware resources, wherein the data loading instruction LD and the data storage instruction ST related to data transfer are scheduled to the memory unit, and the instruction related to arithmetic operation is scheduled to the arithmetic logic unit.
  • the scheduling of instructions to hardware resources refers to scheduling each step of parallel instructions to the earliest position where the corresponding hardware resources can start execution. Considering that the resources related to the hardware memory port are being used by the instructions contained in the predecessor node on which the current instruction depends, the earliest position where the hardware resources can start execution refers to the position where the execution of the instructions contained in the predecessor node on which the current instruction depends in the topological structure diagram of the instruction dependency relationship ends.
  • Scheduling the first step parallel instructions includes the following process: 1) Since the first step parallel instructions include instructions included in nodes V 1 , V 2 and V 5 , and the instructions are all data transfer instructions, the instructions included in nodes V 1 , V 2 and V 5 are scheduled to the memory unit. 2) The instructions included in nodes V 1 , V 2 and V 5 are scheduled to the earliest position where the memory unit can start execution, that is, the starting position of the memory unit, such as the position marked by symbol 1 in the arithmetic logic unit in FIG15 .
  • the scheduling of the second-step parallel instructions includes the following process: 1) Since the second-step parallel instructions include the instructions included in node V 3 , and the instructions are all arithmetic operation instructions, the instructions included in node V 3 are scheduled to the arithmetic logic unit. 2) The instructions included in node V 3 are scheduled to the earliest position where the arithmetic logic unit can start execution, such as the position marked by symbol 2 in the arithmetic logic unit in Figure 15.
  • Scheduling the third step parallel instructions includes the following process: 1) Since the third step parallel instructions include the instructions included in node V 4 , and the instructions are data transfer instructions, the instructions included in node V 4 are scheduled to the memory unit. 2) The instructions included in node V 4 are scheduled to the earliest position where the memory unit can start execution, such as the position marked by symbol 3 in the arithmetic logic unit in Figure 15.
  • Scheduling the fourth step parallel instructions includes the following process: 1) Since the fourth step parallel instructions include the instructions included in the node V 6 , and the instructions are all arithmetic operation instructions, the instructions included in the node V 6 are scheduled to the arithmetic logic unit. 2) The instructions included in the node V 6 are scheduled to the earliest position where the arithmetic logic unit can start execution, such as the position marked by the symbol 4 in the arithmetic logic unit in Figure 15.
  • the scheduling of the fifth step parallel instructions includes the following processes: 1) Since the fifth step parallel instructions include instructions included in nodes V7 and V9 , and the instructions included in node V9 are data transfer instructions, and the instructions included in node V7 are arithmetic operation instructions, the instructions included in node V9 are scheduled to the memory unit, and the instructions included in node V7 are scheduled to the arithmetic logic unit. 2) The instructions included in node V9 are scheduled to the earliest position where the memory unit can start execution, such as the position marked by symbol 5 in the arithmetic logic unit in Figure 15. The instructions included in node V7 are scheduled to the earliest position where the arithmetic logic unit can start execution, such as the position marked by symbol 5 in the arithmetic logic unit in Figure 15.
  • Scheduling the sixth step parallel instruction includes the following process: 1) Since the sixth step parallel instruction includes the instruction included in the node V 8 , and the instruction belongs to the data transfer instruction, the instruction included in the node V 8 is scheduled to the memory unit. 2) The instruction included in the node V 8 is scheduled to the earliest position where the memory unit can start execution, such as the position marked by the symbol 6 in the arithmetic logic unit in Figure 15.
  • Scheduling the seventh step parallel instructions includes the following process: 1) Since the seventh step parallel instructions include the instructions included in the node V10 , and the instructions are all arithmetic operation instructions, the instructions included in the node V10 are scheduled to the arithmetic logic unit. 2) The instructions included in the node V10 are scheduled to the earliest position where the arithmetic logic unit can start execution, such as the position marked by the symbol 7 in the arithmetic logic unit in Figure 15.
  • Scheduling the eighth step parallel instructions includes the following process: 1) Since the eighth step parallel instructions include the instructions included in the node V 11 , and the instructions are all arithmetic operation instructions, the instructions included in the node V 11 are scheduled to the arithmetic logic unit. 2) The instructions included in the node V 11 are scheduled to the earliest position where the arithmetic logic unit can start execution, such as the position marked by the symbol 8 in the arithmetic logic unit in FIG. 15.
  • Step S7 construct the shortest schedule of parallel instructions: the shortest time required for parallel instruction execution under the condition of hardware resource constraints;
  • the shortest schedule for constructing parallel instructions refers to the shortest time required for parallel instruction execution under the condition of hardware resource constraints. It is assumed that all instruction operations require one clock cycle, except for the data loading instruction LD, which requires two clock cycles. Considering that the hardware resources adopt a mechanism of first caching the data to be loaded into a temporary table for the case of loading first and then storing immediately, and then storing the data from the temporary table into the memory resource when the data storage instruction needs to be executed, the data storage instruction ST at the same storage location can be executed one clock after the data loading instruction LD at the location starts.
  • each data handling instruction will occupy the hardware memory port when it is executed, when there are multiple data handling instructions that need to be executed in parallel, only one data handling instruction can be executed at a time, and the execution order can be based on the principle of giving priority to the order of the instructions that can be executed earliest in the topological structure diagram of the instruction dependency.
  • the shortest schedule of building parallel instructions includes the following process:
  • Shortest scheduling of the first step of parallel instructions Since the first step of parallel instructions includes nodes V1 , V2 and V5, all of which include data loading instructions LD in the data handling instructions, and the execution time of each data loading instruction requires two clock cycles, the execution principle of the earliest instruction to be executed in the topological structure diagram of the instruction dependency relationship is followed, and the data loading instructions LD included in nodes V1 , V2 and V5 are executed in sequence. The operation requires a total of 6 clock cycles.
  • Shortest scheduled second-step parallel instruction Since the second-step parallel instruction includes the arithmetic operation instruction SUB instruction included in node V 3 , a total of 1 clock cycle is required to execute the operation.
  • Shortest scheduling of the third-step parallel instruction Since the third-step parallel instruction includes the data loading instruction LD in the data moving instruction included in the node V4 , a total of 2 clock cycles are required to execute the operation.
  • Shortest scheduled fourth-step parallel instruction Since the fourth-step parallel instruction includes the arithmetic operation instruction MUL instruction included in node V 6 , a total of 1 clock cycle is required to execute the operation.
  • the fifth-step parallel instructions include the arithmetic operation instruction ADD instruction contained in node V 7 and the data loading instruction LD in the data movement instruction contained in node V 9 , the ADD instruction contained in node V 7 and the data loading instruction LD contained in node V 9 can be executed simultaneously. It takes 1 clock cycle to execute the ADD instruction contained in node V 7 , and it takes 2 clock cycles to execute the data loading instruction LD contained in node V 9. Therefore, the operation requires a total of 2 clock cycles.
  • Shortest scheduling of the sixth-step parallel instruction Since the sixth-step parallel instruction includes the data loading instruction LD in the data handling instruction included in the node V8 , a total of 2 clock cycles are required to execute the operation.
  • Shortest scheduled seventh-step parallel instruction Since the seventh-step parallel instruction includes the arithmetic operation instruction ADD instruction included in node V 10 , executing the operation requires a total of 1 clock cycle.
  • Shortest scheduled eighth-step parallel instruction Since the eighth-step parallel instruction includes the arithmetic operation instruction SUB instruction included in the node V 11 , a total of 1 clock cycle is required to execute the operation.
  • the time required to execute the entire topological structure diagram of the instruction dependency is the cumulative time required for each step of the shortest scheduling parallel instruction. Therefore, the time required to execute the entire topological structure diagram of the instruction dependency is 6+1+2+1+2+2+1+1, that is, it takes a total of 16 clock cycles to execute the topological diagram, as shown in Figure 16.
  • a means that the execution of the c-th parallel instruction requires a clock cycles, such as 1: 6 means that the execution of the first parallel instruction requires 6 clock cycles.
  • Step S8 Release the executed instruction.
  • the present invention also provides an embodiment of an instruction execution device for graph computing.
  • an embodiment of the present invention provides an instruction execution device for graph computing, including a memory and one or more processors, wherein the memory stores executable code, and when the one or more processors execute the executable code, they are used to implement an instruction execution method for graph computing in the above embodiment.
  • An embodiment of an instruction execution device for graph computing of the present invention can be applied to any device with data processing capabilities, and the any device with data processing capabilities can be a device or apparatus such as a computer.
  • the device embodiment can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by the processor of any device with data processing capabilities in which it is located to read the corresponding computer program instructions in the non-volatile memory into the memory for execution. From the hardware level, as shown in Figure 17, it is a hardware structure diagram of any device with data processing capabilities in which an instruction execution device for graph computing of the present invention is located. In addition to the processor, memory, network interface, and non-volatile memory shown in Figure 17, any device with data processing capabilities in which the device in the embodiment is located can also include other hardware according to the actual function of the device with data processing capabilities, which will not be described in detail.
  • the relevant parts can refer to the partial description of the method embodiment.
  • the device embodiment described above is only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of the present invention. Ordinary technicians in this field can understand and implement it without paying creative work.
  • An embodiment of the present invention further provides a computer-readable storage medium on which a program is stored.
  • a program is stored on which a program is stored.
  • an instruction execution method for graph computing in the above embodiment is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the aforementioned embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, a flash card, etc. equipped on the device.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capability.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capability, and may also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Analysis (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Devices For Executing Special Programs (AREA)
  • Advance Control (AREA)

Abstract

本发明公开了一种用于图计算的指令执行方法及装置,包括以下步骤:步骤S1:将用于神经网络计算的计算图中每个节点的算子下发到算子解释器;步骤S2:算子解释器构建运行时的指令;步骤S3:定义指令依赖关系;步骤S4:构建指令依赖关系图;步骤S5:构建并行指令的拓扑顺序;步骤S6:将并行指令调度到硬件资源上;步骤S7:构建并行指令的最短调度:在硬件资源限制的条件下并行指令执行所需的最短时间;步骤S8:释放已经执行完的指令。本发明从全局角度分析计算图执行过程中节点所包含指令之间的依赖关系以及基于依赖关系推导全局计算图中可并行执行指令的拓扑顺序,提供了将并行指令最快地调度到硬件资源上的方法和装置,优化了计算图的编译效率。

Description

一种用于图计算的指令执行方法及装置
相关申请的交叉引用
本发明要求于2022年9月27日向中国国家知识产权局提交的申请号为CN 202211177797.3、发明名称为“一种用于图计算的指令执行方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及一种基于特定计算模型的计算机系统技术领域,尤其涉及一种用于图计算的指令执行方法及装置。
背景技术
随着近几年神经网络模型的落地,面向神经网络编译的技术变得越来越重要。已有的计算图编译技术仍未从全局角度分析计算图执行过程中节点所包含指令之间的依赖关系以及基于依赖关系推导全局计算图中可并行执行指令的拓扑顺序。本发明通过分析计算图执行过程中指令之间的依赖关系,构建并行指令拓扑顺序,提供了将并行指令最快地调度到硬件资源上的方法和装置,提供了一种用于图计算的指令执行方法及装置的编译技术。
发明内容
本发明的目的在于提供一种用于图计算的指令执行方法及装置,解决了如何从全局角度分析计算图执行过程中节点所包含指令之间的依赖关系以及基于依赖关系推导全局计算图中可并行执行指令的拓扑顺序,将并行指令最快地调度到硬件资源上的问题。
本发明采用的技术方案如下:
一种用于图计算的指令执行方法,包括以下步骤:
步骤S1:将用于神经网络计算的计算图中每个节点的算子下发到算子解释器;
步骤S2:算子解释器构建运行时的指令;
步骤S3:定义指令依赖关系;
步骤S4:构建指令依赖关系图;
步骤S5:构建并行指令的拓扑顺序;
步骤S6:将并行指令调度到硬件资源上;
步骤S7:构建并行指令的最短调度:在硬件资源限制的条件下并行指令执行所需的最短时间;
步骤S8:释放已经执行完的指令。
进一步地,所述步骤S3所述指令依赖关系包括写读强依赖关系、读写弱依赖关系和 写写弱依赖关系。
进一步地,所述写读强依赖关系为:根据指令操作先写寄存器,后读同一寄存器,且后读同一寄存器的指令操作依赖先写寄存器的指令操作。
进一步地,所述读写弱依赖关系为:根据指令操作先读寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先读寄存器的指令操作。
进一步地,所述写写弱依赖关系为:根据指令操作先写寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先写寄存器的指令操作。
进一步地,所述步骤S4的具体步骤为:根据计算图的拓扑结构依次遍历每个节点,并通过分析每个节点指令与其后继节点指令的依赖关系,构建每个节点的依赖关系边构成指令依赖关系图。
进一步地,所述步骤S5的具体步骤为:根据计算图的拓扑结构依次遍历每个计算节点,同时根据所述指令依赖关系图获得执行流中每一步并行执行指令,得到并行指令的拓扑顺序。
进一步地,所述步骤S6的具体步骤为:根据所述指令依赖关系图的拓扑顺序,将每一步并行执行指令调度到对应的硬件资源上。
本发明还提供一种用于图计算的指令执行装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中任一项所述的一种用于图计算的指令执行方法。
本发明还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中任一项所述的一种用于图计算的指令执行方法。
本发明的有益效果是:本发明从全局角度分析计算图执行过程中节点所包含指令之间的依赖关系以及基于依赖关系推导全局计算图中可并行执行指令的拓扑顺序,提供了将并行指令最快地调度到硬件资源上的方法和装置。通过分析和设计并行计算操作来提高图计算的指令执行效率,并且提供了一种用于图计算的指令执行方法及装置的编译技术。研究人员和工程应用者开发算法模型的过程中,利用所述的一种用于图计算的指令执行方法及装置优化模型,优化了计算图的编译效率,推动了深所述关系图中度神经网络模型落地应用的发展。
附图说明
图1为本发明一种用于图计算的指令执行方法的流程示意图;
图2为实施例用于图计算的指令执行方法的架构图;
图3为实施例用于神经网络计算的计算图;
图4为实施例算子解释器构建运行时的指令;
图5为实施例指令之间的依赖关系;
图6为实施例分析指令依赖关系;
图7为实施例第一步并行执行指令;
图8为实施例第二步并行执行指令;
图9为实施例第三步并行执行指令;
图10为实施例第四步并行执行指令;
图11为实施例第五步并行执行指令;
图12为实施例第六步并行执行指令;
图13为实施例第七步并行执行指令;
图14为实施例第八步并行执行指令;
图15为实施例分析指令并行执行顺序;
图16为实施例最短调度并行指令;
图17为本发明一种用于图计算的指令执行装置的结构示意图。
具体实施方式
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
参见图1,一种用于图计算的指令执行方法,包括以下步骤:
步骤S1:将用于神经网络计算的计算图中每个节点的算子下发到算子解释器;
步骤S2:算子解释器构建运行时的指令;
步骤S3:定义指令依赖关系;
所述指令依赖关系包括写读强依赖关系、读写弱依赖关系和写写弱依赖关系;
进一步地,所述写读强依赖关系为:根据指令操作先写寄存器,后读同一寄存器,且后读同一寄存器的指令操作依赖先写寄存器的指令操作;
进一步地,所述读写弱依赖关系为:根据指令操作先读寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先读寄存器的指令操作;
进一步地,所述写写弱依赖关系为:根据指令操作先写寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先写寄存器的指令操作。
步骤S4:构建指令依赖关系图;
根据计算图的拓扑结构依次遍历每个节点,并通过分析每个节点指令与其后继节点指令的依赖关系,构建每个节点的依赖关系边构成指令依赖关系图。
步骤S5:构建并行指令的拓扑顺序;
根据计算图的拓扑结构依次遍历每个计算节点,同时根据所述指令依赖关系图获得执行流中每一步并行执行指令,得到并行指令的拓扑顺序。
步骤S6:将并行指令调度到硬件资源上;
根据所述指令依赖关系图的拓扑顺序,将每一步并行执行指令调度到对应的硬件资源上。
步骤S7:构建并行指令的最短调度:在硬件资源限制的条件下并行指令执行所需的最短时间。
步骤S8:释放已经执行完的指令。
实施例:参见图2,展示了用于图计算的指令执行方法的架构图;
一种用于图计算的指令执行方法,包括以下步骤:
参见图3,步骤S1:将用于神经网络计算的计算图中每个节点的算子下发到算子解释器;
tf.matmul(x,y):表示张量x与张量y进行矩阵乘法操作;
tf.subtract(x,y):表示张量x与张量y进行矩阵相减操作;
tf.add(x,y):表示张量x与张量y进行矩阵相加操作;
参见图4,步骤S2:算子解释器构建运行时的指令;
LDr i,x:所述指令表示写寄存器指令,表示将内存中张量变量x的值写入寄存器中r i中;
MULr i,r j,r k:表示执行矩阵相乘操作,分别读取寄存器r j和寄存器r k中的张量变量,利用所得张量变量进行矩阵乘运算,将计算所得结果写入寄存器r i中;
ADDr i,r j,r k:表示执行矩阵相加操作,分别读取寄存器r j和寄存器r k中的张量变量,利用所得张量变量进行矩阵相加运算,将计算所得结果写入寄存器r i中;
SURr i,r j,r k:表示执行矩阵相减操作,分别读取寄存器r j和寄存器r k中的张量变量,利用所得张量变量进行矩阵相减运算,将计算所得结果写入寄存器r i中。
参见图5,步骤S3:定义指令依赖关系;
LDr i,x:所述指令表示写寄存器指令,表示将内存中张量变量x的值写入寄存器中r i中;
STY,r i:所述指令表示读寄存器指令,表示读取寄存器中r i中的值并写入内存中张量变量y中;
1r 1:表示所述前者写寄存器r i操作;
1r i:表示所述前者读寄存器r i操作;
2r i:表示所述后者写寄存器r i操作;
2r i:表示所述后者读寄存器r i操作。
所述指令依赖关系包括写读强依赖关系、读写弱依赖关系和写写弱依赖关系;
进一步地,所述写读强依赖关系为:根据指令操作先写寄存器,后读同一寄存器,且后读同一寄存器的指令操作依赖先写寄存器的指令操作;
进一步地,所述读写弱依赖关系为:根据指令操作先读寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先读寄存器的指令操作;
进一步地,所述写写弱依赖关系为:根据指令操作先写寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先写寄存器的指令操作。
步骤S4:构建指令依赖关系图;
根据计算图的拓扑结构依次遍历每个节点,并通过分析每个节点指令与其后继节点指令的依赖关系,构建每个节点的依赖关系边构成指令依赖关系图;
所述分析每个节点指令与其后继节点指令的依赖关系是指分析每个节点指令与其后继节点指令的依赖关系,所述依赖关系包含一种写读强依赖关系、一种读写弱依赖关系和一种写写弱依赖关系。
参见图6,展示了为每个节点构建依赖关系边的分析过程;
V i→V j:表示V j节点强依赖于V i节点,也就是说V i节点与V j节点具有写读依赖关系。
V i→V j:表示V j节点弱依赖于V i节点,也就是说V i节点与V j节点具有读写依赖关系。
Figure PCTCN2022124006-appb-000001
表示第1步可同时执行的并行指令包含V i节点处的指令。
节点V 1:节点V 1包含写寄存器r 1,节点V 3包含读寄存器r 1,因此节点V 1与节点V 3存在指令之间写读强依赖关系。
节点V 2:节点V 2包含写寄存器r 2,节点V 3包含读寄存器r 2,因此节点V 2与节点V 3存在指令之间写读强依赖关系。
节点V 3:1)节点V 3包含读寄存器r 2,节点V 4包含写寄存器r 2,因此节点V 3与节点V 4存在指令之间读写弱依赖关系。2)节点V 3包含写寄存器r 1,节点V 7包含读寄存器r 1,因此节点V 3与节点V 7存在指令之间写读强依赖关系。
节点V 4:节点V 4包含写寄存器r 2,节点V 6包含读寄存器r 2,因此节点V 4与节点V 6存在指令之间写读强依赖关系。
节点V 5:节点V 5包含写寄存器r 3,节点V 6包含读寄存器r 3,因此节点V 5与节点V 6存在指令之间写读强依赖关系。
节点V 6:1)节点V 6包含写寄存器r 2,节点V 7包含读寄存器r 2,因此节点V 6与节点V 7存在指令之间写读强依赖关系。2)节点V 6包含读寄存器r 3,节点V 9包含写寄存器r 3,因此节点V 6与节点V 9存在指令之间读写弱依赖关系。
节点V 7:节点V 7包含读寄存器r 2,节点V 8包含写寄存器r 2,因此节点V 7与节点V 8存在指令之间读写弱依赖关系。
节点V 8:节点V 8包含写寄存器r 2,节点V 10包含读寄存器r 2,因此节点V 8与节点V 10存在指令之间写读强依赖关系。
节点V 9:节点V 9包含写寄存器r 3,节点V 10包含读寄存器r 3,因此节点V 9与节点V 10存在指令之间写读强依赖关系。
节点V 10:节点V 10包含写寄存器r 2,节点V 11包含读寄存器r 2,因此节点V 10与节点V 11存在指令之间写读强依赖关系。
步骤S5:构建并行指令的拓扑顺序;
根据计算图的拓扑结构依次遍历每个计算节点,同时根据所述指令依赖关系图获得执行流中每一步并行执行指令,得到并行指令的拓扑顺序;
所述每一步并行执行指令是指运行时执行到当前待分析指令的状态时,如果所述当前待分析指令在所述指令依赖关系图中没有可依赖的前驱节点,那么当前可并行执行的指令包含所述当前待分析指令。
参见图7,展示了第一步并行执行指令,如图中符号①所标识的灰色矩形阴影覆盖的指令;
第一步可并行执行指令:由于节点V 1、节点V 2和节点V 5所包含的指令没有依赖关系,所以第一步可并行执行节点V 1、节点V 2和节点V 5所包含指令。
参见图8,展示了第二步并行执行指令,如图中符号②所标识的灰色矩形阴影覆盖的指令。
第二步可并行执行指令:由于节点V 3依赖节点V 1和节点V 2所包含指令,所以第二步可执行节点V 3所包含指令。由于节点V 6除了依赖节点V 5外,还依赖节点V 4,节点V 4又依赖节点V 3,所以节点V 6与节点V 3存在间接依赖关系,因此第二步不能执行节点V 6所包含指令。最终分析得出,第二步可并行执行节点V 3所包含指令。
参见图9,展示了第三步并行执行指令,如图中符号③所标识的灰色矩形阴影覆盖的指令。
第三步可并行执行指令:由于直接依赖节点V 3的节点包含V 4节点和V 7节点。而且节点V 4只依赖于节点V 3,所以第三步可执行节点V 4所包含指令。由于节点V 7除了依赖节点V 3外,还依赖节点V 6,节点V 6又依赖节点V 4,所以节点V 7与节点V 4存在间接依赖关系,因此第三步不能执行节点V 7所包含指令。最终分析得出,第三步可并行执行节点V 4所包含指令。
参见图10,展示了第四步并行执行指令,如图中符号④所标识的灰色矩形阴影覆盖的指令。
第四步可并行执行指令:由于直接依赖节点V 4的节点只包含V 6节点。虽然节点V 6除了依赖节点V 4外,还依赖节点V 5,但是第一步已经执行完节点V 5所包含指令,所以当第四步时可以看作,节点V 6只依赖节点V 4。所以第四步可执行节点V 6所包含指令。最终分析得出,第四步可并行执行节点V 6所包含指令。
参见图11,展示了第五步并行执行指令,如图中符号⑤所标识的灰色矩形阴影覆盖的指令。
第五步可并行执行指令:由于直接依赖节点V 6的节点包含V 7节点和V 9节点,而且节点V 9只依赖于节点V 6。最终分析得出,第五步可并行执行节点V 7和节点V 9所包含指令。
参见图12,展示了第六步并行执行指令,如图中符号⑥所标识的灰色矩形阴影覆盖的指令。
第六步可并行执行指令:由于直接依赖节点V 7的节点包含V 8节点,直接依赖节点V 9的节点包含V 10节点,但是节点V 10依赖于节点V 8。最终分析得出,第六步可并行执行节点V 8所包含指令。
参见图13,展示了第七步并行执行指令,如图中符号⑦所标识的灰色矩形阴影覆盖的指令。
第七步可并行执行指令:由于直接依赖节点V 8的节点包含V 10节点,虽然V 10节点还依赖于V 9节点,但是第五步时已经执行完V 9节点所包含指令。最终分析得出,第七步可并 行执行节点V 10所包含指令。
参见图14,展示了第八步并行执行指令,如图中符号⑧所标识的灰色矩形阴影覆盖的指令。
第八步可并行执行指令:由于直接依赖节点V 10的节点只包含V 11节点,最终分析得出,第八可并行执行节点V 11所包含指令。
步骤S6:将并行指令调度到硬件资源上;
根据所述指令依赖关系图的拓扑顺序,将每一步并行执行指令调度到对应的硬件资源上;
所述每一步并行执行指令调度到对应的硬件资源上,其中将关于数据搬运的数据加载指令LD和数据存储指令ST调度到内存单元,将关于算术运算的指令调度到算术逻辑单元。所述将指令调度到硬件资源上是指将每一步并行指令调度到对应硬件资源的最早能开始执行的位置。考虑到关于硬件内存端口的资源正在一直被所述当前指令所依赖的前驱节点所包含指令使用,因此所述硬件资源最早能开始执行的位置是指所述关于指令依赖关系的拓扑结构图中当前指令所依赖的前驱节点所包含指令执行结束的位置。
调度第一步并行指令:所述调度第一步并行指令包含如下过程,1)由于第一步并行指令包含节点V 1、节点V 2和节点V 5所包含指令,并且所述指令均属于数据搬运指令,所以将节点V 1、节点V 2和节点V 5所包含指令调度到内存单元。2)将节点V 1、节点V 2和节点V 5所包含指令调度到内存单元最早能开始执行的位置,也就是内存单元的起始位置,如图15中算术逻辑单元中符号①所标识的位置。
调度第二步并行指令:所述调度第二步并行指令包含如下过程,1)由于第二步并行指令包含节点V 3所包含指令,并且所述指令均属于算术运算指令,所以将节点V 3所包含指令调度到算术逻辑单元。2)将节点V 3所包含指令调度到算术逻辑单元最早能开始执行的位置,如图15中算术逻辑单元中符号②所标识的位置。
调度第三步并行指令:所述调度第三步并行指令包含如下过程,1)由于第三步并行指令包含节点V 4所包含指令,并且所述指令属于数据搬运指令,所以将节点V 4所包含指令调度到内存单元。2)将节点V 4所包含指令调度到内存单元最早能开始执行的位置,如图15中算术逻辑单元中符号③所标识的位置。
调度第四步并行指令:所述调度第四步并行指令包含如下过程,1)由于第四步并行指令包含节点V 6所包含指令,并且所述指令均属于算术运算指令,所以将节点V 6所包含指令调度到算术逻辑单元。2)将节点V 6所包含指令调度到算术逻辑单元最早能开始执行的位置,如图15中算术逻辑单元中符号④所标识的位置。
调度第五步并行指令:所述调度第五步并行指令包含如下过程,1)由于第五步并行指令包含节点V 7和节点V 9所包含指令,并且节点V 9所含指令属于数据搬运指令,节点V 7所含指令属于算术运算指令,所以将节点V 9所包含指令调度到内存单元,将节点V 7所含指令调度到算术逻辑单元。2)将节点V 9所包含指令调度到内存单元最早能开始执行的位置,如图15中算术逻辑单元中符号⑤所标识的位置。将节点V 7所包含指令调度到算术逻辑单元最早能开始执行的位置,如图15中算术逻辑单元中符号⑤所标识的位置。
调度第六步并行指令:所述调度第六步并行指令包含如下过程,1)由于第六步并行指令包含节点V 8所包含指令,并且所述指令属于数据搬运指令,所以将节点V 8所包含指令调度到内存单元。2)将节点V 8所包含指令调度到内存单元最早能开始执行的位置,如图15中算术逻辑单元中符号⑥所标识的位置。
调度第七步并行指令:所述调度第七步并行指令包含如下过程,1)由于第七步并行指令包含节点V 10所包含指令,并且所述指令均属于算术运算指令,所以将节点V 10所包含指令调度到算术逻辑单元。2)将节点V 10所包含指令调度到算术逻辑单元最早能开始执行的位置,如图15中算术逻辑单元中符号⑦所标识的位置。
调度第八步并行指令:所述调度第八步并行指令包含如下过程,1)由于第八步并行指令包含节点V 11所包含指令,并且所述指令均属于算术运算指令,所以将节点V 11所包含指令调度到算术逻辑单元。2)将节点V 11所包含指令调度到算术逻辑单元最早能开始执行的位置,如图15中算术逻辑单元中符号⑧所标识的位置。
步骤S7:构建并行指令的最短调度:在硬件资源限制的条件下并行指令执行所需的最短时间;
所述构建并行指令的最短调度是指在硬件资源限制的条件下并行指令执行所需的最短时间。假设所有指令操作都需要一个时钟周期,所述数据加载指令LD除外,所述数据加载指令LD需要两个时钟周期。考虑到硬件资源对于先加载然后立即进行存储的情形采用先将待加载的数据缓存到一张临时表中,然后当需要执行数据存储指令时再从临时表中将数据存储到内存资源中的机制,因此同一个存储位置上的数据存储指令ST可以在所述位置上的数据加载指令LD开始后的一个时钟开始执行。所述构建并行指令的最短调度的过程中,由于每条数据搬运指令执行时会占用硬件内存端口,所以对于存在多条数据搬运指令需要并行执行时,一次只能执行一条数据搬运指令,所述执行的顺序可以根据优先执行关于指令依赖关系的拓扑结构图中最早能开始执行的指令的顺序原则。
所述构建并行指令的最短调度包含如下过程:
最短调度第一步并行指令:由于第一步并行指令包含节点V 1、节点V 2和节点V 5均包含数据搬运指令中的数据加载指令LD,每条数据加载指令的执行时间需要两个时钟周期,所以按照指令依赖关系的拓扑结构图中最早能开始执行的指令的顺序原则,依次执行节点V 1、节点V 2和节点V 5所包含的数据加载指令LD,所述操作一共需要6个时钟周期。
最短调度第二步并行指令:由于第二步并行指令包含节点V 3所包含算术运算指令SUB指令,执行所述操作一共需要1个时钟周期。
最短调度第三步并行指令:由于第三步并行指令包含节点V 4所包含数据搬运指令中的数据加载指令LD,执行所述操作一共需要2个时钟周期。
最短调度第四步并行指令:由于第四步并行指令包含节点V 6所包含算术运算指令MUL指令,执行所述操作一共需要1个时钟周期。
最短调度第五步并行指令:由于第五步并行指令包含节点V 7所含的属于算术运算指令ADD指令和节点V 9所包含的数据搬运指令中的数据加载指令LD,所以可同时执行节点V 7所包含的ADD指令和节点V 9所包含的数据加载指令LD,执行节点V 7所包含的ADD指令需要1个时钟周期,执行节点V 9所包含的数据加载指令LD需要2个时钟周期,因此所述操作一共需要2个时钟周期。
最短调度第六步并行指令:由于第六步并行指令包含节点V 8所包含数据搬运指令中的数据加载指令LD,执行所述操作一共需要2个时钟周期。
最短调度第七步并行指令:由于第七步并行指令包含节点V 10所包含算术运算指令ADD指令,执行所述操作一共需要1个时钟周期。
最短调度第八步并行指令:由于第八步并行指令包含节点V 11所包含算术运算指令SUB指令,执行所述操作一共需要1个时钟周期。
执行整张所述指令依赖关系的拓扑结构图所需的时间是将上述每一步最短调度并行指令所需的时间进行累加。因此执行上述整张所述指令依赖关系的拓扑结构图所需的时间是6+1+2+1+2+2+1+1,也就是说执行所述拓扑图一共需要16个时钟周期,如图16所示。
图16里对应符号意义:
Figure PCTCN2022124006-appb-000002
a表示第c步并行指令执行需要a个时钟周期,如①:6表示第一步并行指令执行需要6个时钟周期。
步骤S8:释放已经执行完的指令。
与前述一种用于图计算的指令执行方法的实施例相对应,本发明还提供了一种用于图 计算的指令执行装置的实施例。
参见图17,本发明实施例提供的一种用于图计算的指令执行装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中的一种用于图计算的指令执行方法。
本发明一种用于图计算的指令执行装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图17所示,为本发明一种用于图计算的指令执行装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图17所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种用于图计算的指令执行方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种用于图计算的指令执行方法,其特征在于,包括以下步骤:
    步骤S1:将用于神经网络计算的计算图中每个节点的算子下发到算子解释器;
    步骤S2:算子解释器构建运行时的指令;
    步骤S3:定义指令依赖关系;
    步骤S4:构建指令依赖关系图;
    步骤S5:构建并行指令的拓扑顺序;
    步骤S6:将并行指令调度到硬件资源上;
    步骤S7:构建并行指令的最短调度:在硬件资源限制的条件下并行指令执行所需的最短时间;
    步骤S8:释放已经执行完的指令。
  2. 如权利要求1所述的一种用于图计算的指令执行方法,其特征在于,所述步骤S3所述指令依赖关系包括写读强依赖关系、读写弱依赖关系和写写弱依赖关系。
  3. 如权利要求2所述的一种用于图计算的指令执行方法,其特征在于,所述写读强依赖关系为:根据指令操作先写寄存器,后读同一寄存器,且后读同一寄存器的指令操作依赖先写寄存器的指令操作。
  4. 如权利要求2所述的一种用于图计算的指令执行方法,其特征在于,所述读写弱依赖关系为:根据指令操作先读寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先读寄存器的指令操作。
  5. 如权利要求2所述的一种用于图计算的指令执行方法,其特征在于,所述写写弱依赖关系为:根据指令操作先写寄存器,后写同一寄存器,且后写同一寄存器的指令操作依赖先写寄存器的指令操作。
  6. 如权利要求1所述的一种用于图计算的指令执行方法,其特征在于,所述步骤S4的具体步骤为:根据计算图的拓扑结构依次遍历每个节点,并通过分析每个节点指令与其后继节点指令的依赖关系,构建每个节点的依赖关系边构成指令依赖关系图。
  7. 如权利要求1所述的一种用于图计算的指令执行方法,其特征在于,所述步骤S5的具体步骤为:根据计算图的拓扑结构依次遍历每个计算节点,同时根据所述指令依赖关系图获得执行流中每一步并行执行指令,得到并行指令的拓扑顺序。
  8. 如权利要求1所述的一种用于图计算的指令执行方法,其特征在于,所述步骤S6的具体步骤为:根据所述指令依赖关系图的拓扑顺序,将每一步并行执行指令调度到对应的硬件资源上。
  9. 一种用于图计算的指令执行装置,其特征在于,包括存储器和一个或多个处理器,所述存 储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求1-8中任一项所述的一种用于图计算的指令执行方法。
  10. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现权利要求1-8中任一项所述的一种用于图计算的指令执行方法。
PCT/CN2022/124006 2022-09-27 2022-10-09 一种用于图计算的指令执行方法及装置 WO2024065869A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/071,978 US20240118897A1 (en) 2022-09-27 2022-11-30 Instruction Execution Method and Apparatus for Graph Computation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211177797.3A CN115269016A (zh) 2022-09-27 2022-09-27 一种用于图计算的指令执行方法及装置
CN202211177797.3 2022-09-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/071,978 Continuation US20240118897A1 (en) 2022-09-27 2022-11-30 Instruction Execution Method and Apparatus for Graph Computation

Publications (1)

Publication Number Publication Date
WO2024065869A1 true WO2024065869A1 (zh) 2024-04-04

Family

ID=83756230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124006 WO2024065869A1 (zh) 2022-09-27 2022-10-09 一种用于图计算的指令执行方法及装置

Country Status (3)

Country Link
US (1) US20240118897A1 (zh)
CN (1) CN115269016A (zh)
WO (1) WO2024065869A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105408859A (zh) * 2013-09-12 2016-03-16 马维尔国际贸易有限公司 用于指令调度的方法和系统
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN111309479A (zh) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 一种任务并行处理的实现方法、装置、设备和介质
CN114461351A (zh) * 2022-04-13 2022-05-10 之江实验室 一种用于神经网络计算的动态图执行方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070958A1 (en) * 2007-01-25 2010-03-18 Nec Corporation Program parallelizing method and program parallelizing apparatus
CN108595157B (zh) * 2018-04-28 2022-05-10 百度在线网络技术(北京)有限公司 区块链数据的处理方法、装置、设备和存储介质
CN110825440B (zh) * 2018-08-10 2023-04-14 昆仑芯(北京)科技有限公司 指令执行方法和装置
CN110377340B (zh) * 2019-07-24 2021-06-01 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN112463709A (zh) * 2019-09-09 2021-03-09 上海登临科技有限公司 可配置的异构人工智能处理器
US11640295B2 (en) * 2020-06-26 2023-05-02 Intel Corporation System to analyze and enhance software based on graph attention networks
CN112037061A (zh) * 2020-08-31 2020-12-04 深圳前海微众银行股份有限公司 区块链中交易的处理方法、装置、电子设备及存储介质
CN113554161A (zh) * 2021-07-20 2021-10-26 清华大学 一种神经网络加速器编译方法及装置
CN114237775A (zh) * 2022-02-21 2022-03-25 众连智能科技有限公司 一种并行执行方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105408859A (zh) * 2013-09-12 2016-03-16 马维尔国际贸易有限公司 用于指令调度的方法和系统
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN111309479A (zh) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 一种任务并行处理的实现方法、装置、设备和介质
CN114461351A (zh) * 2022-04-13 2022-05-10 之江实验室 一种用于神经网络计算的动态图执行方法及装置

Also Published As

Publication number Publication date
US20240118897A1 (en) 2024-04-11
CN115269016A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
Aiken et al. Perfect pipelining: A new loop parallelization technique
TWI442233B (zh) 使用交易以平行化循序框架之方法及用於記錄相關指令之電腦儲存媒體
US8667260B2 (en) Building approximate data dependences with a moving window
WO2024021192A1 (zh) 一种用于神经网络计算的图优化方法和装置
JP2014146355A (ja) トランザクションを用いるシーケンシャルフレームワークの並行化
Hosabettu et al. Proof of correctness of a processor with reorder buffer using the completion functions approach
Drăgoi et al. Automatic linearizability proofs of concurrent objects with cooperating updates
WO2024065867A1 (zh) 一种用于神经网络编译的内存优化方法及装置
US8458671B1 (en) Method and system for stack back-tracing in computer programs
Kokologiannakis et al. Dynamic partial order reductions for spinloops
WO2018076979A1 (zh) 一种指令间数据依赖的检测方法和装置
WO2024065869A1 (zh) 一种用于图计算的指令执行方法及装置
Bai et al. Computing execution times with execution decision diagrams in the presence of out-of-order resources
Hosny et al. Characterizing and optimizing EDA flows for the cloud
Abdulla et al. Overcoming Memory Weakness with Unified Fairness
Goossens Dataflow management, dynamic load balancing, and concurrent processing for real‐time embedded vision applications using Quasar
US20040123072A1 (en) Method and system for modeling non-interlocked diversely bypassed exposed pipeline processors for static scheduling
Kazemi et al. A scratchpad memory-based execution platform for functional reactive systems and its static timing analysis
Suba Hierarchical pipelining of nested loops in high-level synthesis
WO2024065866A1 (zh) 一种用于计算图编译的中间表示方法及装置
CN115004150A (zh) 用于预测和调度软件流水化循环中的复制指令的方法和装置
JP2015038646A (ja) 情報処理装置及び情報処理方法
US20240104016A1 (en) Intermediate Representation Method and Apparatus for Compiling Computation Graphs
Patwardhan et al. Polyhedral Model Guided Automatic GPU Cache Exploitation Framework
WO2024065868A1 (zh) 一种用于图计算并行执行的中间表示方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960473

Country of ref document: EP

Kind code of ref document: A1