WO2023123266A1 - 子图的编译、执行方法及相关设备 - Google Patents

子图的编译、执行方法及相关设备 Download PDF

Info

Publication number
WO2023123266A1
WO2023123266A1 PCT/CN2021/143304 CN2021143304W WO2023123266A1 WO 2023123266 A1 WO2023123266 A1 WO 2023123266A1 CN 2021143304 W CN2021143304 W CN 2021143304W WO 2023123266 A1 WO2023123266 A1 WO 2023123266A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
subgraph
reconstruction
target
Prior art date
Application number
PCT/CN2021/143304
Other languages
English (en)
French (fr)
Inventor
焦建兵
许世峰
范礼
林嘉树
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/143304 priority Critical patent/WO2023123266A1/zh
Priority to CN202180064167.2A priority patent/CN116710891A/zh
Publication of WO2023123266A1 publication Critical patent/WO2023123266A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a subgraph compiling and execution method and related equipment.
  • the neural network model needs to be executed on the hardware after software compilation, in order to realize functions such as training and reasoning using the neural network model.
  • the target detection network model As shown in Figure 1, taking the target detection network model as an example, it can accurately recognize the input cat image after the steps of model analysis, model compilation, model optimization, and model execution.
  • XLA accelerated linear algebra
  • TensorRT an inference engine applied to Nvidia GPU, which is divided into model analysis (parser) and engine optimization as a whole.
  • Embodiments of the present application provide a method for compiling and executing a subgraph and related equipment, which can reduce scheduling time.
  • the embodiment of the present application provides a method for compiling a subgraph, including: obtaining a first subgraph, the first subgraph is any one of multiple subgraphs obtained by segmenting the calculation graph, and the first subgraph is A subgraph includes a plurality of first nodes and directed edges between the plurality of first nodes; the first subgraph is reconstructed to obtain a first reconstructed subgraph, and the first reconstructed The subgraph includes at least one first reconstruction node and a directed edge between the at least one first reconstruction node; compiling the first target reconstruction node to obtain compiled data of the first target reconstruction node , the first target reconstruction node is any first reconstruction node in the at least one first reconstruction node, and the first target reconstruction node includes the Mth of the plurality of first nodes A directed edge between a node and the M first nodes, where M is a positive integer.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes.
  • the directed edges in the calculation graph include data edges and control edges, and the directed edges described in this application also include data edges and control edges.
  • the directed edges between multiple first nodes include multiple Data edges and control edges between the first nodes.
  • the calculation graph is divided into multiple subgraphs, and then any subgraph in the multiple subgraphs is reconstructed to obtain a reconstructed subgraph, and then the reconstruction node in the reconstructed subgraph is
  • the scheduling unit realizes compiling any reconstruction node in the reconstruction subgraph.
  • the reconstructed subgraph obtained in this application includes at least one reconstructed node and a directed edge between the at least one reconstructed node, so the reconstructed subgraph obtained in this application is essentially a subgraph; and, Since any reconstructed node in the reconstructed subgraph includes one or more nodes in the arbitrary subgraph and directed edges between one or more nodes, the reconstructed node is essentially a A subgraph with a smaller subgraph size.
  • the reconstruction node as the scheduling unit is essentially a graph scheduling mechanism, and scheduling one reconstruction node for compilation at a time is equivalent to scheduling one or more nodes in any subgraph for compilation; Compared with using the nodes (such as computing nodes) in any subgraph as the scheduling unit and scheduling one node in any subgraph at a time, the scheduling time can be reduced by using the reconstructed node as the scheduling unit when compiling the subgraph. Further, the compiled data of any one of the reconstructed nodes obtained by compiling the arbitrary one of the reconstructed nodes is used to execute one or more nodes in the arbitrary subgraph.
  • a node is a scheduling unit. Since the reconstructed node includes multiple nodes, it is equivalent to scheduling multiple nodes at a time; compared with using nodes (such as computing nodes) in any sub-graph as the scheduling unit, this application can significantly reduce scheduling time.
  • the compiling the first target reconstruction node to obtain the compilation data of the first target reconstruction node includes: compiling each of the M first nodes A node is divided into at least one first child node to obtain a first thread-level subgraph, the first thread-level subgraph including N first child nodes and directed edges between the N first child nodes , the N is a positive integer greater than or equal to the M; compiling the first thread-level subgraph to obtain the compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node includes compiled data of the N first child nodes.
  • the sub-nodes described in this application are obtained by dividing the nodes into threads.
  • a node is divided into at least one sub-node, and each sub-node in the at least one sub-node is a thread.
  • the at least one sub-node can be Execute in parallel.
  • the operator operation represented by each of the M first nodes is divided into at least one thread, and each of the at least one thread A thread is represented by a child node; M first nodes are divided into threads to obtain N first child nodes. Since executing N first child nodes is equivalent to executing M first nodes, it is obtained by dividing the same first node At least one first child node of can be executed concurrently.
  • the N first child nodes divided by the same first node
  • the first child nodes can be executed concurrently, thereby reducing the execution time of each of the M first nodes, that is, reducing the total execution time of the M first nodes, and improving performance.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N One of the first child nodes, the compiled data of the first object includes at least one of the following: storage location, life cycle and cache management operation (cache management operation, CMO) indication information of the input data of the first object ; the storage location, life cycle and cache management operation instruction information of the output data of the first object; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • cache management operation cache management operation
  • one or more of the storage location of input and output data, life cycle, and cache management operation indication information, dependencies, and the type of computing device that executes any one node or sub-node can be Items are compiled into the compiled data of any node or sub-node, which is beneficial to the execution of any node or sub-node and improves performance.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the information of other sub-nodes executed in parallel with any sub-node can be compiled into the compiled data of any sub-node, which is beneficial to the parallel execution of any sub-node and other sub-nodes.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the input data of the node needs to indicate: Write the target input data from the memory into the cache before, so that the node can execute; optionally, the cache management operation indication information of the input data of the node can also indicate: the target input data does not need to be used as the reconstruction After the input data of other nodes in the node, delete the target input data in the cache, so as to reasonably release the cache space.
  • the cache management operation indication information of the input data of the child node needs to indicate: The child node has previously written the target input data from the memory into the cache, so that the child node can be executed; optionally, the cache management operation indication information of the input data of the child node can also indicate: the target input data does not need After being used as the input data of other child nodes in the thread-level subgraph, the target input data in the cache is deleted, so as to reasonably release the cache space.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the cache management operation instruction information of the output data of any node in the reconstruction node optionally indicates: write the output data of the node into the memory, so that the output data of the node can be used for other purposes; After the output data of the node is written into the memory or after the output data of the node does not need to be used as the input data of other nodes in the reconstruction node, delete the output data of the node in the cache, so as to reasonably release the cache space.
  • the cache management operation indication information of the output data of any sub-node in the thread-level subgraph optionally indicates: write the output data of the sub-node into the memory, so that the output data of the sub-node can be used for other purposes ;Delete the output data of the child node in the cache after writing the output data of the child node in memory or after the output data of the child node is no longer needed as the input data of other child nodes in the thread-level subgraph , so as to reasonably release the cache space.
  • the dependency relationship of the first object includes: the number of nodes that the first object depends on and the nodes that depend on the first object, or the child nodes that the first object depends on The number and child nodes depend on the first object.
  • the dependency relationship of any node in the reconstructed nodes includes the number of nodes that the node depends on, and when the node is executed, when the number of nodes that the node depends on is reduced to 0, the node can be executed; and , the dependency relationship of a node also includes other nodes that depend on this node; in this way, the execution sequence among the nodes in the refactoring node can be controlled.
  • the dependency relationship of any child node in the thread-level subgraph includes the number of child nodes that the child node depends on.
  • the child node When executing a child node, when the number of child nodes that the child node depends on is reduced to 0, it can be executed The child node; and, the dependency relationship of the child node also includes other child nodes that depend on the child node; in this way, the execution sequence among the child nodes in the thread-level subgraph can be controlled.
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the directed edges between the nodes in the reconstructed nodes can be expressed as the dependencies of the nodes during compilation, for example, expressed as the first dependency of the nodes, so as to realize the connection between the nodes in the reconstructed nodes A directed edge expression.
  • directed edges between sub-nodes in thread-level subgraphs can be expressed as dependencies of sub-nodes at compile time, for example, expressed as the first dependency of sub-nodes, thus realizing sub-nodes in thread-level sub-graphs
  • the expression of the directed edge between; wherein, the directed edge between the sub-nodes in the thread-level subgraph mainly refers to the directed edge between the sub-nodes obtained by dividing different nodes.
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes In one of, the fourth object is the second sub-node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring the
  • the present application uses some communication operators to implement data communication between multiple computing devices (such as multiple chips or multiple processors) and multiple servers (servers), such as collective communication operators.
  • the present application may transmit data from the first computing device to the second computing device through the first collective communication operator, that is, the first communication operator node is a node representing the first collective communication operator; wherein, the collective communication operator may It is represented as a collective communication operator node, so the first communication operator node is also the first collective communication operator node; and, the collective communication operator node is the same as the node in the calculation graph, subgraph or reconstruction node, and can perform Thread division.
  • the method of obtaining the second reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph.
  • multiple reconstruction nodes can be executed in a distributed manner. Taking two reconstruction nodes and two computing devices as an example, one of the reconstruction nodes is executed by one of the computing devices, and the other reconstruction node is executed by another The computing device executes, wherein, the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs; if the output of a node in one of the reconstruction nodes If the data is the input data of a node in another reconstructed node, then the output data of a node in one of the reconstructed nodes needs to be transmitted from one computing device to another computing device.
  • This application can use a set communication operator node to transfer The output data of the nodes in one of the reconstructed nodes is transmitted from one of the computing devices to another computing device; and because the collective communication operator node can transmit the output data of the nodes in one of the reconstructed nodes from one of the computing devices to Another computing device is used as the input data of the nodes in another reconstruction node, so there is a directed edge between the nodes in one of the reconstruction nodes and the set communication operator node, and the set communication operator node and another reconstruction node
  • the nodes in the node have directed edges; then, the directed edges between the nodes in one of the reconstructed nodes and the set communication operator node can be expressed as the dependency relationship of the nodes in one of the reconstructed nodes, and the set There is a directed edge between a communication operator node and a node in another reconstruction node, which is expressed as a dependency relationship between nodes in another reconstruction node, for example, it is expressed as the second dependency relationship of a node in one reconstruction
  • multiple thread-level subgraphs can be executed in a distributed manner. Taking two thread-level subgraphs and two computing devices as an example, one thread-level subgraph is executed by one of the computing devices, and the other thread-level subgraph is executed by another thread-level subgraph.
  • the two thread-level subgraphs can be obtained by dividing the threads of different reconstruction nodes in the same reconstruction subgraph respectively, or can be obtained by respectively The sub-threads are divided; if the output data of the child nodes in one of the thread-level subgraphs is the input data of the child nodes in another thread-level subgraph, then the output data of the child nodes in one of the thread-level subgraphs needs to be From one of the computing devices to another computing device, the present application can use the set communication operator node to transmit the output data of the child nodes in one of the thread-level subgraphs from one of the computing devices to another computing device; and because the set The communication operator node can transmit the output data of the sub-nodes in one of the thread-level subgraphs from one computing device to another computing device as the input data of the sub-nodes in another thread-level subgraph, so one of the threads There is a directed edge between the child nodes in the level subgraph and the collective communication operator no
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • this application transmits data from the third computing device to the second computing device through the second set of communication operators, so the second communication operator node is a node representing the second set of communication operators, that is, the second The communication operator nodes are the second set of communication operator nodes. It should be further noted that the method for obtaining the third reconstructed subgraph in this application is the same as the method for obtaining the first reconstructed subgraph.
  • multiple reconstruction nodes can be executed in a distributed manner. Taking two reconstruction nodes and two computing devices as an example, one of the reconstruction nodes is executed by one of the computing devices, and the other reconstruction node is executed by another The computing device executes, wherein, the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs; if the output of a node in one of the reconstruction nodes The data is the input data of a node in another reconstructed node, then the output data of a node in one of the reconstructed nodes needs to be transmitted from one computing device to another computing device, this application can use the collective communication operator node to The output data of the nodes in one of the reconstructed nodes is transmitted from one of the computing devices to another computing device; and because the collective communication operator node can transmit the output data of the nodes in one of the reconstructed nodes from one of the computing devices to Another computing device is used as the input data of the nodes in another reconstruction node, so there is
  • multiple thread-level subgraphs can be executed in a distributed manner. Taking two thread-level subgraphs and two computing devices as an example, one thread-level subgraph is executed by one of the computing devices, and the other thread-level subgraph is executed by another thread-level subgraph.
  • the two thread-level subgraphs can be obtained by dividing the threads of different reconstruction nodes in the same reconstruction subgraph respectively, or can be obtained by respectively The sub-threads are divided; if the output data of the child nodes in one of the thread-level subgraphs is the input data of the child nodes in another thread-level subgraph, then the output data of the child nodes in one of the thread-level subgraphs needs to be From one of the computing devices to another computing device, the present application can use the set communication operator node to transmit the output data of the child nodes in one of the thread-level subgraphs from one of the computing devices to another computing device; and because the set The communication operator node can transmit the output data of the sub-nodes in one of the thread-level subgraphs from one computing device to another computing device as the input data of the sub-nodes in another thread-level subgraph, so one of the threads There is a directed edge between the child nodes in the level subgraph and the collective communication operator no
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the method for obtaining the fourth reconstructed subgraph in this application is the same as the method for obtaining the first reconstructed subgraph.
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is transformed into the directed edge between one of the reconstructed nodes and the other reconstructed node, and the directed edge between one of the reconstructed nodes and the other reconstructed node can be expressed through one of the reconstructed nodes Dependencies are expressed so that the order between two reconstruction nodes in the same flow can be controlled.
  • the first target reconstruction node and the fourth target reconstruction node are in the same flow, and there is a Directed edge; the directed edge between the first node in the first target reconstruction node and the third target node in the fourth target reconstruction node can be converted into the first target reconstruction node and the fourth target Reconstruct the directed edges between the nodes, and then express the directed edges between the first target reconstructed node and the fourth target reconstructed node as the dependency relationship of the first target reconstructed node, and compile the In the compiled data of a target reconstruction node, the sequence between the first target reconstruction node and the fourth target reconstruction node is controlled.
  • the first target reconstruction node and the fourth target reconstruction node may be reconstruction nodes in the same reconstruction subgraph, or may be reconstruction nodes in different reconstruction subgraphs.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between . It should be noted that the method for obtaining the fifth reconstructed subgraph in this application is the same as the method for obtaining the first reconstructed subgraph.
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is converted into the directed edge between one reconstruction node and the sending operator node, the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node.
  • the directed edge between one of the reconstruction nodes and the sending operator node, and the directed edge between the sending operator node and the receiving operator node can be expressed by the dependency relationship of the sending operator node , and the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node can be expressed through the dependency relationship of the receiving operator node, so as to control different flows
  • the first target reconstruction node and the fifth target reconstruction node are in different flows, and there is a
  • the directed edge between the first node in the first target reconstruction node and the fourth target node in the fifth target reconstruction node can be converted into the first sending operator node and the first target reconstruction node.
  • the directed edge between is expressed as the second dependency relationship of the first sending operator node, and compiled in the compiled data of the first target reconstruction node; and the first receiving operator node and the fifth target reconstruction node can be
  • the directed edge between can be expressed as the first dependency relationship of the first receiving operator node, and the directed edge between the first sending operator node and the first receiving operator node can be expressed as the first receiving operator node , and compile the first dependency relationship and the second dependency relationship of the first receiving operator node into the compilation data of the fifth target reconstruction node, thereby controlling the first target reconstruction node and the fifth target Refactor the order between nodes.
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between . It should be noted that the method for obtaining the sixth reconstructed subgraph in this application is the same as the method for obtaining the first reconstructed subgraph.
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is converted into the directed edge between one reconstruction node and the sending operator node, the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node.
  • the directed edge between one of the reconstruction nodes and the sending operator node, and the directed edge between the sending operator node and the receiving operator node can be expressed by the dependency relationship of the sending operator node
  • the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node can be expressed through the dependency relationship of the receiving operator node, so as to control different flows
  • the first target reconstruction node and the sixth target reconstruction node are in different flows, and there is a
  • the directed edge between the first node in the first target reconstruction node and the fifth target node in the sixth target reconstruction node can be converted into the sixth target reconstruction node and the second sending algorithm
  • the directed edge between is expressed as the second dependency relationship of the second sending operator node, and the first dependency relationship and the second dependency relationship of the second sending operator node are compiled in the compilation data of the sixth target reconstruction node;
  • the directed edge between the second receiving operator node and the first target reconstruction node can be expressed as the first dependency relationship of the second receiving operator node, and the
  • the embodiment of the present application provides a method for executing a subgraph, including: obtaining the compilation data of the first target reconstruction node in the first reconstruction subgraph, and the first reconstruction subgraph is for the first obtained by reconstructing a subgraph, the first subgraph includes a plurality of first nodes and directed edges between the plurality of first nodes, and the first subgraph is a multiple any one of the subgraphs; the first reconstructed subgraph includes at least one first reconstructed node and a directed edge between the at least one first reconstructed node, and the first target reconstructed node is all any one of the at least one first reconfiguration node, the first target reconfiguration node includes M first nodes among the plurality of first nodes and one of the M first nodes Directed edges between, the M is a positive integer; execute the compiled data of the first target reconstruction node.
  • the calculation graph is divided into multiple subgraphs, and then any subgraph in the multiple subgraphs is reconstructed to obtain a reconstructed subgraph, and then the reconstruction node in the reconstructed subgraph is
  • the scheduling unit realizes compiling any reconstruction node in the reconstruction subgraph.
  • the reconstructed subgraph obtained in this application includes at least one reconstructed node and a directed edge between the at least one reconstructed node, so the reconstructed subgraph obtained in this application is essentially a subgraph; and, Since any reconstructed node in the reconstructed subgraph includes one or more nodes in the arbitrary subgraph and directed edges between one or more nodes, the reconstructed node is essentially a A subgraph with a smaller subgraph size.
  • the reconstruction node as the scheduling unit is essentially a graph scheduling mechanism, and scheduling one reconstruction node for compilation at a time is equivalent to scheduling one or more nodes in any subgraph for compilation;
  • the scheduling time can be reduced by using the reconstructed node as the scheduling unit when compiling the subgraph.
  • the compiled data of any one of the reconstruction nodes obtained by compiling the arbitrary one of the reconstruction nodes is used to execute one or more nodes in the any one of the subgraphs, and when the subgraph is executed, one reconstruction node is scheduled at a time
  • the compiled data of a node is used to execute the compiled data of one or more nodes in any sub-graph at one time.
  • Reconstructing a node is equivalent to scheduling one or more nodes in any one of the subgraphs to execute the one or more nodes; A node in any one of the subgraphs is scheduled to be executed, and the scheduling time can be reduced by taking the reconstructed node as the scheduling unit when the subgraph is executed.
  • the first subgraph includes multiple first nodes and directed edges between the multiple first nodes, and the first subgraph
  • the first reconstructed subgraph obtained by reconstruction includes at least one first reconstructed node and directed edges between at least one first reconstructed node, wherein the first target reconstructed node includes M in the first subgraph Directed edges between the first node and M first nodes, M is a positive integer; compiling the first target reconstruction node is equal to compiling the M first nodes, and the first target reconstruction
  • the compiled data of the node is used to execute the M first nodes, that is, executing the first target reconstruction node is equal to executing the M first nodes.
  • the embodiment of the present application provides a graph scheduling mechanism, which takes the reconstructed node of the graph structure as the scheduling unit, and can reduce the scheduling time.
  • the compiled data of the first target reconstruction node is obtained by compiling the first thread-level subgraph, and the first thread-level subgraph is obtained by combining the M first
  • Each first node among the nodes is divided into at least one first child node, and the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, The N is a positive integer greater than or equal to the M.
  • any reconstructed node includes M first nodes and directed edges between M first nodes, and when compiling any reconstructed node, the M first nodes
  • Each first node is divided into at least one first child node, that is, the operator operation represented by each of the M first nodes is divided into at least one thread, and each thread in the at least one thread passes through a
  • the child node represents; M first child nodes are divided into threads to obtain N first child nodes, and N is a positive integer greater than or equal to M, and the directed relationship between the N first child nodes and the N first child nodes
  • An edge constitutes a first thread-level subgraph; compiling the first thread-level subgraph can obtain compiled data of the first target reconstruction node, and the compiled data of the first target reconstruction node includes N first subgraphs Compiled data for the node.
  • N first sub-nodes Since executing N first sub-nodes is equivalent to executing M first sub-nodes, and at least one first sub-node divided by the same first sub-node can be executed concurrently, thus, when scheduling compiled data of N first sub-nodes
  • the first child nodes divided by the same first node among the N first child nodes can be executed concurrently, thereby reducing the execution of each first node among the M first nodes time, that is, to reduce the total execution time of the M first nodes and improve performance.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes or compiled data of the N first child nodes;
  • a target to reconstruct the compiled data of the node including: writing the compiled data of the R sixth objects into the cache; for each sixth object in the R sixth objects, perform the following operations to execute the Compiled data of a target reconstruction node: when the number of eighth objects that the seventh object depends on is 0, read the compiled data of the seventh object from the cache, and compiling data to execute the seventh object, where the seventh object is any one of the R sixth objects; wherein, if the R sixth objects are the M first nodes, the eighth The objects are nodes; if the R sixth objects are the N first child nodes, the eighth object is a child node.
  • any node in the refactoring node when the number of nodes that any node depends on is 0, execute any node; for any child node in the thread-level subgraph, That is, when the number of child nodes that any child node depends on is 0, execute any child node; setting the timing of any node or any child node being executed in this way is beneficial to improve execution efficiency.
  • the compilation data of the seventh object includes an initial number of eighth objects that the seventh object depends on; if the initial number of eighth objects that the seventh object depends on is not 0, After each ninth object is executed, the number of the eighth object that the seventh object depends on is decremented by 1, and the ninth object is the eighth object that the seventh object depends on.
  • the timing when it is pushed to the computing resources for execution is after the number of nodes it depends on is 0; this application can rely on any node
  • the initial number of nodes is compiled in the compiled data of any one node, so that the timing of pushing it to the computing resource for execution can be determined according to the number of nodes that any one node depends on during execution. If the initial number of nodes that any node depends on is 0, then when the compiled data of any node is written into the cache, it can be pushed to the computing resource for execution.
  • any node can be pushed to the computing resources for execution.
  • the time it is pushed to the computing resources for execution is after the number of sub-nodes it depends on is 0; this application can rely on any sub-node
  • the initial number of sub-nodes is compiled in the compiled data of any sub-node. In this way, during execution, the timing of pushing it to the computing resources for execution can be determined according to the number of sub-nodes that any sub-node depends on. If the initial number of child nodes that any child node depends on is 0, then when the compiled data of any child node is written into the cache, it can be pushed to the computing resource for execution.
  • the initial number of child nodes that any child node depends on is not 0, you need to wait for the child nodes that any child node depends on to be executed before pushing the compiled data of any child node to computing resources for execution ; In this case, after each child node that any child node depends on is executed, the number of child nodes that any child node depends on is reduced by 1 until the number of child nodes that any child node depends on decreases After it is 0, it is the timing when the arbitrary sub-node is pushed to the computing resource for execution, so that any sub-node is pushed to the computing resource for execution.
  • the compilation data of the seventh object is written into the In the cache; when the initial number of the eighth object that the seventh object depends on is not 0, the compilation data of the seventh object is written into the cache during the execution of the ninth object of.
  • this arbitrary node does not need to wait for other nodes to finish executing before it can execute; in this case, you can execute any
  • the node writes the compiled data of any node into the cache before, so as to ensure the smooth execution of the reconstructed node.
  • any node needs to wait for the nodes it depends on to be executed before it can be executed;
  • the compiled data of any node is written into the cache, so that after the nodes that any node depends on are all executed, any node can be executed immediately, so as to ensure the smooth execution of the refactored node; and, there is no Writing the compiled data of any one node into the cache too early will not cause the compiled data of any one node to occupy the cache for a long time, so as to achieve reasonable utilization of the cache.
  • any child node in the thread-level subgraph depends on is 0, this arbitrary child node does not need to wait for other child nodes to finish executing before it can execute; in this case, you can execute the The compiled data of any sub-node is written into the cache before any sub-node, so as to ensure the smooth execution of the thread-level sub-graph.
  • the arbitrary child node needs to wait for the child nodes it depends on to be executed before it can be executed; in this case, you can execute any child node
  • the compiled data of any child node is written into the cache, so that after the child nodes that any child node depends on are all executed, any child node can be executed immediately, thereby ensuring that the The smooth execution of the thread-level subgraph; moreover, if the compiled data of any sub-node is not written into the cache too early, it will not cause the compiled data of any sub-node to occupy the cache for a long time, so as to realize the reasonable utilization of the cache.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N
  • the compiled data of the first object includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the first object; The storage location, life cycle and cache management operation indication information of the output data; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes In one of, the fourth object is the second sub-node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring the
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • the embodiment of the present application provides a subgraph compiling device, including: an acquisition unit, configured to acquire a first subgraph, the first subgraph is any subgraph obtained by segmenting a calculation graph One, the first subgraph includes a plurality of first nodes and directed edges between the plurality of first nodes; the reconstruction unit is configured to reconstruct the first subgraph to obtain the first A reconstructed subgraph, the first reconstructed subgraph includes at least one first reconstructed node and a directed edge between the at least one first reconstructed node; a compiling unit configured to reconstruct the first target node Compiling to obtain compiled data of the first target reconstruction node, where the first target reconstruction node is any first reconstruction node in the at least one first reconstruction node, and the first target
  • the reconstructed nodes include M first nodes among the plurality of first nodes and directed edges between the M first nodes, where M is a positive integer.
  • the compiling unit is specifically configured to: divide each of the M first nodes into at least one first child node, so as to obtain a first thread-level subgraph , the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, and the N is a positive integer greater than or equal to the M; for the first A thread-level subgraph is compiled to obtain the compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N
  • the compiled data of the first object includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the first object; The storage location, life cycle and cache management operation indication information of the output data; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes In one of, the fourth object is the second sub-node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring the
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • the embodiment of the present application provides a subgraph execution device, which is characterized by comprising: an acquisition unit configured to acquire compiled data of a first target reconstruction node in a first reconstruction subgraph, the first A reconstructed subgraph is obtained by reconstructing a first subgraph, the first subgraph includes a plurality of first nodes and directed edges between the plurality of first nodes, and the first subgraph is any one of multiple subgraphs obtained by computing graph segmentation; the first reconstructed subgraph includes at least one first reconstructed node and a directed edge between the at least one first reconstructed node, so
  • the first target reconstruction node is any first reconstruction node in the at least one first reconstruction node, and the first target reconstruction node includes M first nodes in the plurality of first nodes and directed edges between the M first nodes, where M is a positive integer; an execution unit configured to execute compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node is obtained by compiling the first thread-level subgraph, and the first thread-level subgraph is obtained by combining the M first
  • Each first node among the nodes is divided into at least one first child node, and the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, The N is a positive integer greater than or equal to the M.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes or compiled data of the N first child nodes; the execution unit, specifically For: writing the compiled data of the R sixth objects into the cache; for each sixth object of the R sixth objects, perform the following operations to execute the compiled data of the first target reconstruction node :
  • the seventh object is any one of the R sixth objects; wherein, if the R sixth objects are the M first nodes, the eighth object is a node; if the R The sixth objects are the N first child nodes, and the eighth object is a child node.
  • the compilation data of the seventh object includes an initial number of eighth objects that the seventh object depends on; if the initial number of eighth objects that the seventh object depends on is not 0, After each ninth object is executed, the number of the eighth object that the seventh object depends on is decremented by 1, and the ninth object is the eighth object that the seventh object depends on.
  • the compilation data of the seventh object is written into the In the cache; when the initial number of the eighth object that the seventh object depends on is not 0, the compilation data of the seventh object is written into the cache during the execution of the ninth object of.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N
  • the compiled data of the first object includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the first object; The storage location, life cycle and cache management operation indication information of the output data; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes In one of, the fourth object is the second sub-node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring the
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • the embodiment of the present application provides a graph compiling device, including a processor, a memory, a communication interface, and one or more programs, the one or more programs are stored in the memory and configured Performed by the processor, the program includes instructions for performing the steps in the method described in the above first aspect and any possible implementation manner.
  • the embodiment of the present application provides a graph execution device, including a processor, a memory, a communication interface, and one or more programs, the one or more programs are stored in the memory and configured Performed by the processor, the program includes instructions for performing the steps in the method described in the above second aspect and any possible implementation manner.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium includes a computer program, and when the computer program is run on a computer or a processor, the computer or the The processor performs the method described in the first aspect or the second aspect and any possible implementation thereof.
  • an embodiment of the present application provides a computer program product, the computer program product includes a computer program, and when the computer program is run on a computer or a processor, the computer or the processor performs the following steps: The method described in the first aspect or the second aspect and any possible implementation thereof.
  • the embodiment of the present application provides a chip, including: a processor, configured to call and run a computer program from the memory, so that the device installed with the above-mentioned chip executes the above-mentioned first aspect or the second aspect and any of them.
  • a processor configured to call and run a computer program from the memory, so that the device installed with the above-mentioned chip executes the above-mentioned first aspect or the second aspect and any of them.
  • FIG. 1 is a schematic diagram of the application of a neural network model provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a system architecture for compiling and executing a subgraph provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the principle of subgraph reconstruction provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of thread division provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another thread division provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of dividing nodes into reconstructed nodes according to an embodiment of the present application.
  • Fig. 7 is a schematic diagram of generating a sub-graph segmentation strategy provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of generating another sub-graph segmentation strategy provided by the embodiment of the present application.
  • FIG. 9 is a schematic diagram of resource allocation provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a data prefetching operation provided by an embodiment of the present application.
  • Fig. 11 is a schematic diagram of a data invalidation operation provided by the embodiment of the present application.
  • FIG. 12 is a schematic diagram of a data write-back operation provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of compiling the reconstruction node shown in FIG. 5 .
  • FIG. 14 is a schematic diagram of compiling the thread-level subgraph shown in FIG. 5 .
  • FIG. 15 is a schematic diagram expressing the dependency relationship of some sub-nodes in the thread-level subgraph shown in FIG. 5 .
  • Fig. 16 is a schematic diagram of a directed edge provided by an embodiment of the present application.
  • Fig. 17 is a schematic representation of a control edge provided by an embodiment of the present application.
  • Fig. 18 is a schematic representation of another control edge provided by the embodiment of the present application.
  • FIG. 19 is a schematic diagram of an expression of distributed execution provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of another expression of distributed execution provided by the embodiment of the present application.
  • Fig. 21 is a schematic structural diagram of a graph execution device provided by an embodiment of the present application.
  • Fig. 22 is a schematic diagram of an asynchronous pipeline on the host side and the device side provided by the embodiment of the present application.
  • Fig. 23 is a schematic diagram of another asynchronous pipeline on the host side and the device side provided by the embodiment of the present application.
  • FIG. 24 is a schematic diagram of a dependency relationship between nodes in a reconstructed node or a dependency relationship between sub-nodes in a thread-level subgraph provided by an embodiment of the present application.
  • FIG. 25 is a schematic diagram of a buffer management operation of input and output data of a node or a sub-node provided by an embodiment of the present application.
  • Fig. 26 is a schematic flowchart of a method for compiling a subgraph provided by an embodiment of the present application.
  • FIG. 27 is a schematic flowchart of a method for executing a subgraph provided by an embodiment of the present application.
  • Fig. 28 is a schematic structural diagram of an apparatus for compiling a subgraph provided by an embodiment of the present application.
  • Fig. 29 is a schematic structural diagram of an execution device for a subgraph provided by an embodiment of the present application.
  • Fig. 30 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • this application provides a graph scheduling mechanism with the graph as the scheduling unit, including graph compilation with the graph as the scheduling unit and graph execution to reduce scheduling time and improve execution concurrency.
  • FIG. 2 is a schematic diagram of a subgraph compiling and execution system architecture provided by the embodiment of the present application.
  • the system architecture includes a graph compiling device 100 and a graph executing device 200.
  • the graph compiling device 100 can be set in The host (host) side may be the host, and the graph execution apparatus 200 may be set on the device (device) side or be the device.
  • the input of the graph compiling device 100 is a neural network model generated by front-end scripts such as tensorflow/pytorch/onnx, and the graph compiling device 100 is responsible for compiling the neural network model to generate a data structure recognized by the system, as the graph execution device 200 of the system input; the graph execution device 200 executes the reconstruction node or thread-level sub-graph based on the compilation data output by the graph compilation device 100, taking the reconstruction node or thread-level sub-graph as the scheduling unit, so as to obtain the result of neural network model reasoning or Trained model parameters.
  • the graph compilation device 100 and the graph execution device 200 will be described in detail below.
  • the graph compiling device 100 1.
  • the graph compilation device 100 includes a parsing and conversion (parser) unit 101 , a graph optimization unit 102 , a subgraph segmentation unit 103 , a resource allocation unit 104 and a compilation unit 105 , which will be described in detail below.
  • parser parsing and conversion
  • Analysis and conversion unit 101 used to convert the calculation graph of front-end intermediate representation (intermediate representation, IR) such as tensorflow/pytorch/onnx into the calculation graph of the intermediate representation of this system.
  • intermediate representation intermediate representation
  • the calculation graph is composed of nodes and edges.
  • the nodes in the calculation graph represent operators, or operators.
  • the lines in the calculation graph are also called edges, which represent the dependence between calculations.
  • the edges in the calculation graph are directed edges, which define the relationship between operations and are divided into two categories: one is used to transmit data, called data
  • the edge is represented by a solid line; the other type is used to define the dependency relationship (that is, the order of execution), called the control edge, and is represented by a dotted line. All nodes in the calculation graph are connected by data edges or control edges.
  • Nodes with an in-degree (that is, the number of dependent nodes) of 0 have no pre-dependency and can be executed immediately; nodes with an in-degree greater than 0 have to wait for their It can only be executed after the execution of all dependent nodes is completed.
  • Graph optimization unit 102 optimize the calculation graph that has been expressed in the middle of the system, for example, optimize the calculation graph output by the analysis conversion unit 101, specifically including operator fusion and constants for the operators represented by the nodes in the calculation graph Folding, precision and format optimization, etc., where the operators represented by nodes include but are not limited to: calculation operators, communication operators (such as collective communication operators, etc.), control operators (such as sending operators, receiving operators, etc.) operators, etc.), heterogeneous computing operators (such as CPU operators, GPU operators, matrix operators, vector operators, etc.), etc. It should be understood that the graph optimization unit 102 is optional, that is, the input of the subgraph splitting unit 103 may also be an unoptimized computation graph.
  • Subgraph segmentation unit 103 used to divide the calculation graph into multiple subgraphs, and then reconstruct each subgraph to obtain multiple reconstructed subgraphs; optionally, each subgraph in the reconstructed subgraph can also be further
  • Each reconstruction node is divided into threads to obtain the thread-level subgraph corresponding to each reconstruction node, that is, the internal nodes of each reconstruction node in the reconstruction subgraph are divided into at least one thread-level sub-node to obtain The thread-level subgraph corresponding to each reconstruction node is obtained.
  • the input of the subgraph segmentation unit 103 may be the output of the graph optimization unit 102, or the output of the analysis conversion unit 101; the output of the subgraph segmentation unit 103 may be a reconstructed subgraph or a reconstructed node, or Can be a thread-level subgraph.
  • the principle of subgraph reconstruction is to use at least one node in the subgraph and the directed edge between the at least one node as a reconstruction node, which is equivalent to further dividing the subgraph to obtain multiple smaller-scale
  • the subgraphs of these smaller scale subgraphs are reconstructed nodes one by one, and the directed edges between these smaller scale subgraphs are the directed edges between reconstructed nodes.
  • the reconstructed subgraph in this application includes at least one reconstructed node and the directed edge between the at least one reconstructed node, so the reconstructed subgraph in this application is still a subgraph in essence; and, since the reconstructed Any reconstructed node in the subgraph includes one or more nodes in the subgraph before the reconstruction and directed edges between the one or more nodes, so the reconstructed node in this application is essentially a ratio A subgraph with a smaller graph size, that is, reconstructing nodes into a graph structure.
  • FIG. 3 is a schematic diagram of a principle of subgraph reconstruction provided by an embodiment of the present application.
  • the subgraph includes: node a, node b, node c, node d, node e, node f and the directed edge between node a and node b, the directed edge between node b and node c Edge, directed edge between node c and node d, directed edge between node d and node e, directed edge between node e and node f; node a, node b, node c, node a
  • the directed edge between node b and node b and node c is used as reconstructed node A
  • the directed edge between node d, node e and node d and node e is used as reconstructed node B , taking node f as reconstructed node C;
  • the directed edge between node c and node d is the directed edge between node c and node
  • the thread division of the reconstructed node is to divide each node included in the reconstructed node into at least one child node, so as to obtain a thread-level subgraph composed of multiple child nodes and directed edges between these multiple child nodes ; and the directed edge between any two child nodes in the thread-level subgraph is determined according to the directed edge between the nodes that obtain the arbitrary two child nodes, and at least one child node obtained by dividing the same node can be executed in parallel.
  • the input and output data of the node will also be divided into an equal number of threads, for example, the input data of the node is also divided into at least one sub-input data , the at least one sub-input data is the input data of at least one sub-node, that is, the at least one sub-input data is in one-to-one correspondence with the at least one sub-node, and the output data of the at least one sub-node constitutes the node’s Output Data.
  • whether a node can be divided into multiple sub-nodes is determined by whether the input data of the node can be divided.
  • node 1 represents a plus sign
  • the input data of node 1 is data A and data B
  • data A can be divided into sub-data A1 and sub-data A2
  • data B can be divided into sub-data B1 and sub-data B2
  • node 1 can be divided into sub-data Node 10 and sub-node 11, sub-node 10 and sub-node 11 both represent a plus sign
  • the input of sub-node 10 is sub-data A1 and sub-data B1
  • the input of sub-node 11 is sub-data A2 and sub-data B2.
  • directed edges in a computation graph, subgraph, reconstruction subgraph, and thread-level subgraph may include data edges and control edges.
  • the present application can also analyze the cache (cache) occupancy of the calculation graph, add dependencies between threads in the thread-level subgraph, and control the concurrency of threads, so as to improve the utilization rate of computing resources and storage efficiency.
  • the reconstruction nodes include: convolution operator_0 node, vector operator_0 node, convolution operator_1 node, vector operator_1 node, and convolution operator_0 node and Directed edges between vector operator_0 nodes, directed edges between vector operator_0 nodes and convolution operator_1 nodes, and between convolution operator_1 nodes and vector operator_1 nodes
  • the directed edge; the convolution operator_0 node, the vector operator_0 node, the convolution operator_1 node, and the vector operator_1 node are all divided into 8 threads, that is, the convolution operator_0
  • Each of the node, vector operator_0 node, convolution operator_1 node, and vector operator_1 node is divided into 8 child nodes, so that the child nodes of 8 threads and the child nodes of 8 threads
  • the directed edges of constitute the thread-level subgraph; wherein, only the directed edges between
  • the threads between the 8 threads Directed edges may also exist between child nodes, and synchronization points (not shown in FIG. 4 ) may also be inserted between threads among the eight threads.
  • the concurrency degree of the 8 threads in the thread-level subgraph shown in FIG. 4 is 4, that is, 4 threads can be executed concurrently, and the other 4 threads can be executed concurrently.
  • the number of threads that each node in the reconstructed node is divided into may be the same or different, wherein, among the nodes with different numbers of threads, there are converging sub-nodes and splitting sub-nodes in the divided thread-level subgraph.
  • the reconstruction node includes artificial intelligence cube (AIC) operator node, artificial intelligence matrix central processing unit (artificial intelligence CPU, AICPU) operator node, artificial intelligence vector (artificial intelligence vector, AIV) operator nodes, communication operator nodes and control operator nodes, and directed edges between AIC operator nodes and AICPU operator nodes, directed edges between AICPU operator nodes and AIV operator nodes, AIV The directed edge between the operator node and the communication operator node and the directed edge between the communication operator node and the control operator node; when dividing threads, the AIC operator node, AIV operator node and control operator node The nodes are divided into three threads, and the AICPU operator node and the communication operator node are divided into two threads; among them, the divided AIV operator 1 subnode is a split subnode, and the communication operator 1 subnode and the AIC
  • the subgraph segmentation unit 103 divides any node in the subgraph into which reconstructed node in the reconstructed subgraph is determined according to the subgraph segmentation strategy; And, subgraph segmentation unit 103 is in the process of thread division, the number of threads (threading num) divided by any node or the number of child nodes obtained by any node division and the concurrency degree (window size) of any child node , is also determined according to the subgraph segmentation strategy.
  • the subgraph segmentation strategy is generated by the following three methods.
  • Subgraph segmentation strategy generation method 1
  • the intervals are divided according to the difference information of the tensor shape on the subgraph, and then a priori thread number and concurrency degree are calculated for each interval; among them, the difference information identifies tensors
  • the optional calculation methods include: the absolute value of the difference in the number of bytes occupied by the tensor, etc. Specifically, count the tensor occupancy of the input and output data of each node in the subgraph, and cluster according to the size of the difference information of the tensor arrangement shape, so that the place where the tensor arrangement shape does not change much into a refactoring node.
  • FIG. 6 is a schematic diagram of dividing nodes into reconstructed nodes provided by the embodiment of the present application.
  • the subgraph includes: node a, node b, node c, node d.
  • the directed edge between node e and node f is a schematic diagram of dividing nodes into reconstructed nodes provided by the embodiment of the present application.
  • the subgraph includes: node a, node b, node c, node d.
  • Node e, node f and the directed edge between node a and node b the directed edge between node b and node c
  • the directed edge between node c and node d the directed edge between node d and node e
  • the number of bytes occupied by the input and output tensor of node a is t 0 and t 1
  • the number of bytes occupied by the input and output tensor of node b is t 1 and t 2
  • the number of bytes occupied by the input and output tensor of node c is The number of nodes is t 2 and t 3
  • the number of bytes occupied by the input and output tensor of node d is t 3 and t 4
  • the number of bytes occupied by the input and output tensor of node e is t 4 and t 5
  • the number of bytes occupied by node f is The number of bytes occupied by the input and output tensors is t 5 and t 6 , so the differential information of the input and output tensor of node a is
  • , and the differential information of the input and output tensor of node b is
  • the principle of selecting the number of threads and the degree of concurrency is: the sum S of all tensor data volumes in concurrent threads is smaller than the capacity (size) of the L2 cache.
  • the minimum number of thread splits n can be calculated, if Among them, C cache represents the capacity of the second-level cache, indicating that all data can be placed in the second-level cache without cutting threads.
  • FIG. 7 is a schematic diagram of generating a subgraph segmentation strategy provided by an embodiment of the present application.
  • the subgraph segmentation strategy is generated for the calculation graph of the neural network model through the strategy search algorithm.
  • the neural network model is compiled, and the newly generated Carry out Autotune tuning on the shape of the operator, and carry out on-board verification to obtain the execution performance of the calculation graph of the neural network model, such as obtaining the on-board running time (on-board cycles) and feeding it back to the strategy search algorithm.
  • the optimal subgraph segmentation strategy in the tuning process is stored in the policy knowledge base for practical application; in order to simplify the time of model compilation/operator tuning/board verification, the performance predictor can also be directly used for the application
  • the execution performance of the calculation graph under the subgraph segmentation strategy is evaluated, and the evaluation result is fed back to the strategy search algorithm.
  • the subgraph segmentation strategy is adjusted according to the feedback result of the performance predictor through the strategy search algorithm, and the optimal The optimal subgraph segmentation strategy is stored in the strategy knowledge base for practical application.
  • a policy generation neural network model is trained offline, and integrated into the graph compiling device 100, and the policy generation neural network model is directly invoked during compilation to infer and generate a subgraph segmentation strategy.
  • FIG. 8 is a schematic diagram of generating another subgraph segmentation strategy provided by an embodiment of the present application.
  • the data set is prepared offline.
  • the data set includes a large number of calculation graphs with sub-graph segmentation strategies to be output.
  • the neural network model is used to generate the sub-graph segmentation strategy for the calculation graph in the data set through the policy generation neural network model, and through reinforcement learning Algorithms such as (RL) and supervised training update and improve the policy generation neural network model, and finally use the trained policy generation neural network model as the policy generator in the online reasoning stage, which is directly called at compile time to generate subgraph segmentation strategies.
  • RL reinforcement learning Algorithms
  • Resource allocation unit 104 use the reconstruction node or thread-level subgraph as the scheduling unit to perform resource allocation, data life cycle identification, and set cache management operations; optionally, the resource allocation unit 104 can also The graph is optimized, as shown in Figure 9. Among them, resource allocation, data life cycle identification and cache management operations are performed with the reconstruction node or thread-level subgraph as the scheduling unit, including: resource allocation (such as memory allocation) for each node in the reconstruction node, input , Identify the life cycle of output data and set cache management operations, or perform resource allocation, life cycle identification of input and output data, and set cache management operations for each child node in the thread-level subgraph.
  • resource allocation such as memory allocation
  • the subgraph splitting unit 103 obtains the reconstructed node or thread level subgraph according to the foregoing subgraph splitting strategy.
  • This application uses the refactoring node or thread-level subgraph as the scheduling unit for scheduling, and the scope of the refactoring node or thread-level subgraph can be expressed through the same scope (scope) attribute value, or through the function operator (functionOp) to express.
  • scope scope
  • functionOp function operator
  • optimizing the reconstructed node or thread-level subgraph includes: taking the reconstructed node or thread-level subgraph as the scheduling unit, and optimizing the operator represented by the node in the reconstructed node or the sub-node in the thread-level subgraph Optimization, including operator fusion optimization, single operator optimization, constant folding optimization, data type (dtype) optimization, format (format) optimization, etc. It should be understood that the optimization of operators in the reconstructed node or thread-level subgraph includes but is not limited to the above optimization.
  • the resource allocation to the reconstructed node or the thread-level subgraph includes memory allocation to the input, output, temporary memory area (workspace), etc. of the nodes in the reconstructed node or the sub-nodes in the thread-level subgraph. It should be understood that when memory allocation is performed on refactoring nodes or thread-level subgraphs, the range of memory reuse can be between refactoring nodes or thread-level subgraphs, or between refactoring nodes or thread-level subgraphs. Inside.
  • memory allocation includes allocating memory for storage in the graph compiling apparatus 100 and allocating memory for storing in the graph execution apparatus 200 .
  • the data lifecycle identification of the reconstructed node or thread-level subgraph includes lifecycle identification of the input and output data of each node in the reconstructed node or each sub-node in the thread-level subgraph. For example, identify data life cycles such as data generation (produce), data consumption (consume), data first read, data last read, data first write, data last write, etc.; it should be noted that the data Generation means that the input data of a certain node is generated by the previous nodes or the input data of a certain sub-node is generated by the previous sub-nodes, and the consumption of data means that the output data of this node is used as the input data of the following nodes or this sub-node The output data is used as the input data of the following child nodes.
  • data generation means that the input data of a certain node is generated by the previous nodes or the input data of a certain sub-node is generated by the previous sub-nodes
  • the consumption of data means that the output data of this node is
  • setting the cache management operation is also to set the cache management operation for the input and output data of each node in the reconstructed node or each sub-node in the thread-level subgraph; the set cache management operation includes prefetching (prefetch ) operations, invalid (invalid) operations, writeback (writeback) operations, and refresh (flush) operations, etc., so as to maximize the performance of the cache during execution.
  • prefetch prefetching
  • invalid invalid
  • writeback writeback
  • refresh refresh
  • Prefetch operation refers to the data coming in from the reconstructed node or thread-level subgraph, that is, the output data of the node in the non-reconstructed node or the sub-node in the thread-level subgraph, and the data is processed. Prefetch, and write in the cache in advance to reduce the performance consumption caused by reading the data from the memory during calculation.
  • FIG. 10 is a schematic diagram of a data prefetching operation provided by an embodiment of the present application.
  • the output data of node 1 outside the reconstructed node is data 1, and data 1 is the input data of nodes 2 and 3 in the reconstructed node, then the data 1 is prefetched; or, thread level
  • the output data of sub-node 1 outside the subgraph is data 1, and data 1 is the input data of sub-node 2 and sub-node 3 in the thread-level sub-graph, so the prefetch operation is performed on data 1; among them, in node 2 and node 3 Or before sub-node 2 and sub-node 3 are executed, data 1 is stored in the memory, and prefetching data 1 means writing data 1 from the memory into the cache.
  • Invalid operation refers to the data consumed in the reconstructed node or thread-level subgraph. After the last use is completed, the data in the cache is invalidated to reduce the performance consumption caused by refreshing the memory.
  • FIG. 11 is a schematic diagram of a data invalidation operation provided by an embodiment of the present application.
  • the output data of node 1 in the reconstructed node is data 1, and data 1 is the input data of nodes 2 and 3 in the reconstructed node, and data 1 is used as node 2 and node 3 in the reconstructed node
  • data 1 will be invalidated, that is, data 1 in the cache will be deleted; or, the output of sub-node 1 in the thread-level subgraph
  • the data is data 1, and data 1 is the input data of child node 2 and child node 3 in the thread-level subgraph.
  • Write-back operation refers to the data used in the reconstruction node or thread-level sub-graph, if it is also used outside the reconstruction node or the thread-level sub-graph, it will be written back when the data is generated in memory.
  • FIG. 12 is a schematic diagram of a data write-back operation provided by an embodiment of the present application.
  • the output data of node 1 in the reconstructed node is data 1, and data 1 is not only the input data of node 2 in the reconstructed node, but also the input data of node 3 outside the reconstructed node, Then perform a write-back operation on data 1, that is, write data 1 in the cache into the memory; or, the output data of child node 1 in the thread-level subgraph is data 1, and data 1 is divided by the child node in the thread-level subgraph In addition to the input data of node 2, it is also the input data of child node 3 outside the thread-level subgraph, and the data 1 is written back, that is, the data 1 in the cache is written into the memory.
  • Refresh operation refers to adding an invalid operation to delete the data in the cache after a certain data is written back to the memory.
  • Compilation unit 105 used for compiling by taking reconstruction nodes or thread-level subgraphs as scheduling units, and compiling each reconstruction node or thread-level subgraph into a scheduling unit.
  • the compiling unit 105 compiles each refactored node, and expresses the calculation, cache management operation and dependency relationship of each node in the refactored node, that is, completes the binary execution code of each node and the dependency relationship Generation of representations and representations of cache management operations.
  • the compiling unit 105 compiles each thread-level subgraph, and expresses the calculation, cache management operation tasks and dependencies of each sub-node in the thread-level sub-graph, that is, completes the binary execution code and dependency of each sub-node. Representation of relationships and generation of representations of cache management operations.
  • the dependency relationship is used to represent a directed edge, such as a directed edge between nodes, a directed edge between a reconstructed node and a reconstructed node, or a directed edge between a child node and a child node.
  • a directed edge such as a directed edge between nodes, a directed edge between a reconstructed node and a reconstructed node, or a directed edge between a child node and a child node.
  • the compilation unit 105 compiles each reconstruction node to obtain the compilation data of each reconstruction node, and compiling each reconstruction node includes compiling each node in the reconstruction node to obtain the reconstruction data
  • the compiled data of each node in the reconstruction node so the compiled data of each reconstruction node includes the compiled data of each node in the reconstruction node, and the compiled data of each node includes the hardware-aware tasks generated for the node , the descriptor of the node and other information.
  • the hardware-aware tasks include the binary execution code of the computing operator executable by the hardware at runtime and the input parameters (args) of the binary execution code, etc., so the tasks corresponding to the node include the computing operator represented by the node
  • the descriptor includes dependencies and cache management operation indication information, etc., so the descriptor of a node includes the dependency relationship of the node and the cache management operation indication information of the node, and the node
  • the cache management operation indication information of is used to indicate that corresponding cache management operations (such as prefetch, invalidation, write-back and refresh, etc.) are performed on the input and output data of the node during execution.
  • FIG. 13 is a schematic diagram of compiling the reconstruction node shown in FIG. 5 .
  • the reconstruction nodes include 5 nodes, namely AIC operator node, AICPU operator node, AIV operator node, communication operator node and control operator node. These 5 nodes are respectively compiled to generate The tasks corresponding to each of these 5 nodes; in addition, the reconstructed nodes include directed edges between these 5 nodes, so the descriptors of each of these 5 nodes are also generated during compilation (Fig.
  • the descriptor of each node includes the dependency relationship of the node, so that the hardware can perceive the execution sequence of the five nodes during execution; and the aforementioned resource allocation unit 104 sets the Corresponding cache management operations, so the descriptor of each node also includes cache management operation instruction information for input and output data of each node.
  • the compilation unit 105 compiles each thread-level subgraph to obtain the compiled data of each thread-level subgraph, and the compiled data of each thread-level subgraph is also divided to obtain the compilation of the reconstructed nodes of the thread-level subgraph data, and compiling each thread-level subgraph includes compiling each sub-node in the thread-level sub-graph to obtain the compiled data of each sub-node in the thread-level sub-graph, so each thread-level sub-graph
  • the compiled data includes the compiled data of each sub-node in the thread-level subgraph, and the compiled data of each sub-node includes information such as a hardware-aware task generated for the sub-node, a descriptor of the sub-node, and the like.
  • the hardware-aware tasks include the binary execution code of the computing operator executable by the hardware at runtime and the input parameters (args) of the binary execution code, etc., so the tasks corresponding to the sub-nodes include the computing operations represented by the sub-nodes.
  • the binary execution code of the operator and the input parameters of the binary execution code; the descriptor includes the dependency relationship and cache management operation instruction information, etc., so the descriptor of the child node includes the dependency relationship of the child node and the cache management operation instruction of the child node
  • the information and the cache management operation indication information of the sub-node are used to instruct corresponding cache management operations (such as prefetch, invalidation, write-back and refresh, etc.) to be performed on the input and output data of the sub-node during execution.
  • FIG. 14 is a schematic diagram of compiling the thread-level subgraph shown in FIG. 5 .
  • the thread-level subgraph includes 13 sub-nodes, and the 13 sub-nodes are compiled separately to generate tasks corresponding to each of the 13 sub-nodes; in addition, the thread-level sub-graph also includes these 13 sub-nodes between the directed edges, so the descriptors (not shown in Figure 14) of each of the 13 sub-nodes are also generated during compilation, wherein the descriptors of each sub-node include the dependencies of each sub-node, thus During execution, the hardware can perceive the execution order of these 13 sub-nodes; and, the aforementioned resource allocation unit 104 sets corresponding cache management operations for each sub-node, so the descriptor of each sub-node also includes the input and output of each sub-node Instructions for cache management operations on data.
  • the compiled data of each node or sub-node includes the type of computing device executing the node or sub-node.
  • the type of heterogeneous processors is added to each node or sub-node task, so that the hardware can be dispatched to the corresponding hardware acceleration unit for processing during execution, ensuring concurrent execution of heterogeneous processors.
  • heterogeneous operator nodes are scheduled to heterogeneous processors, allowing heterogeneous operator nodes to perform concurrent execution, such as CPU operator nodes, GPU operator nodes, chip application specific integrated circuits (Application Specific Integrated Circuit, ASIC) concurrent execution of operator nodes.
  • ASIC Application Specific Integrated Circuit
  • each thread in the thread-level subgraph can execute concurrently, and the compiled data of each sub-node in the thread-level sub-graph also includes other sub-nodes that execute in parallel with the sub-node, and the other sub-nodes and the sub-node Nodes are obtained by dividing the same node. In this way, when multiple threads execute, the threads can execute concurrently.
  • the dependency relationship of a certain node can be expressed as the number of other nodes that the node depends on and other nodes that depend on the node
  • the dependency relationship of a certain child node can be expressed as the number of other child nodes that the child node depends on and other child nodes that depend on this child node.
  • FIG. 15 is a schematic diagram expressing the dependency relationship of some sub-nodes in the thread-level subgraph shown in FIG. 5 .
  • the dependency expression can be expressed by the previous (predecessor, pred) and the successor (successor, succ), wherein, the pred_cnt expresses the number of nodes that depend on the previous node, and the succ_list expresses the number of nodes that depend on the node successor node.
  • the number of child nodes of AIC operator 1 that depends on the previous node is 0, the child node that depends on the child node of AIC operator 1 is the child node of AICPU operator 1; the number of child nodes of AIC operator 2 that depends on the previous node is 0, The child node that depends on the child node of AICPU operator 2 is the child node of AICPU operator 1; the number of child nodes that depend on the child node of AICPU operator 1 is 2, and the child node that depends on the child node of AICPU operator 1 is the child node of AIV operator 1 .
  • the data edges and control edges between nodes in the reconstructed node can be expressed by the number of other nodes that the node depends on and the other nodes that depend on the node, or the data between sub-nodes in the thread-level subgraph
  • Both the edge and the controlling edge can be expressed by the number of other child nodes that a child node depends on and the other child nodes that depend on this child node. That is, the data edges and control edges between nodes in a reconstructed node or between sub-nodes in a thread-level subgraph can be expressed through the above pred_cnt and succ_list, and compiled in the node or sub-graph In the compiled data of the node.
  • FIG. 16 is a schematic diagram of a directed edge provided by an embodiment of the present application.
  • the solid lines between nodes represent data edges.
  • the solid line between the left AIC operator node and the left AICPU operator node represents the connection between the left AIC operator node and the left AICPU operator node.
  • the number of other nodes (pred_cnt) that the left AIC operator node depends on is 0, and the other nodes (succ_list) that depend on the left AIC operator node are the left AICPU operator node, and compile it In the compiled data of the AIC operator node on the left;
  • the dotted line between the nodes indicates the control edge, for example, the dotted line between the AICPU operator node on the left and the AICPU operator node on the right indicates the AICPU operator node on the left and the AICPU on the right
  • the control edge between the operator nodes, where the number of other nodes (pred_cnt) that the left AICPU operator node depends on is 1, that is, the left AIC operator node depends on the left AICPU operator node.
  • the nodes (succ_list) are the AICPU operator node on the right and the AIV operator node on the left, and compile them into the compiled data of the AICPU operator node on the left.
  • control edges between nodes in different reconstructed nodes in the same flow are converted into control edges between different reconstructed nodes; and the control edges between different reconstructed nodes can be expressed by reconstructing
  • the number of other reconstruction nodes that a node depends on and the number of other reconstruction nodes that depend on this reconstruction node are expressed, that is, expressed through the above pred_cnt and succ_list.
  • the different reconstruction nodes may be reconstruction nodes in the same reconstruction subgraph, or may be reconstruction nodes in different reconstruction subgraphs respectively.
  • FIG. 17 is a schematic representation of a control edge provided by an embodiment of the present application.
  • the control edge between the AICPU operator node in one of the reconstruction nodes and the AICPU operator node in the other reconstruction node is transformed into the two Express the control edge between two reconstruction nodes, and express the control edge between the two reconstruction nodes as a dependency relationship between the two reconstruction nodes, and compile the compiled data at the two reconstruction nodes. middle.
  • the third type is the control edge between nodes in different reconstruction nodes in different streams, adding send (send) operator nodes and receiving (recv) operator nodes to convert it into the expression of control edges between streams,
  • sending operator nodes It is represented by the control edge between the child node and the receiving operator node, and the control edge between the receiving operator node and the reconstruction node in other streams.
  • the sending operator node can send a synchronization signal to the receiving operator node, and the synchronization signal can indicate that the nodes that the sending operator node depends on have been executed.
  • the receiving operator node receives the synchronization signal sent by the sending operator node. After signaling the existence of a control edge with the receiving operator node the reconstruction node can be executed.
  • the different reconstruction nodes may be reconstruction nodes in the same reconstruction subgraph, or may be reconstruction nodes in different reconstruction subgraphs respectively.
  • FIG. 18 is a schematic representation of another control edge provided by an embodiment of the present application.
  • two reconstruction nodes are in two streams respectively, the control edge between the AICPU operator node in one reconstruction node and the AICPU operator node in the other reconstruction node;
  • a sending operator node is added to the reconstruction node or a flow including one of the reconstruction nodes, and the sending operator node depends on one of the reconstruction nodes, that is, between the sending operator node and the last executed node in one of the nodes
  • Add a directed edge (such as a data edge), for example, add a data edge between the last executed AIV operator node and the sending operator node in one of the reconstruction nodes; and add a receive in the flow that includes another reconstruction node
  • the operator node, another reconstruction node depends on the receiving operator node, that is, a directed edge (such as a data edge) is added between the receiving operator node and the first executed node in another node, for example, in another A data edge
  • control edge between the sending operator node and the receiving operator node, the receiving operator node and the other reconstruction node first
  • the data edge between the executed AIC operator nodes can be expressed as the dependency relationship of the receiving operator node, and compiled in the compiled data of another reconstruction node.
  • this application also supports distributed execution when the reconstruction node or thread-level subgraph is used as the scheduling unit.
  • different reconstruction nodes can be executed in different computing devices, or different Thread-level subgraphs can be executed on different computing devices, thus enabling distributed execution.
  • the different reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs; different thread-level subgraphs can be used for the same reconstruction
  • the reconstruction nodes in the subgraphs are obtained by thread division, or the reconstruction nodes in different reconstruction subgraphs are obtained by thread division.
  • This application uses the reconstruction node or thread-level subgraph as the scheduling unit for distributed execution, and can also realize the data interaction between computing devices.
  • the computing device in this application may be a chip (die), a processor, and the like.
  • the present application implements data communication among multiple computing devices and multiple servers through some collective communication operators, for example, transmits data from one computing device to another computing device through collective communication operators.
  • the directed edges between the child nodes and the nodes in the reconstruction node, or the directed edges between the collective communication operator nodes and the child nodes in the thread-level subgraph can be expressed as the dependency relationship of the collective communication operator nodes, reorganization The dependency relationship of nodes in a structural node, or the dependency relationship of child nodes in a thread-level subgraph; where, the dependency relationship of a collective communication operator node can be expressed as the number and dependency of nodes or child nodes that a collective communication operator node depends on The nodes or child nodes of the collective communication operator node are expressed through pred_cnt and succ_list.
  • the collective communication operator node also represents the node of the collective communication operator.
  • the collective communication operator node is the same as the node in the calculation graph, subgraph or reconstruction node, and can perform operations such as thread division and compilation; among them, when When the data to be transported by the collective communication operator node is the output data of the child nodes, the collective communication operator node can optionally be divided into threads.
  • FIG. 19 is a schematic diagram of an expression of distributed execution provided by an embodiment of the present application.
  • reconstruction node 0 and reconstruction node 1 perform distributed execution, the output data (output memory block D) of node D in reconstruction node 0 is the input data of node Q in reconstruction node 1,
  • the input data of the node H in the reconstructed node 0 includes the output data (output memory block M) of the node M in the reconstructed node 1;
  • the computing device 0 executes the reconstructed node 0, so the computing device 0 executes the node D and the node H;
  • Computing device 1 executes reconstruction node 1, so computing device 1 executes node M and node Q;
  • output memory block D is migrated from computing device 0 to computing device 1 by data management unit (data management unit, DMU) node 0 as The input data of node Q;
  • the output memory block M is migrated from the computing device 1 to the computing device 0 by the data management unit node 2 as the input data of no
  • the directed edge between node D and data management unit node 0 can be expressed as a dependency relationship of node D, and compiled in the compiled data of node D, and the compiled data of the reconstructed node 0 includes The compiled data of node D; the directed edge between node D and data management unit node 0, and the directed edge between data management unit node 0 and node Q are expressed as dependencies of data management unit node 0, and compiled in In the compiled data of the data management unit node 0, and the compiled data of the reconstructed node 0 includes the compiled data of the data management unit node 0; and the directed edge between the data management unit node 0 and the node Q is expressed as the node Q Dependencies, and compiled in the compiled data of node Q, and the compiled data of node 1 includes the compiled data of node Q.
  • the directed edge between node M and data management unit node 2 can be expressed as a dependency relationship of node M, and compiled in the compiled data of node M, and the compiled data of node 1 can be reconstructed Include the compiled data of node M; express the directed edge between node M and data management unit node 2, and the directed edge between data management unit node 2 and node H as the dependency relationship of data management unit node 2, and compile
  • the compiled data of the data management unit node 2 includes the compiled data of the data management unit node 2
  • the directed edge between the data management unit node 2 and the node H is expressed as a node H and compiled in the compiled data of node H, and the compiled data of node 0 includes the compiled data of node H.
  • Figure 19 also includes two collective communication operator nodes, data management unit node 1 and data management unit node 3, the directed edge between node H and data management unit node 1, and the link between data management unit node 1 and node F.
  • the directed edge between node Q and data management unit node 3, and the directed edge between data management unit node 3 and node E are also expressed in the same way and will not be repeated here.
  • FIG. 20 is a schematic diagram of an expression of distributed execution provided by an embodiment of the present application.
  • thread-level subgraph 0 and thread-level subgraph 1 carry out distributed execution, and the output data (output memory block 30) of child node 30 in thread-level subgraph 1 is the child in thread-level subgraph 0.
  • Input data of node 13; computing device 0 executes thread-level subgraph 0, so computing device 0 executes subnode 13; computing device 1 executes thread-level subgraph 1, so computing device 1 executes subnode 30;
  • output memory block 30 consists of data
  • the management unit node 1 migrates from the computing device 1 to the computing device 0 as the input data of the child node 13 .
  • the directed edge between the child node 30 and the data management unit node 1 can be expressed as a dependency relationship of the child node 30, and compiled in the compiled data of the child node 30, and the thread-level subgraph 1
  • the compiled data of includes the compiled data of child node 30; the directed edge between child node 30 and data management unit node 1, and the directed edge between data management unit node 1 and child node 13 are expressed as data management unit node 1 Dependency relationship, and compiled in the compiled data of the data management unit node 1, and the compiled data of the thread-level subgraph 1 includes the compiled data of the data management unit node 1; and the data management unit node 1 and the child node 13
  • the directed edge of is expressed as a dependency relationship of child node 13, and is compiled in the compiled data of child node 13, and the compiled data of thread-level subgraph 0 includes the compiled data of child node 13.
  • the data management unit node 1 can be regarded as a child node that divides the data management unit node 1 into one thread; in addition, the data management unit node 00 and the data management unit node 01 are also included in Figure 20, and the data management unit node 00 and data management unit node 01 are obtained by dividing data management unit node 0 into threads.
  • the output data (output memory block 12) of the child node 12 in the thread level subgraph 0 is the input data of the child node 50 in the thread level subgraph 1, and the output data of the child node 13 in the thread level subgraph 0 (
  • the output memory block 13) is the input data of the child node 51 in the thread-level subgraph 1; the child node 12 and the child node 13 are executed by the computing device 0, and the child node 50 and the child node 51 are executed by the computing device 1; the output memory block 12 is composed of the data
  • the management unit child node 01 migrates from the computing device 0 to the computing device 1 as the input data of the child node 50; the output memory block 13 is migrated from the computing device 0 to the computing device 1 by the data management unit child node 00 as the child node 50.
  • the directed edge between the child node 12 and the data management unit node 01 can be expressed as a dependency relationship of the child node 12, and compiled in the compiled data of the child node 12, and the thread-level subgraph 0
  • the compiled data of includes the compiled data of child node 12; the directed edge between child node 12 and data management unit node 01, and the directed edge between data management unit node 01 and child node 50 are expressed as data management unit node 01 and compiled in the compiled data of the data management unit node 01, and the compiled data of the thread-level subgraph 0 includes the compiled data of the data management unit node 01; and the data management unit node 01 and the child node 50
  • the directed edge of is expressed as the dependency relationship of the sub-node 50, and compiled in the compiled data of the sub-node 50, and the compiled data of the thread-level sub-graph 1 includes the compiled data of the sub-node 50.
  • the directed edge between child node 13 and data management unit node 00 can be expressed as a dependency relationship of child node 13, and compiled in the compiled data of child node 13, and the thread-level subgraph
  • the compiled data of 0 includes the compiled data of child node 13; the directed edge between child node 13 and data management unit node 00, and the directed edge between data management unit node 00 and child node 51 are expressed as data management unit nodes 00, and compiled in the compiled data of the data management unit node 00, and the compiled data of the thread-level subgraph 0 includes the compiled data of the data management unit node 00; and the data management unit node 00 and the child node 51
  • the directed edges between are expressed as the dependency relationship of the sub-node 51, and compiled in the compiled data of the sub-node 51, and the compiled data of the thread-level sub-graph 1 includes the compiled data of the sub-node 51.
  • the graph execution device 200 The graph execution device 200.
  • the graph execution device 200 includes a scheduling unit 201 and an execution unit 202, which are described in detail below:
  • Scheduling unit 201 used to dispatch the compiled data of each reconstructed node or thread-level subgraph output by the graph compiling device 100 to the execution unit 202 with the reconstructed node or thread-level subgraph as a scheduling unit.
  • Execution unit 202 configured to execute the compiled data of the reconstructed node or thread-level subgraph by taking the reconstructed node or thread-level subgraph as a scheduling unit.
  • FIG. 21 is a schematic structural diagram of a graph execution device 200 provided in a real-time example of the present application.
  • the graph execution device 200 may also include a cache 203 and a memory 204 .
  • the compiled data of each reconstructed node or thread-level subgraph output by the graph compiling device 100 is initially stored in the memory 204, and the graph execution device 200 needs to reconstruct the reconstructed node or thread-level subgraph Compiled data for node or thread level subgraphs are preloaded from memory 204 into cache 203, thereby improving runtime performance.
  • the scheduling unit 201 can be a software unit, such as microcontroller (micro control unit, MCU) firmware (firmware), and the execution unit 202 is a hardware deceleration unit, so that the reconfiguration of nodes or thread-level sub-units can be completed through the cooperation of software and hardware.
  • the scheduling unit 201 is responsible for scheduling the compiled data of the reconstructed node or thread-level subgraph, and the execution unit 202 executes the compiled data of the reconstructed node or thread-level subgraph.
  • the specific process is as follows:
  • the graph execution device 200 first preloads the compiled data of the reconstructed node or thread-level subgraph into the cache 203, and notifies the scheduling unit 201 to start execution, and the scheduling unit 201 reads the initially ready nodes or sub-nodes from the cache 203 task.
  • the reconstructed node includes compiled data of at least one node, and the compiled data of each node includes the task corresponding to the node and the descriptor of the node, and the descriptor of the node includes the dependency relationship of the node, and the dependency of the node
  • the thread-level subgraph includes compiled data of at least one child node, and the compiled data of each child node includes the task corresponding to the child node and the descriptor of the child node, and the descriptor of the child node includes the Dependency relationship, the dependency relationship of this child node can be expressed as the number of child nodes that this child node depends on and other child nodes that depend on this child node. If the number of child nodes that this child node initially depends on is 0, then the task corresponding to this child node Initially ready tasks.
  • the scheduling unit 201 pushes the tasks of the ready nodes or child nodes to the execution unit 202 for execution.
  • a ready node refers to a node whose number of dependent nodes is 0, including an initially ready node and a later ready node;
  • a later ready node refers to a node whose initial dependent number of nodes is not 0, but in During the execution of the refactoring node, after each node that the node depends on is executed, the number of nodes that the node depends on is reduced by 1 (that is, pred_cnt--), and after all the nodes that the node depends on have been executed, That is, after the number of nodes that the node depends on is reduced to 0, the node is a ready node.
  • a ready child node refers to a child node whose number of dependent child nodes is 0, including an initially ready child node and a later ready child node;
  • a later ready child node refers to a child node that a child node initially depends on The number of is not 0, but during the execution of the refactoring child node, after each child node that the child node depends on is executed, the number of child nodes that the child node depends on will be reduced by 1, and so on. After all the nodes have been executed, that is, after the number of child nodes that the child node depends on is reduced to 0, the child node is a ready child node.
  • the execution unit 202 reads the descriptor of the ready node or sub-node from the cache 203, and executes the task of the ready node or sub-node based on the descriptor of the ready node or sub-node.
  • the descriptor of the node includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the node, the storage location, life cycle and cache management operation indication information of the output data of the node, the node Dependency relationship, the type of computing device used to execute the node; therefore, the execution unit 202 executes the ready node based on the information included in the descriptor of the ready node.
  • the descriptor of a child node includes at least one of the following: the storage location, life cycle, and cache management operation indication information of the input data of the child node, and the storage location, life cycle, and cache management operation indication information of the output data of the child node information, the dependency relationship of the child node, the type of computing device used to execute the child node, and other child nodes that are executed in parallel with the child node; therefore, the execution unit 202 executes based on the information included in the descriptor of the ready child node Executes the ready child.
  • the execution unit 202 notifies the scheduling unit 201 after executing the task of the ready node or child node.
  • the scheduling unit 201 reads the number of nodes that other nodes depend on the currently executed node, and decreases the number of nodes that other nodes depend on by 1. When the number of nodes that other nodes depend on is reduced to 0, the The tasks of other nodes are pushed to the execution unit 202 for execution.
  • the graph compilation device 100 also includes a memory. After the graph compilation device 100 compiles and obtains the compiled data of the reconstructed node or thread-level subgraph, it can store it in the memory of the graph compilation device 100; Before executing the reconstructed node or thread-level subgraph, the graph compiling device 100 writes the compiled data of the reconstructed node or thread-level subgraph into the cache 203 in advance; or, the graph executing device 200 executes the reconstructed node or thread-level subgraph Before the thread-level subgraph, the compiled data of the reconstructed node or the thread-level subgraph is preloaded into the cache 203 in advance.
  • the performance at runtime can be improved; for example, the memory 204 on the graph execution device 200 side can be reduced, and the memory on the graph compiling device 100 can be fully utilized; the processing steps of the graph compiling device 100 can be reduced, and the graph compiling device 100 and graph
  • the asynchronous pipeline of the execution device 200 makes the pipeline more balanced.
  • the graph compiling apparatus 100 may be set on the host side or just the host, and the graph executing apparatus 200 may be set on the device side or just the device side.
  • FIG. 22 and FIG. 23 are schematic diagrams of a host-side and device-side asynchronous pipelines provided by embodiments of the present application, respectively.
  • the host side and the device side can implement asynchronous pipeline operations.
  • the host side compiles a reconstructed node or thread-level subgraph to obtain the compiled data of the reconstructed node or thread-level subgraph, and compiles the reconstructed node or thread-level subgraph.
  • the compiled data of the reconstructed node or thread-level subgraph is stored in the memory on the host side; the host side writes the compiled data of the reconstructed node or thread-level subgraph before the device side executes the reconstructed node or thread-level subgraph Device side; the device side starts to execute the refactoring node or thread-level subgraph, and the host side starts to compile another refactoring node or thread-level subgraph to obtain the compiled data of the other refactoring node or thread-level subgraph , and store the compiled data of the other reconstruction node or thread-level subgraph in the host-side memory; after executing the reconstruction node or thread-level subgraph on the device side, execute the other reconstruction node or thread-level subgraph Before the graph, the host side writes the compilation data of another reconstructed node or thread-level subgraph to the device side, and repeats the above operations to realize asynchronous pipeline.
  • the host side and the device side can implement asynchronous pipeline operations.
  • the host side compiles a reconstructed node or thread-level subgraph to obtain the compiled data of the reconstructed node or thread-level subgraph, and compiles the reconstructed node or thread-level subgraph.
  • the compiled data of the reconstructed node or thread-level subgraph is stored in the memory on the host side; before the device side executes the reconstructed node or thread-level subgraph, it reads the compiled data of the reconstructed node or thread-level subgraph from the host side; The device side starts to execute the refactoring node or thread-level subgraph, and the host side starts to compile another refactoring node or thread-level subgraph to obtain the compiled data of the other refactoring node or thread-level subgraph, and The compiled data of the other reconstruction node or thread-level subgraph is stored in the memory of the host; after the device side executes the reconstruction node or thread-level subgraph and before executing the other reconstruction node or thread-level subgraph, Read the compiled data of another reconstructed node or thread-level subgraph from the host side, and repeat the above operations to realize asynchronous pipeline.
  • the host side and the device side preload the compiled data of the reconstructed node or thread-level subgraph into the cache of the device side through the high-speed serial computer expansion bus standard channel (PCIe-through), which can bring the following advantages Performance improvement: reduce the memory usage on the device side and make full use of the memory on the host side; reduce the processing steps on the host side and adjust the asynchronous pipeline to make the pipeline more balanced.
  • PCIe-through high-speed serial computer expansion bus standard channel
  • the timing of preloading the compiled data of any node in the reconstructed node or any sub-node in the thread-level subgraph into the cache 203 is described as follows:
  • the scheduling unit 201 receives the completion of the execution of the node that the late-stage ready node depends on, for example, after the execution of the first executed node that the late-stage ready node depends on is completed, the late-stage ready Before the number of nodes that the node depends on is reduced by 1, the compiled data of the late-ready node is preloaded into the cache 203; thus, when the number of nodes that the late-ready node depends on is reduced to 0, it is pushed to the hardware for execution , the compiled data of the late-ready node is already in the cache 203 .
  • FIG. 24 is a schematic diagram of a dependency relationship between nodes in a reconstructed node or a dependency relationship between sub-nodes in a thread-level subgraph provided by an embodiment of the present application.
  • the reconstruction node includes node 0, node 1 and node 2, and both node 1 and node 2 depend on node 0; among them, the compiled data of node 0 is preloaded into the cache before executing the reconstruction node 203; and the compiled data of node 1 is preloaded into the cache 203 after the execution of node 0 is completed and before the number of nodes dependent on node 1 is reduced by 1; and the compiled data of node 2 is executed at node 0 After completion, the number of nodes that node 2 depends on is preloaded into the cache 203 before being decremented by 1.
  • the thread-level subgraph includes subnode 0, subnode 1, and subnode 2, and subnode 1 and subnode 2 all depend on subnode 0; among them, the compiled data of subnode 0 is before executing the thread-level subgraph preloaded into the cache 203; and the compiled data of child node 1 is preloaded into the cache 203 after the execution of child node 0 is completed, and before the number of child nodes dependent on child node 1 is reduced by 1; child node 2 The compiled data of is preloaded into the cache 203 after the execution of child node 0 is completed and before the number of child nodes on which child node 2 depends is decremented by 1.
  • the input data of the reconstructed node or thread-level subgraph can also be stored in the memory of the graph compiling device 100, so as to further reduce the memory usage of the graph executing device 200 and adjust the asynchronous pipeline on the graph compiling device 100 side. more balanced purpose.
  • the input data of the reconstruction node mainly refers to the output data of the nodes in the non-reconstruction node, that is, the data coming in from the reconstruction node, including the input data of the initially ready nodes in the reconstruction node; similarly,
  • the input data of the thread-level subgraph mainly refers to the output data of the child nodes in the non-thread-level subgraph, that is, the data coming in from outside the thread-level subgraph, including the input data of the initially ready child nodes in the thread-level subgraph .
  • the graph compiling device 100 can be set on the host side or the host
  • the graph execution device 200 can be set on the device side or the device side
  • the input data for reconstructing node or thread-level subgraphs can also be placed in the memory of the host side
  • the device side preloads the input data of the reconstructed node or thread-level subgraph into the data device side cache through direct memory access (Direct Memory Access, DMA), so as to further reduce the memory usage of the device side and adjust the host
  • DMA Direct Memory Access
  • the timing of the cache management operation of the input and output data of a node or subsection is described as follows:
  • the input data of the reconstructed node that is, the data coming in from the reconstructed node
  • the input data is prefetched according to the life cycle of the input data, that is, Before executing the reconstruction node or the node, write the input data into the cache 203; After the input data of other nodes in the node, the input data in the cache 203 is deleted.
  • the input data of the thread-level subgraph (that is, the data coming in from outside the thread-level subgraph) as the input data of a child node of the thread-level subgraph
  • the input data is prefetched according to the life cycle of the input data , that is, before executing the thread-level subgraph or the subnode, write the input data into the cache 203; and perform an invalid operation on the input data according to the life cycle of the input data, that is, After being used as input data for other child nodes in the thread-level subgraph for the last time, the input data in the cache 203 is deleted.
  • the output data of a node in the reconstructed node is written back according to the life cycle of the output data, that is, the output data of the node is written into the memory; and, according to the life cycle of the output data Refresh the output data, that is, delete the output data of the node in the cache after writing the output data of the node into the memory; and invalidate the output data according to the life cycle of the output data, also That is, after the output data is used as input data of other nodes in the reconstructed node for the last time, the output data in the cache 203 is deleted.
  • the output data is written back according to the life cycle of the output data, that is, the output data of the sub-node is written into the memory; and, according to the The life cycle of the output data refreshes the output data, that is, after the output data of the child node is written into the memory, the output data of the child node in the cache is deleted; and the output data is updated according to the life cycle of the output data
  • the output data is invalidated, that is, after the output data is used as input data of other child nodes in the thread-level subgraph for the last time, the output data in the cache 203 is deleted.
  • FIG. 25 is a schematic diagram of a buffer management operation of input and output data of a node or a sub-node provided by an embodiment of the present application.
  • readyq ready-to-run queue
  • cq completion queue
  • the scheduling unit 201 uses the descriptor of node 1 or child node 1 to specify hardware to initiate a prefetch operation for data 0: when data 0 needs to be prefetched, the scheduling unit 201 pushes the descriptor of node 1 or child node 1 to After the ready-to-run queue, the hardware selects an opportunity to initiate prefetching.
  • FIG. 26 is a schematic flowchart of a method for compiling a subgraph provided by an embodiment of the present application.
  • the method for compiling a subgraph is applied to the graph compiling device 100; the method for compiling a subgraph includes but is not limited to the following operations or step:
  • Step 2601 Obtain a first subgraph, the first subgraph is any one of multiple subgraphs obtained by cutting the calculation graph, the first subgraph includes multiple first nodes and the multiple first Directed edges between nodes.
  • the calculation graph in this application is a calculation graph of a neural network model, and one calculation graph represents the operation of a neural network model.
  • directed edges in the calculation graph include data edges and control edges, and the directed edges described in this application also include data edges and control edges.
  • the directed edges between multiple first nodes include multiple Data edges and control edges between the first nodes.
  • Step 2602 Restructure the first subgraph to obtain a first reconstructed subgraph, the first reconstructed subgraph includes at least one first reconstructed node and the at least one first reconstructed node The directed edge between.
  • the first reconstructed subgraph is also obtained by reconstructing the first subgraph based on the reconstruction principle described in FIG. 3 .
  • Step 2603 Compile the first target reconstruction node to obtain compiled data of the first target reconstruction node, where the first target reconstruction node is any one of the at least one first reconstruction node A reconstruction node, the first target reconstruction node includes M first nodes among the plurality of first nodes and directed edges between the M first nodes, where M is a positive integer.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes.
  • a first reconstruction node since a first reconstruction node includes M first nodes and directed edges between the M first nodes, that is, a first reconstruction node is essentially a small subgraph; therefore, scheduling a first reconstruction node for compilation means scheduling a subgraph smaller in scale than the first subgraph for compilation, so the process of scheduling the first target reconstruction node for compilation is a The compilation process of a graph scheduling mechanism.
  • the calculation graph is divided into multiple subgraphs, and then any subgraph in the multiple subgraphs is reconstructed to obtain a reconstructed subgraph, and then the reconstruction node in the reconstructed subgraph is
  • the scheduling unit realizes compiling any reconstruction node in the reconstruction subgraph.
  • the reconstructed subgraph obtained in this application includes at least one reconstructed node and a directed edge between the at least one reconstructed node, so the reconstructed subgraph obtained in this application is essentially a subgraph; and, Since any reconstructed node in the reconstructed subgraph includes one or more nodes in the arbitrary subgraph and directed edges between one or more nodes, the reconstructed node is essentially a A subgraph with a smaller subgraph size.
  • the reconstruction node as the scheduling unit is essentially a graph scheduling mechanism, and scheduling one reconstruction node for compilation at a time is equivalent to scheduling one or more nodes in any subgraph for compilation;
  • the scheduling time can be reduced by using the reconstructed node as the scheduling unit when compiling the subgraph.
  • the compiled data of any one of the reconstruction nodes obtained by compiling the arbitrary one of the reconstruction nodes is used to execute one or more nodes in the any one of the subgraphs, and when the subgraph is executed, one reconstruction node is scheduled at a time
  • the compiled data of a node is used to execute the compiled data of one or more nodes in any sub-graph at one time.
  • Reconstructing a node is equivalent to scheduling one or more nodes in any one of the subgraphs to execute the one or more nodes; A node in any one of the subgraphs is scheduled to be executed, and the scheduling time can be reduced by taking the reconstructed node as the scheduling unit when the subgraph is executed.
  • the first subgraph includes multiple first nodes and directed edges between the multiple first nodes, and the first subgraph
  • the first reconstructed subgraph obtained by reconstruction includes at least one first reconstructed node and directed edges between at least one first reconstructed node, wherein the first target reconstructed node includes M in the first subgraph Directed edges between the first node and M first nodes, M is a positive integer; compiling the first target reconstruction node is equal to compiling the M first nodes, and the first target reconstruction
  • the compiled data of the node is used to execute the M first nodes, that is, executing the first target reconstruction node is equal to executing the M first nodes.
  • the embodiment of the present application provides a graph scheduling mechanism, which takes the reconstructed node of the graph structure as the scheduling unit, and can reduce the scheduling time.
  • the compiling the first target reconstruction node to obtain the compilation data of the first target reconstruction node includes: compiling each of the M first nodes A node is divided into at least one first child node to obtain a first thread-level subgraph, the first thread-level subgraph including N first child nodes and directed edges between the N first child nodes , the N is a positive integer greater than or equal to the M; compiling the first thread-level subgraph to obtain the compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node includes compiled data of the N first child nodes.
  • sub-nodes described in this application are obtained by dividing the nodes into threads, and a node is also divided into at least one sub-node, each sub-node in the at least one sub-node is a thread, and the at least one sub-node can be executed in parallel.
  • the first thread-level subgraph is also obtained by dividing the first target reconstruction node into threads based on the principle of thread division described in FIG. 4 or FIG. 5 .
  • any reconstructed node includes M first nodes and directed edges between M first nodes, and when compiling any reconstructed node, the M first nodes
  • Each first node is divided into at least one first child node, that is, the operator operation represented by each of the M first nodes is divided into at least one thread, and each thread in the at least one thread passes through a
  • the child node represents; M first child nodes are divided into threads to obtain N first child nodes, and N is a positive integer greater than or equal to M, and the directed relationship between the N first child nodes and the N first child nodes
  • An edge constitutes a first thread-level subgraph; compiling the first thread-level subgraph can obtain compiled data of the first target reconstruction node, and the compiled data of the first target reconstruction node includes N first subgraphs Compiled data of the node; since executing N first child nodes is equivalent to executing M first nodes, and at least one first child node divided by the same first node can be executed concurrently, thus, when
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N One of the first child nodes, the compiled data of the first object includes at least one of the following: storage location, life cycle and cache management operation (cache management operation, CMO) indication information of the input data of the first object ; the storage location, life cycle and cache management operation instruction information of the output data of the first object; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • cache management operation cache management operation
  • the storage location of input and output data includes the storage location in memory;
  • the cache management operation instruction information includes instruction information such as prefetch operation, invalid operation, write back operation, refresh operation, etc., prefetch operation, invalid operation, write back operation 1.
  • the specific process of the refresh operation is shown in FIG. 10 to FIG. 12 .
  • one or more of the storage location of input and output data, life cycle, and cache management operation indication information, dependencies, and the type of computing device that executes any one node or sub-node can be Items are compiled into the compiled data of any node or sub-node, which is beneficial to the execution of any node or sub-node and improves performance.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the information of other sub-nodes executed in parallel with any sub-node can be compiled into the compiled data of any sub-node, which is beneficial to the parallel execution of any sub-node and other sub-nodes.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the After the first preset times of the first child node in the first thread-level subgraph, delete the target input data in the buffer, that is, the last time the target input data is used as any one of the first A reconstruction node first target after the input data of the first node in the reconstruction node, or after the input data of the target was last used as the input data of the first child node in the first thread-level subgraph, delete The target input data
  • the first preset count is also the consumption count of the target input data in the first target reconstruction node or the first thread-level subgraph.
  • the cache management operation indication information of the input data of the node needs to indicate: Write the target input data from the memory into the cache before, so that the node can execute; optionally, the cache management operation indication information of the input data of the node can also indicate: the target input data does not need to be used as the reconstruction After the input data of other nodes in the node, delete the target input data in the cache, so as to reasonably release the cache space.
  • the cache management operation indication information of the input data of the child node needs to indicate: The child node has previously written the target input data from the memory into the cache, so that the child node can be executed; optionally, the cache management operation indication information of the input data of the child node can also indicate: the target input data does not need After being used as the input data of other child nodes in the thread-level subgraph, the target input data in the cache is deleted, so as to reasonably release the cache space.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache , that is, after the output data of the first object is used as the input data of the first node in the first target reconstruction node for the last time, or after the output data of the first object is used as the first node for the last time After the input data of the first child node in the thread-level subgraph, the output data of the first object in the cache is deleted.
  • the second preset number of times is the consumption number of the output data of the first object in the first target reconstruction node or the first thread-level subgraph.
  • the cache management operation instruction information of the output data of any node in the reconstruction node optionally indicates: write the output data of the node into the memory, so that the output data of the node can be used for other purposes; After the output data of the node is written into the memory or after the output data of the node does not need to be used as the input data of other nodes in the reconstruction node, delete the output data of the node in the cache, so as to reasonably release the cache space.
  • the cache management operation indication information of the output data of any sub-node in the thread-level subgraph optionally indicates: write the output data of the sub-node into the memory, so that the output data of the sub-node can be used for other purposes ;Delete the output data of the child node in the cache after writing the output data of the child node in memory or after the output data of the child node is no longer needed as the input data of other child nodes in the thread-level subgraph , so as to reasonably release the cache space.
  • the dependency relationship of the first object includes: the number of nodes that the first object depends on and the nodes that depend on the first object, or the child nodes that the first object depends on The number and child nodes depend on the first object.
  • the dependency relationship of any node in the reconstructed nodes includes the number of nodes that the node depends on, and when the node is executed, when the number of nodes that the node depends on is reduced to 0, the node can be executed; and , the dependency relationship of a node also includes other nodes that depend on this node; in this way, the execution sequence among the nodes in the refactoring node can be controlled.
  • the dependency relationship of any child node in the thread-level subgraph includes the number of child nodes that the child node depends on.
  • the child node When executing a child node, when the number of child nodes that the child node depends on is reduced to 0, it can be executed The child node; and, the dependency relationship of the child node also includes other child nodes that depend on the child node; in this way, the execution sequence among the child nodes in the thread-level subgraph can be controlled.
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the directed edges between the nodes in the reconstructed nodes can be expressed as the dependencies of the nodes during compilation, for example, expressed as the first dependency of the nodes, so as to realize the connection between the nodes in the reconstructed nodes A directed edge expression.
  • directed edges between sub-nodes in thread-level subgraphs can be expressed as dependencies of sub-nodes at compile time, for example, expressed as the first dependency of sub-nodes, thus realizing sub-nodes in thread-level sub-graphs
  • the expression of the directed edge between; wherein, the directed edge between the sub-nodes in the thread-level subgraph mainly refers to the directed edge between the sub-nodes obtained by dividing different nodes.
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes One of them, the fourth object is any second child node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring each second child
  • data communication between multiple computing devices (such as multiple chips or multiple processors) and multiple servers (servers) is realized through some collective communication operators, such as collective communication operators.
  • the present application may transmit data from the first computing device to the second computing device through the first collective communication operator, that is, the first communication operator node is a node representing the first collective communication operator; wherein, the collective communication operator may is represented as a collective communication operator node, so the first communication operator node is the first collective communication operator node; and, the collective communication operator node is the same as the node in the calculation graph, subgraph or reconstruction node, which can Perform thread division.
  • the method of obtaining the second reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph; wherein, the first reconstructed subgraph is obtained by reconstructing the first subgraph, and the first The reconstructed subgraph includes at least one first reconstructed node and directed edges between at least one first reconstructed node, because the first subgraph includes multiple first nodes and directed edges between multiple first nodes , so the first target reconstruction node in the first reconstruction subgraph includes at least one first node among the plurality of first nodes in the first subgraph and the directed edge between the at least one first node;
  • the second reconstructed subgraph is obtained by reconstructing the second subgraph, and the second reconstructed subgraph includes at least one second reconstructed node and at least one directed edge between the second reconstructed node, because The second subgraph includes a plurality of second nodes and directed edges between the plurality of second nodes, so any second reconstruction node in the second reconstruction subgraph includes a pluralit
  • the second reconstruction subgraph is also based on the reconstruction principle described in FIG. 3
  • the second thread-level subgraph is also obtained based on the thread division principle described in FIG. 4 or FIG. 5 .
  • multiple reconstruction nodes can be executed in a distributed manner. Taking two reconstruction nodes and two computing devices as an example, one of the reconstruction nodes is executed by one of the computing devices, and the other reconstruction node is executed by another The computing device executes, wherein, the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs; if the output of a node in one of the reconstruction nodes If the data is the input data of a node in another reconstructed node, then the output data of a node in one of the reconstructed nodes needs to be transmitted from one computing device to another computing device.
  • This application can use a set communication operator node to transfer The output data of the nodes in one of the reconstructed nodes is transmitted from one of the computing devices to another computing device; and because the collective communication operator node can transmit the output data of the nodes in one of the reconstructed nodes from one of the computing devices to Another computing device is used as the input data of the nodes in another reconstruction node, so there is a directed edge between the nodes in one of the reconstruction nodes and the set communication operator node, and the set communication operator node and another reconstruction node
  • the nodes in the node have directed edges; then, the directed edges between the nodes in one of the reconstructed nodes and the set communication operator node can be expressed as the dependency relationship of the nodes in one of the reconstructed nodes, and the set There is a directed edge between a communication operator node and a node in another reconstruction node, which is expressed as a dependency relationship between nodes in another reconstruction node, for example, it is expressed as the second dependency relationship of a node in one reconstruction
  • multiple thread-level subgraphs can be executed in a distributed manner. Taking two thread-level subgraphs and two computing devices as an example, one thread-level subgraph is executed by one of the computing devices, and the other thread-level subgraph is executed by another thread-level subgraph.
  • the two thread-level subgraphs can be obtained by dividing the threads of different reconstruction nodes in the same reconstruction subgraph respectively, or can be obtained by respectively The sub-threads are divided; if the output data of the child nodes in one of the thread-level subgraphs is the input data of the child nodes in another thread-level subgraph, then the output data of the child nodes in one of the thread-level subgraphs needs to be From one of the computing devices to another computing device, the present application can use the set communication operator node to transmit the output data of the child nodes in one of the thread-level subgraphs from one of the computing devices to another computing device; and because the set The communication operator node can transmit the output data of the sub-nodes in one of the thread-level subgraphs from one computing device to another computing device as the input data of the sub-nodes in another thread-level subgraph, so one of the threads There is a directed edge between the child nodes in the level subgraph and the collective communication operator no
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes One of them, the fifth object is any third sub-node in the third thread-level sub-graph, and the third thread-level sub-graph is obtained by re
  • the present application implements data communication between multiple computing devices (such as multiple chips or multiple processors) and multiple servers (servers) through some collective communication operators, such as collective communication operators.
  • This application transmits data from the third computing device to the second computing device through the second set of communication operators, so the second communication operator node is a node representing the second set of communication operators, that is, the second communication operator node is The second set of communication operator nodes.
  • the method of obtaining the third reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph; wherein, the first reconstructed subgraph is obtained by reconstructing the first subgraph, and the first The reconstructed subgraph includes at least one first reconstructed node and directed edges between at least one first reconstructed node, because the first subgraph includes multiple first nodes and directed edges between multiple first nodes , so the first target reconstruction node in the first reconstruction subgraph includes at least one first node among the plurality of first nodes in the first subgraph and the directed edge between the at least one first node; Theoretically, the third reconstructed subgraph is obtained by reconstructing the third subgraph, and the third reconstructed subgraph includes at least one third reconstructed node and at least one directed edge between the third reconstructed node, because The third subgraph includes a plurality of third nodes and directed edges between the plurality of third nodes, so any third reconstruction node in the third reconstruction subgraph includes a pluralit
  • the third reconstruction subgraph is also based on the reconstruction principle described in FIG. 3
  • the third thread-level subgraph is also obtained based on the thread division principle described in FIG. 4 or FIG. 5 .
  • multiple reconstruction nodes can be executed in a distributed manner. Taking two reconstruction nodes and two computing devices as an example, one of the reconstruction nodes is executed by one of the computing devices, and the other reconstruction node is executed by another The computing device executes, wherein, the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs; if the output of a node in one of the reconstruction nodes If the data is the input data of a node in another reconstructed node, then the output data of a node in one of the reconstructed nodes needs to be transmitted from one computing device to another computing device.
  • This application can use a set communication operator node to transfer The output data of the nodes in one of the reconstructed nodes is transmitted from one of the computing devices to another computing device; and because the collective communication operator node can transmit the output data of the nodes in one of the reconstructed nodes from one of the computing devices to Another computing device is used as the input data of the nodes in another reconstruction node, so there is a directed edge between the nodes in one of the reconstruction nodes and the set communication operator node, and the set communication operator node and another reconstruction node
  • the nodes in the node have directed edges; then, the directed edges between the nodes in one of the reconstructed nodes and the set communication operator node can be expressed as the dependency relationship of the nodes in one of the reconstructed nodes, and the set There is a directed edge between a communication operator node and a node in another reconstruction node, which is expressed as a dependency relationship between nodes in another reconstruction node, for example, it is expressed as the second dependency relationship of a node in one reconstruction
  • multiple thread-level subgraphs can be executed in a distributed manner. Taking two thread-level subgraphs and two computing devices as an example, one thread-level subgraph is executed by one of the computing devices, and the other thread-level subgraph is executed by another thread-level subgraph.
  • the two thread-level subgraphs can be obtained by dividing the threads of different reconstruction nodes in the same reconstruction subgraph respectively, or can be obtained by respectively The sub-threads are divided; if the output data of the child nodes in one of the thread-level subgraphs is the input data of the child nodes in another thread-level subgraph, then the output data of the child nodes in one of the thread-level subgraphs needs to be From one of the computing devices to another computing device, the present application can use the set communication operator node to transmit the output data of the child nodes in one of the thread-level subgraphs from one of the computing devices to another computing device; and because the set The communication operator node can transmit the output data of the sub-nodes in one of the thread-level subgraphs from one computing device to another computing device as the input data of the sub-nodes in another thread-level subgraph, so one of the threads There is a directed edge between the child nodes in the level subgraph and the collective communication operator no
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the method of obtaining the fourth reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph; wherein, the first reconstructed subgraph is obtained by reconstructing the first subgraph, and the first reconstructed The subgraph comprises at least one first reconstructed node and directed edges between at least one first reconstructed node, because the first subgraph comprises a plurality of first nodes and directed edges between the plurality of first nodes, Therefore, the first target reconstruction node in the first reconstruction subgraph includes at least one first node among the plurality of first nodes in the first subgraph and the directed edge between the at least one first node; similarly , the fourth reconstructed subgraph is obtained by reconstructing the fourth subgraph, and the fourth reconstructed subgraph includes at least one fourth reconstructed node and at least one directed edge between the fourth reconstructed node, because the fourth The four subgraphs include multiple fourth nodes and directed edges between multiple fourth nodes, so any fourth reconstruction node in the fourth reconstruction subgraph includes multiple fourth nodes
  • the fourth reconstruction subgraph is also based on the reconstruction principle described in FIG. 3 .
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is transformed into the directed edge between one of the reconstructed nodes and the other reconstructed node, and the directed edge between one of the reconstructed nodes and the other reconstructed node can be expressed through one of the reconstructed nodes Dependencies are expressed so that the order between two reconstruction nodes in the same flow can be controlled.
  • the first target reconstruction node and the fourth target reconstruction node are in the same flow, and there is a Directed edge; the directed edge between the first node in the first target reconstruction node and the third target node in the fourth target reconstruction node can be converted into the first target reconstruction node and the fourth target Reconstruct the directed edges between the nodes, and then express the directed edges between the first target reconstructed node and the fourth target reconstructed node as the dependency relationship of the first target reconstructed node, and compile the In the compiled data of a target reconstruction node, the sequence between the first target reconstruction node and the fourth target reconstruction node is controlled.
  • the first target reconstruction node and the fourth target reconstruction node may be reconstruction nodes in the same reconstruction subgraph, or may be reconstruction nodes in different reconstruction subgraphs.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the method of obtaining the fifth reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph; wherein, the first reconstructed subgraph is obtained by reconstructing the first subgraph, and the first reconstructed subgraph
  • the subgraph comprises at least one first reconstructed node and directed edges between at least one first reconstructed node, because the first subgraph comprises a plurality of first nodes and directed edges between the plurality of first nodes, Therefore, the first target reconstruction node in the first reconstruction subgraph includes at least one first node among the plurality of first nodes in the first subgraph and the directed edge between the at least one first node; similarly , the fifth reconstructed subgraph is obtained by reconstructing the fifth subgraph, and the fifth reconstructed subgraph includes at least one fifth reconstructed node and at least one directed edge between the fifth reconstructed node, because the fifth reconstructed subgraph
  • the five subgraphs include a plurality of fifth nodes and directed edges between the fifth nodes, so any fifth reconstruction node
  • the fifth reconstruction subgraph is also based on the reconstruction principle described in FIG. 3 .
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is converted into the directed edge between one reconstruction node and the sending operator node, the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node.
  • the directed edge between one of the reconstruction nodes and the sending operator node, and the directed edge between the sending operator node and the receiving operator node can be expressed by the dependency relationship of the sending operator node , and the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node can be expressed through the dependency relationship of the receiving operator node, so as to control different flows
  • the first target reconstruction node and the fifth target reconstruction node are in different flows, and there is a
  • the directed edge between the first node in the first target reconstruction node and the fourth target node in the fifth target reconstruction node can be converted into the first sending operator node and the first target reconstruction node.
  • the directed edge between is expressed as the second dependency relationship of the first sending operator node, and compiled in the compiled data of the first target reconstruction node; and the first receiving operator node and the fifth target reconstruction node can be
  • the directed edge between can be expressed as the first dependency relationship of the first receiving operator node, and the directed edge between the first sending operator node and the first receiving operator node can be expressed as the first receiving operator node , and compile the first dependency relationship and the second dependency relationship of the first receiving operator node into the compilation data of the fifth target reconstruction node, thereby controlling the first target reconstruction node and the fifth target Refactor the order between nodes.
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • the method of obtaining the sixth reconstructed subgraph in this application is the same as the method of obtaining the first reconstructed subgraph; wherein, the first reconstructed subgraph is obtained by reconstructing the first subgraph, and the first reconstructed subgraph
  • the subgraph comprises at least one first reconstructed node and directed edges between at least one first reconstructed node, because the first subgraph comprises a plurality of first nodes and directed edges between the plurality of first nodes, Therefore, the first target reconstruction node in the first reconstruction subgraph includes at least one first node among the plurality of first nodes in the first subgraph and the directed edge between the at least one first node; similarly , the sixth reconstructed subgraph is obtained by reconstructing the sixth subgraph, and the sixth reconstructed subgraph includes at least one sixth reconstructed node and at least one directed edge between the sixth reconstructed node, because the sixth The six subgraphs include multiple sixth nodes and directed edges between multiple sixth nodes, so any sixth reconstructed node in the sixth reconstructed sub
  • the sixth reconstruction subgraph is also based on the reconstruction principle described in FIG. 3 .
  • the two reconstruction nodes can be reconstruction nodes in the same reconstruction subgraph, or reconstruction nodes in different reconstruction subgraphs node; if there is a directed edge between a node in one of the reconstructed nodes and a node in another reconstructed node, the directed edge between the node in one of the reconstructed nodes and the node in the other reconstructed node can be The directed edge is converted into the directed edge between one reconstruction node and the sending operator node, the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node.
  • the directed edge between one of the reconstruction nodes and the sending operator node, and the directed edge between the sending operator node and the receiving operator node can be expressed by the dependency relationship of the sending operator node
  • the directed edge between the sending operator node and the receiving operator node, and the directed edge between the receiving operator node and another reconstruction node can be expressed through the dependency relationship of the receiving operator node, so as to control different flows
  • the first target reconstruction node and the sixth target reconstruction node are in different flows, and there is a
  • the directed edge between the first node in the first target reconstruction node and the fifth target node in the sixth target reconstruction node can be converted into the sixth target reconstruction node and the second sending algorithm
  • the directed edge between is expressed as the second dependency relationship of the second sending operator node, and the first dependency relationship and the second dependency relationship of the second sending operator node are compiled in the compilation data of the sixth target reconstruction node;
  • the directed edge between the second receiving operator node and the first target reconstruction node can be expressed as the first dependency relationship of the second receiving operator node, and the
  • FIG. 27 is a schematic flowchart of a subgraph execution method provided by the embodiment of the present application.
  • the subgraph execution method is applied to the graph execution device 200; the subgraph execution method includes but is not limited to the following operations or step:
  • Step 2701 Obtain compiled data of the first target reconstruction node in the first reconstruction subgraph, the first reconstruction subgraph is obtained by reconstructing the first subgraph, and the first subgraph includes multiple A directed edge between a first node and the plurality of first nodes, the first subgraph is any one of a plurality of subgraphs obtained by computing graph segmentation; the first reconstructed subgraph includes A directed edge between at least one first reconstruction node and the at least one first reconstruction node, the first target reconstruction node is any first reconstruction node in the at least one first reconstruction node node, the first target reconstruction node includes M first nodes among the plurality of first nodes and directed edges between the M first nodes, where M is a positive integer.
  • Step 2702 Execute the compiled data of the first target reconstruction node.
  • the calculation graph is divided into multiple subgraphs, and then any subgraph in the multiple subgraphs is reconstructed to obtain a reconstructed subgraph, and then the reconstruction node in the reconstructed subgraph is
  • the scheduling unit realizes compiling any reconstruction node in the reconstruction subgraph.
  • the reconstructed subgraph obtained in this application includes at least one reconstructed node and a directed edge between the at least one reconstructed node, so the reconstructed subgraph obtained in this application is essentially a subgraph; and, Since any reconstructed node in the reconstructed subgraph includes one or more nodes in the arbitrary subgraph and directed edges between one or more nodes, the reconstructed node is essentially a A subgraph with a smaller subgraph size.
  • the reconstruction node as the scheduling unit is essentially a graph scheduling mechanism, and scheduling one reconstruction node for compilation at a time is equivalent to scheduling one or more nodes in any subgraph for compilation;
  • the scheduling time can be reduced by using the reconstructed node as the scheduling unit when compiling the subgraph.
  • the compiled data of any one of the reconstruction nodes obtained by compiling the arbitrary one of the reconstruction nodes is used to execute one or more nodes in the any one of the subgraphs, and when the subgraph is executed, one reconstruction node is scheduled at a time
  • the compiled data of a node is used to execute the compiled data of one or more nodes in any sub-graph at one time.
  • Reconstructing a node is equivalent to scheduling one or more nodes in any one of the subgraphs to execute the one or more nodes; A node in any one of the subgraphs is scheduled to be executed, and the scheduling time can be reduced by taking the reconstructed node as the scheduling unit when the subgraph is executed.
  • the first subgraph includes multiple first nodes and directed edges between the multiple first nodes, and the first subgraph
  • the first reconstructed subgraph obtained by reconstruction includes at least one first reconstructed node and directed edges between at least one first reconstructed node, wherein the first target reconstructed node includes M in the first subgraph Directed edges between the first node and M first nodes, M is a positive integer; compiling the first target reconstruction node is equal to compiling the M first nodes, and the first target reconstruction
  • the compiled data of the node is used to execute the M first nodes, that is, executing the first target reconstruction node is equal to executing the M first nodes.
  • the embodiment of the present application provides a graph scheduling mechanism, which takes the reconstructed node of the graph structure as the scheduling unit, and can reduce the scheduling time.
  • the compiled data of the first target reconstruction node is obtained by compiling the first thread-level subgraph, and the first thread-level subgraph is obtained by combining the M first
  • Each first node among the nodes is divided into at least one first child node, and the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, The N is a positive integer greater than or equal to the M.
  • any reconstructed node includes M first nodes and directed edges between M first nodes, and when compiling any reconstructed node, the M first nodes
  • Each first node is divided into at least one first child node, that is, the operator operation represented by each of the M first nodes is divided into at least one thread, and each thread in the at least one thread passes through a
  • the child node represents; M first child nodes are divided into threads to obtain N first child nodes, and N is a positive integer greater than or equal to M, and the directed relationship between the N first child nodes and the N first child nodes
  • An edge constitutes a first thread-level subgraph; compiling the first thread-level subgraph can obtain compiled data of the first target reconstruction node, and the compiled data of the first target reconstruction node includes N first subgraphs Compiled data for the node.
  • N first sub-nodes Since executing N first sub-nodes is equivalent to executing M first sub-nodes, and at least one first sub-node divided by the same first sub-node can be executed concurrently, thus, when scheduling compiled data of N first sub-nodes
  • the first child nodes divided by the same first node among the N first child nodes can be executed concurrently, thereby reducing the execution of each first node among the M first nodes time, that is, to reduce the total execution time of the M first nodes and improve performance.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes or compiled data of the N first child nodes;
  • a target to reconstruct the compiled data of the node including: writing the compiled data of the R sixth objects into the cache; for each sixth object in the R sixth objects, perform the following operations to execute the Compiled data of a target reconstruction node: when the number of eighth objects that the seventh object depends on is 0, read the compiled data of the seventh object from the cache, and compiling data to execute the seventh object, where the seventh object is any one of the R sixth objects; wherein, if the R sixth objects are the M first nodes, the eighth The objects are nodes; if the R sixth objects are the N first child nodes, the eighth object is a child node.
  • the eighth object is a node, which includes computing operators and nodes, communication operator nodes, control operator nodes, and heterogeneous computing operator nodes and other types of nodes; when the R sixth objects are N first child nodes, the eighth object is a child node, which includes computing operators and nodes, communication operator nodes, control operator nodes, heterogeneous computing Sub-nodes obtained by dividing various types of nodes such as operator nodes.
  • any node in the refactoring node when the number of nodes that any node depends on is 0, execute any node; for any child node in the thread-level subgraph, That is, when the number of child nodes that any child node depends on is 0, execute any child node; setting the timing of any node or any child node being executed in this way is beneficial to improve execution efficiency.
  • the compilation data of the seventh object includes an initial number of eighth objects that the seventh object depends on; if the initial number of eighth objects that the seventh object depends on is not 0, After each ninth object is executed, the number of the eighth object that the seventh object depends on is decremented by 1, and the ninth object is the eighth object that the seventh object depends on.
  • the timing when it is pushed to the computing resources for execution is after the number of nodes it depends on is 0; this application can rely on any node
  • the initial number of nodes is compiled in the compiled data of any one node, so that the timing of pushing it to the computing resource for execution can be determined according to the number of nodes that any one node depends on during execution. If the initial number of nodes that any node depends on is 0, then when the compiled data of any node is written into the cache, it can be pushed to the computing resource for execution.
  • any node can be pushed to the computing resources for execution.
  • the time it is pushed to the computing resources for execution is after the number of sub-nodes it depends on is 0; this application can rely on any sub-node
  • the initial number of sub-nodes is compiled in the compiled data of any sub-node. In this way, during execution, the timing of pushing it to the computing resources for execution can be determined according to the number of sub-nodes that any sub-node depends on. If the initial number of child nodes that any child node depends on is 0, then when the compiled data of any child node is written into the cache, it can be pushed to the computing resource for execution.
  • the initial number of child nodes that any child node depends on is not 0, you need to wait for the child nodes that any child node depends on to be executed before pushing the compiled data of any child node to computing resources for execution ; In this case, after each child node that any child node depends on is executed, the number of child nodes that any child node depends on is reduced by 1 until the number of child nodes that any child node depends on decreases After it is 0, it is the timing when the arbitrary sub-node is pushed to the computing resource for execution, so that any sub-node is pushed to the computing resource for execution.
  • the compilation data of the seventh object is written into the In the cache; when the initial number of the eighth object that the seventh object depends on is not 0, the compilation data of the seventh object is written into the cache during the execution of the ninth object of.
  • this arbitrary node does not need to wait for other nodes to finish executing before it can execute; in this case, you can execute any
  • the node writes the compiled data of any node into the cache before, so as to ensure the smooth execution of the reconstructed node.
  • any node needs to wait for the nodes it depends on to be executed before it can be executed;
  • the compiled data of any node is written into the cache, so that after the nodes that any node depends on are all executed, any node can be executed immediately, so as to ensure the smooth execution of the refactored node; and, there is no Writing the compiled data of any one node into the cache too early will not cause the compiled data of any one node to occupy the cache for a long time, so as to achieve reasonable utilization of the cache.
  • any child node in the thread-level subgraph depends on is 0, this arbitrary child node does not need to wait for other child nodes to finish executing before it can execute; in this case, you can execute the The compiled data of any sub-node is written into the cache before any sub-node, so as to ensure the smooth execution of the thread-level sub-graph.
  • the arbitrary child node needs to wait for the child nodes it depends on to be executed before it can be executed; in this case, you can execute any child node
  • the compiled data of any child node is written into the cache, so that after the child nodes that any child node depends on are all executed, any child node can be executed immediately, thereby ensuring that the The smooth execution of the thread-level subgraph; moreover, if the compiled data of any sub-node is not written into the cache too early, it will not cause the compiled data of any sub-node to occupy the cache for a long time, so as to realize the reasonable utilization of the cache.
  • FIG. 28 is a schematic structural diagram of an apparatus for compiling a subgraph provided by an embodiment of the present application.
  • the apparatus 2800 for compiling a subgraph is applied to the apparatus 100 for compiling a subgraph.
  • the apparatus 2800 for compiling a subgraph includes:
  • An acquisition unit 2801 configured to acquire a first subgraph, the first subgraph is any one of multiple subgraphs obtained by segmenting the computation graph, the first subgraph includes multiple first nodes and the multiple A directed edge between the first nodes;
  • a reconstruction unit 2802 configured to reconstruct the first subgraph to obtain a first reconstructed subgraph, the first reconstructed subgraph includes at least one first reconstruction node and the at least one first Reconstruct directed edges between nodes;
  • the compiling unit 2803 is configured to compile the first target reconstruction node to obtain compiled data of the first target reconstruction node, where the first target reconstruction node is one of the at least one first reconstruction node Any one of the first reconstruction nodes, the first target reconstruction node includes M first nodes in the plurality of first nodes and directed edges between the M first nodes, and the M is positive integer.
  • the compiling unit 2803 is specifically configured to: divide each of the M first nodes into at least one first child node, so as to obtain the first thread-level child node Graph, the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, and the N is a positive integer greater than or equal to the M; for the The first thread-level subgraph is compiled to obtain the compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N
  • the compiled data of the first object includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the first object; The storage location, life cycle and cache management operation indication information of the output data; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes One of them, the fourth object is any second child node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring each second child
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • each unit of the subgraph compiling device 2800 can also refer to the corresponding description of the method embodiment shown in FIG. 26 , and the beneficial effects brought by the subgraph compiling device 2800 can also be referred to in FIG. The corresponding description of the method embodiment will not be repeated here.
  • FIG. 29 is a schematic structural diagram of a subgraph execution device provided by an embodiment of the present application.
  • the subgraph execution device 2900 is applied to the graph execution device 200.
  • the subgraph execution device 2900 includes:
  • the obtaining unit 2901 is configured to obtain compiled data of a first target reconstruction node in a first reconstruction subgraph, the first reconstruction subgraph is obtained by reconstructing the first subgraph, and the first subgraph
  • the graph includes a plurality of first nodes and directed edges between the plurality of first nodes, and the first subgraph is any one of a plurality of subgraphs obtained by cutting the calculation graph;
  • the first reconstruction The subgraph includes at least one first reconstruction node and a directed edge between the at least one first reconstruction node, and the first target reconstruction node is any one of the at least one first reconstruction node
  • the first target reconstruction node includes M first nodes among the plurality of first nodes and directed edges between the M first nodes, where M is a positive integer;
  • the execution unit 2902 is configured to execute the compiled data of the first target reconstruction node.
  • the compiled data of the first target reconstruction node is obtained by compiling the first thread-level subgraph, and the first thread-level subgraph is obtained by combining the M first
  • Each first node among the nodes is divided into at least one first child node, and the first thread-level subgraph includes N first child nodes and directed edges between the N first child nodes, The N is a positive integer greater than or equal to the M.
  • the compiled data of the first target reconstruction node includes compiled data of the M first nodes or compiled data of the N first child nodes; the execution unit 2902, It is specifically used to: write the compilation data of the R sixth objects into the cache; for each of the R sixth objects, perform the following operations to perform the compilation of the first target reconstruction node Data: when the number of eighth objects that the seventh object depends on is 0, read the compiled data of the seventh object from the cache, and execute the seventh object according to the compiled data of the seventh object object, the seventh object is any one of the R sixth objects; wherein, if the R sixth objects are the M first nodes, the eighth object is a node; if the The R sixth objects are the N first child nodes, and the eighth object is a child node.
  • the compilation data of the seventh object includes an initial number of eighth objects that the seventh object depends on; if the initial number of eighth objects that the seventh object depends on is not 0, After each ninth object is executed, the number of the eighth object that the seventh object depends on is decremented by 1, and the ninth object is the eighth object that the seventh object depends on.
  • the compilation data of the seventh object is written into the In the cache; when the initial number of the eighth object that the seventh object depends on is not 0, the compilation data of the seventh object is written into the cache during the execution of the ninth object of.
  • the compiled data of the first target reconstruction node includes compiled data of a first object, and the first object is one of the M first nodes or the N
  • the compiled data of the first object includes at least one of the following: the storage location, life cycle and cache management operation indication information of the input data of the first object; The storage location, life cycle and cache management operation indication information of the output data; the dependency relationship of the first object; the type of computing device used to execute the first object.
  • the first object is one of the N first child nodes
  • the compiled data of the first object further includes: a second object, the second object is one of the N first child nodes, and the second object and the first object are obtained by dividing the same first node.
  • the cache management operation indication information of the input data of the first object is used to indicate at least one of the following: write the target input data from the memory into the cache before executing the first object , the input data of the first object includes the target input data, the target input data is not the output data of the first node in the first target reconstruction node, or the target input data is not the first The output data of the first child node in the thread-level subgraph; after the target input data is input to the first node in the first target reconstruction node for a first preset number of times, or after the target input data is input to the The target input data in the cache is deleted after the first preset number of times of the first sub-node in the first thread-level subgraph.
  • the cache management operation indication information of the output data of the first object is used to indicate at least one of the following: write the output data of the first object into memory; After the output data of an object is written in the memory, delete the output data of the first object in the cache; after the output data of the first object is input into the first node second preset in the first target reconstruction node After setting the number of times, or after the output data of the first object is input to the first child node in the first thread-level subgraph for the second preset number of times, delete the output data of the first object in the cache .
  • the dependency relationship of the first object includes a first dependency relationship of the first object, and the first dependency relationship of the first object is used to indicate that the first object and the third A directed edge between objects; wherein, if the first object is one of the M first nodes, the third object is one of the M first nodes; if the The first object is one of the N first child nodes, the third object is one of the N first child nodes, and the third object and the first object are different from each other. obtained by dividing the first node of .
  • the output data of the first object is the input data of the fourth object
  • the first object is executed by the first computing device
  • the fourth object is executed by the second computing device
  • the dependency relationship of the first object also includes a second dependency relationship of the first object, and the second dependency relationship of the first object is used to represent the directed relationship between the first object and the first communication operator node.
  • the output data of the first object is transmitted from the first computing device to the second computing device by the first communication operator node; wherein, if the first object is the M first One of the nodes, the fourth object is the first target node in the second target reconstruction node, and the second target reconstruction node is the at least one first reconstruction node except the first target A first reconstruction node other than the reconstruction node, the first target node is a first node in the plurality of first nodes except the M first nodes; or the second target reconstruction node
  • the node is any second reconstruction node in the second reconstruction subgraph, the second reconstruction subgraph is obtained by performing the reconstruction on the second subgraph, and the second subgraph is the multiple A subgraph in a subgraph except the first subgraph, the first target node is a second node in the second subgraph; if the first object is the N first subnodes One of them, the fourth object is any second child node in the second thread-level subgraph, and the second thread-level subgraph is obtained by refactoring each second child
  • the input data of the first object includes output data of a fifth object
  • the first object is executed by the first computing device
  • the fifth object is executed by the third computing device
  • the The dependency relationship of the first object also includes a third dependency relationship of the first object
  • the third dependency relationship of the first object is used to represent the directed relationship between the first object and the second communication operator node.
  • the output data of the fifth object is transmitted from the third computing device to the first computing device by the second communication operator node; wherein, if the first object is the M first One of the nodes, the fifth object is the second target node in the third target reconstruction node, and the third target reconstruction node is the at least one first reconstruction node except the first target A first reconfiguration node other than the reconfiguration node, the second target node is a first node in the plurality of first nodes except the M first nodes; or the third target reconfiguration
  • the node is any third reconstruction node in the third reconstruction subgraph, the third reconstruction subgraph is obtained by performing the reconstruction on the third subgraph, and the third subgraph is the multiple A subgraph in a subgraph except the first subgraph, the second target node is a third node in the third subgraph; if the first object is the N first subnodes In one of the above, the fifth object is the third child node in the third thread-level subgraph, and the third thread-level subgraph is reconstructed by re
  • the fourth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the third target node is the Mth node among the plurality of first nodes A first node other than a node; or the fourth target reconstruction node is any fourth reconstruction node in the fourth reconstruction sub-graph, and the fourth reconstruction sub-graph is the fourth sub-graph Obtained by the reconstruction, the fourth subgraph is a subgraph other than the first subgraph in the plurality of subgraphs
  • the third target node is the fourth subgraph in the fourth subgraph.
  • the first target reconstruction node and the fourth target reconstruction node are in the same stream (stream), and the compilation data of the first target reconstruction node also includes the dependencies of the first target reconstruction node relationship, the dependency relationship of the first target reconstruction node is used to represent the directed edge between the first target reconstruction node and the fourth target reconstruction node.
  • the fifth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fourth target node is the Mth node among the plurality of first nodes A first node other than a node; or the fifth target reconstruction node is any fifth reconstruction node in the fifth reconstruction sub-graph, and the fifth reconstruction sub-graph is the fifth sub-graph Obtained by the reconstruction, the fifth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fourth target node is the fifth subgraph in the fifth subgraph.
  • the first target reconstruction node and the fifth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second Two dependency relationships, the first dependency relationship of the first sending operator node is used to represent the directed edge between the first sending operator node and the first target reconstruction node, the first sending operator node The second dependency relationship of the child nodes is used to represent the directed edge between the first sending operator node and the first receiving operator node, and the connection between the first receiving operator node and the fifth target reconstruction node There are directed edges between .
  • the sixth target reconstruction node is the first reconstruction node in the at least one first reconstruction node except the first target reconstruction node
  • the fifth target node is the Mth node among the plurality of first nodes A first node other than a node; or the sixth target reconstruction node is any sixth reconstruction node in the sixth reconstruction subgraph
  • the sixth reconstruction subgraph is the sixth subgraph Obtained by the reconstruction, the sixth subgraph is a subgraph in the plurality of subgraphs except the first subgraph, and the fifth target node is the sixth subgraph in the sixth subgraph.
  • the first target reconstruction node and the sixth target reconstruction node are in different streams, and the compiled data of the first target reconstruction node also includes the first dependency relationship and the second receiving operator node Two dependency relationships, the first dependency relationship of the second receiving operator node is used to represent the directed edge between the second receiving operator node and the first target reconstruction node, the second receiving operator node The second dependency relationship of the child nodes is used to represent the directed edge between the second receiving operator node and the second sending operator node, and the link between the second sending operator node and the sixth target reconstruction node There are directed edges between .
  • each unit of the execution device 2900 of the subgraph can also refer to the corresponding description of the method embodiment shown in FIG. 27 , and the beneficial effects brought by the execution device 2900 of the subgraph can also be referred to in FIG. The corresponding description of the method embodiment will not be repeated here.
  • FIG. 30 is a schematic structural diagram of another computer device provided by an embodiment of the present application.
  • the computer device may be the aforementioned graph compiling device 100 or graph execution device 200, and the computer device 3000 includes: at least one CPU, memory, the type of memory may include SRAM and ROM, microcontroller (Micro controller Unit, MCU), WLAN Subsystems, buses, transmission interfaces, etc.
  • the computer device 3000 may also include other dedicated processors such as an application processor (Application Processor, AP), NPU, and other power management subsystems, clock management subsystems, and power management subsystems. subsystem.
  • AP Application Processor
  • the connectors include various interfaces, transmission lines or buses, etc. These interfaces are usually electrical communication interfaces, but may also be mechanical interfaces or other forms of interfaces. This embodiment does not limit it.
  • the CPU can be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor; optionally, the CPU can be a processor group composed of multiple processors, and between multiple processors are coupled to each other by one or more buses.
  • the CPU implements any one of the link training methods in the foregoing method embodiments by calling program instructions stored in the on-chip memory or the off-chip memory.
  • the CPU and the MCU jointly implement any link training method in the foregoing method embodiments, for example, the CPU completes some steps in the link training method, and the MCU completes the steps in the link training method. Additional steps.
  • the AP or other special-purpose processor implements any one of the link training methods in the foregoing method embodiments by calling program instructions stored in the on-chip memory or off-chip memory.
  • the transmission interface can be an interface for receiving and sending data of the processor chip, and the transmission interface usually includes multiple interfaces.
  • the transmission interface can include an internal integrated circuit (Inter-Integrated Circuit, I2C) Interface, Serial Peripheral Interface (SPI), Universal asynchronous receiver-transmitter (UART) interface, General-purpose input/output (GPIO) interface, etc. It should be understood that these interfaces may implement different functions by multiplexing the same physical interface.
  • the transmission interface may also include High Definition Multimedia Interface (HDMI), V-By-One interface, Embedded Display Port (EDP), mobile industry processing Interface (Mobile Industry Processor Interface, MIPI) or Display Port (DP), etc.
  • HDMI High Definition Multimedia Interface
  • V-By-One interface V-By-One interface
  • EDP Embedded Display Port
  • MIPI Mobile Industry Processor Interface
  • DP Display Port
  • the above parts are integrated on the same chip; in another optional situation, the memory may be an independent chip.
  • a WLAN subsystem may include, for example, radio frequency circuits and a baseband.
  • the chip involved in the embodiment of this application is a system manufactured on the same semiconductor substrate by an integrated circuit process, also called a semiconductor chip, which can be fabricated on a substrate (usually a semiconductor such as silicon) by using an integrated circuit process.
  • the integrated circuit may include various functional devices, and each type of functional device includes transistors such as logic gate circuits, Metal-Oxide-Semiconductor (MOS) transistors, bipolar transistors or diodes, and may also include capacitors, resistors, etc. or other components such as inductors.
  • MOS Metal-Oxide-Semiconductor
  • bipolar transistors or diodes may also include capacitors, resistors, etc. or other components such as inductors.
  • Each functional device can work independently or under the action of necessary driver software, and can realize various functions such as communication, calculation, or storage.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium includes a computer program, and when the computer program is run on a computer or a processor, the computer or the processor can perform As in the method described above in FIG. 26 or FIG. 27 .
  • the embodiment of the present application also provides a computer program product, the computer program product includes a computer program, and when the computer program is run on a computer or a processor, the computer or the processor performs the process as shown in Figure 26 above. Or the method described in Figure 27.
  • the embodiment of the present application also provides a chip, including: a processor, configured to call and run a computer program from a memory, so that a device equipped with the above-mentioned chip executes the method described in FIG. 26 or FIG. 27 .
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the above units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above functions are realized in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the above-mentioned methods in various embodiments of the present application.
  • the modules in the device of the embodiment of the present application can be combined, divided and deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种人工智能技术领域的子图的编译、执行方法及相关设备,其中方法包括:获取第一子图,第一子图为由计算图切分得到的多个子图中的任意一个,第一子图包括多个第一节点和多个第一节点之间的有向边;对第一子图进行重构,以得到第一重构子图,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边;对第一目标重构节点进行编译,以得到第一目标重构节点的编译数据,第一目标重构节点为至少一个第一重构节点中的任意一个,第一目标重构节点包括多个第一节点中的M个第一节点和M个第一节点之间的有向边,M为正整数。采用本申请实施例,能够减少调度时间。

Description

子图的编译、执行方法及相关设备 技术领域
本申请涉及人工智能技术领域,尤其涉及一种子图的编译、执行方法及相关设备。
背景技术
神经网络模型需要进行软件编译后在硬件上执行,才能实现利用该神经网络模型的训练和推理等功能。如图1所示,以目标检测网络模型为例,其经过模型解析、模型编译、模型优化和模型执行等步骤,即可实现对输入的猫的图像的准确识别。
随着神经网络技术的不断发展,出现了很多处理深度神经网络(deep neural networks,DNN)的硬件,例如线性代数加速器(accelerated linear algebra,XLA)和TensorRT等,这些硬件支持对各种神经网络模型进行硬件加速处理。其中,XLA为加速TensorFlow程序运行的编译器,其将神经网络模型的计算图编译成为专门针对该神经网络模型的算子的二进制执行代码(computation kernels),并且XLA在编译得到算子的二进制执行代码的过程中会利用当前模型的信息做出针对性的优化;而TensorRT是应用于英伟达图像处理器(Nvidia GPU)的推理引擎(inference engine),整体分为模型的解析(parser)、引擎优化和执行(execution)3个步骤。此外,还有一些技术基于流(stream)的方式,将神经网络模型的计算图切分为多个子图,多个子图对应在多个流内执行,且流内的节点(node)是按顺序执行的,只有流间的节点才能并发执行,故计算图的执行并发度低,其中计算图中的节点表示神经网络模型的算子或操作符;并且,计算图在基于流的方式执行时,是以节点/任务(task)为调度单位的,故调度时间长,其中任务是针对节点编译得到的、用于执行该节点所表示的算子的可执行文件。
发明内容
本申请实施例提供了一种子图的编译、执行方法及相关设备,能够减少调度时间。
第一方面,本申请实施例提供了一种子图的编译方法,包括:获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边;对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边;对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。其中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据。应理解,计算图中的有向边包括数据边和控制边,本申请中所描述的有向边也均包括数据边和控制边,例如多个第一节点之间的有向边包括多个第一节点之间的数据边和控制边。
在本申请实施例中,将计算图切分为多个子图,然后对多个子图中的任意一个子图进行重构以得到重构子图,再以重构子图中的重构节点为调度单位实现对该重构子图中的任意一个重构节点进行编译。需要说明的是,本申请得到的重构子图包括至少一个重构节点和该至 少一个重构节点之间的有向边,故本申请得到的重构子图实质上还是一张子图;并且,由于该重构子图中任意一个重构节点包括该任意一个子图中的一个或多个节点和一个或多个节点之间的有向边,故重构节点实质上是一张比该任意一个子图规模更小的子图。在子图编译时,以重构节点为调度单位实质上是一种图调度机制,且一次调度一个重构节点进行编译等于一次调度该任意一个子图中的一个或多个节点进行编译;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点,在子图编译时以重构节点为调度单位可以减少调度时间。进一步地,对该任意一个重构节点进行编译得到的该任意一个重构节点的编译数据用于执行该任意一个子图中的一个或多个节点,在子图执行时,本申请以重构节点为调度单位,由于重构节点包括多个节点,相当于一次调度了多个节点;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,本申请可以显著减少调度时间。
在一种可能的实现方式中,所述对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,包括:将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。其中,所述第一目标重构节点的编译数据包括所述N个第一子节点的编译数据。需要说明的是,本申请所描述的子节点是对节点进行线程划分得到的,一个节点划分为至少一个子节点,该至少一个子节点中的每个子节点为一个线程,该至少一个子节点可以并行执行。
在本实现方式中,在对该任意一个重构节点进行编译时,将M个第一节点中的每个第一节点所代表的算子操作划分为至少一个线程,至少一个线程中的每个线程通过一个子节点表示;M个第一节点进行线程划分后得到N个第一子节点,由于执行N个第一子节点等于执行了M个第一节点,而由同一个第一节点划分得到的至少一个第一子节点可以并发执行,如此,在调度N个第一子节点的编译数据执行这N个第一子节点时,N个第一子节点中由同一个第一节点划分得到的第一子节点可以并发执行,从而减少M个第一节点中的每个第一节点的执行时间,也即减少M个第一节点的总执行时间,提升性能。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作(cache manage operation,CMO)指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在本实现方式中,可以将任意一个节点或子节点的输入、输出数据的存储位置、生命周期和缓存管理操作指示信息、依赖关系、执行其的计算装置的类型等信息中的一项或多项编译进该任意一个节点或子节点的编译数据中,如此,有利于该任意一个节点或子节点的执行,提升性能。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在本实现方式中,可以将与任意一个子节点并行执行的其他子节点的信息编译进该任意一个子节点的编译数据中,如此,有利于该任意一个子节点与其他子节点并行执行。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在本实现方式中,如果重构节点中的任意一个节点的输入数据包括从该重构节点外进来的目标输入数据,那么该节点的输入数据的缓存管理操作指示信息需要指示:在执行该节点之前将该目标输入数据从内存中写入缓存中,以使得该节点可以执行;可选地,该节点的输入数据的缓存管理操作指示信息还可以指示:在目标输入数据不需要作为该重构节点中的其他节点的输入数据之后,删除缓存中的目标输入数据,从而合理释放缓存的空间。同理,如果线程级子图中的任意一个子节点的输入数据包括从该线程级子图外进来的目标输入数据,那么该子节点的输入数据的缓存管理操作指示信息需要指示:在执行该子节点之前将该目标输入数据从内存中写入缓存中,以使得该子节点可以执行;可选地,该子节点的输入数据的缓存管理操作指示信息还可以指示:在目标输入数据不需要作为该线程级子图中的其他子节点的输入数据之后,删除缓存中的目标输入数据,从而合理释放缓存的空间。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在本实现方式中,重构节点中的任意一个节点的输出数据的缓存管理操作指示信息可选地指示:将该节点的输出数据写入内存中,以便该节点的输出数据用于他用;在将该节点的输出数据写入内存中之后或在该节点的输出数据不需要作为该重构节点中的其他节点的输入数据之后,删除缓存中的该节点的输出数据,从而合理释放缓存的空间。同理,线程级子图中的任意一个子节点的输出数据的缓存管理操作指示信息可选地指示:将该子节点的输出数据写入内存中,以便该子节点的输出数据用于他用;在将该子节点的输出数据写入内存中之后或在该子节点的输出数据不需要作为该线程级子图中的其他子节点的输入数据之后,删除缓存中的该子节点的输出数据,从而合理释放缓存的空间。
在一种可能的实现方式中,所述第一对象的依赖关系包括:所述第一对象依赖的节点的数量和依赖于所述第一对象的节点,或所述第一对象依赖的子节点的数量和依赖于所述第一对象的子节点。
在本实现方式中,重构节点中的任意一个节点的依赖关系包括该节点依赖的节点的数量,在执行节点时,当在节点依赖的节点的数量减为0时,可以执行该节点;并且,节点的依赖关系还包括依赖于该节点的其他节点;如此,可以控制重构节点中的节点之间的执行顺序。同理,线程级子图中的任意一个子节点的依赖关系包括该子节点依赖的子节点的数量,在执行子节点时,当在子节点依赖的子节点的数量减为0时,可以执行该子节点;并且,该子节点的依赖关系还包括依赖于该子节点的其他子节点;如此,可以控制线程级子图中的子节点之间的执行顺序。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在本实现方式中,重构节点中的节点之间的有向边在编译时可以表达为节点的依赖关系,例如表达为节点的第一依赖关系,从而实现重构节点中的节点之间的有向边的表达。同理,线程级子图中的子节点之间的有向边在编译时可以表达为子节点的依赖关系,例如表达为子节点的第一依赖关系,从而实现线程级子图中的子节点之间的有向边的表达;其中,线程级子图中的子节点之间的有向边主要指由不同节点划分得到的子节点之间的有向边。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。需要说明的是,本申请通过一些通信算子来实现多个计算装置(例如多个芯片或多个处理器)以及多个服务器(server)之间的数据通信,例如集合通信算子。本申请可以通过第一集合通信算子将数据从第一计算装置传输至第二计算装置,也即第一通信算子节点为表示第一集合通信算子的节点;其中,集合通信算子可以被表示为集合通信算子节点,故第一通信算子节点也即第一集合通信算子节点;并且,集合通信算子节点与计算图、子图或重构节点中的节点一样,可以进行线程划分。进一步需要说明的是,本申请得到第二重构子图方式和得到第一重构子图的方式相同。
在本实现方式中,多个重构节点可以分布式执行,以两个重构节点、两个计算装置为例,其中一个重构节点由其中一个计算装置执行,另外一个重构节点由另一个计算装置执行,其中,这两个重构节点可以是同一重构子图中的重构节点,也可以是不同重构子图中的重构节点;如果其中一个重构节点内的节点的输出数据为另外一个重构节点内的节点的输入数据,那么需要将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个重构节点内的节点的输入数据,故其中一个重构节点内的节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个重构节点内的节点存在有向边;那么,可以将其中一个重构节点内的节点与集合通信算子节点之间存在有向边表达为其中一个重构节点内的节点的依赖关 系,以及将集合通信算子节点与另外一个重构节点内的节点存在有向边表达为另外一个重构节点内的节点的依赖关系,例如表达为其中一个重构节点内的节点第二依赖关系和另外一个重构节点内的节点的第二依赖关系;如此,基于重构节点内的节点与集合通信算子节点之间的有向边的表达,可以实现多个重构节点的分布式执行的表达。同理,多个线程级子图可以分布式执行,以两个线程级子图、两个计算装置为例,其中一个线程级子图由其中一个计算装置执行,另外一个线程级子图由另一个计算装置执行,其中,这两个线程级子图可以是分别对同一重构子图中的不同重构节点进行线程划分得到,也可以是分别对不同重构子图中的重构节点进行子线程划分得到;如果其中一个线程级子图内的子节点的输出数据为另外一个线程级子图内的子节点的输入数据,那么需要将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个线程级子图内的子节点的输入数据,故其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个线程级子图内的子节点存在有向边;那么,可以将其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边表达为其中一个线程级子图内的子节点的依赖关系,以及将集合通信算子节点与另外一个线程级子图内的子节点存在有向边表达为另外一个线程级子图内的子节点的依赖关系,例如表达为其中一个线程级子图内的子节点第二依赖关系和另外一个线程级子图内的子节点的第二依赖关系;如此,基于线程级子图内的子节点与集合通信算子节点之间的有向边的表达,可以实现多个线程级子图的分布式执行的表达。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。需要说明的是,本申请通过第二集合通信算子将数据从第三计算装置传输至第二计算装置,故第二通信算子节点为表示第二集合通信算子的节点,也即第二通信算子节点为第二集合通信算子节点。进一步需要说明的是,本申请得到第三重构子图方式和得到第一重构子图的方式相同。
在本实现方式中,多个重构节点可以分布式执行,以两个重构节点、两个计算装置为例,其中一个重构节点由其中一个计算装置执行,另外一个重构节点由另一个计算装置执行,其中,这两个重构节点可以是同一重构子图中的重构节点,也可以是不同重构子图中的重构节点;如果其中一个重构节点内的节点的输出数据为另外一个重构节点内的节点的输入数据, 那么需要将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个重构节点内的节点的输入数据,故其中一个重构节点内的节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个重构节点内的节点存在有向边;那么,可以将其中一个重构节点内的节点与集合通信算子节点之间存在有向边表达为其中一个重构节点内的节点的依赖关系,以及将集合通信算子节点与另外一个重构节点内的节点存在有向边表达为另外一个重构节点内的节点的依赖关系,例如表达为其中一个重构节点内的节点第二依赖关系和另外一个重构节点内的节点的第二依赖关系;如此,基于重构节点内的节点与集合通信算子节点之间的有向边的表达,可以实现多个重构节点的分布式执行的表达。同理,多个线程级子图可以分布式执行,以两个线程级子图、两个计算装置为例,其中一个线程级子图由其中一个计算装置执行,另外一个线程级子图由另一个计算装置执行,其中,这两个线程级子图可以是分别对同一重构子图中的不同重构节点进行线程划分得到,也可以是分别对不同重构子图中的重构节点进行子线程划分得到;如果其中一个线程级子图内的子节点的输出数据为另外一个线程级子图内的子节点的输入数据,那么需要将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个线程级子图内的子节点的输入数据,故其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个线程级子图内的子节点存在有向边;那么,可以将其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边表达为其中一个线程级子图内的子节点的依赖关系,以及将集合通信算子节点与另外一个线程级子图内的子节点存在有向边表达为另外一个线程级子图内的子节点的依赖关系,例如表达为其中一个线程级子图内的子节点第二依赖关系和另外一个线程级子图内的子节点的第二依赖关系;如此,基于线程级子图内的子节点与集合通信算子节点之间的有向边的表达,可以实现多个线程级子图的分布式执行的表达。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。需要说明的是,本申请得到第四重构子图方式和得到第一重构子图的方式相同。
在本实现方式中,对于在同一流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与另外一个重构节点之间的有向边 来表达,而其中一个重构节点与另外一个重构节点之间的有向边可以通过其中一个重构节点的依赖关系来表达,从而可以控制同一流中的两个重构节点之间的顺序。例如,该第一目标重构节点和第四目标重构节点在同一流中,该第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间的有向边,转换成该第一目标重构节点与第四目标重构节点之间的有向边,再将该第一目标重构节点与第四目标重构节点之间的有向边表达为该第一目标重构节点的依赖关系,并且编译在该第一目标重构节点的编译数据中,从而控制该第一目标重构节点与第四目标重构节点之间的顺序。应理解,该第一目标重构节点和第四目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。需要说明的是,本申请得到第五重构子图方式和得到第一重构子图的方式相同。
在本实现方式中,对于在不同流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边和接收算子节点与另外一个重构节点之间的有向边来表达,而其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边可以通过发送算子节点的依赖关系来表达,以及发送算子节点与接收算子节点之间的有向边、接收算子节点与另外一个重构节点之间的有向边可以通过接收算子节点的依赖关系来表达,从而控制不同流中的两个重构节点之间的顺序。例如,该第一目标重构节点与第五目标重构节点在不同流中,该第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间的有向边,转换成第一发送算子节点与第一目标重构节点之间的有向边、第一发送算子节点与第一接收算子节点之间的有向边、第一接收算子节点与第五目标重构节点之间的有向边;再将第一发送算子节点与第一目标重构节点之间的有向边表达为第一发送算子节点的第一依赖关系,以及将第一发送算子节点与第一接收算子节点之间的有向边表达为第一发送算子节点的第二依赖关系,并且编译在该第一目标重构节点的编译数据中;而且可以将第一接收算子节点与第五目标重构节点之间的有向边可以表达为第一接收算子节点的第一依赖关系,以及将第一发送算子节点与第一接收算子节点之间的有向边表达为第一接收算子节点的第二依赖关系,并且将第一接收算子节点的第一依赖关系和第二依赖关系编译在第五目标重构节点的编译数据中,从而控制该第一目标重构节点与第五目标重构节 点之间的顺序。应理解,该第一目标重构节点和第五目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。需要说明的是,本申请得到第六重构子图方式和得到第一重构子图的方式相同。
在本实现方式中,对于在不同流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边和接收算子节点与另外一个重构节点之间的有向边来表达,而其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边可以通过发送算子节点的依赖关系来表达,以及发送算子节点与接收算子节点之间的有向边、接收算子节点与另外一个重构节点之间的有向边可以通过接收算子节点的依赖关系来表达,从而控制不同流中的两个重构节点之间的顺序。例如,该第一目标重构节点与第六目标重构节点在不同流中,该第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间的有向边,转换成第六目标重构节点与第二发送算子节点之间的有向边、第二发送算子节点与第二接收算子节点之间的有向边、第二接收算子节点与第一目标重构节点之间的有向边;再将第六目标重构节点与第二发送算子节点之间的有向边表达为第二发送算子节点的第一依赖关系,以及将第二发送算子节点与第二接收算子节点之间的有向边表达为第二发送算子节点的第二依赖关系,并且将第二发送算子节点的第一依赖关系和第二依赖关系编译在第六目标重构节点的编译数据中;而且可以将第二接收算子节点与该第一目标重构节点之间的有向边表达为第二接收算子节点的第一依赖关系,以及将第二发送算子节点与第二接收算子节点之间的有向边表达为第二接收算子节点的第二依赖关系,并且将第二接收算子节点的第一依赖关系和第二依赖关系编译在该第一目标重构节点的编译数据中,从而控制该第一目标重构节点与第六目标重构节点之间的顺序。应理解,该第一目标重构节点和第六目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
第二方面,本申请实施例提供了一种子图的执行方法,包括:获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到 的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数;执行所述第一目标重构节点的编译数据。
在本申请实施例中,将计算图切分为多个子图,然后对多个子图中的任意一个子图进行重构以得到重构子图,再以重构子图中的重构节点为调度单位实现对该重构子图中的任意一个重构节点进行编译。需要说明的是,本申请得到的重构子图包括至少一个重构节点和该至少一个重构节点之间的有向边,故本申请得到的重构子图实质上还是一张子图;并且,由于该重构子图中任意一个重构节点包括该任意一个子图中的一个或多个节点和一个或多个节点之间的有向边,故重构节点实质上是一张比该任意一个子图规模更小的子图。在子图编译时,以重构节点为调度单位实质上是一种图调度机制,且一次调度一个重构节点进行编译等于一次调度该任意一个子图中的一个或多个节点进行编译;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点,在子图编译时以重构节点为调度单位可以减少调度时间。进一步地,对该任意一个重构节点进行编译得到的该任意一个重构节点的编译数据用于执行该任意一个子图中的一个或多个节点,在子图执行时,一次调度一个重构节点的编译数据用于执行等于一次调度该任意一个子图中的一个或多个节点的编译数据进行执行,也即本申请以重构节点为调度单位,一次调度一个重构节点以执行该一个重构节点,等于一次调度该任意一个子图中的一个或多个节点以执行该一个或多个节点;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点以执行该一个节点,在子图执行时以重构节点为调度单位可以减少调度时间。例如,对于由计算图切分得到的多个子图中的第一子图而言,第一子图包括多个第一节点和多个第一节点之间的有向边,由第一子图重构得到的第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,其中的第一目标重构节点包括第一子图中的M个第一节点和M个第一节点之间的有向边,M为正整数;对该第一目标重构节点进行编译,等于对该M个第一节点进行编译,并且该第一目标重构节点的编译数据用于执行该M个第一节点,也即执行该第一目标重构节点等于执行该M个第一节点。综上,本申请实施例提供了一种图调度机制,以图结构的重构节点为调度单位,能够减少调度时间。
在一种可能的实现方式中,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
在本实现方式中,任意一个重构节点包括M个第一节点和M个第一节点之间的有向边,在对该任意一个重构节点进行编译时,将M个第一节点中的每个第一节点划分为至少一个第一子节点,也即将M个第一节点中的每个第一节点所代表的算子操作划分为至少一个线程,至少一个线程中的每个线程通过一个子节点表示;M个第一节点进行线程划分后得到N个第一子节点,且N为大于或等于M的正整数,N个第一子节点和N个第一子节点之间的有向边构成第一线程级子图;对该第一线程级子图进行编译,可以得到该第一目标重构节点的编译数据,且该第一目标重构节点的编译数据包括N个第一子节点的编译数据。由于执行N个第一子节点等于执行了M个第一节点,而由同一个第一节点划分得到的至少一个第一子节点可以并发执行,如此,在调度N个第一子节点的编译数据执行这N个第一子节点时,N个第一子节点中由同一个第一节点划分得到的第一子节点可以并发执行,从而减少M个第一节点中的每个第一 节点的执行时间,也即减少M个第一节点的总执行时间,提升性能。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行所述第一目标重构节点的编译数据,包括:将R个第六对象的编译数据写入缓存中;针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
在本实现方式中,对于重构节点内的任意一个节点来说,当该任意一个节点依赖的节点的数量为0时,执行该任意一个节点;对于线程级子图内的任意一个子节点来说,当该任意一个子节点依赖的子节点的数量为0时,执行该任意一个子节点;如此设置任意一个节点或任意一个子节点被执行的时机,有利于提升执行效率。
在一种可能的实现方式中,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
在本实现方式中,对于重构节点内的任意一个节点来说,其被推送到计算资源中执行的时机是在其依赖的节点的数量为0以后;本申请可以将该任意一个节点依赖的节点的初始数量编译在该任意一个节点的编译数据中,如此,在执行时可以根据该任意一个节点依赖的节点的数量来确定将其推送到计算资源中执行的时机。如果该任意一个节点依赖的节点的初始数量为0,那么当该任意一个节点的编译数据被写入缓存中时,即可推送到计算资源中执行。如果该任意一个节点依赖的节点的初始数量不为0,那么需要等该任意一个节点依赖的节点都执行完了,才可把该任意一个节点的编译数据推送到计算资源中执行;此种情况下,在每执行完一个该任意一个节点依赖的节点之后,将该任意一个节点依赖的节点的数量减1,直至该任意一个节点依赖的节点的数量减为0以后,为该任意一个节点被推送到计算资源中执行的时机,从而将该任意一个节点推送到计算资源中执行。同理,对于线程级子图内的任意一个子节点来说,其被推送到计算资源中执行的时机是在其依赖的子节点的数量为0以后;本申请可以将该任意一个子节点依赖的子节点的初始数量编译在该任意一个子节点的编译数据中,如此,在执行时可以根据该任意一个子节点依赖的子节点的数量来确定将其推送到计算资源中执行的时机。如果该任意一个子节点依赖的子节点的初始数量为0,那么当该任意一个子节点的编译数据被写入缓存中时,即可推送到计算资源中执行。如果该任意一个子节点依赖的子节点的初始数量不为0,那么需要等该任意一个子节点依赖的子节点都执行完了,才可把该任意一个子节点的编译数据推送到计算资源中执行;此种情况下,在每执行完一个该任意一个子节点依赖的子节点之后,将该任意一个子节点依赖的子节点的数量减1,直至该任意一个子节点依赖的子节点的数量减为0以后,为该任意一个子节点被推送到计算资源中执行的时机,从而将该任意一个子节点推送到计算资源中执行。
在一种可能的实现方式中,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
在本实现方式中,如果重构节点内的任意一个节点依赖的节点的初始数量为0,该任意一 个节点无需等待其他节点执行完了以后才可执行;此种情况下,可以在执行该任意一个节点之前将该任意一个节点的编译数据写入缓存中,从而保证该重构节点的顺利执行。如果该任意一个节点依赖的节点的初始数量不为0,该任意一个节点需要等待其依赖的节点都执行完了以后才可执行;此种情况下,可以在执行该任意一个节点依赖的节点的过程中,将该任意一个节点的编译数据写入缓存中,从而在任意一个节点依赖的节点都执行完了以后,可以立马接着执行该任意一个节点,从而保证该重构节点的顺利执行;并且,没有过早将该任意一个节点的编译数据写入缓存中,不会导致该任意一个节点的编译数据长期占用缓存,实现缓存的合理利用。同理,如果线程级子图内的任意一个子节点依赖的子节点的初始数量为0,该任意一个子节点无需等待其他子节点执行完了以后才可执行;此种情况下,可以在执行该任意一个子节点之前将该任意一个子节点的编译数据写入缓存中,从而保证该线程级子图的顺利执行。如果该任意一个子节点依赖的子节点的初始数量不为0,该任意一个子节点需要等待其依赖的子节点都执行完了以后才可执行;此种情况下,可以在执行该任意一个子节点依赖的子节点的过程中,将该任意一个子节点的编译数据写入缓存中,从而在任意一个子节点依赖的子节点都执行完了以后,可以立马接着执行该任意一个子节点,从而保证该线程级子图的顺利执行;并且,没有过早将该任意一个子节点的编译数据写入缓存中,不会导致该任意一个子节点的编译数据长期占用缓存,实现缓存的合理利用。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子 节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所 述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
需要说明的是,第二方面所描述的可能的实现方式的有益效果也可以参照第一方面的描述,有些不再重复描述。
第三方面,本申请实施例提供了一种子图的编译装置,包括:获取单元,用于获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边;重构单元,用于对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边;编译单元,用于对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
在一种可能的实现方式中,所述编译单元,具体用于:将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个 第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点 为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
需要说明的是,第三方面的有益效果可以参照第一方面的描述,此处不再重复描述。
第四方面,本申请实施例提供了一种子图的执行装置,其特征在于,包括:获取单元,用于获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图 进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数;执行单元,用于执行所述第一目标重构节点的编译数据。
在一种可能的实现方式中,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行单元,具体用于:将R个第六对象的编译数据写入缓存中;针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
在一种可能的实现方式中,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
在一种可能的实现方式中,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除 所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
需要说明的是,第四方面的有益效果可以参照第一方面或第二方面的描述,此处不再重复描述。
第五方面,本申请实施例提供了一种图编译装置,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器进行,所述程序包括用于进行如上述第一方面及其中任一可能的实现方式所述的方法中的步骤的指令。
第六方面,本申请实施例提供了一种图执行装置,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器进行,所述程序包括用于进行如上述第二方面及其中任一可能的实现方式所述的方法中的步骤的指令。
第七方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质包 括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如上述第一方面或第二方面及其中任一可能的实现方式所述的方法。
第八方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如上述第一方面或第二方面及其中任一可能的实现方式所述的方法。
第九方面,本申请实施例提供了一种芯片,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有上述芯片的设备执行如上述第一方面或第二方面及其中任一可能的实现方式所述的方法。
附图说明
图1是本申请实施例提供的一种神经网络模型的应用示意图。
图2是本申请实施例提供的一种子图的编译、执行系统架构示意图。
图3是本申请实施例提供的一种子图重构的原理示意图。
图4是本申请实施例提供的一种线程划分的示意图。
图5是本申请实施例提供的另一种线程划分的示意图。
图6是本申请实施例提供的一种将节点划分到重构节点的示意图。
图7是本申请实施例提供的一种子图切分策略的生成示意图。
图8是本申请实施例提供的另一种子图切分策略的生成示意图。
图9是本申请实施例提供的一种资源分配的示意图。
图10是本申请实施例提供的一种数据的预取操作的示意图。
图11是本申请实施例提供的一种数据的无效操作的示意图。
图12是本申请实施例提供的一种数据的回写操作的示意图。
图13是对图5所示的重构节点进行编译的示意图。
图14是对图5所示的线程级子图进行编译的示意图。
图15是对图5所示的线程级子图中的部分子节点的依赖关系进行表达的示意图。
图16是本申请实施例提供的一种有向边的示意图。
图17是本申请实施例提供的一种控制边的表达示意图。
图18是本申请实施例提供的另一种控制边的表达示意图。
图19是本申请实施例提供的一种分布式执行的表达的示意图。
图20是本申请实施例提供的另一种分布式执行的表达的示意图。
图21是本申请实施例提供的一种图执行装置的结构示意图。
图22是本申请实施例提供的一种主机侧与设备侧的异步流水线的示意图。
图23是本申请实施例提供的另一种主机侧与设备侧的异步流水线的示意图。
图24是本申请实施例提供的一种重构节点中节点之间的依赖关系或线程级子图中的子节点之间的依赖关系示意图。
图25是本申请实施例提供的一种节点或子节点的输入、输出数据的缓存管理操作示意图。
图26是本申请实施例提供的一种子图的编译方法的流程示意图。
图27是本申请实施例提供的一种子图的执行方法的流程示意图。
图28是本申请实施例提供的一种子图的编译装置的结构示意图。
图29是本申请实施例提供的一种子图的执行装置的结构示意图。
图30是本申请实施例提供的一种计算机设备的机构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本说明书中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本说明书所描述的实施例可以与其它实施例相结合。
鉴于以节点/任务为调度单位的流调度机制,存在调度时间长和执行并发度低的问题,本申请提供了一种以图为调度单位的图调度机制,包括以图为调度单位的图编译和图执行,以减少调度时间和提升执行并发度。
请参阅图2,图2是本申请实施例提供的一种子图的编译、执行系统架构示意图,该系统架构包括图编译装置100和图执行装置200,可选地,图编译装置100可以设置在主机(host)侧或者就是host,图执行装置200可以设置在设备(device)侧或者就是device。其中,图编译装置100的输入为以tensorflow/pytorch/onnx等前端脚本生成的神经网络模型,图编译装置100负责将神经网络模型进行编译以生成系统认识的数据结构,作为系统的图执行装置200的输入;图执行装置200根据图编译装置100输出的编译数据,以重构节点或线程级子图为调度单位进行重构节点或线程级子图的执行,从而获取神经网络模型推理的结果或者训练的模型参数。下面对图编译装置100和图执行装置200进行具体说明。
一、图编译装置100。
图编译装置100包括解析转换(parser)单元101、图优化单元102、子图切分单元103、资源分配单元104和编译单元105,下面具体说明。
解析转换单元101:用于将tensorflow/pytorch/onnx等前端中间表达(intermediate representation,IR)的计算图转换本系统中间表达的计算图。
其中,计算图由节点(nodes)和线(edges)组成。计算图中的节点表示操作符(Operator),或者称之为算子。计算图中的线也称为边,表示计算间的依赖,计算图中的边是有向边,定义了操作之间的关系,分为两类:其中一类用来传输数据,称为数据边,用实线表示;另一类用来定义依赖关系(即执行先后顺序),称为控制边,用虚线表示。计算图中所有的节点都通过数据边或者控制边连接,其中入度(也即依赖的节点的数量)为0的节点没有前置依赖,可以立即执行;入度大于0的节点,要等待其依赖的所有节点执行结束之后,才可以执行。
图优化单元102:对已经使用本系统中间表达的计算图进行优化,例如对解析转换单元101输出的计算图进行优化,具体包括对计算图中的节点所表示的算子进行算子融合、常量折叠、精度和格式(format)优化等,其中,节点所表示的算子包括但不限于:计算算子、通信算子(例如集合通信算子等)、控制算子(例如发送算子、接收算子等)、异构计算算子(例如CPU算子、GPU算子、矩阵类算子、向量类算子等)等。应理解,图优化单元102是可选的,也即子图切分单元103的输入也可以是未进行优化的计算图。
子图切分单元103:用于将计算图切分成多个子图,再对每个子图进行重构以得到多个重构子图;可选地,还可以进一步对重构子图中的每个重构节点进行线程(thread)划分以得到每个重构节点对应的线程级子图,也即将重构子图中的每个重构节点内部的节点划分成至少一个线程级的子节点以得到该每个重构节点对应的线程级子图。应理解,子图切分单元103的输入可以是图优化单元102的输出,也可以是解析转换单元101的输出;子图切分单元103的输出可以是重构子图或重构节点,也可以是线程级子图。
其中,子图重构的原理是将子图中的至少一个节点以及这至少一个节点之间的有向边作为一个重构节点,相当于对子图进一步进行切分后得到多个更小规模的子图,这些更小规模的子图就是一个个重构节点,这些更小规模的子图之间的有向边就是重构节点之间的有向边。应理解,本申请中的重构子图包括至少一个重构节点和该至少一个重构节点之间的有向边,故本申请中的重构子图实质上还是一张子图;并且,由于重构子图中任意一个重构节点包括重构之前的子图中的一个或多个节点和这一个或多个节点之间的有向边,故本申请中的重构节点实质上是一张比子图规模更小的子图,也即重构节点为图结构。
举例来说,请参阅图3,图3是本申请实施例提供的一种子图重构的原理示意图。如图3所示,子图包括:节点a、节点b、节点c、节点d、节点e、节点f以及节点a与节点b之间的有向边、节点b与节点c之间的有向边、节点c与节点d之间的有向边、节点d与节点e之间的有向边、节点e与节点f之间的有向边;将节点a、节点b、节点c、节点a与节点b之间的有向边以及节点b与节点c之间的有向边作为重构节点A,将节点d、节点e以及节点d与节点e之间的有向边作为重构节点B,将节点f作为重构节点C;而节点c与节点d之间的有向边即为重构节点A和重构节点B之间的有向边,节点e与节点f之间的有向边即为重构节点B和重构节点C之间的有向边;如此,重构得到重构子图,该重构子图包括:重构节点A、重构节点B、重构节点C以及重构节点A和重构节点B之间的有向边、重构节点B和重构节点C之间的有向边。可以发现,重构节点A、重构节点B和重构节点C实质上是一张比子图规模更小的子图,也即重构节点A、重构节点B和重构节点C为图结构。
其中,对重构节点进行线程划分也即将重构节点中包括的每个节点划分成至少一个子节点,从而得到由多个子节点和这多个子节点之间的有向边构成的线程级子图;而线程级子图中的任意两个子节点之间的有向边是根据划分得到该任意两个子节点的节点之间的有向边确定的,且由同一个节点划分得到的至少一个子节点可以并行执行。应理解,由于在进行线程划分时,一个节点划分成至少一个子节点,那么该节点的输入、输出数据也会划分成等数量 个线程,例如该节点的输入数据也划分为至少一个子输入数据,这至少一个子输入数据分别为至少一个子节点的输入数据,也即这至少一个子输入数据与这至少一个子节点是一一对应的,且这至少一个子节点的输出数据构成该节点的输出数据。其中,一个节点是否可以划分为多个子节点,以该节点的输入数据是否可以划分来确定,若该节点的输入数据可以划分为多个子输入数据,则该节点可以划分为多个子节点;否则,该节点不能划分为多个子节点。例如,节点1表示加号,节点1的输入数据就是数据A和数据B,数据A可以划分为子数据A1和子数据A2,数据B可以划分为子数据B1和子数据B2,那么节点1划分为子节点10和子节点11,子节点10和子节点11均表示加号,子节点10的输入为子数据A1和子数据B1,子节点11的输入为子数据A2和子数据B2。
应理解,在本申请中,计算图、子图、重构子图、线程级子图中的有向边均可以包括数据边和控制边。
进一步地,本申请还可以通过分析计算图的缓存(cache)占用,对线程级子图中的各线程之间添加依赖,控制线程的并发度,以提高计算资源利用率和存储效率。
举例来说,请参阅图4,图4是本申请实施例提供的一种线程划分的示意图。如图4所示,重构节点包括:卷积算子_0节点、向量算子_0节点、卷积算子_1节点、向量算子_1节点,以及卷积算子_0节点与向量算子_0节点之间的有向边、向量算子_0节点与卷积算子_1节点之间的有向边、卷积算子_1节点与向量算子_1节点之间的有向边;将卷积算子_0节点、向量算子_0节点、卷积算子_1节点、向量算子_1节点均划分成8个线程,也即将卷积算子_0节点、向量算子_0节点、卷积算子_1节点、向量算子_1节点中的每个均划分成8个子节点,从而8个线程的子节点以及8个线程的子节点之间的有向边构成线程级子图;其中,图4中仅示出了8个线程中的每个线程内的子节点之间的有向边;应理解,8个线程中的线程之间的子节点之间也可能存在有向边(图4中未示出),以及8个线程中的线程之间还可以插入同步点(图4中未示出)。同理,还需要将卷积算子_0节点、向量算子_0节点、卷积算子_1节点、向量算子_1节点的输入、输出数据均划分成8个线程,也即将数据_0、数据_1、数据_2、数据_3、数据_4中的每个均划分成8个子数据。其中,图4所示的线程级子图中的8个线程的并发度为4,也即其中4个线程可以并发执行,另外4个线程又可以并发执行。
需要说明的是,由于前述对重构节点中的节点进行了线程划分,那么在线程级子图中必然会有各线程之间的同步,本申请可以根据重构节点(或计算图、子图)中的节点之间的数据依赖在线程级子图中的各线程之间的插入同步点,实现各线程之间的同步;或者说,重构节点中的每个节点被切分的线程数可以相同,也可以不同,其中,在线程数不同的节点之间,在划分得到的线程级子图中存在汇聚子节点和分拆子节点。
举例来说,请参阅图5,图5是本申请实施例提供的另一种线程划分的示意图。如图5所示,重构节点内包括人工智能矩阵(artificial intelligence cube,AIC)算子节点、人工智能矩阵中央处理器(artificial intelligence CPU,AICPU)算子节点、人工智能向量(artificial intelligence vector,AIV)算子节点、通信算子节点和控制算子节点,以及AIC算子节点与AICPU算子节点之间的有向边、AICPU算子节点与AIV算子节点之间的有向边、AIV算子节点与通信算子节点之间的有向边和通信算子节点与控制算子节点之间的有向边;在进行线程划分时,AIC算子节点、AIV算子节点和控制算子节点均划分为3个线程,AICPU算子节点和通信算子节点均划分为2个线程;其中,划分得到的AIV算子1子节点为分拆子节点,通信算子1子节点和AICPU算子1子节点即为汇聚子节点,也为分拆子节点。
其中,子图切分单元103在进行子图重构的过程中,将子图中的任意一个节点划分到重构 子图中的哪个重构节点中,是根据子图切分策略确定的;并且,子图切分单元103在进行线程划分的过程中,任意一个节点划分的线程数(threading num)或者任意一个节点划分得到的子节点的数量以及任意一个子节点的并发度(window size),也是根据子图切分策略确定的。本申请中,子图切分策略由以下三种途径生成。
子图切分策略生成途径一:
子图切分时,根据子图上张量排布形状(tensor shape)的差分信息来划分区间,然后给每个区间计算一个先验的线程数和并发度;其中,差分信息标识张量之间的差异,可选的计算方式包括:张量占用字节数之差的绝对值等。具体地,统计子图中每个节点的输入、输出数据的张量(tensor)占用,并根据张量排布形状的差分信息的大小进行聚类,使得张量排布形状变化不大的地方划分到一个重构节点中。
举例来说,请参阅图6,图6是本申请实施例提供的一种将节点划分到重构节点的示意图,如图6所示,子图包括:节点a、节点b、节点c、节点d、节点e、节点f以及节点a与节点b之间的有向边、节点b与节点c之间的有向边、节点c与节点d之间的有向边、节点d与节点e之间的有向边、节点e与节点f之间的有向边。其中,节点a的输入、输出张量占用字节数为t 0和t 1,节点b的输入、输出张量占用字节数为t 1和t 2,节点c的输入、输出张量占用字节数为t 2和t 3,节点d的输入、输出张量占用字节数为t 3和t 4,节点e的输入、输出张量占用字节数为t 4和t 5,节点f的输入、输出张量占用字节数为t 5和t 6,故节点a的输入输出张量的差分信息为|Δt 0-1|、节点b的输入输出张量的差分信息为|Δt 1-2|,节点c的输入输出张量的差分信息为|Δt 2-3|,节点d的输入输出张量的差分信息为|Δt 3-4|,节点e的输入输出张量的差分信息为|Δt 4-5|,节点f的输入输出张量的差分信息为|Δt 5-6|;由于|Δt 0-1|、|Δt 1-2|与|Δt 2-3|的和的大小,以及|Δt 3-4|与|Δt 4-5|的和的大小,与|Δt 5-6|的大小接近,故可以将节点a、节点b、节点c、节点a与节点b之间的有向边以及节点b与节点c之间的有向边作为重构节点A,将节点d、节点e以及节点d与节点e之间的有向边作为重构节点B,以及将节点f作为重构节点C。
其中,选择线程数与并发度的原则是:可并发线程内的所有张量数据量之和S小于二级缓存(L2 cache)的容量(size)。例如,在并发度window siz=2的情况下,可以计算出最小的线程切分份数n,如果
Figure PCTCN2021143304-appb-000001
其中C cache表示二级缓存的容量,说明不用切线程,全部数据都能放到二级缓存中,为考虑并发性,则可以令n=2;否则,令
Figure PCTCN2021143304-appb-000002
进一步地,如果希望增加并发线程数,例如在并发度window size=4的情况下,那么可以在保持对二级缓存容量需求不变的基础上,将最小的线程切分份数变为2n。
子图切分策略生成途径二:
通过自动调优工具(Autotune)自动搜索算法找到最优子图切分策略,并将其保存在知识库中,编译时直接读取知识库中保存的策略。
举例来说,请参阅图7,图7是本申请实施例提供的一种子图切分策略的生成示意图。如图7所示,通过策略搜索算法对神经网络模型的计算图生成子图切分策略,应用该子图切分策略后,进行该神经网络模型的编译,并可以在编译过程中对新生成的算子形状进行Autotune调优,并进行上板验证获取该神经网络模型的计算图的执行性能,例如获取上板运行的时间(on-board cycles)并反馈给策略搜索算法,在多次迭代后将调优过程中最优的子图切分策略存入策略知识库,以备实际应用;为了简化模型编译/算子调优/上板验证的时间,也可以通过性能预测器直接对应用该子图切分策略下的计算图进行执行性能的评估,并将评估结果反馈给策略搜索算法,最后通过策略搜索算法根据性能预测器反馈的结果对该子图切分策略进行调整,得到最优的子图切分策略存入策略知识库,以备实际应用。
子图切分策略生成途径三:
离线训练一个策略生成神经网络模型,并将其集成到图编译装置100中,编译时直接调用该策略生成神经网络模型推理产生子图切分策略。
举例来说,请参阅图8,图8是本申请实施例提供的另一种子图切分策略的生成示意图。如图8所示,离线准备数据集,数据集内包括大量待输出子图切分策略的计算图,通过策略生成神经网络模型对数据集中计算图进行子图切分策略生成,并通过强化学习(RL)、监督训练等算法对策略生成神经网络模型进行更新提升,最终将训练好的策略生成神经网络模型作为在线推理阶段的策略生成器,在编译时直接调用以生成子图切分策略。
资源分配单元104:以重构节点或线程级子图为调度单位进行资源分配、数据生命周期识别以及设定缓存管理操作;可选地,资源分配单元104还可以对重构节点或线程级子图进行优化,如图9所示。其中,以重构节点或线程级子图为调度单位进行资源分配、数据生命周期识别以及设定缓存管理操作,包括:对重构节点中的每个节点进行资源分配(例如内存分配)、输入、输出数据的生命周期识别以及设定缓存管理操作,或对线程级子图中的每个子节点进行资源分配、输入、输出数据的生命周期识别以及设定缓存管理操作。
应理解,子图切分单元103根据前述子图切分策略得到重构节点或线程级子图。本申请为以该重构节点或线程级子图为调度单位进行调度,可以将重构节点或线程级子图的范围通过相同的范围(scope)属性值表达,或者通过功能算子(functionOp)进行表达。如此,通过这样的表达,可以将一个重构节点或线程级子图作为一个整体呈现给软件栈,方便编译和执行。
其中,对重构节点或线程级子图进行优化包括:以重构节点或线程级子图为调度单位,对重构节点内的节点或线程级子图内的子节点所表示的算子进行优化,包括算子融合优化、单算子优化、常量折叠优化、数据类型(dtype)优化、格式(format)优化等。应理解,对重构节点或线程级子图内的算子进行优化,包括但不限于上述优化。
其中,对重构节点或线程级子图进行资源分配包括对重构节点内的节点或线程级子图内子节点的输入、输出、临时内存区(workspace)等进行内存分配。应理解,在对重构节点或线程级子图进行内存分配时,内存的复用的范围可以是重构节点之间或线程级子图之间,也可以是重构节点或线程级子图之内。
需要说明的是,内存分配包括分配在图编译装置100中存储时的内存以及分配在图执行装置200中存储时的内存。
其中,对重构节点或线程级子图的数据生命周期识别,包括对重构节点内的每个节点或线程级子图中的每个子节点的输入、输出数据进行生命周期识别。例如,识别数据的生成(produce)、数据的消费(consume)、数据第一次读、数据最后一次读、数据第一次写、数据最后一次写等数据生命周期;需要说明的是,数据的生成表示某个节点的输入数据由前面几个节点生成或某个子节点的输入数据由前面几个子节点生成,数据的消费表示这个节点的输出数据作为后面几个节点的输入数据或这个子节点的输出数据作为后面几个子节点的输入数据。
其中,设定缓存管理操作也即对重构节点内的每个节点或线程级子图中的每个子节点的输入、输出数据设定缓存管理操作;设定的缓存管理操作包括预取(prefetch)操作、无效(invalid)操作、回写(writeback)操作和刷新(flush)操作等操作,从而在执行时最大限度提升缓存的使用性能。下面具体介绍。
(1)预取操作:是指针对从重构节点或线程级子图外进来的数据,也即非重构节点内的 节点或线程级子图内的子节点的输出数据,对该数据进行预取,而预先写入缓存中,以减少计算时从内存读取该数据而带来的性能消耗。
举例来说,请参阅图10,图10是本申请实施例提供的一种数据的预取操作的示意图。如图10所示,重构节点外的节点1的输出数据为数据1,数据1为重构节点内的节点2和节点3的输入数据,则对数据1进行预取操作;或者,线程级子图外的子节点1的输出数据为数据1,数据1为线程级子图内的子节点2和子节点3的输入数据,则对数据1进行预取操作;其中,在节点2、节点3或子节点2、子节点3执行之前,数据1是存储在内存中的,对数据1进行预取也即将数据1从内存写入缓存中。
(2)无效操作:是指针对重构节点或线程级子图内消耗完成的数据,最后一次使用完成后,将缓存中的该数据无效掉,减少刷回内存带来的性能消耗。
举例来说,请参阅图11,图11是本申请实施例提供的一种数据的无效操作的示意图。如图11所示,重构节点内的节点1的输出数据为数据1,数据1为重构节点内的节点2和节点3的输入数据,数据1在作为重构节点内的节点2和节点3的输入数据之后,不再作为重构节点内的其他节点的输入数据,则对数据1进行无效操作,也即将缓存中的数据1删除;或者,线程级子图内的子节点1的输出数据为数据1,数据1为线程级子图内的子节点2和子节点3的输入数据,数据1在作为线程级子图内的子节点2和子节点3的输入数据之后,不再作为线程级子图内的其他子节点的输入数据,则对数据1进行无效操作,也即将缓存中的数据1删除。
(3)回写操作:是指针对在重构节点或线程级子图内使用的数据,若在该重构节点或该线程级子图外也在使用,则在该数据生成时回写至内存中。
举例来说,请参阅图12,图12是本申请实施例提供的一种数据的回写操作的示意图。如图12所示,重构节点内的节点1的输出数据为数据1,数据1除为重构节点内的节点2的输入数据之外,还为重构节点外的节点3的输入数据,则对数据1进行回写操作,也即将缓存中的数据1写入内存中;或者,线程级子图内的子节点1的输出数据为数据1,数据1除为线程级子图内的子节点2的输入数据之外,还为线程级子图外的子节点3的输入数据,则对数据1进行回写操作,也即将缓存中的数据1写入内存中。
(4)刷新操作:是指在某个数据回写至内存之后,增加无效操作以删除缓存中的该数据。
编译单元105:用于以重构节点或线程级子图为调度单位进行编译,将每个重构节点或每个线程级子图编译成一个调度单位。
具体地,编译单元105对每个重构节点进行编译,对重构节点内的每个节点的计算、缓存管理操作和依赖关系进行表达,也即完成每个节点的二进制执行代码、依赖关系的表达以及缓存管理操作的表达的生成。或者,编译单元105对每个线程级子图进行编译,对线程级子图内的每个子节点的计算、缓存管理操作任务和依赖关系进行表达,也即完成每个子节点的二进制执行代码、依赖关系的表达以及缓存管理操作的表达的生成。其中,依赖关系用于表示有向边,例如节点与节点之间的有向边,重构节点与重构节点之间的有向边,或者子节点与子节点之间的有向边。应理解,每个节点或子节点的缓存管理操作是由前述资源分配单元104设定的。
其中,编译单元105对每个重构节点进行编译以得到每个重构节点的编译数据,而对每个重构节点进行编译包括对该重构节点内的每个节点进行编译以得到该重构节点内的每个节点的编译数据,故每个重构节点的编译数据包括该重构节点内的每个节点的编译数据,每个节点的编译数据包括针对该节点生成的硬件感知的任务、该节点的描述符等信息。本申请中, 硬件感知的任务包括运行时的硬件可执行的计算算子的二进制执行代码以及该二进制执行代码的入参(args)等,故节点对应的任务包括该节点所表示的计算算子的二进制执行代码以及该二进制执行代码的入参;描述符包括依赖关系和缓存管理操作指示信息等,故节点的描述符包括该节点的依赖关系以及该节点的缓存管理操作指示信息,且该节点的缓存管理操作指示信息用于指示在执行时对该节点的输入、输出数据进行相应的缓存管理操作(例如预取、无效、回写和刷新等)。
举例来说,请参阅图13,图13是对图5所示的重构节点进行编译的示意图。如图13所示,重构节点中包括AIC算子节点、AICPU算子节点、AIV算子节点、通信算子节点和控制算子节点这5个节点,对这5个节点分别进行编译,生成这5个节点中每个节点对应的任务;此外,重构节点中包括这5个节点之间的有向边,故在编译时还生成这5个节点中每个节点的描述符(图13中未示出),其中,每个节点的描述符包括该节点的依赖关系,从而在执行时硬件可以感知这5个节点的执行顺序;并且,前述资源分配单元104为每个节点设定了相应的缓存管理操作,故每个节点的描述符还包括每个节点的输入、输出数据的缓存管理操作指示信息。
或者,编译单元105对每个线程级子图进行编译以得到每个线程级子图的编译数据,每个线程级子图的编译数据也即划分得到该线程级子图的重构节点的编译数据,而对每个线程级子图进行编译包括对该线程级子图内的每个子节点进行编译以得到该线程级子图内的每个子节点的编译数据,故每个线程级子图的编译数据包括该线程级子图内的每个子节点的编译数据,每个子节点的编译数据包括针对该子节点生成的硬件感知的任务、该子节点的描述符等信息。本申请中,硬件感知的任务包括运行时的硬件可执行的计算算子的二进制执行代码以及该二进制执行代码的入参(args)等,故子节点对应的任务包括该子节点所表示的计算算子的二进制执行代码以及该二进制执行代码的入参;描述符包括依赖关系和缓存管理操作指示信息等,故子节点的描述符包括该子节点的依赖关系以及该子节点的缓存管理操作指示信息且该子节点的缓存管理操作指示信息用于指示在执行时对该子节点的输入、输出数据进行相应的缓存管理操作(例如预取、无效、回写和刷新等)。
举例来说,请参阅图14,图14是对图5所示的线程级子图进行编译的示意图。如图14所示,线程级子图内包括13个子节点,对这13个子节点分别进行编译,生成这13个子节点中每个子节点对应的任务;此外,线程级子图还包括这13个子节点之间的有向边,故在编译时还生成这13个子节点中每个子节点的描述符(图14中未示出),其中,每个子节点的描述符包括每个子节点的依赖关系,从而在执行时硬件可以感知这13个子节点的执行顺序;并且,前述资源分配单元104为每个子节点设定了相应的缓存管理操作,故每个子节点的描述符还包括每个子节点的输入、输出数据的缓存管理操作指示信息。
在本申请中,每个节点或子节点的编译数据包括执行该节点或子节点计算装置的类型。具体地,在每个节点或子节点任务中增加异构处理器的类型,在执行时以便硬件调度到对应的硬件加速单元进行处理,保证异构处理器的并发执行。如此,在执行时将异构算子节点调度到异构处理器上,让异构算子节点进行并发执行,例如CPU算子节点、GPU算子节点、芯片专用集成电路(Application Specific Integrated Circuit,ASIC)算子节点的并发执行。
在本申请中,线程级子图内的各线程可以并发执行,线程级子图中的每个子节点的编译数据中还包括与该子节点并行执行的其他子节点,该其他子节点和该子节点是由同一节点划分得到的。如此,在多个线程执行时,线程间可以并发执行。
在本申请中,某个节点的依赖关系可以表达为该节点依赖的其他节点的数量和依赖于该节点的其他节点,某个子节点的依赖关系可以表达为该子节点依赖的其他子节点的数量和依 赖于该子节点的其他子节点。
举例来说,请参阅图15,图15是对图5所示的线程级子图中的部分子节点的依赖关系进行表达的示意图。如图15所示,依赖关系表达可以通过前面的(predecessor,pred)和后继的(successor,succ)来表达,其中,通过pred_cnt表达某个节点依赖前面节点的个数,succ_list表达依赖该节点的后继节点。例如,AIC算子1子节点依赖前面节点的个数为0,依赖AIC算子1子节点的子节点为AICPU算子1子节点;AIC算子2子节点依赖前面节点的个数为0,依赖AIC算子2子节点的子节点为AICPU算子1子节点;AICPU算子1子节点依赖前面节点的个数为2,依赖AICPU算子1子节点的子节点为AIV算子1子节点。
在本申请中,依赖关系的表达包括但不限于以下情况:
(1)有向边表达:
第一种,重构节点内的节点之间的数据边和控制边均可以通过节点依赖的其他节点的数量和依赖于该节点的其他节点来表达,或线程级子图内子节点之间的数据边和控制边均可以通过子节点依赖的其他子节点的数量和依赖于该子节点的其他子节点来表达。也即,重构节点内的节点之间的数据边和控制边或线程级子图内的子节点之间的数据边和控制边均可以通过上述pred_cnt和succ_list表达,并且编译在该节点或子节点的编译数据中。
举例来说,请参阅图16,图16是本申请实施例提供的一种有向边的示意图。如图16所示,节点之间的实线表示数据边,例如左边的AIC算子节点与左边的AICPU算子节点之间的实线表示左边的AIC算子节点与左边的AICPU算子节点之间的数据边,左边的AIC算子节点的依赖的其他节点的数量(pred_cnt)为0,依赖于左边的AIC算子节点的其他节点(succ_list)为左边的AICPU算子节点,并将其编译在左边的AIC算子节点的编译数据中;节点之间的虚线表示控制边,例如左边的AICPU算子节点与右边的AICPU算子节点之间的虚线表示左边的AICPU算子节点与右边的AICPU算子节点之间的控制边,其中,左边的AICPU算子节点的依赖的其他节点的数量(pred_cnt)为1,也即为左边的AIC算子节点,依赖于左边的AICPU算子节点的其他节点(succ_list)为右边的AICPU算子节点和左边的AIV算子节点,并将其编译在左边的AICPU算子节点的编译数据中。
第二种,同一流内的不同重构节点内的节点之间的控制边,转换为该不同重构节点之间的控制边来表达;而不同重构节点之间的控制边可以通过重构节点依赖的其他重构节点的数量和依赖于该重构节点的其他重构节点来表达,也即通过上述pred_cnt和succ_list表达。需要说明的是,该不同重构节点可以是同一重构子图中的重构节点,也可以是分别在不同重构子图中的重构节点。
举例来说,请参阅图17,图17是本申请实施例提供的一种控制边的表达示意图。如图17所示,两个重构节点在同一流内的,其中一个重构节点内的AICPU算子节点与另外一个重构节点内的AICPU算子节点之间的控制边,转换成这个两个重构节点之间的控制边来表达,并把这个两个重构节点之间的控制边表达成这两个重构节点之间的依赖关系,编译在这两个重构节点的编译数据中。
第三种,不同流内的不同重构节点内的节点之间的控制边,增加发送(send)算子节点和接收(recv)算子节点,以转换为流之间的控制边的表达,以控制流之间的前后顺序;具体地,将不同流内的不同重构节点内的节点之间的控制边转换成某个流内的重构节点与发送算子节点的控制边、发送算子节点与接收算子节点的控制边、接收算子节点与其他流内的重构节点的控制边来表达。发送算子节点可以向接收算子节点发送同步信号,该同步信号可以指示该发送算子节点依赖的节点都已经执行完成,如此,该接收算子节点在接收到该发送算 子节点发送的同步信号后,与该接收算子节点控制边的存在重构节点可以执行。需要说明的是,该不同重构节点可以是同一重构子图中的重构节点,也可以是分别在不同重构子图中的重构节点。
举例来说,请参阅图18,图18是本申请实施例提供的另一种控制边的表达示意图。如图18所示,两个重构节点分别在两个流内,其中一个重构节点内的AICPU算子节点与另外一个重构节点内的AICPU算子节点之间的控制边;在其中一个重构节点或者包括其中一个重构节点的流中增加发送算子节点,发送算子节点依赖于其中一个重构节点,也即在发送算子节点与其中一个节点内最后被执行的节点之间增加有向边(如数据边),例如在其中一个重构节点内最后被执行的AIV算子节点与发送算子节点之间增加数据边;以及在包括另一个重构节点的流中增加接收算子节点,另外一个重构节点依赖于接收算子节点,也即在接收算子节点与另外一个节点内最先被执行的节点之间增加有向边(如数据边),例如在另外一个重构节点内最先被执行的AIC算子节点与接收算子节点之间增加数据边;并且发送算子节点与接收算子节点之间有控制边,从而转换成了两个流之间的控制边来表达;而其中一个重构节点内最后被执行的AIV算子节点与发送算子节点之间增加数据边、发送算子节点与接收算子节点之间的控制边可以表达成发送算子节点的依赖关系,并编译在其中一个重构节点的编译数据中,同理,发送算子节点与接收算子节点之间的控制边、接收算子节点与另外一个重构节点内最先被执行的AIC算子节点之间的数据边可以表达成接收算子节点的依赖关系,并编译在另外一个重构节点的编译数据中。
(2)分布式执行的表达:
由于神经网络模型可以分布式执行,本申请在以重构节点或者线程级子图为调度单位时,也支持分布式执行,例如不同的重构节点可以在不同的计算装置中执行,或不同的线程级子图可以在不同的计算装置中执行,从而实现分布式执行。需要说明的是,该不同重构节点可以是同一重构子图中的重构节点,也可以是分别在不同重构子图中的重构节点;不同的线程级子图可以对同一重构子图内的重构节点进行线程划分得到,也可以分别对不同的重构子图内的重构节点进行线程划分得到。
在神经网络的分布式执行时,各计算装置之间涉及数据交互,本申请在以重构节点或者线程级子图为调度单位进行分布式执行,同样也可以实现各计算装置之间的数据交互;其中,本申请中的计算装置可以是芯片(die)、处理器等。本申请通过一些集合通信算子来实现多个计算装置以及多个服务(server)之间的数据通信,例如通过集合通信算子将数据从一个计算装置中传输至另一个计算装置中。因此,本申请中集合通信算子节点与重构节点内的节点之间存在有向边,或集合通信算子节点与线程级子图内的子节点之间存在有向边,而集合通信算子节点与重构节点内的节点之间的有向边,或集合通信算子节点与线程级子图内的子节点之间存在有向边可以表达为集合通信算子节点的依赖关系、重构节点内的节点的依赖关系、或线程级子图内的子节点的依赖关系;其中,集合通信算子节点的依赖关系可以表达为集合通信算子节点依赖的节点或子节点的数量和依赖该集合通信算子节点的节点或子节点,也即通过pred_cnt和succ_list进行表达。应理解,集合通信算子节点也即表示集合通信算子的节点,集合通信算子节点与计算图、子图或重构节点中的节点一样,可以进行线程划分以及编译等操作;其中,当集合通信算子节点需要搬运的数据是子节点的输出数据时,可选的将集合通信算子节点进行线程划分。
举例来说,请参阅图19,图19是本申请实施例提供的一种分布式执行的表达的示意图。如图19所示,重构节点0和重构节点1进行分布式执行,重构节点0内的节点D的输出数据(输 出内存块D)为重构节点1内的节点Q的输入数据,重构节点0内的节点H的输入数据包括重构节点1内的节点M的输出数据(输出内存块M);计算装置0执行重构节点0,故计算装置0执行节点D和节点H;计算装置1执行重构节点1,故计算装置1执行节点M和节点Q;输出内存块D由数据管理单元(data management unit,DMU)节点0从计算装置0迁移至计算装置1中,以作为节点Q的输入数据;输出内存块M由数据管理单元节点2从计算装置1迁移至计算装置0中,以作为节点H的输入数据;数据管理单元节点0和数据管理单元节点2均为集合通信算子节点。如此,在编译时,可以将节点D与数据管理单元节点0之间的有向边表达为节点D的依赖关系,并编译在节点D的编译数据中,且该重构节点0的编译数据包括节点D的编译数据;将节点D与数据管理单元节点0之间的有向边、数据管理单元节点0与节点Q之间的有向边表达为数据管理单元节点0的依赖关系,并编译在数据管理单元节点0的编译数据中,且该重构节点0的编译数据包括数据管理单元节点0的编译数据;以及将数据管理单元节点0与节点Q之间的有向边表达为节点Q的依赖关系,并编译在节点Q的编译数据中,且该重构节点1的编译数据包括节点Q的编译数据。同理,在编译时,可以将节点M与数据管理单元节点2之间的有向边表达为节点M的依赖关系,并编译在节点M的编译数据中,且该重构节点1的编译数据包括节点M的编译数据;将节点M与数据管理单元节点2之间的有向边、数据管理单元节点2与节点H之间的有向边表达为数据管理单元节点2的依赖关系,并编译在数据管理单元节点2的编译数据中,且该重构节点1的编译数据包括数据管理单元节点2的编译数据;以及将数据管理单元节点2与节点H之间的有向边表达为节点H的依赖关系,并编译在节点H的编译数据中,且该重构节点0的编译数据包括节点H的编译数据。此外,图19中还包括数据管理单元节点1和数据管理单元节点3两个集合通信算子节点,节点H与数据管理单元节点1之间的有向边,数据管理单元节点1与节点F之间的有向边,节点Q与数据管理单元节点3之间的有向边,以及数据管理单元节点3与节点E之间的有向边,也采用相同的方式表达,此处不再赘述。
又举例来说,请参阅图20,图20是本申请实施例提供的一种分布式执行的表达的示意图。如图20所示,线程级子图0和线程级子图1进行分布式执行,线程级子图1内的子节点30的输出数据(输出内存块30)为线程级子图0内的子节点13的输入数据;计算装置0执行线程级子图0,故计算装置0执行子节点13;计算装置1执行线程级子图1,故计算装置1执行子节点30;输出内存块30由数据管理单元节点1从计算装置1迁移至计算装置0中,以作为子节点13的输入数据。如此,在编译时,可以将子节点30与数据管理单元节点1之间的有向边表达为子节点30的依赖关系,并编译在子节点30的编译数据中,且该线程级子图1的编译数据包括子节点30的编译数据;将子节点30与数据管理单元节点1之间的有向边、数据管理单元节点1与子节点13之间的有向边表达为数据管理单元节点1的依赖关系,并编译在数据管理单元节点1的编译数据中,且该线程级子图1的编译数据包括数据管理单元节点1的编译数据;以及将数据管理单元节点1与子节点13之间的有向边表达为子节点13的依赖关系,并编译在子节点13的编译数据中,且该线程级子图0的编译数据包括子节点13的编译数据。应理解,数据管理单元节点1可以看作是将数据管理单元节点1划分成1个线程的子节点;此外,图20中还包括数据管理单元节点00和数据管理单元节点01,数据管理单元节点00和数据管理单元节点01是对数据管理单元节点0进行线程划分得到的。其中,线程级子图0内的子节点12的输出数据(输出内存块12)为线程级子图1内的子节点50的输入数据,线程级子图0内的子节点13的输出数据(输出内存块13)为线程级子图1内的子节点51的输入数据;子节点12和子节点13由计算装置0执行,子节点50和子节点51由计算装置1执行;输出内存块12由数据管理单元子节点01从计算装置0迁移至计算装置1中,以作为子节点50的输入数据;输出内存块13由数据管理单元子节点00 从计算装置0迁移至计算装置1中,以作为子节点51的输入数据。如此,在编译时,可以将子节点12与数据管理单元节点01之间的有向边表达为子节点12的依赖关系,并编译在子节点12的编译数据中,且该线程级子图0的编译数据包括子节点12的编译数据;将子节点12与数据管理单元节点01之间的有向边、数据管理单元节点01与子节点50之间的有向边表达为数据管理单元节点01的依赖关系,并编译在数据管理单元节点01的编译数据中,且该线程级子图0的编译数据包括数据管理单元节点01的编译数据;以及将数据管理单元节点01与子节点50之间的有向边表达为子节点50的依赖关系,并编译在子节点50的编译数据中,且该线程级子图1的编译数据包括子节点50的编译数据。同理,在编译时,可以将子节点13与数据管理单元节点00之间的有向边表达为子节点13的依赖关系,并编译在子节点13的编译数据中,且该线程级子图0的编译数据包括子节点13的编译数据;将子节点13与数据管理单元节点00之间的有向边、数据管理单元节点00与子节点51之间的有向边表达为数据管理单元节点00的依赖关系,并编译在数据管理单元节点00的编译数据中,且该线程级子图0的编译数据包括数据管理单元节点00的编译数据;以及将数据管理单元节点00与子节点51之间的有向边表达为子节点51的依赖关系,并编译在子节点51的编译数据中,且该线程级子图1的编译数据包括子节点51的编译数据。
二、图执行装置200。
图执行装置200包括调度单元201和执行单元202,下面具体说明:
调度单元201:用于以重构节点或线程级子图为调度单位,将图编译装置100输出的一个个重构节点或线程级子图的编译数据调度给执行单元202。
执行单元202:用于以重构节点或线程级子图为调度单位对重构节点或线程级子图的编译数据进行执行。
具体地,请参阅图21,图21是本申请实时例提供的一种图执行装置200的结构示意图,如图21所示,图执行装置200还可以包括缓存203和内存204。其中,图编译装置100输出的一个个重构节点或线程级子图的编译数据最开始是存储在内存204中,图执行装置200需要在执行重构节点或线程级子图之期将重构节点或线程级子图的编译数据从内存204中预加载到缓存203中,从而提高运行时的性能。其中,调度单元201可以是软件单元,例如微控制器(micro control unit,MCU)固件(firmware),而执行单元202为硬件减速单元,从而通过软硬件配合的方式完成重构节点或线程级子图的执行过程,调度单元201负责重构节点或线程级子图的编译数据的调度,执行单元202对重构节点或线程级子图的编译数据进行执行,具体流程如下:
①、图执行装置200先将重构节点或线程级子图的编译数据预加载进缓存203中,并通知调度单元201启动执行,调度单元201从缓存203中读取初始就绪的节点或子节点的任务。
其中,重构节点包括至少一个节点的编译数据,且每个节点的编译数据包括该节点对应的任务和该节点的描述符,而该节点的描述符包括该节点的依赖关系,该节点的依赖关系可以表达为该节点依赖的节点数量以及依赖该节点的其他节点,若该节点初始依赖的节点数量为0(也即pred_cnt=0),则该节点对应的任务为初始就绪的任务。同理,线程级子图包括至少一个子节点的编译数据,且每个子节点的编译数据包括该子节点对应的任务和该子节点的描述符,而该子节点的描述符包括该子节点的依赖关系,该子节点的依赖关系可以表达为该子节点依赖的子节点数量以及依赖该子节点的其他子节点,若该子节点初始依赖的子节点数量为0,则该子节点对应的任务为初始就绪的任务。
②、调度单元201将就绪的节点或子节点的任务推送到执行单元202中执行。
其中,就绪的节点是指依赖的节点的数量为0的节点,包括初始就绪的节点和后期就绪的 节点;后期就绪的节点是指:某个节点初始依赖的节点的数量不为0,但在重构节点的执行过程中,每执行完一个该节点依赖的节点后,则该节点依赖的节点的数量减1(也即pred_cnt--),等该节点依赖的节点都已经被执行完以后,也即该节点依赖的节点的数量减为0以后,该节点为就绪的节点。同理,就绪的子节点是指依赖的子节点的数量为0的子节点,包括初始就绪的子节点和后期就绪的子节点;后期就绪的子节点是指:某个子节点初始依赖的子节点的数量不为0,但在重构子节点的执行过程中,每执行完一个该子节点依赖的子节点后,则该子节点依赖的子节点的数量减1,等该子节点依赖的子节点都已经被执行完以后,也即该子节点依赖的子节点的数量减为0以后,该子节点为就绪的子节点。
③、执行单元202从缓存203中读取该就绪的节点或子节点的描述符,基于该就绪的节点或子节点的描述符执行该就绪的节点或子节点的任务。
其中,节点的描述符包括以下至少一项:该节点的输入数据的存储位置、生命周期和缓存管理操作指示信息,该节点的输出数据的存储位置、生命周期和缓存管理操作指示信息,该节点的依赖关系,用于执行该节点的计算装置的类型;故执行单元202基于该就绪的节点的描述符包括的这些信息来执行该就绪的节点。同理,子节点的描述符包括以下至少一项:该子节点的输入数据的存储位置、生命周期和缓存管理操作指示信息,该子节点的输出数据的存储位置、生命周期和缓存管理操作指示信息,该子节点的依赖关系,用于执行该子节点的计算装置的类型,与该子节点并行执行的其他子节点;故执行单元202基于该就绪的子节点的描述符包括的这些信息来执行该就绪的子节点。
④、执行单元202执行完该就绪的节点或子节点的任务,通知调度单元201。
⑤、调度单元201读取依赖于当前执行完成的节点的其他节点依赖的节点的数量,将该其他节点依赖的节点的数量减1,当该其他节点依赖的节点的数量减为0,将该其他节点的任务推送到执行单元202中执行。
重复上述步骤②至⑤,直到重构节点或线程级子图执行完成。
作为一种实现方式,图编译装置100也包括内存,图编译装置100编译得到重构节点或线程级子图的编译数据后,可以将其存放在图编译装置100的内存;图执行装置200在执行该重构节点或线程级子图之前,图编译装置100将该重构节点或线程级子图的编译数据提前写入到缓存203中;或者,图执行装置200在执行该重构节点或线程级子图之前,将该重构节点或线程级子图的编译数据提前预加载到缓存203中。如此,可以提高运行时的性能;例如,减少对图执行装置200侧的内存204的占用,充分利用图编译装置100侧的内存;减少图编译装置100的处理步骤,调整图编译装置100与图执行装置200的异步流水线使得流水更加均衡。其中,图编译装置100可以设置在主机侧或者就是主机,图执行装置200可以设置在设备侧或者就是设备侧。
举例来说,请参阅图22和图23,图22和图23分别是本申请实施例提供的一种主机侧与设备侧的异步流水线的示意图。如图22所示,主机侧和设备侧可以实现异步流水操作,主机侧对一个重构节点或线程级子图进行编译以得到该重构节点或线程级子图的编译数据,并将该重构节点或线程级子图的编译数据存放在主机侧的内存中;主机侧在设备侧执行该重构节点或线程级子图之前,将该重构节点或线程级子图的编译数据写入设备侧;设备侧开始执行该重构节点或线程级子图,并且主机侧开始对另一个重构节点或线程级子图进行编译以得到该另一个重构节点或线程级子图的编译数据,并将该另一个重构节点或线程级子图的编译数据存放在主机侧内存;在设备侧执行完该重构节点或线程级子图之后,执行该另一个重构节点或线程级子图之前,主机侧将该另一重构节点或线程级子图的编译数据写入设备侧,重复上 述操作实现异步流水。如图23所示,主机侧和设备侧可以实现异步流水操作,主机侧对一个重构节点或线程级子图进行编译以得到该重构节点或线程级子图的编译数据,并将该重构节点或线程级子图的编译数据存放在主机侧的内存中;设备侧执行该重构节点或线程级子图之前,从主机侧读取该重构节点或线程级子图的编译数据;设备侧开始执行该重构节点或线程级子图,并且主机侧开始对另一个重构节点或线程级子图进行编译以得到该另一个重构节点或线程级子图的编译数据,并将该另一个重构节点或线程级子图的编译数据存放在主机侧内存;设备侧在执行完该重构节点或线程级子图之后,执行该另一个重构节点或线程级子图之前,从主机侧读取该另一重构节点或线程级子图的编译数据,重复上述操作实现异步流水。其中,主机侧与设备侧通过高速串行计算机扩展总线标准通道(PCIe-through)的方式将重构节点或线程级子图的编译数据预加载到设备侧的缓存中,可以带来以下方面的性能提升:减少对设备侧的内存占用,充分利用主机侧的内存;减少主机侧的处理步骤,调整异步流水线使得流水更加均衡。
作为一种实现方式,重构节点中的任意一个节点的编译数据或线程级子图中的任意一个子节点的编译数据预加载至缓存203中的时机如下描述:
(1)对于初始就绪的节点,在执行该重构节点之前,将该初始就绪的节点的编译数据预加载至缓存203中;同理,对于初始就绪的子节点,在执行该线程级子图之前,将该初始就绪的子节点的编译数据预加载至缓存203中。
(2)对于后期就绪的节点,在调度单元201收到该后期就绪的节点依赖的节点执行完成之后,例如该后期就绪的节点依赖的最先被执行的一个节点执行完成之后,对该后期就绪的节点依赖的节点的数量进行减1之前,将该后期就绪的节点的编译数据预加载至缓存203中;如此,待该后期就绪的节点依赖的节点的数量减到0时,推送到硬件执行时,该后期就绪的节点的编译数据已经在缓存203中了。
举例来说,请参阅图24,图24是本申请实施例提供的一种重构节点中节点之间的依赖关系或线程级子图中的子节点之间的依赖关系示意图。如图24所示,重构节点中包括节点0、节点1和节点2,节点1和节点2均依赖于节点0;其中,节点0的编译数据是在执行该重构节点之前预加载至缓存203中的;而节点1的编译数据是在节点0执行完成以后,对节点1依赖的节点的数量进行减1之前,预加载至缓存203中的;以及节点2的编译数据是在节点0执行完成以后,对节点2依赖的节点的数量进行减1之前,预加载至缓存203中的。同理,线程级子图中包括子节点0、子节点1和子节点2,子节点1和子节点2均依赖于子节点0;其中,子节点0的编译数据是在执行该线程级子图之前预加载至缓存203中的;而子节点1的编译数据是在子节点0执行完成以后,对子节点1依赖的子节点的数量进行减1之前,预加载至缓存203中的;子节点2的编译数据是在子节点0执行完成以后,对子节点2依赖的子节点的数量进行减1之前,预加载至缓存203中的。
作为一种实现方式,重构节点或线程级子图的输入数据也可以存放在图编译装置100的内存中,达到进一步减小图执行装置200侧的内存占用和调整图编译装置100侧异步流水线更加均衡的目的。其中,重构节点的输入数据主要是指非重构节点内的节点的输出数据,也即从重构节点外进来的数据,包括重构节点内的初始就绪的节点的输入数据;同理,线程级子图的输入数据主要是指非线程级子图内的子节点的输出数据,也即从线程级子图外进来的数据,包括线程级子图内的初始就绪的子节点的输入数据。
举例来说,图编译装置100可以设置在主机侧或者就是主机,图执行装置200可以设置在设备侧或者就是设备侧,重构节点或线程级子图的输入数据亦可放在主机侧内存中,设备侧 通过直接存储器访问(Direct Memory Access,DMA)的方式将该重构节点或线程级子图的输入数据预加载到数据设备侧缓存中,从而达到进一步减小设备侧内存占用和调整主机侧异步流水线更加均衡的目的。
作为一种实现方式,节点或子节的输入、输出数据的缓存管理操作的时机如下描述:
(1)输入数据发起缓存管理操作。
对于重构节点的输入数据(也即从重构节点外进来的数据)作为该重构节点的某个节点的输入数据,根据该输入数据的生命周期对该输入数据进行预取操作,也即在执行该重构节点或该节点之前,将该输入数据写入至缓存203中;并且,根据该输入数据的生命周期对该输入数据进行无效操作,也即在该输入数据最后一次作为重构节点内的其他节点的输入数据之后,删除缓存203中的该输入数据。对于线程级子图的输入数据(也即从线程级子图外进来的数据)作为该线程级子图的某个子节点的输入数据,根据该输入数据的生命周期对该输入数据进行预取操作,也即在执行该线程级子图或该子节点之前,将该输入数据写入至缓存203中;并且,根据该输入数据的生命周期对该输入数据进行无效操作,也即在该输入数据最后一次作为线程级子图内的其他子节点的输入数据之后,删除缓存203中的该输入数据。
(2)输出数据发起缓存管理操作。
对于重构节点内的某个节点的输出数据,根据该输出数据的生命周期对该输出数据进行回写操作,也即将该节点的输出数据写入内存中;并且,根据该输出数据的生命周期对该输出数据进行刷新操作,也即在将该节点的输出数据写入内存中之后,删除缓存中的该节点的输出数据;以及根据该输出数据的生命周期对该输出数据进行无效操作,也即在该输出数据最后一次作为重构节点内的其他节点的输入数据之后,删除缓存203中的该输出数据。同理,对于线程级子图内的某个子节点的输出数据,根据该输出数据的生命周期对该输出数据进行回写操作,也即将该子节点的输出数据写入内存中;并且,根据该输出数据的生命周期对该输出数据进行刷新操作,也即在将该子节点的输出数据写入内存中之后,删除缓存中的该子节点的输出数据;以及根据该输出数据的生命周期对该输出数据进行无效操作,也即在该输出数据最后一次作为线程级子图内的其他子节点的输入数据之后,删除缓存203中的该输出数据。
举例来说,请参阅图25,图25是本申请实施例提供的一种节点或子节点的输入、输出数据的缓存管理操作示意图。如图25所示,节点1为重构节点内的其中一个节点,将就绪(pred_cnt=0)的节点1对应的任务推送到准备运行队列(readyq)中,从完成队列(cq)中收到节点1对应的任务的完成响应,查询依赖于节点1的后续节点对应的任务。对于节点1的输入数据——数据0而言,如果数据0为重构节点外或线程级子图外进来的数据,则需要对数据0进行预取操作;在收到节点1对应的任务的完成响应后,将数据0的消费次数(cons_cnt)减1,当数据0的消费次数减到0时,对数据0发起无效操作。对于节点1的输出数据——数据2而言,则对数据2的生产次数(prod_cnt),当数据2的生产次数减为0时,对数据2发起回写操作,例如数据2的生产次数为1,数据2的生产次数减一次之后就发起回写操作。同理,子节点1为线程级子图内的其中一个子节点,将就绪(pred_cnt=0)的子节点1对应的任务推送到准备运行队列(readyq)中,从完成队列(cq)中收到子节点1对应的任务的完成响应,查询依赖于子节点1的后续子节点对应的任务。对于子节点1的输入数据——数据0而言,如果数据0为线程级子图外或线程级子图外进来的数据,则需要对数据0进行预取操作;在收到子节点1对应的任务的完成响应后,将数据0的消费次数(cons_cnt)减1,当数据0的消费次数减到0时,对数据0发起无效操作。对于子节点1的输出数据——数据2而言,则对数据2的生产次数(prod_cnt), 当数据2的生产次数减为0时,对数据2发起回写操作,例如数据2的生产次数为1,数据2的生产次数减一次之后就发起回写操作。其中,调度单元201通过节点1或子节点1的描述符指定硬件发起对于数据0进行预取操作:当数据0需要预取时,调度单元201将节点1或子节点1的描述符,推送到准备运行队列中之后,由硬件择机发起预取。
下面结合具体实施方式从方法侧对本申请提供的以图为调度单位的图编译和图执行流程进行详细的介绍。
一、图编译阶段。
请参阅图26,图26是本申请实施例提供的一种子图的编译方法的流程示意图,该子图的编译方法应用于图编译装置100;该子图的编译方法包括但不限于如下操作或步骤:
步骤2601:获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边。
其中,本申请中的计算图为神经网络模型的计算图,一张计算图表示一个神经网络模型的运算。
应理解,计算图中的有向边包括数据边和控制边,本申请中所描述的有向边也均包括数据边和控制边,例如多个第一节点之间的有向边包括多个第一节点之间的数据边和控制边。
步骤2602:对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边。
应理解,第一重构子图也是基于图3所描述的重构原理对第一子图进行重构得到的。
步骤2603:对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
其中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据。
其中,由于一个第一重构节点包括M个第一节点和该M个第一节点之间的有向边,也即一个第一重构节点本质上是一张比第一子图的规模更小的子图;故调度一个第一重构节点进行编译,也即调度一张比第一子图的规模更小的子图进行编译,因此调度第一目标重构节点进行编译的过程是一种图调度机制的编译过程。
在本申请实施例中,将计算图切分为多个子图,然后对多个子图中的任意一个子图进行重构以得到重构子图,再以重构子图中的重构节点为调度单位实现对该重构子图中的任意一个重构节点进行编译。需要说明的是,本申请得到的重构子图包括至少一个重构节点和该至少一个重构节点之间的有向边,故本申请得到的重构子图实质上还是一张子图;并且,由于该重构子图中任意一个重构节点包括该任意一个子图中的一个或多个节点和一个或多个节点之间的有向边,故重构节点实质上是一张比该任意一个子图规模更小的子图。在子图编译时,以重构节点为调度单位实质上是一种图调度机制,且一次调度一个重构节点进行编译等于一次调度该任意一个子图中的一个或多个节点进行编译;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点,在子图编译时以重构节点为调度单位可以减少调度时间。进一步地,对该任意一个重构节点进行编译得到的该任意一个重构节点的编译数据用于执行该任意一个子图中的一个或多个节点,在子图执行时,一次调度一个重构节点的编译数据用于执行等于一次调度该任意一个子图中的一个或多个节点的编译数据进行执行,也即本申请以重构节点为调度单位,一次调度一个重构节点以执行该 一个重构节点,等于一次调度该任意一个子图中的一个或多个节点以执行该一个或多个节点;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点以执行该一个节点,在子图执行时以重构节点为调度单位可以减少调度时间。例如,对于由计算图切分得到的多个子图中的第一子图而言,第一子图包括多个第一节点和多个第一节点之间的有向边,由第一子图重构得到的第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,其中的第一目标重构节点包括第一子图中的M个第一节点和M个第一节点之间的有向边,M为正整数;对该第一目标重构节点进行编译,等于对该M个第一节点进行编译,并且该第一目标重构节点的编译数据用于执行该M个第一节点,也即执行该第一目标重构节点等于执行该M个第一节点。综上,本申请实施例提供了一种图调度机制,以图结构的重构节点为调度单位,能够减少调度时间。
在一种可能的实现方式中,所述对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,包括:将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。其中,所述第一目标重构节点的编译数据包括所述N个第一子节点的编译数据。需要说明的是,本申请所描述的子节点是对节点进行线程划分得到的,也一个节点划分为至少一个子节点,该至少一个子节点中的每个子节点为一个线程,该至少一个子节点可以并行执行。
应理解,第一线程级子图也是基于图4或图5所描述的线程划分原理对该第一目标重构节点进行线程划分得到的。
在本实现方式中,任意一个重构节点包括M个第一节点和M个第一节点之间的有向边,在对该任意一个重构节点进行编译时,将M个第一节点中的每个第一节点划分为至少一个第一子节点,也即将M个第一节点中的每个第一节点所代表的算子操作划分为至少一个线程,至少一个线程中的每个线程通过一个子节点表示;M个第一节点进行线程划分后得到N个第一子节点,且N为大于或等于M的正整数,N个第一子节点和N个第一子节点之间的有向边构成第一线程级子图;对该第一线程级子图进行编译,可以得到该第一目标重构节点的编译数据,且该第一目标重构节点的编译数据包括N个第一子节点的编译数据;由于执行N个第一子节点等于执行了M个第一节点,而由同一个第一节点划分得到的至少一个第一子节点可以并发执行,如此,在调度N个第一子节点的编译数据执行这N个第一子节点时,N个第一子节点中由同一个第一节点划分得到的第一子节点可以并发执行,从而减少M个第一节点中的每个第一节点的执行时间,也即减少M个第一节点的总执行时间,提升性能。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作(cache manage operation,CMO)指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。需要说明的是,为了便于描述,本申请中的“对象”这一术语代指节点或子节点。
其中,输入、输出数据的存储位置包括在内存中的存储位置;缓存管理操作指示信息包括预取操作、无效操作、回写操作、刷新操作等指示信息,预取操作、无效操作、回写操作、刷新操作的具体过程如图10至图12所示。
在本实现方式中,可以将任意一个节点或子节点的输入、输出数据的存储位置、生命周期和缓存管理操作指示信息、依赖关系、执行其的计算装置的类型等信息中的一项或多项编译进该任意一个节点或子节点的编译数据中,如此,有利于该任意一个节点或子节点的执行,提升性能。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在本实现方式中,可以将与任意一个子节点并行执行的其他子节点的信息编译进该任意一个子节点的编译数据中,如此,有利于该任意一个子节点与其他子节点并行执行。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据,也即在所述目标输入数据最后一次作为所述任意一个第一重构节点第一目标重构节点中的第一节点的输入数据之后,或在所述目标输入数据最后一次作为所述第一线程级子图中的第一子节点的输入数据之后,删除所述缓存中的所述目标输入数据。
应理解,第一预设次数也即为目标输入数据在第一目标重构节点或第一线程级子图中的消费次数。
在本实现方式中,如果重构节点中的任意一个节点的输入数据包括从该重构节点外进来的目标输入数据,那么该节点的输入数据的缓存管理操作指示信息需要指示:在执行该节点之前将该目标输入数据从内存中写入缓存中,以使得该节点可以执行;可选地,该节点的输入数据的缓存管理操作指示信息还可以指示:在目标输入数据不需要作为该重构节点中的其他节点的输入数据之后,删除缓存中的目标输入数据,从而合理释放缓存的空间。同理,如果线程级子图中的任意一个子节点的输入数据包括从该线程级子图外进来的目标输入数据,那么该子节点的输入数据的缓存管理操作指示信息需要指示:在执行该子节点之前将该目标输入数据从内存中写入缓存中,以使得该子节点可以执行;可选地,该子节点的输入数据的缓存管理操作指示信息还可以指示:在目标输入数据不需要作为该线程级子图中的其他子节点的输入数据之后,删除缓存中的目标输入数据,从而合理释放缓存的空间。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据,也即在所述第一对象的输出数据最后一次作为所述第一目标重构节点中的第一节点的输入数据之后,或在所述第一对象的输出数据最后一次作为所述第一线程级子图中的第一子节点的输入数据之后,删除缓存中的所述第一对象的输出数据。
应理解,第二预设次数也即为第一对象的输出数据在第一目标重构节点或第一线程级子图中的消费次数。
在本实现方式中,重构节点中的任意一个节点的输出数据的缓存管理操作指示信息可选 地指示:将该节点的输出数据写入内存中,以便该节点的输出数据用于他用;在将该节点的输出数据写入内存中之后或在该节点的输出数据不需要作为该重构节点中的其他节点的输入数据之后,删除缓存中的该节点的输出数据,从而合理释放缓存的空间。同理,线程级子图中的任意一个子节点的输出数据的缓存管理操作指示信息可选地指示:将该子节点的输出数据写入内存中,以便该子节点的输出数据用于他用;在将该子节点的输出数据写入内存中之后或在该子节点的输出数据不需要作为该线程级子图中的其他子节点的输入数据之后,删除缓存中的该子节点的输出数据,从而合理释放缓存的空间。
在一种可能的实现方式中,所述第一对象的依赖关系包括:所述第一对象依赖的节点的数量和依赖于所述第一对象的节点,或所述第一对象依赖的子节点的数量和依赖于所述第一对象的子节点。
在本实现方式中,重构节点中的任意一个节点的依赖关系包括该节点依赖的节点的数量,在执行节点时,当在节点依赖的节点的数量减为0时,可以执行该节点;并且,节点的依赖关系还包括依赖于该节点的其他节点;如此,可以控制重构节点中的节点之间的执行顺序。同理,线程级子图中的任意一个子节点的依赖关系包括该子节点依赖的子节点的数量,在执行子节点时,当在子节点依赖的子节点的数量减为0时,可以执行该子节点;并且,该子节点的依赖关系还包括依赖于该子节点的其他子节点;如此,可以控制线程级子图中的子节点之间的执行顺序。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在本实现方式中,重构节点中的节点之间的有向边在编译时可以表达为节点的依赖关系,例如表达为节点的第一依赖关系,从而实现重构节点中的节点之间的有向边的表达。同理,线程级子图中的子节点之间的有向边在编译时可以表达为子节点的依赖关系,例如表达为子节点的第一依赖关系,从而实现线程级子图中的子节点之间的有向边的表达;其中,线程级子图中的子节点之间的有向边主要指由不同节点划分得到的子节点之间的有向边。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的任意一个第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。需要说明的是, 本申请通过一些集合通信算子来实现多个计算装置(例如多个芯片或多个处理器)以及多个服务器(server)之间的数据通信,例如集合通信算子。本申请可以通过第一集合通信算子将数据从第一计算装置传输至第二计算装置,也即第一通信算子节点为表示第一集合通信算子的节点;其中,集合通信算子可以被表示为集合通信算子节点,故第第一通信算子节点即为第一集合通信算子节点;并且,集合通信算子节点与计算图、子图或重构节点中的节点一样,可以进行线程划分。进一步需要说明的是,本申请得到第二重构子图方式和得到第一重构子图的方式相同;其中,第一重构子图是对第一子图进行重构得到的,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,因为第一子图包括多个第一节点和多个第一节点之间的有向边,故第一重构子图中的第一目标重构节点包括第一子图中的多个第一节点中的至少一个第一节点和该至少一个第一节点之间的有向边;同理,第二重构子图是对第二子图进行重构得到的,第二重构子图包括至少一个第二重构节点和至少一个第二重构节点之间的有向边,因为第二子图包括多个第二节点和多个第二节点之间的有向边,故第二重构子图中的任意一个第二重构节点包括第二子图中的多个第二节点中的至少一个第二节点和该至少一个第二节点之间的有向边。
应理解,第二重构子图也是基于图3所描述的重构原理的,第二线程级子图也是基于图4或图5所描述的线程划分原理得到的。
在本实现方式中,多个重构节点可以分布式执行,以两个重构节点、两个计算装置为例,其中一个重构节点由其中一个计算装置执行,另外一个重构节点由另一个计算装置执行,其中,这两个重构节点可以是同一重构子图中的重构节点,也可以是不同重构子图中的重构节点;如果其中一个重构节点内的节点的输出数据为另外一个重构节点内的节点的输入数据,那么需要将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个重构节点内的节点的输入数据,故其中一个重构节点内的节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个重构节点内的节点存在有向边;那么,可以将其中一个重构节点内的节点与集合通信算子节点之间存在有向边表达为其中一个重构节点内的节点的依赖关系,以及将集合通信算子节点与另外一个重构节点内的节点存在有向边表达为另外一个重构节点内的节点的依赖关系,例如表达为其中一个重构节点内的节点第二依赖关系和另外一个重构节点内的节点的第二依赖关系;如此,基于重构节点内的节点与集合通信算子节点之间的有向边的表达,可以实现多个重构节点的分布式执行的表达。同理,多个线程级子图可以分布式执行,以两个线程级子图、两个计算装置为例,其中一个线程级子图由其中一个计算装置执行,另外一个线程级子图由另一个计算装置执行,其中,这两个线程级子图可以是分别对同一重构子图中的不同重构节点进行线程划分得到,也可以是分别对不同重构子图中的重构节点进行子线程划分得到;如果其中一个线程级子图内的子节点的输出数据为另外一个线程级子图内的子节点的输入数据,那么需要将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个线程级子图内的子节点的输入数据,故其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个线程 级子图内的子节点存在有向边;那么,可以将其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边表达为其中一个线程级子图内的子节点的依赖关系,以及将集合通信算子节点与另外一个线程级子图内的子节点存在有向边表达为另外一个线程级子图内的子节点的依赖关系,例如表达为其中一个线程级子图内的子节点第二依赖关系和另外一个线程级子图内的子节点的第二依赖关系;如此,基于线程级子图内的子节点与集合通信算子节点之间的有向边的表达,可以实现多个线程级子图的分布式执行的表达。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的任意一个第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。需要说明的是,本申请通过一些集合通信算子来实现多个计算装置(例如多个芯片或多个处理器)以及多个服务器(server)之间的数据通信,例如集合通信算子。本申请通过第二集合通信算子将数据从第三计算装置传输至第二计算装置,故第二通信算子节点为表示第二集合通信算子的节点,也即第二通信算子节点为第二集合通信算子节点。进一步需要说明的是,本申请得到第三重构子图方式和得到第一重构子图的方式相同;其中,第一重构子图是对第一子图进行重构得到的,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,因为第一子图包括多个第一节点和多个第一节点之间的有向边,故第一重构子图中的第一目标重构节点包括第一子图中的多个第一节点中的至少一个第一节点和该至少一个第一节点之间的有向边;同理,第三重构子图是对第三子图进行重构得到的,第三重构子图包括至少一个第三重构节点和至少一个第三重构节点之间的有向边,因为第三子图包括多个第三节点和多个第三节点之间的有向边,故第三重构子图中的任意一个第三重构节点包括第三子图中的多个第三节点中的至少一个第三节点和该至少一个第三节点之间的有向边。
应理解,第三重构子图也是基于图3所描述的重构原理的,第三线程级子图也是基于图4或图5所描述的线程划分原理得到的。
在本实现方式中,多个重构节点可以分布式执行,以两个重构节点、两个计算装置为例,其中一个重构节点由其中一个计算装置执行,另外一个重构节点由另一个计算装置执行,其中,这两个重构节点可以是同一重构子图中的重构节点,也可以是不同重构子图中的重构节点;如果其中一个重构节点内的节点的输出数据为另外一个重构节点内的节点的输入数据,那么需要将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个重构节点内的节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个重构节点内 的节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个重构节点内的节点的输入数据,故其中一个重构节点内的节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个重构节点内的节点存在有向边;那么,可以将其中一个重构节点内的节点与集合通信算子节点之间存在有向边表达为其中一个重构节点内的节点的依赖关系,以及将集合通信算子节点与另外一个重构节点内的节点存在有向边表达为另外一个重构节点内的节点的依赖关系,例如表达为其中一个重构节点内的节点第二依赖关系和另外一个重构节点内的节点的第二依赖关系;如此,基于重构节点内的节点与集合通信算子节点之间的有向边的表达,可以实现多个重构节点的分布式执行的表达。同理,多个线程级子图可以分布式执行,以两个线程级子图、两个计算装置为例,其中一个线程级子图由其中一个计算装置执行,另外一个线程级子图由另一个计算装置执行,其中,这两个线程级子图可以是分别对同一重构子图中的不同重构节点进行线程划分得到,也可以是分别对不同重构子图中的重构节点进行子线程划分得到;如果其中一个线程级子图内的子节点的输出数据为另外一个线程级子图内的子节点的输入数据,那么需要将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,本申请可以采用集合通信算子节点将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置;而由于集合通信算子节点可以将其中一个线程级子图内的子节点的输出数据从其中一个计算装置传输至另外一个计算装置,作为另外一个线程级子图内的子节点的输入数据,故其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边,集合通信算子节点与另外一个线程级子图内的子节点存在有向边;那么,可以将其中一个线程级子图内的子节点与集合通信算子节点之间存在有向边表达为其中一个线程级子图内的子节点的依赖关系,以及将集合通信算子节点与另外一个线程级子图内的子节点存在有向边表达为另外一个线程级子图内的子节点的依赖关系,例如表达为其中一个线程级子图内的子节点第二依赖关系和另外一个线程级子图内的子节点的第二依赖关系;如此,基于线程级子图内的子节点与集合通信算子节点之间的有向边的表达,可以实现多个线程级子图的分布式执行的表达。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。需要说明的是,本申请得到第四重构子图方式和得到第一重构子图的方式相同;其中,第一重构子图是对第一子图进行重构得到的,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,因为第一子图包括多个第一节点和多个第一节点之间的有向边,故第一重构子图中的第一目标重构节点包括第一子图中的多个第一节点中的至少一个第一节点和该至少一个第一节点之间的有向边;同理,第四重构子图是对第四子图进行重构得到的,第四重构子图包括至少一个第四重构节点和至少一个第四重构节点之间的有向边,因为第四子图包括多个第四节点和多个第四节点之间的有向边,故第四重构子图中的任意一个第四重构节点包括第四子图中的多个第四节点中的至少一个第四节点和该至少一个第四节点之间的有向边。
应理解,第四重构子图也是基于图3所描述的重构原理的。
在本实现方式中,对于在同一流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与另外一个重构节点之间的有向边来表达,而其中一个重构节点与另外一个重构节点之间的有向边可以通过其中一个重构节点的依赖关系来表达,从而可以控制同一流中的两个重构节点之间的顺序。例如,该第一目标重构节点和第四目标重构节点在同一流中,该第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间的有向边,转换成该第一目标重构节点与第四目标重构节点之间的有向边,再将该第一目标重构节点与第四目标重构节点之间的有向边表达为该第一目标重构节点的依赖关系,并且编译在该第一目标重构节点的编译数据中,从而控制该第一目标重构节点与第四目标重构节点之间的顺序。应理解,该第一目标重构节点和第四目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。需要说明的是,本申请得到第五重构子图方式和得到第一重构子图的方式相同;其中,第一重构子图是对第一子图进行重构得到的,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,因为第一子图包括多个第一节点和多个第一节点之间的有向边,故第一重构子图中的第一目标重构节点包括第一子图中的多个第一节点中的至少一个第一节点和该至少一个第一节点之间的有向边;同理,第五重构子图是对第五子图进行重构得到的,第五重构子图包括至少一个第五重构节点和至少一个第五重构节点之间的有向边,因为第五子图包括多个第五节点和多个第五节点之间的有向边,故第五重构子图中的任意一个第五重构节点包括第五子图中的多个第五节点中的至少一个第五节点和该至少一个第五节点之间的有向边。
应理解,第五重构子图也是基于图3所描述的重构原理的。
在本实现方式中,对于在不同流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边和接收算子节点与另外一个重构节点之间的有向边来表达,而其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边可以通过发送算子节点的依赖关系来表达,以及发送算子节点与接收算子节点 之间的有向边、接收算子节点与另外一个重构节点之间的有向边可以通过接收算子节点的依赖关系来表达,从而控制不同流中的两个重构节点之间的顺序。例如,该第一目标重构节点与第五目标重构节点在不同流中,该第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间的有向边,转换成第一发送算子节点与第一目标重构节点之间的有向边、第一发送算子节点与第一接收算子节点之间的有向边、第一接收算子节点与第五目标重构节点之间的有向边;再将第一发送算子节点与第一目标重构节点之间的有向边表达为第一发送算子节点的第一依赖关系,以及将第一发送算子节点与第一接收算子节点之间的有向边表达为第一发送算子节点的第二依赖关系,并且编译在该第一目标重构节点的编译数据中;而且可以将第一接收算子节点与第五目标重构节点之间的有向边可以表达为第一接收算子节点的第一依赖关系,以及将第一发送算子节点与第一接收算子节点之间的有向边表达为第一接收算子节点的第二依赖关系,并且将第一接收算子节点的第一依赖关系和第二依赖关系编译在第五目标重构节点的编译数据中,从而控制该第一目标重构节点与第五目标重构节点之间的顺序。应理解,该第一目标重构节点和第五目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。需要说明的是,本申请得到第六重构子图方式和得到第一重构子图的方式相同;其中,第一重构子图是对第一子图进行重构得到的,第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,因为第一子图包括多个第一节点和多个第一节点之间的有向边,故第一重构子图中的第一目标重构节点包括第一子图中的多个第一节点中的至少一个第一节点和该至少一个第一节点之间的有向边;同理,第六重构子图是对第六子图进行重构得到的,第六重构子图包括至少一个第六重构节点和至少一个第六重构节点之间的有向边,因为第六子图包括多个第六节点和多个第六节点之间的有向边,故第六重构子图中的任意一个第六重构节点包括第六子图中的多个第六节点中的至少一个第六节点和该至少一个第六节点之间的有向边。
应理解,第六重构子图也是基于图3所描述的重构原理的。
在本实现方式中,对于在不同流中的两个重构节点,这两个重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点;若其中一个重构节点中的节点与另外一个重构节点中的节点之间存在有向边,可以将其中一个重构节点中的节点与另外一个重构节点中的节点之间的有向边转换成其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点之间的有向边和接收算子节点与另外一个重构节点之间的有向边来表达,而其中一个重构节点与发送算子节点之间的有向边、发送算子节点与接收算子节点 之间的有向边可以通过发送算子节点的依赖关系来表达,以及发送算子节点与接收算子节点之间的有向边、接收算子节点与另外一个重构节点之间的有向边可以通过接收算子节点的依赖关系来表达,从而控制不同流中的两个重构节点之间的顺序。例如,该第一目标重构节点与第六目标重构节点在不同流中,该第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边;可以将该第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间的有向边,转换成第六目标重构节点与第二发送算子节点之间的有向边、第二发送算子节点与第二接收算子节点之间的有向边、第二接收算子节点与第一目标重构节点之间的有向边;再将第六目标重构节点与第二发送算子节点之间的有向边表达为第二发送算子节点的第一依赖关系,以及将第二发送算子节点与第二接收算子节点之间的有向边表达为第二发送算子节点的第二依赖关系,并且将第二发送算子节点的第一依赖关系和第二依赖关系编译在第六目标重构节点的编译数据中;而且可以将第二接收算子节点与该第一目标重构节点之间的有向边表达为第二接收算子节点的第一依赖关系,以及将第二发送算子节点与第二接收算子节点之间的有向边表达为第二接收算子节点的第二依赖关系,并且将第二接收算子节点的第一依赖关系和第二依赖关系编译在该第一目标重构节点的编译数据中,从而控制该第一目标重构节点与第六目标重构节点之间的顺序。应理解,该第一目标重构节点和第六目标重构节点可以是同一个重构子图中的重构节点,也可以是不同重构子图中的重构节点。
需要说明的是,图26所描述的子图的编译方法的相关步骤或操作的解释说明,可以参阅图2至图25所描述的内容。
二、图执行阶段。
请参阅图27,图27是本申请实施例提供的一种子图的执行方法的流程示意图,该子图的执行方法应用于图执行装置200;该子图的执行方法包括但不限于如下操作或步骤:
步骤2701:获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
步骤2702:执行所述第一目标重构节点的编译数据。
在本申请实施例中,将计算图切分为多个子图,然后对多个子图中的任意一个子图进行重构以得到重构子图,再以重构子图中的重构节点为调度单位实现对该重构子图中的任意一个重构节点进行编译。需要说明的是,本申请得到的重构子图包括至少一个重构节点和该至少一个重构节点之间的有向边,故本申请得到的重构子图实质上还是一张子图;并且,由于该重构子图中任意一个重构节点包括该任意一个子图中的一个或多个节点和一个或多个节点之间的有向边,故重构节点实质上是一张比该任意一个子图规模更小的子图。在子图编译时,以重构节点为调度单位实质上是一种图调度机制,且一次调度一个重构节点进行编译等于一次调度该任意一个子图中的一个或多个节点进行编译;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点,在子图编译时以重构节点为调度单位可以减少调度时间。进一步地,对该任意一个重构节点进行编译得到的该任意一个重构节点的编译数据用于执行该任意一个子图中的一个或多个节点,在子图执行时,一 次调度一个重构节点的编译数据用于执行等于一次调度该任意一个子图中的一个或多个节点的编译数据进行执行,也即本申请以重构节点为调度单位,一次调度一个重构节点以执行该一个重构节点,等于一次调度该任意一个子图中的一个或多个节点以执行该一个或多个节点;相比于以该任意一个子图中的节点(例如计算节点)为调度单位,一次调度该任意一个子图中的一个节点以执行该一个节点,在子图执行时以重构节点为调度单位可以减少调度时间。例如,对于由计算图切分得到的多个子图中的第一子图而言,第一子图包括多个第一节点和多个第一节点之间的有向边,由第一子图重构得到的第一重构子图包括至少一个第一重构节点和至少一个第一重构节点之间的有向边,其中的第一目标重构节点包括第一子图中的M个第一节点和M个第一节点之间的有向边,M为正整数;对该第一目标重构节点进行编译,等于对该M个第一节点进行编译,并且该第一目标重构节点的编译数据用于执行该M个第一节点,也即执行该第一目标重构节点等于执行该M个第一节点。综上,本申请实施例提供了一种图调度机制,以图结构的重构节点为调度单位,能够减少调度时间。
在一种可能的实现方式中,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
在本实现方式中,任意一个重构节点包括M个第一节点和M个第一节点之间的有向边,在对该任意一个重构节点进行编译时,将M个第一节点中的每个第一节点划分为至少一个第一子节点,也即将M个第一节点中的每个第一节点所代表的算子操作划分为至少一个线程,至少一个线程中的每个线程通过一个子节点表示;M个第一节点进行线程划分后得到N个第一子节点,且N为大于或等于M的正整数,N个第一子节点和N个第一子节点之间的有向边构成第一线程级子图;对该第一线程级子图进行编译,可以得到该第一目标重构节点的编译数据,且该第一目标重构节点的编译数据包括N个第一子节点的编译数据。由于执行N个第一子节点等于执行了M个第一节点,而由同一个第一节点划分得到的至少一个第一子节点可以并发执行,如此,在调度N个第一子节点的编译数据执行这N个第一子节点时,N个第一子节点中由同一个第一节点划分得到的第一子节点可以并发执行,从而减少M个第一节点中的每个第一节点的执行时间,也即减少M个第一节点的总执行时间,提升性能。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行所述第一目标重构节点的编译数据,包括:将R个第六对象的编译数据写入缓存中;针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
需要说明的是,当R个第六对象为M个第一节点时,第八对象为节点,该节点包括计算算子及节点、通信算子节点、控制算子节点、异构计算算子节点等各类节点;当R个第六对象为N个第一子节点时,第八对象为子节点,该子节点包括计算算子及节点、通信算子节点、控制算子节点、异构计算算子节点等各类节点划分得到的子节点。
在本实现方式中,对于重构节点内的任意一个节点来说,当该任意一个节点依赖的节点的数量为0时,执行该任意一个节点;对于线程级子图内的任意一个子节点来说,当该任意一 个子节点依赖的子节点的数量为0时,执行该任意一个子节点;如此设置任意一个节点或任意一个子节点被执行的时机,有利于提升执行效率。
在一种可能的实现方式中,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
在本实现方式中,对于重构节点内的任意一个节点来说,其被推送到计算资源中执行的时机是在其依赖的节点的数量为0以后;本申请可以将该任意一个节点依赖的节点的初始数量编译在该任意一个节点的编译数据中,如此,在执行时可以根据该任意一个节点依赖的节点的数量来确定将其推送到计算资源中执行的时机。如果该任意一个节点依赖的节点的初始数量为0,那么当该任意一个节点的编译数据被写入缓存中时,即可推送到计算资源中执行。如果该任意一个节点依赖的节点的初始数量不为0,那么需要等该任意一个节点依赖的节点都执行完了,才可把该任意一个节点的编译数据推送到计算资源中执行;此种情况下,在每执行完一个该任意一个节点依赖的节点之后,将该任意一个节点依赖的节点的数量减1,直至该任意一个节点依赖的节点的数量减为0以后,为该任意一个节点被推送到计算资源中执行的时机,从而将该任意一个节点推送到计算资源中执行。同理,对于线程级子图内的任意一个子节点来说,其被推送到计算资源中执行的时机是在其依赖的子节点的数量为0以后;本申请可以将该任意一个子节点依赖的子节点的初始数量编译在该任意一个子节点的编译数据中,如此,在执行时可以根据该任意一个子节点依赖的子节点的数量来确定将其推送到计算资源中执行的时机。如果该任意一个子节点依赖的子节点的初始数量为0,那么当该任意一个子节点的编译数据被写入缓存中时,即可推送到计算资源中执行。如果该任意一个子节点依赖的子节点的初始数量不为0,那么需要等该任意一个子节点依赖的子节点都执行完了,才可把该任意一个子节点的编译数据推送到计算资源中执行;此种情况下,在每执行完一个该任意一个子节点依赖的子节点之后,将该任意一个子节点依赖的子节点的数量减1,直至该任意一个子节点依赖的子节点的数量减为0以后,为该任意一个子节点被推送到计算资源中执行的时机,从而将该任意一个子节点推送到计算资源中执行。
在一种可能的实现方式中,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
在本实现方式中,如果重构节点内的任意一个节点依赖的节点的初始数量为0,该任意一个节点无需等待其他节点执行完了以后才可执行;此种情况下,可以在执行该任意一个节点之前将该任意一个节点的编译数据写入缓存中,从而保证该重构节点的顺利执行。如果该任意一个节点依赖的节点的初始数量不为0,该任意一个节点需要等待其依赖的节点都执行完了以后才可执行;此种情况下,可以在执行该任意一个节点依赖的节点的过程中,将该任意一个节点的编译数据写入缓存中,从而在任意一个节点依赖的节点都执行完了以后,可以立马接着执行该任意一个节点,从而保证该重构节点的顺利执行;并且,没有过早将该任意一个节点的编译数据写入缓存中,不会导致该任意一个节点的编译数据长期占用缓存,实现缓存的合理利用。同理,如果线程级子图内的任意一个子节点依赖的子节点的初始数量为0,该任意一个子节点无需等待其他子节点执行完了以后才可执行;此种情况下,可以在执行该任意一个子节点之前将该任意一个子节点的编译数据写入缓存中,从而保证该线程级子图的顺利执行。如果该任意一个子节点依赖的子节点的初始数量不为0,该任意一个子节点需要等待其 依赖的子节点都执行完了以后才可执行;此种情况下,可以在执行该任意一个子节点依赖的子节点的过程中,将该任意一个子节点的编译数据写入缓存中,从而在任意一个子节点依赖的子节点都执行完了以后,可以立马接着执行该任意一个子节点,从而保证该线程级子图的顺利执行;并且,没有过早将该任意一个子节点的编译数据写入缓存中,不会导致该任意一个子节点的编译数据长期占用缓存,实现缓存的合理利用。
需要说明的是,图27所描述的子图的执行方法的相关步骤或操作以及术语的解释说明,可以参阅图2至图26所描述的内容,此处不再重复描述。
请参阅图28,图28是本申请实施例提供了一种子图的编译装置的结构示意图,该子图的编译装置2800应用于图编译装置100,该子图的编译装置2800包括:
获取单元2801,用于获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边;
重构单元2802,用于对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边;
编译单元2803,用于对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
在一种可能的实现方式中,所述编译单元2803,具体用于:将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第 一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的任意一个第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所 述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
需要说明的是,子图的编译装置2800各个单元的实现还可以对应参照图26所示的方法实施例的相应描述,以及子图的编译装置2800带来的有益效果也可以参照图26所示的方法实施例的相应描述,此处不再重复描述。
请参阅图29,图29是本申请实施例提供了一种子图的执行装置的结构示意图,该子图的执行装置2900应用于图执行装置200,该子图的执行装置2900包括:
获取单元2901,用于获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数;
执行单元2902,用于执行所述第一目标重构节点的编译数据。
在一种可能的实现方式中,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少 一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行单元2902,具体用于:将R个第六对象的编译数据写入缓存中;针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
在一种可能的实现方式中,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
在一种可能的实现方式中,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
在一种可能的实现方式中,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;所述第一对象的依赖关系;用于执行所述第一对象的计算装置的类型。
在一种可能的实现方式中,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据,所述目标输入数据非所述第一目标重构节点中的第一节点的输出数据,或所述目标输入数据非所述第一线程级子图中的第一子节点的输出数据;在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
在一种可能的实现方式中,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:将所述第一对象的输出数据写入内存中;在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
在一种可能的实现方式中,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;其中,若所 述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
在一种可能的实现方式中,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点,所述第二目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第一目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第二目标重构节点为第二重构子图中的任意一个第二重构节点,所述第二重构子图是对第二子图进行所述重构得到的,所述第二子图为所述多个子图中除所述第一子图之外的子图,所述第一目标节点为所述第二子图中的第二节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的任意一个第二子节点,所述第二线程级子图是通过将所述第二目标重构节点中的每个第一目标节点划分为至少一个第二子节点得到的,所述第二线程级子图包括P个第二子节点和所述P个第二子节点之间的有向边,所述P为正整数。
在一种可能的实现方式中,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点,所述第三目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第二目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第三目标重构节点为第三重构子图中的任意一个第三重构节点,所述第三重构子图是对第三子图进行所述重构得到的,所述第三子图为所述多个子图中除所述第一子图之外的子图,所述第二目标节点为所述第三子图中的第三节点;若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点,所述第三线程级子图是通过将所述第三目标重构节点中的每个第二目标节点划分为至少一个第三子节点得到的,所述第三线程级子图包括Q个第三子节点和所述Q个第三子节点之间的有向边,所述Q为正整数。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边,所述第四目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第三目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第四目标重构节点为第四重构子图中的任意一个第四重构节点,所述第四重构子图是对第四子图进行所述重构得到的,所述第四子图为所述多个子图中除所述第一子图之外的子图,所述第三目标节点为所述第四子图中的第四节点;所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第五目标重构节点中 的第四目标节点之间存在有向边,所述第五目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第四目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第五目标重构节点为第五重构子图中的任意一个第五重构节点,所述第五重构子图是对第五子图进行所述重构得到的,所述第五子图为所述多个子图中除所述第一子图之外的子图,所述第四目标节点为所述第五子图中的第五节点;所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
在一种可能的实现方式中,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边,所述第六目标重构节点为所述至少一个第一重构节点中除所述第一目标重构节点之外的第一重构节点,所述第五目标节点为所述多个第一节点中除所述M个第一节点之外的第一节点;或所述第六目标重构节点为第六重构子图中的任意一个第六重构节点,所述第六重构子图是对第六子图进行所述重构得到的,所述第六子图为所述多个子图中除所述第一子图之外的子图,所述第五目标节点为所述第六子图中的第六节点;所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
需要说明的是,子图的执行装置2900各个单元的实现还可以对应参照图27所示的方法实施例的相应描述,以及子图的执行装置2900带来的有益效果也可以参照图27所示的方法实施例的相应描述,此处不再重复描述。
请参阅图30,图30是本申请实施例提供的另一种计算机设备的结构示意图。该计算机设备可以是前述图编译装置100或图执行装置200,该计算机设备3000包括:至少一个CPU,存储器,存储器的类型例如可以包括SRAM和ROM,微控制器(Micro controller Unit,MCU)、WLAN子系统、总线、传输接口等。虽然图30中未示出,该计算机设备3000还可以包括应用处理器(Application Processor,AP),NPU等其他专用处理器,以及电源管理子系统、时钟管理子系统和功耗管理子系统等其他子系统。
计算机设备3000的上述各个部分通过连接器相耦合,示例性的,连接器包括各类接口、传输线或总线等,这些接口通常是电性通信接口,但是也可能是机械接口或其它形式的接口,本实施例对此不做限定。
可选的,CPU可以是一个单核(single-CPU)处理器或多核(multi-CPU)处理器;可选的,CPU可以是多个处理器构成的处理器组,多个处理器之间通过一个或多个总线彼此耦合。在一种可选的情况中,CPU通过调用片上述存储器或者片外存储器中存储的程序指令实现如前述方法实施例中的任一种链路训练方法。在一种可选的情况中,CPU和MCU共同实现如前述方法实施例中的任一种链路训练方法,例如CPU完成链路训练方法中的部分步骤,而MCU完成链路训练方法中的其他步骤。在一种可选的情况中,AP或者其他专用处理器通过调用片上述存储器或者片外存储器中存储的程序指令实现如前述方法实施例中的任一种链路 训练方法。
该传输接口可以为处理器芯片的接收和发送数据的接口,该传输接口通常包括多种接口,在一种可选的情况下,该传输接口可以包括内部整合电路(Inter-Integrated Circuit,I2C)接口、串行外设接口(Serial Peripheral Interface,SPI)、通用异步收发机(Universal asynchronous receiver-transmitter,UART)接口、通用输入输出(General-purpose input/output,GPIO)接口等。应当理解,这些接口可以是通过复用相同的物理接口来实现不同的功能。
在一种可选的情况中,传输接口还可以包括高清晰度多媒体接口(High Definition Multimedia Interface,HDMI)、V-By-One接口、嵌入式显示端口(Embedded Display Port,eDP)、移动产业处理器接口(Mobile Industry Processor Interface,MIPI)或Display Port(DP)等。
在一种可选的情况中,上述各部分集成在同一个芯片上;在另一种可选的情况中,存储器可以是独立存在的芯片。
WLAN子系统例如可以包括射频电路和基带。
在本申请实施例中涉及的芯片是以集成电路工艺制造在同一个半导体衬底上的系统,也叫半导体芯片,其可以是利用集成电路工艺制作在衬底(通常是例如硅一类的半导体材料)上形成的集成电路的集合,其外层通常被半导体封装材料封装。所述集成电路可以包括各类功能器件,每一类功能器件包括逻辑门电路、金属氧化物半导体(Metal-Oxide-Semiconductor,MOS)晶体管、双极晶体管或二极管等晶体管,也可包括电容、电阻或电感等其他部件。每个功能器件可以独立工作或者在必要的驱动软件的作用下工作,可以实现通信、运算、或存储等各类功能。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质包括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如前述图26或图27中所述的方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如前述图26或图27中所述的方法。
本申请实施例还提供了一种芯片,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有上述芯片的设备执行如前述图26或图27中所述的方法。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本说明书中所提供的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结 合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
上述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。此外,本申请各实施例中的术语、解释说明,可以参照其他实施例中相应的描述。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
以上描述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (39)

  1. 一种子图的编译方法,其特征在于,包括:
    获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边;
    对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边;
    对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
  2. 根据权利要求1所述的方法,其特征在于,所述对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,包括:
    将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;
    对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:
    所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;
    所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;
    所述第一对象的依赖关系;
    用于执行所述第一对象的计算装置的类型。
  4. 根据权利要求3所述的方法,其特征在于,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
  5. 根据权利要求3或4所述的方法,其特征在于,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:
    在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据;
    在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或,在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
  6. 根据权利要求3-5任一项所述的方法,其特征在于,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:
    将所述第一对象的输出数据写入内存中;
    在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;
    在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或,在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
  7. 根据权利要求3-6任一项所述的方法,其特征在于,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
  8. 根据权利要求3-7任一项所述的方法,其特征在于,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点。
  9. 根据权利要求3-8任一项所述的方法,其特征在于,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边;
    所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述第一目标重构节点中的第一 节点与第五目标重构节点中的第四目标节点之间存在有向边;
    所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
  12. 根据权利要求1-11任一项所述的方法,其特征在于,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边;
    所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
  13. 一种子图的执行方法,其特征在于,包括:
    获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数;
    执行所述第一目标重构节点的编译数据。
  14. 根据权利要求13所述的方法,其特征在于,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
  15. 根据权利要求13或14所述的方法,其特征在于,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行所述第一目标重构节点的编译数据,包括:
    将R个第六对象的编译数据写入缓存中;
    针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:
    在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;
    其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
  16. 根据权利要求15所述的方法,其特征在于,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
  17. 根据权利要求16所述的方法,其特征在于,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
  18. 一种子图的编译装置,其特征在于,包括:
    获取单元,用于获取第一子图,所述第一子图为由计算图切分得到的多个子图中的任意一个,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边;
    重构单元,用于对所述第一子图进行重构,以得到第一重构子图,所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边;
    编译单元,用于对第一目标重构节点进行编译,以得到所述第一目标重构节点的编译数据,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数。
  19. 根据权利要求18所述的装置,其特征在于,所述编译单元,具体用于:
    将所述M个第一节点中的每个第一节点划分为至少一个第一子节点,以得到第一线程级子图,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数;
    对所述第一线程级子图进行编译,以得到所述第一目标重构节点的编译数据。
  20. 根据权利要求18或19所述的装置,其特征在于,所述第一目标重构节点的编译数据包括第一对象的编译数据,所述第一对象为所述M个第一节点中的其中一个或所述N个第一子节点中的其中一个,所述第一对象的编译数据包括以下至少一项:
    所述第一对象的输入数据的存储位置、生命周期和缓存管理操作指示信息;
    所述第一对象的输出数据的存储位置、生命周期和缓存管理操作指示信息;
    所述第一对象的依赖关系;
    用于执行所述第一对象的计算装置的类型。
  21. 根据权利要求20所述的装置,其特征在于,所述第一对象为所述N个第一子节点中的其中一个,所述第一对象的编译数据还包括:与所述第一对象并行执行的第二对象,所述第二对象为所述N个第一子节点中的其中一个,所述第二对象和所述第一对象是由同一第一节点划分得到的。
  22. 根据权利要求20或21所述的装置,其特征在于,所述第一对象的输入数据的缓存管理操作指示信息用于指示以下至少一项:
    在执行所述第一对象之前将目标输入数据从内存中写入缓存中,所述第一对象的输入数据包括所述目标输入数据;
    在所述目标输入数据输入所述第一目标重构节点中的第一节点第一预设次数后,或在所述目标输入数据输入所述第一线程级子图中的第一子节点所述第一预设次数后,删除所述缓存中的所述目标输入数据。
  23. 根据权利要求20-22任一项所述的装置,其特征在于,所述第一对象的输出数据的缓存管理操作指示信息用于指示以下至少一项:
    将所述第一对象的输出数据写入内存中;
    在将所述第一对象的输出数据写入内存中之后,删除缓存中的所述第一对象的输出数据;
    在所述第一对象的输出数据输入所述第一目标重构节点中的第一节点第二预设次数后,或在所述第一对象的输出数据输入所述第一线程级子图中的第一子节点所述第二预设次数之后,删除缓存中的所述第一对象的输出数据。
  24. 根据权利要求20-23任一项所述的装置,其特征在于,所述第一对象的依赖关系包括所述第一对象的第一依赖关系,所述第一对象的第一依赖关系用于表示所述第一对象与第三对象之间的有向边;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第三对象为所述M个第一节点中的其中一个;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第三对象为所述N个第一子节点中的其中一个,所述第三对象和所述第一对象是由不同的第一节点划分得到的。
  25. 根据权利要求20-24任一项所述的装置,其特征在于,所述第一对象的输出数据为第四对象的输入数据,所述第一对象由第一计算装置执行,所述第四对象由第二计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第二依赖关系,所述第一对象的第二依赖关系用于表示所述第一对象与第一通信算子节点之间的有向边,所述第一对象的输出数据由所述第一通信算子节点从所述第一计算装置传输至所述第二计算装置;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第四对象为第二目标重构节点中的第一目标节点;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第四对象为第二线程级子图中的第二子节点。
  26. 根据权利要求20-25任一项所述的装置,其特征在于,所述第一对象的输入数据包括第五对象的输出数据,所述第一对象由第一计算装置执行,所述第五对象由第三计算装置执行,所述第一对象的依赖关系还包括所述第一对象的第三依赖关系,所述第一对象的第三依赖关系用于表示所述第一对象与第二通信算子节点之间的有向边,所述第五对象的输出数据由所述第二通信算子节点从所述第三计算装置传输至所述第一计算装置;
    其中,若所述第一对象为所述M个第一节点中的其中一个,所述第五对象为第三目标重构节点中的第二目标节点;
    若所述第一对象为所述N个第一子节点中的其中一个,所述第五对象为第三线程级子图中的第三子节点。
  27. 根据权利要求18-26任一项所述的装置,其特征在于,所述第一目标重构节点中的第一节点与第四目标重构节点中的第三目标节点之间存在有向边;
    所述第一目标重构节点与所述第四目标重构节点在同一流(stream)中,所述第一目标重构节点的编译数据还包括所述第一目标重构节点的依赖关系,所述第一目标重构节点的依赖关系用于表示所述第一目标重构节点与所述第四目标重构节点之间的有向边。
  28. 根据权利要求18-27任一项所述的装置,其特征在于,所述第一目标重构节点中的第一节点与第五目标重构节点中的第四目标节点之间存在有向边;
    所述第一目标重构节点与所述第五目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第一发送算子节点的第一依赖关系和第二依赖关系,所述第一发送算子节点的第一依赖关系用于表示所述第一发送算子节点与所述第一目标重构节点之间的有向边,所述第一发送算子节点的第二依赖关系用于表示所述第一发送算子节点与第一接收算子节点之间的有向边,所述第一接收算子节点与所述第五目标重构节点之间存在有向边。
  29. 根据权利要求18-28任一项所述的装置,其特征在于,所述第一目标重构节点中的第一节点与第六目标重构节点中的第五目标节点之间存在有向边;
    所述第一目标重构节点与所述第六目标重构节点在不同流中,所述第一目标重构节点的编译数据还包括第二接收算子节点的第一依赖关系和第二依赖关系,所述第二接收算子节点的第一依赖关系用于表示所述第二接收算子节点与所述第一目标重构节点之间的有向边,所述第二接收算子节点的第二依赖关系用于表示所述第二接收算子节点与第二发送算子节点之间的有向边,所述第二发送算子节点与所述第六目标重构节点之间存在有向边。
  30. 一种子图的执行装置,其特征在于,包括:
    获取单元,用于获取第一重构子图中的第一目标重构节点的编译数据,所述第一重构子图是对第一子图进行重构得到的,所述第一子图包括多个第一节点和所述多个第一节点之间的有向边,所述第一子图为由计算图切分得到的多个子图中的任意一个;所述第一重构子图包括至少一个第一重构节点和所述至少一个第一重构节点之间的有向边,所述第一目标重构节点为所述至少一个第一重构节点中的任意一个第一重构节点,所述第一目标重构节点包括所述多个第一节点中的M个第一节点和所述M个第一节点之间的有向边,所述M为正整数;
    执行单元,用于执行所述第一目标重构节点的编译数据。
  31. 根据权利要求30所述的装置,其特征在于,所述第一目标重构节点的编译数据是对第一线程级子图进行编译得到的,所述第一线程级子图是通过将所述M个第一节点中的每个第一节点划分为至少一个第一子节点得到的,所述第一线程级子图包括N个第一子节点和所述N个第一子节点之间的有向边,所述N为大于或等于所述M的正整数。
  32. 根据权利要求30或31所述的装置,其特征在于,所述第一目标重构节点的编译数据包括所述M个第一节点的编译数据或所述N个第一子节点的编译数据;所述执行单元,具体用于:
    将R个第六对象的编译数据写入缓存中;
    针对所述R个第六对象中的每个第六对象,执行以下操作,以执行所述第一目标重构节点的编译数据:
    在第七对象依赖的第八对象的数量为0的情况下,从所述缓存中读取所述第七对象的编译数据,并根据所述第七对象的编译数据执行所述第七对象,所述第七对象为所述R个第六对象中的任意一个;
    其中,若所述R个第六对象为所述M个第一节点,所述第八对象为节点;若所述R个第六对象为所述N个第一子节点,所述第八对象为子节点。
  33. 根据权利要求32所述的装置,其特征在于,所述第七对象的编译数据包括所述第七对象依赖的第八对象的初始数量;若所述第七对象依赖的第八对象的初始数量不为0,在每执行完一个第九对象之后,所述第七对象依赖的第八对象的数量减1,所述第九对象为所述第七对象依赖的第八对象。
  34. 根据权利要求33所述的装置,其特征在于,在所述第七对象依赖的第八对象的初始数量为0的情况下,所述第七对象的编译数据是在执行所述第七对象之前写入所述缓存中的;在所述第七对象依赖的第八对象的初始数量不为0的情况下,所述第七对象的编译数据是在执行所述第九对象的过程中写入所述缓存中的。
  35. 一种图编译装置,其特征在于,包括处理器、存储器,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器进行,所述程序包括用于进行如权利要求1-12中任一项所述的方法中的步骤的指令。
  36. 一种图执行装置,其特征在于,包括处理器、存储器,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器进行,所述程序包括用于进行如权利要求13-17中任一项所述的方法中的步骤的指令。
  37. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如权利要求1-12或13-17中任一项所述的方法。
  38. 一种计算机程序产品,所述计算机程序产品包括计算机程序,当所述计算机程序在计算机或处理器上运行时,使得所述计算机或所述处理器进行如权利要求1-12或13-17中任一项所述的方法。
  39. 一种芯片,其特征在于,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如权利要求1-12或13-17中任一项所述的方法。
PCT/CN2021/143304 2021-12-30 2021-12-30 子图的编译、执行方法及相关设备 WO2023123266A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/143304 WO2023123266A1 (zh) 2021-12-30 2021-12-30 子图的编译、执行方法及相关设备
CN202180064167.2A CN116710891A (zh) 2021-12-30 2021-12-30 子图的编译、执行方法及相关设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/143304 WO2023123266A1 (zh) 2021-12-30 2021-12-30 子图的编译、执行方法及相关设备

Publications (1)

Publication Number Publication Date
WO2023123266A1 true WO2023123266A1 (zh) 2023-07-06

Family

ID=86997109

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/143304 WO2023123266A1 (zh) 2021-12-30 2021-12-30 子图的编译、执行方法及相关设备

Country Status (2)

Country Link
CN (1) CN116710891A (zh)
WO (1) WO2023123266A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408346A (zh) * 2023-10-25 2024-01-16 北京中科弧光量子软件技术有限公司 一种量子线路确定方法、装置和计算设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130291113A1 (en) * 2012-04-26 2013-10-31 David Bryan Dewey Process flow optimized directed graph traversal
CN110689116A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种神经网络剪枝方法、装置、计算机设备及存储介质
CN111160551A (zh) * 2019-12-04 2020-05-15 上海寒武纪信息科技有限公司 计算图执行方法、计算机设备及存储介质
CN111338635A (zh) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 计算图的图编译方法、装置、设备及存储介质
CN112711478A (zh) * 2019-10-24 2021-04-27 珠海零边界集成电路有限公司 基于神经网络的任务处理方法、装置、服务器和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130291113A1 (en) * 2012-04-26 2013-10-31 David Bryan Dewey Process flow optimized directed graph traversal
CN110689116A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种神经网络剪枝方法、装置、计算机设备及存储介质
CN112711478A (zh) * 2019-10-24 2021-04-27 珠海零边界集成电路有限公司 基于神经网络的任务处理方法、装置、服务器和存储介质
CN111160551A (zh) * 2019-12-04 2020-05-15 上海寒武纪信息科技有限公司 计算图执行方法、计算机设备及存储介质
CN111338635A (zh) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 计算图的图编译方法、装置、设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408346A (zh) * 2023-10-25 2024-01-16 北京中科弧光量子软件技术有限公司 一种量子线路确定方法、装置和计算设备

Also Published As

Publication number Publication date
CN116710891A (zh) 2023-09-05

Similar Documents

Publication Publication Date Title
JP6908682B2 (ja) グラフに基づくプログラムの仕様の実行
US10318260B2 (en) Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
Kaeli et al. Heterogeneous computing with OpenCL 2.0
CN107077364B (zh) 基于特定数据端口连接的识别使用图组件的自动聚类的基于图的程序规范的编译
US9996394B2 (en) Scheduling accelerator tasks on accelerators using graphs
US9009711B2 (en) Grouping and parallel execution of tasks based on functional dependencies and immediate transmission of data results upon availability
CN112381220B (zh) 一种神经网络张量处理器
US10133827B2 (en) Automatic generation of multi-source breadth-first search from high-level graph language
CN104536937A (zh) 基于cpu-gpu异构集群的大数据一体机实现方法
US11403104B2 (en) Neural network processor, chip and electronic device
WO2020083050A1 (zh) 一种数据流处理方法及相关设备
US20220342712A1 (en) Method for Processing Task, Processor, Device and Readable Storage Medium
CN110750265B (zh) 一种面向图计算的高层次综合方法及系统
US20220043770A1 (en) Neural network processor, chip and electronic device
WO2023123266A1 (zh) 子图的编译、执行方法及相关设备
CN110865814A (zh) 一种支持异构计算核架构的编译器实现方法和系统
TW202109286A (zh) 純函數語言神經網路加速器系統及結構
US20080120497A1 (en) Automated configuration of a processing system using decoupled memory access and computation
US20080244152A1 (en) Method and Apparatus for Configuring Buffers for Streaming Data Transfer
Rucker et al. Revet: A language and compiler for dataflow threads
WO2022253075A1 (zh) 一种编译方法及相关装置
Hsiu et al. Multilayer bus optimization for real-time embedded systems
WO2022261867A1 (zh) 一种任务调度方法和装置
WO2023070380A1 (zh) 数据处理装置及神经网络处理器
CN112445724B (zh) 针对片上存储器重用的链接时地址分配方法

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180064167.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21969624

Country of ref document: EP

Kind code of ref document: A1