WO2022143419A1 - 一种计算图的节点融合方法及设备 - Google Patents

一种计算图的节点融合方法及设备 Download PDF

Info

Publication number
WO2022143419A1
WO2022143419A1 PCT/CN2021/140906 CN2021140906W WO2022143419A1 WO 2022143419 A1 WO2022143419 A1 WO 2022143419A1 CN 2021140906 W CN2021140906 W CN 2021140906W WO 2022143419 A1 WO2022143419 A1 WO 2022143419A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
branch
branches
nodes
parallelizable
Prior art date
Application number
PCT/CN2021/140906
Other languages
English (en)
French (fr)
Inventor
张兆创
高雄
曾子韬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21914156.1A priority Critical patent/EP4258175A4/en
Publication of WO2022143419A1 publication Critical patent/WO2022143419A1/zh
Priority to US18/214,101 priority patent/US20230334292A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of machine learning, and in particular, to a method and device for node fusion of computational graphs.
  • Computational graph is a general computing process representation method, which is used to describe the directed acyclic graph of functions. It is widely used on various data processing platforms.
  • a computational graph includes multiple nodes and directed edges. In the field of machine learning, the computation graph is used to represent the computation logic involved in the neural network.
  • Each node in the computation graph represents the corresponding operation performed by the neural network (for example, the add node represents an addition operation), and the corresponding operation can also be Called a computing task, a node represents a computing task, and a directed edge connects the previous node (which can be called a previous node or a parent node) to the next node (which can be called a back node or a child node), representing the output of the parent node as input for child nodes.
  • the general practice of the deep learning framework for the specific implementation of the computational graph is: first, convert the user-defined neural network into a computational graph (the computational graph has been optimized), and then follow the topological order of each node in the computational graph. Load the computing tasks corresponding to these sorted nodes to the acceleration hardware (such as graphics processing unit (GPU), tensor processing unit (TPU), and ascending processor (such as, Ascend 910 and Ascend 310), etc.) are executed on the module device that specifically performs the computing task. As shown in FIG. 1 , the left part of FIG. 1 is a calculation graph, and the right part of FIG. 1 is the topological sorting result corresponding to the calculation graph.
  • the acceleration hardware such as graphics processing unit (GPU), tensor processing unit (TPU), and ascending processor (such as, Ascend 910 and Ascend 310), etc.
  • the left part of FIG. 1 is a calculation graph
  • the right part of FIG. 1 is the topological sorting result corresponding to the
  • each computing task on the device must start after the execution of the computing task of the previous topological order, which means that the sorting will be implicitly calculated in the data independent calculation.
  • Add execution dependencies between tasks As shown in Figure 1, there is no explicit dependency between the computing tasks corresponding to node C and node D. However, after determining the topological ordering, node C must be executed first, and then node D can be executed. . Finally, the total network execution time on the device is the sum of the time-consuming of each independent computing task plus the additional time-consuming caused by interaction and communication.
  • the network model Bert-Large has 1024 hidden layers
  • the network model GPT-3 There are 2048 hidden layers, which makes the performance requirements of deep learning frameworks increasingly demanding. Since this type of network structure contains many branch structures and exhibits direct or indirect computational independence, these branch structures have the possibility of parallel execution in execution. Therefore, in order to obtain higher execution performance, for those with The multi-branch structure of the neural network integrates the computing logic of multiple nodes that can be executed in parallel and consumes less computing resources before executing.
  • the existing node fusion methods of computational graphs mainly include horizontal fusion (horizontal fusion) method and operator level parallelism (operator level parallelism) method.
  • horizontal fusion method is accelerated linear algebra (accelerated linear algebra, XLA) compiler for An optimization step of GPU, this fusion method is conservative when looking for fusion nodes.
  • XLA accelerated linear algebra
  • the return node of the calculation graph is used as the starting point to search for the fusionable nodes, and the search process stops once it is interrupted, and the nodes to be fusion need to be fused.
  • the operator level parallelism method is an optimization step of the deep learning framework MXNet, which requires that the input data of multiple nodes to be fused all come from the same parent node.
  • the existing node fusion methods of these computational graphs have many constraints and cannot fully search for fused nodes. Therefore, an efficient computational graph node fusion method needs to be introduced.
  • the embodiments of the present application provide a method and device for node fusion of a computational graph. Compared with the prior art, the possibility of other parallel branches is considered, so as to find a parallelizable branch that is different from the rules of the prior art.
  • the combination of branches means that the nodes in the combination of these branches can be fused in the node fusion of the computational graph, thereby expanding the range of fused nodes that can be obtained.
  • an embodiment of the present application first provides a node fusion method for a computational graph, which can be used in the field of artificial intelligence, and the method includes:
  • the embodiments of the present application first provide a method for node fusion of computational graphs, which can be used in the field of artificial intelligence, and can be specifically applied to a deep learning framework.
  • the method includes: first, the deep learning framework can first obtain a network of neural networks.
  • the network structure can be the network structure of the neural network customized through the API provided by the deep learning framework, or the network structure of the pre-defined neural network can be directly obtained from the network model library (model zoo).
  • model zoo model zoo
  • each parallel branch group includes a plurality of sub-branches (ie, at least two sub-branches), the sub-branches are a sequential series structure without branches, and each sub-branches in each parallel branchable group includes one or more nodes.
  • the parallelizable branch group indicates that multiple sub-branches of the parallelizable branch group support being executed in parallel.
  • sub-branches belonging to the same parallel branchable group need to satisfy two conditions: one is that there is no dependency between the sub-branches; the other is that any two or more nodes belonging to different sub-branches are merged There is no ring structure in the computation graph after forming a node, that is, there is no ring structure in the fused computation graph.
  • At least one parallelizable branching group (which may be referred to as a first parallelizable branching group) included in the parallelizable branching group satisfies at least one of the following conditions: all the parallelizable branching groups in the first parallelizable branching group
  • the inputs of the sub-branches come from the same node, the outputs of at least two sub-branches in the first parallelizable branch group point to different nodes, the outputs of all sub-branches in the first parallelizable branch group point to the same node, and the first
  • the inputs of at least two sub-branches in the parallelizable branch group come from different nodes, the first node of all the child branches in the first parallelizable branch group has no parent node, and all the child branches in the first parallelizable branch group The last node has no children.
  • the deep learning framework can obtain one or more parallelizable branch groups, and for each parallelizable branch group, the deep learning framework can fuse multiple nodes from different sub-branches in each parallelizable branch group , so as to obtain the second calculation graph.
  • the one or more parallelizable branch groups described in the embodiments of the present application are not limited to all parallelizable branch groups obtained based on the extraction step, and also include branch groups that can be searched based on existing methods. For the convenience of description, this application only describes the differences that are different from the existing search methods. No matter what method is used to extract the parallel branchable group, each parallel branchable group can be analyzed based on the fusion method described in this application. The multiple nodes are fused to obtain the second computational graph.
  • a method for node fusion of computational graphs is provided.
  • the possibility of some other parallel branches is considered, so as to find a method that can be defined by rules different from the prior art.
  • the combination of parallel branches means that the nodes in the combination of these branches can be fused in the node fusion of the computation graph, thereby expanding the range of fused nodes that can be obtained.
  • the parallelizable branch group when there are multiple parallelizable branch groups, the parallelizable branch group further includes at least one second parallelizable branch group, and the second parallelizable branch group satisfies the following: Condition: the inputs of all sub-branches in the second parallelizable branch group come from the same node and the outputs of at least two sub-branches in the second parallelizable branch group point to the same node, and each sub-branch in the second parallelizable branch group is The sub-branches include at least two nodes.
  • the method for obtaining the parallel branchable group based on the connection relationship between the nodes in the first computation graph may be the method for obtaining the second parallel branchable group as described above, which has wide applicability.
  • some sub-branches in one or more parallelizable branch groups obtained by the deep learning framework at the beginning may still retain some unfused nodes or are not suitable for parallel execution These nodes can be collectively referred to as non-fused nodes.
  • the reason why these nodes are preserved when extracting parallel branch groups is to keep the branches intact, so that the branches are not interrupted by individual nodes during the search process. Once interrupted, potentially parallel executable nodes will be missed.
  • the deep learning framework obtains one or more parallelizable branch groups, it can also remove the non-fused nodes from each sub-branch in each parallelizable branch group (ie, the target parallelizable branch group), so as to obtain the removed non-fused nodes.
  • Each parallelable branch group (may be referred to as the third parallelizable branch group) of the target parallelizable branch group, and any sub-branch in the target parallelizable branch group can be called a target sub-branch.
  • multiple nodes from different sub-branches referring to the nodes that can be fused but not yet fused
  • each parallel branchable group that is, the third parallel branchable group
  • the second computation graph includes the fused nodes and the unfused nodes in the first computation graph (unfused nodes refer to nodes that have not been fused), and the third computation graph can be parallelized.
  • a non-fused node may specifically be a node with exclusive computing operations.
  • the used compilation suite does not provide a kernel compilation solution after fusion of certain nodes; it may also be a specific computing operation.
  • certain operations such as the matrix multiplication operation and convolution operation of the neural network are highly computationally intensive. In most cases, when executing on the device, the computing resources on the device will be used as fully as possible.
  • the specific method for obtaining the second computation graph is described, that is, the non-fused nodes are first removed from the first computation graph, and then the remaining fused nodes are fused to obtain the second computation graph , since the unfused nodes are removed from the first computation graph in advance, the fusion efficiency can be improved.
  • the above-mentioned process of merging multiple nodes from different sub-branches in the third parallelizable branch group is performed iteratively , that is, the above step of fusing multiple nodes from different sub-branches in the third parallel branchable group is repeatedly performed until the number of unfused nodes in the third parallel branchable group is less than 2.
  • the deep learning framework conducts Multiple nodes from different sub-branches are fused, and the process of obtaining a fusion node can be obtained based on a combined search (also called a parallel fusion search).
  • the principle of the combined search is: in any third parallel branch group , each sub-branch includes one or more nodes. Nodes from different sub-branches can be combined as fusion nodes. Nodes on the same sub-branch cannot be fused due to the front-to-back connection relationship.
  • the fusion node can be obtained based on the search algorithm.
  • the specific search process can be as follows: First, select (eg, randomly select) an unfused node from the multiple sub-branches in the third parallel branchable group to obtain n unfused nodes, and the unfused node is an unfused node.
  • the fused nodes, n ⁇ 2 then generate m node combinations based on the selected n unfused nodes, each node combination in the m node combinations includes at least two unfused nodes, m ⁇ 1 and 2m ⁇ n , and evaluate the computing power required by the m node combinations through the constructed computing power evaluation model to obtain m evaluation results.
  • Each evaluation result in the m evaluation results is used to represent one of the following situations: The computing power resources required by each node combination in the m node combinations, and the computing power resources saved by each node combination in the m node combinations.
  • the first evaluation result satisfies the preset condition
  • the first node combination corresponding to the first evaluation result is fused to obtain a first fusion node
  • each unfused node in the first node combination is marked is a fused node
  • the first evaluation result is one of the m evaluation results
  • the first node combination is one of the m node combinations.
  • an evaluation result of the current round is obtained, it is possible to determine whether the evaluation result satisfies a preset condition, and then perform subsequent processing based on the determination result; After rounds of searching candidate node combinations to obtain multiple evaluation results (that is, at least two evaluation results), it is then judged whether the multiple evaluation results have evaluation results that meet the preset conditions, and then follow-up processing is performed based on the judgment results.
  • the first evaluation result satisfying the preset condition includes at least one of the following situations:
  • each evaluation result in the m evaluation results is used to represent the computing power resources required by each node combination in the m node combinations
  • the first evaluation result reaches the computing power requirement of the module (device) that specifically performs the computing task on the target acceleration hardware
  • each evaluation result in the m evaluation results is used to characterize the saving of each node combination in the m node combinations.
  • the first evaluation result is optimal among the m evaluation results
  • each evaluation result in the m evaluation results is used to represent the computing power saved by each node combination in the m node combinations.
  • the first evaluation result is optimal among x evaluation results, and the x evaluation results are at least two evaluation results among the m evaluation results.
  • the bifurcation structure is to have the same parent node (also referred to as a common parent node, which is collectively referred to as a common parent node in the embodiments of this application for ease of explanation) or has the same child node (also referred to as a common child node, for ease of explanation).
  • the non-convergence structure refers to those branch structures in which the first node of the branch structure has no parent node or the last node of the branch structure has no child nodes, except for many
  • the remaining nodes in the first calculation graph can be collectively referred to as the scattered structure.
  • one way of extracting one or more parallelizable branch groups from the first computation graph based on the connection relationship between the nodes in the first computation graph may be: searching the first computation graph for nodes that have the same parent (which can be called is a first common parent node or a common first parent node, and for ease of explanation, it is collectively referred to as the first common parent node in this embodiment of the present application) Multiple branches (may be called first branches), and according to the The first branch obtains a parallelizable branch group; and/or, in the first computation graph, searching for the same child node (which may be referred to as the first common child node or the common first child node, is implemented in this application for the convenience of elaboration)
  • a plurality of branches may be referred to as second branches
  • a parallel branchable group is obtained according to the plurality of second branches.
  • This search method can also be referred to as a one-way search in a
  • a search method for extracting one or more parallel branchable groups from the first calculation graph based on the connection relationship between the nodes in the first calculation graph is described.
  • the search method is based on whether there are Common parent nodes or whether there are common child nodes, and common parent nodes or common child nodes widely exist in the computational graph converted from neural networks, this search method can search to obtain a large number of parallel branch groups.
  • a specific implementation manner of obtaining a parallelizable branch group according to the multiple first branches may be: using the first common parent
  • the node is the starting point. According to the connection relationship between the nodes, each first branch in the plurality of first branches is searched downwards, until other common parent nodes in the first calculation graph are encountered during the downward search process.
  • each first branch in the downward search process each constitute a child branch, for example, if there are 4 first branches, then
  • the above downward search process can obtain 4 sub-branches, each sub-branch can be called a first sub-branch, and the obtained multiple first sub-branches form a parallelizable branch group, for example, the above-mentioned 4 sub-branches constitute a parallelizable branch Group.
  • each first sub-branch in the plurality of first sub-branches does not include the first common parent node and the common child node, that is, each first sub-branch does not include the first sub-branch as a starting point.
  • a common parent node and in the downward search process of each first branch, if other common parent nodes (ie second common parent nodes) are encountered, the second common parent node is included in the corresponding first child branch , if a common child node is encountered, the common child node is excluded from the corresponding first child branch.
  • second common parent nodes ie second common parent nodes
  • the deep learning framework obtains a parallel branch group according to a plurality of first branches, that is, the downward search starts from the common parent node, and this method extracts parallel branches from the common parent node.
  • the group method can ensure that nodes belonging to different sub-branches in the same parallel branch group do not have inconsistent dependency behaviors. It is guaranteed that the fusion of nodes between branches will not form a ring structure.
  • it is necessary to judge whether the nodes form a ring after fusion it is necessary to judge whether the nodes form a ring after fusion, and it is a very complicated operation to judge whether it forms a ring. Therefore, there is no need to additionally perform loop-forming judgment on each obtained combination, which simplifies the operation process.
  • a specific implementation manner of obtaining a parallelizable branch group according to the multiple second branches may be: using the first common sub-branch
  • the node is the starting point.
  • each second branch in the plurality of second branches is searched upwards, until other common child nodes or common child nodes in the first calculation graph are encountered during the upward search process. Stop at the parent node.
  • the nodes traversed by each second branch form a child branch. For example, if there are 3 second branches, then after the above upward search process, 3 child branches can be obtained.
  • the branch may be referred to as a second sub-branch, and a plurality of obtained second sub-branches form a parallelizable branch group.
  • the above-mentioned three sub-branches form a parallelizable branch group.
  • each second sub-branch in the plurality of second sub-branches does not include the first common child node and the common parent node, that is, each second sub-branch does not include the starting point of The first common child node, and in the upward search process of each second branch, if other common child nodes are encountered, the other common child nodes are incorporated into the corresponding second child branch, and if a common parent node is encountered, Then the common parent node is excluded from the corresponding second child branch.
  • the deep learning framework obtains a parallel branchable group according to multiple second branches, that is, an upward search is performed starting from a common child node, and the parallel branchable group is extracted from a common child node.
  • This method can ensure that nodes belonging to different sub-branches in the same parallel branch group do not have inconsistent dependency behaviors.
  • the fusion of internodes does not form a ring structure. Therefore, the search method described in the embodiment of the present application also ensures that no ring structure occurs, and there is no need to additionally perform a ring formation judgment on each combination obtained each time, which simplifies the operation process.
  • the manner of extracting one or more parallelizable branch groups from the first computation graph based on the connection relationship between nodes in the first computation graph may also be: Searching a plurality of third branches in the graph, and obtaining a parallelizable branch group according to the plurality of third branches, wherein the first node of each third branch has no parent node; and/or, from the first computational graph A plurality of fourth branches without child nodes are searched, and a parallelizable branch group is obtained according to the plurality of fourth branches, wherein the first node of each third branch has no parent node.
  • search method for extracting one or more parallel branchable groups from the first calculation graph based on the connection relationship between the nodes in the first calculation graph is described, and the search method is based on no It is carried out by parent nodes or nodes without child nodes, and nodes without parent nodes or nodes without child nodes are also widely present in the computational graph converted from neural networks. Therefore, in addition to the above search based on common parent nodes or common child nodes In addition to the method, this search method can still search out a large number of parallel branch groups. This search method can also be called a one-way search without aggregation structure.
  • a specific implementation manner of obtaining a parallelizable branch group according to the plurality of third branches may be: first, the deep learning framework starts from One or more nodes without a parent node are searched in the first computational graph, and the node without a parent node is used as a starting point, and according to the connection relationship between the nodes, along each third branch of the plurality of third branches Each branch searches downwards until it stops when it encounters a common parent node or a common child node in the first calculation graph during the downward search process.
  • each third branch each form a child Branch, for example, if there are two third branches, then two sub-branches can be obtained through the above downward search process, each sub-branch can be called a third sub-branch, and the obtained multiple third sub-branches constitute a parallel branch.
  • each sub-branch can be called a third sub-branch, and the obtained multiple third sub-branches constitute a parallel branch.
  • the above two sub-branches constitute a parallel branchable group.
  • each third sub-branch in the plurality of third sub-branches does not include a common child node encountered in the downward search process, that is, each third sub-branch may include as a starting point node, and in the downward search process of each third branch, if a common parent node is encountered, the common parent node can be incorporated into the corresponding third child branch; if a common child node is encountered, the corresponding first The common child node is excluded from the three child branches.
  • the deep learning framework obtains a parallel branchable group according to multiple third branches, that is, the downward search starts from a node without a parent node, which starts from a node without a parent node.
  • the method of extracting a parallelizable branch group can also ensure that nodes belonging to different sub-branchs in the same parallelizable branch group do not have inconsistent dependency behaviors. This ensures that the fusion of nodes between branches will not form a ring structure. Therefore, the search method described in the embodiments of the present application can also ensure that no ring structure occurs, and there is no need to additionally perform a ring formation judgment on each combination obtained each time, which simplifies the operation process.
  • a specific implementation manner of obtaining a parallelizable branch group according to the plurality of fourth branches may be: first, the deep learning framework starts from One or more nodes without child nodes are searched in the first computational graph, and the node without child nodes is used as a starting point, according to the direct connection relationship between the nodes, along each of the plurality of fourth branches Each of the four branches searches upwards until they stop when they encounter a common parent node or a common child node in the first calculation graph during the upward search process. During the upward search process, the nodes traversed by each fourth branch each constitute a child branch.
  • each sub-branch can be called a fourth sub-branch, and the obtained multiple fourth sub-branches form a parallel branch group, for example , the above four sub-branches constitute a parallel branch group.
  • each fourth sub-branch in the plurality of fourth sub-branches does not include a common parent node encountered in the upward search process, that is, each fourth sub-branch may include a node as a starting point , and in the downward search process of each fourth branch, if a common child node is encountered, the common child node can be incorporated into the corresponding fourth child branch; if a common parent node is encountered, the corresponding fourth The common parent node is excluded from the child branch.
  • the deep learning framework obtains a parallel branchable group according to multiple fourth branches, that is, the upward search starts from the node without child nodes.
  • the method of parallelizable branch group can also ensure that nodes belonging to different sub-branches in the same parallelizable branch group do not have inconsistent dependency behaviors. For example, there is no common parent node in the sub-branch obtained from the upward search based on the node without child nodes. Therefore, it can be ensured that the fusion of nodes between branches will not form a ring structure. Therefore, the search method described in the embodiments of the present application can also ensure that no ring structure occurs, and there is no need to additionally perform a ring formation judgment on each combination obtained each time, which simplifies the operation process.
  • the one-way search in the multi-branch structure and the one-way search in the non-convergence structure are both based on the direct connection relationship between nodes, it can be directly searched according to the upward search or the downward search.
  • the search finds the respective parallel branchable groups.
  • the stopping conditions are stricter, which will cause some potential parallel execution nodes to be scattered in the first computation graph after processing the first computation graph according to the above search method. In a computational graph, these nodes can be called scattered nodes.
  • These scattered nodes may be independent nodes left behind after some front and rear nodes have been merged into parallel branching groups, or may be some special node structures, such as a local diamond structure, that is, scattered from one node, intermediate or multiple These scattered nodes interrupt the chance of being found due to their own special structure or location, or are discarded because there are no valid nodes on the relevant branch.
  • This type of node structure can be called a scatter structure. Finding nodes that can be executed in parallel in a scatter structure is more expensive than searching directly based on the connection relationship between nodes. Ring Judgment.
  • the way of extracting one or more parallelizable branch groups from the first computation graph based on the connection relationship between the nodes in the first computation graph may also be:
  • a parallel branchable group is obtained according to the plurality of fifth branches.
  • This search method can also be called a scatter-structure acyclic search.
  • search method for extracting one or more parallel branchable groups from the first computation graph based on the connection relationship between the nodes in the first computation graph is described, and the search method is based on scattering
  • This search method as a supplementary search method of the above two methods, can search for fused nodes from the first calculation graph to the greatest extent. make the search more extensive.
  • the deep learning framework may further compile the second computation graph to obtain a compiled second computation graph.
  • the second computational graph can be directly loaded into the target acceleration hardware for execution.
  • the compilation process of the second computational graph includes two parts. The other part is the compilation of the fusion node, that is, the compilation of the fusion node, and the kernel corresponding to the fusion node is obtained.
  • the compilation process of the second computation graph also includes the compilation process of the fusion node, which is achievable.
  • the present application in order to preserve the parallelism between the nodes before the fusion node, completes the generation of the kernel by merging segments when compiling the fusion node to generate the kernel, and in the Among them, each code segment maintains a clear independence through code branching.
  • Segment merging divides IR generation into two stages: 1) Independently schedule each node to obtain the number of sub-IRs equal to the number of nodes in the first stage; 2) Then fuse and modify these sub-IRs to obtain the second stage of the total IR.
  • the method for compiling the fusion node to obtain the corresponding kernel can be as follows: first, independently schedule the p nodes to obtain p sub-IRs, and then The p sub-IRs are fused to obtain a total IR, and finally the total IR is compiled to obtain the kernel corresponding to the fusion node.
  • the independent scheduling analysis of each node improves the flexibility of fusion, and ensures the correct generation of the fusion node code, and the generated kernels still maintain parallelism, allowing the corresponding target acceleration hardware to perform deeper parallel optimization.
  • the deep learning framework can be an AI framework with various specific forms, for example, it can be a mainstream deep learning framework such as mindspore, tensorflow, tensornetwork, pytorch, mxnet, caffe, theano, etc. , or other niche deep learning frameworks, as long as the deep learning framework can optimize and compile the calculation graph, it can be considered as the deep learning framework described in the embodiments of this application.
  • the expression form of the frame is not limited.
  • a second aspect of the embodiments of the present application provides a deep learning framework, where the deep learning framework has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • a third aspect of an embodiment of the present application provides a computer device, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to call the program stored in the memory to execute the first aspect of the embodiment of the present application or any one of the possible implementation methods of the first aspect.
  • a fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer can execute the first aspect or any possibility of the first aspect. method of implementation.
  • a fifth aspect of the embodiments of the present application provides a computer program or computer program product that, when the computer program or computer program product runs on a computer, enables the computer to execute the first aspect or any of the possible implementations of the first aspect. method.
  • a sixth aspect of an embodiment of the present application provides a chip, the chip includes at least one processor and at least one interface circuit, the interface circuit is coupled to the processor, and the at least one interface circuit is configured to perform a transceiving function and send an instruction to At least one processor, at least one processor is used to run a computer program or instruction, which has the function of implementing the method as described above in the first aspect or any possible implementation manner of the first aspect, and the function can be implemented by hardware or software.
  • the implementation can also be implemented by a combination of hardware and software, where the hardware or software includes one or more modules corresponding to the above functions.
  • the interface circuit is used to communicate with other modules outside the chip, for example, the interface circuit can send the second computation graph obtained by the processor on the chip to the target acceleration hardware (eg, CPU, NPU, GPU, TPU, etc.) , ASIC, FPGA, etc.).
  • the target acceleration hardware eg, CPU, NPU, GPU, TPU, etc.
  • ASIC application specific integrated circuit
  • Fig. 1 is a schematic diagram of topological sorting of computation graph
  • Fig. 2 is a schematic diagram of the network volume of different types of neural networks
  • FIG. 3 is a schematic structural diagram of a device and a computing unit included in different acceleration hardware provided by an embodiment of the present application;
  • FIG. 4 is a schematic diagram of a compilation process (common node compilation) of a plurality of nodes in a computation graph of a neural network provided by the implementation of this application;
  • Fig. 5 is a schematic diagram of horizontal fusion mode
  • Fig. 6 is a schematic diagram of an operator-level parallel mode
  • FIG. 7 is a schematic structural diagram of an artificial intelligence main frame provided by an embodiment of the present application.
  • Fig. 8 is a processing flow chart of the deep learning framework processing computation graph provided by the embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a node fusion method of a computational graph provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of optimizing the first calculation graph provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of different branch structure features in a first calculation graph provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of first branches according to an embodiment of the present application
  • FIG. 13 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of second branches according to an embodiment of the present application
  • FIG. 14 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of third branches according to an embodiment of the present application.
  • 15 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of fourth branches according to an embodiment of the present application.
  • 16 is a schematic diagram of simplifying a local structure around a target node according to an embodiment of the present application.
  • FIG. 17 is a schematic diagram of finding two other fifth branches from the first calculation graph provided by the embodiment of the present application.
  • FIG. 18 is a schematic diagram of a typical scattered structure including nodes that can be executed in parallel according to an embodiment of the present application;
  • FIG. 19 is a schematic diagram of a specific example of the scattered structure acyclic search provided by the embodiment of the application.
  • 20 is a schematic diagram of a parallelizable branch group obtained by the deep learning framework provided by the embodiment of the present application.
  • 21 is a schematic diagram of removing non-fused nodes from a first parallel branchable group to obtain a second parallel branchable group according to an embodiment of the present application;
  • FIG. 22 is a schematic diagram of the computing power usage on the device after 3 different nodes are compiled into kernels according to an embodiment of the present application
  • FIG. 23 is a schematic diagram of a search process of a combined search provided by an embodiment of the present application.
  • FIG. 24 is a schematic diagram of a comparison between ordinary node compilation and fusion node compilation provided by an embodiment of the present application.
  • FIG. 25 is a specific code example of segment merging provided by the embodiment of the present application.
  • Figure 26 provides another specific code example for segment merging provided by the embodiment of the present application.
  • FIG. 27 is a schematic diagram of a flowchart of a method for node fusion of a computational graph provided by an embodiment of the present application.
  • FIG. 28 is a schematic diagram of the income when the kernel before and after the node fusion provided by the embodiment of the application is executed on the device;
  • 29 is a schematic structural diagram of a deep learning framework provided by an embodiment of the application.
  • FIG. 30 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the embodiments of the present application provide a method and device for node fusion of a computational graph. Compared with the prior art, the possibility of other parallel branches is considered, so as to find a parallelizable branch that is different from the rules of the prior art.
  • the combination of branches means that the nodes in the combination of these branches can be fused in the node fusion of the computational graph, thereby expanding the range of fused nodes that can be obtained.
  • the embodiments of the present application involve a lot of related knowledge about neural networks, computational graphs, etc.
  • the following first introduces related terms and concepts that may be involved in the embodiments of the present application. It should be understood that the related concept interpretation may be limited due to the specific circumstances of the embodiments of the present application, but it does not mean that the present application can only be limited to the specific circumstances, and there may be differences in the specific circumstances of different embodiments. There is no specific limitation here.
  • a neural network can be composed of neural units, which can be specifically understood as a neural network with an input layer, a hidden layer, and an output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle is all is the hidden layer. Among them, a neural network with many hidden layers is called a deep neural network (DNN).
  • the work of each layer in a neural network can be expressed mathematically To describe, from the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column through five operations on the input space (set of input vectors) Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3.
  • Computational graph is a general computing process representation method, which is used to describe the directed acyclic graph of functions. It is widely used on various data processing platforms.
  • a computational graph includes multiple nodes and directed edges.
  • a neural network uses one or more neural network layers (eg, hidden layer, output layer, etc.) to generate an output for the input received
  • the output of each hidden layer is used as the next layer (eg, the input of the next hidden layer or output layer of the neural network, each layer of the neural network generates an output from the received input according to the current value of the current related parameters (eg, weight) of the layer.
  • the computation graph is used to represent the computation logic involved in the neural network, and each node in the computation graph represents the corresponding operation performed by the neural network (for example, the add node represents an addition operation), and the corresponding operation also It can be called a computing task, a node represents a computing task, the direction of the arrow of the directed edge represents the flow of data, and the directed edge with an arrow connects the previous node (which can be called a previous node or parent node) to the next one.
  • a node (may be called a post node or a child node) that represents the output of the parent node as the input of the child node.
  • the node pointed by the arrow in the line segment with the arrow is the child node of the node connected at the end of the line segment
  • the node connected at the end of the line segment is the node pointed by the arrow of the line segment.
  • the parent node for example, the node B in the left part of FIG. 1 is the child node of the node A, and the node B is also the parent node of the node C and the node D at the same time.
  • the output of a certain node is used as the input of different nodes, then the certain node is called the same parent node of the different nodes (it may also be called a common parent node.
  • node B in the left part of Figure 1 is the parent node of node C and node D at the same time, then node B is called the common parent node of node C and node D.
  • the certain node is called the same child node of the different node (also called a common child node, which is collectively referred to as the Common child node), for example, node E in the left part of Figure 1 is a child node of node C and node D at the same time, then node E is called the common child node of node C and node D.
  • the process of searching the computational graph includes a downward search and an upward search.
  • the process of searching in the direction of the arrow of the directed edge in the graph (that is, the direction indicated by the arrow); and the process of searching upward refers to a selected node as the starting point, based on the direction of the arrow of the directed edge in the calculation graph.
  • the process of searching in the opposite direction (that is, the direction indicated by the tail of the arrow).
  • the node fusion of the computational graph refers to the integration of the computational logic of two or more nodes in the computational graph to obtain a new node, which is called a fusion node.
  • a fusion node or an unfused node is a node that can be fused but is not yet fused.
  • the input of the new node is the input set of each node before fusion, and the output of the new node includes the output of each node before fusion.
  • the computing tasks corresponding to each node can be integrated into one computing task and loaded onto a module (the module may be referred to as a device) for a specific computing task on the corresponding acceleration hardware for execution.
  • a module the module may be referred to as a device
  • there are some nodes that are not suitable for fusion or nodes that cannot be fused and such nodes may be collectively referred to as non-fused nodes.
  • Acceleration hardware can also be called hardware accelerators, hardware acceleration chips, etc. As the network volume displayed by the neural network becomes larger and larger, its network structure becomes more and more complex. In order to undertake heavy computing tasks, dedicated computing acceleration hardware Emerged as the times require, such as GPUs, TPUs, Ascend processors (such as Ascend 910 and Ascend 310), etc. These acceleration hardware enhances the ability of intensive computing, and has more and more Large parallel computing space.
  • the computational graph of the neural network is compiled and executed by the device that performs the computing task on the acceleration hardware.
  • each acceleration hardware has one or more devices. The ability of the device is stronger, and each device is divided into one or more small computing units.
  • Different types of GPUs, TPUs and other acceleration hardware include different numbers of devices, and the computing units included in different acceleration hardware devices are slightly different.
  • the computing unit included in the device in the GPU is a basic execution unit (streaming multiprocessor, SM), as shown in the sub-schematic diagram (a) in Figure 3; the computing unit included in the TPU (or the Ascend processor) is Computational core (core), as shown in sub-schematic diagram (b) in Figure 3.
  • SM streaming multiprocessor
  • core Computational core
  • the number of devices included in the GPU and the TPU in FIG. 3 and the number of SMs or the number of cores included in each device are for illustration only, and are not specifically illustrated here. As shown in Table 1 below, the number of SMs or cores included in different models of GPUs, DPUs, and Ascend processors is considerable, and the number of computing units that can be executed in parallel is on the rise.
  • Table 1 Number of parallel-executing compute units in different acceleration hardware
  • IR is an intermediary for translation between source code and object code during program compilation.
  • the compiler does not directly translate source code into object code, but first translates it into an "intermediate language", and then translates it into object code from the "intermediate language”. Therefore, the compilation process of the compiler is usually divided into front-end and back-end.
  • the front-end will perform lexical analysis, syntax analysis, semantic analysis, etc. on the input source code, and then generate an intermediate expression form (ie IR), and the back-end will then
  • the IR is optimized to generate object code that runs directly on the target device (eg, various acceleration hardware).
  • the computational tasks represented by the nodes in the computational graph can be regarded as source code.
  • the corresponding After compiling each IR the target code corresponding to each node can be obtained.
  • the target code is called the kernel.
  • the generated kernel can be recognized by the corresponding acceleration hardware, and the corresponding acceleration hardware can specifically execute the computing task. device to execute.
  • FIG. 4 is a schematic diagram of the compilation process of multiple nodes in the calculation diagram of the neural network provided for the implementation of the application, and FIG. 4 is illustrated by taking 3 nodes as an example.
  • the compiler also called encoder
  • the obtained three kernels can be directly loaded onto the device in the corresponding acceleration hardware, and the computing unit on the device (for example, SM , core, etc.) execute the corresponding kernel in turn.
  • the XLA compiler is a domain-specific linear algebra compiler that can speed up tensorflow models without potentially requiring source code changes.
  • a tensorflow model is a common deep learning framework.
  • AKG is a common compiler in deep learning frameworks and an automatic generator of kernels in neural networks.
  • the existing node fusion methods of computational graphs mainly include horizontal fusion method and operator level parallelism method.
  • the horizontal fusion method is an optimization step of XLA for GPU. After this optimization step is enabled, many small kernels in the calculation graph (such as multiplication and addition when updating the training parameters of the neural network) will be fused.
  • Figure 5 is a schematic diagram of the horizontal fusion method. In its implementation, in order to ensure that the ring structure is not introduced, this method starts from the return node of the calculation graph (that is, the (ROOT) tuple node in the left part of Figure 5) upwards.
  • Operator-level parallelism is an optimization step of the deep learning framework MXNet.
  • all child nodes from the same common parent node are identified and merged into a new node. Since there is a common parent node, the corresponding computing tasks of each child node are independent of each other, so the new nodes generated by the fusion show the parallel characteristics of the child nodes.
  • FIG. 6 is a schematic diagram of the parallel mode of the operator level. Specifically, the child nodes Embeding0 to Embeding25 of the Split are found from the Split node in the left part of the diagram in FIG. 6, and merged into FIG. 6 The ParallelOP node in the right part of the graph. In a similar way, the nodes that meet this characteristic are processed on the entire computational graph, that is, the optimization of the computational graph is completed.
  • the horizontal fusion method has the following limitations in implementation: 1) It is conservative when looking for fused nodes.
  • the return node of the calculation graph is used as the starting point to search back for the fused nodes (that is, it only acts on (ROOT) tuple node), and stop once the search process is interrupted; 2)
  • the nodes to be merged need to have the same data arrangement and output data type.
  • the limitation in the implementation of the operator-level parallelism is that the input data of multiple nodes to be merged are required to come from the common parent node, that is, the nodes to be merged (eg, each Embeding node in Figure 6) are required to be merged with the common parent node.
  • the parent node eg, the Split node in Figure 6 must have a direct connection.
  • the existing node fusion methods of these computational graphs have many constraints, and cannot fully search for fused nodes. Therefore, how to search for fused nodes in the computational graph has a wider coverage and less restrictions. It has become an urgent problem to be solved. question.
  • the implementation of the present application provides a method for node fusion in a computational graph. The method searches for parallel branch groups only based on the connection relationship between nodes in the computational graph, and the search process will not be interrupted by individual non-fused nodes. The search hit rate of fused nodes in the neural network is calculated.
  • Figure 7 shows a structural schematic diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips (also called AI chips), for example, hardware acceleration chips such as CPU, NPU, GPU, TPU, ASIC, FPGA, etc.
  • smart chips also called AI chips
  • hardware acceleration chips such as CPU, NPU, GPU, TPU, ASIC, FPGA, etc.
  • the smart chip is also provided Including processors of the Ascend series (such as Ascend 910, Ascend 310, etc.);
  • the basic platform includes distributed computing frameworks and related platform guarantees and supports such as networks, which can include cloud storage and computing, interconnection networks, etc.
  • sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, safe city, etc.
  • the present application can be applied to various fields in the field of artificial intelligence, for example, the field of computer vision (eg, the field of image processing), the field of semantic analysis, etc.
  • the node fusion of the computational graph in the embodiment of the present application The method belongs to a specific data processing method in the above "(3) data processing".
  • the node fusion method of the computational graph is applied to the deep learning framework, and the fused nodes in the computational graph are fused through the deep learning framework to obtain Fusion nodes, and the obtained fusion nodes and ordinary nodes (that is, unfused nodes) are processed by the compiler in the deep learning framework into a kernel that can be recognized by smart chips (eg, the above-mentioned GPU, TPU and other acceleration hardware) , and finally each kernel is executed by the device on the corresponding acceleration hardware.
  • smart chips eg, the above-mentioned GPU, TPU and other acceleration hardware
  • the deep learning framework provided by the embodiment provides a processing flow chart for processing the computational graph.
  • the processing flow of the deep learning framework on the computational graph can be divided into four steps, namely network definition, computational graph optimization, compilation, and hardware loading. The following are respectively. Introduce:
  • API application programming interface
  • an optimization step of parallel fusion is additionally added, which is used to fuse potential parallel executable nodes (that is, fusion nodes) in the calculation graph to obtain fusion nodes.
  • the compiler in the deep learning framework will be responsible for compiling the kernel required to accelerate the hardware sequence according to each node (including the fusion node) in the calculation graph.
  • LLVM low level virtual machine
  • LLVM It is a framework system for the framework compiler (compiler), written in C++, and used to optimize the compile-time (compile-time), link-time (link-time), and run-time (run-time) of programs written in any programming language. ) and idle-time, open to developers and compatible with existing scripts.
  • common node compilation and fusion node compilation are distinguished at compile time, wherein common node compilation refers to the compilation of nodes in a computation graph that has not been fused, which is similar to the existing compilation process, while fusion node compilation is performed.
  • Compilation refers to compiling each obtained fusion node, and the compiling process is different from the compiling process of the existing ordinary nodes.
  • the corresponding kernel of each node can be obtained, and each kernel can be loaded into the corresponding acceleration hardware (such as GPU, TPU, Ascend 910, Ascend). 310, etc.), the computing tasks of each node are specifically performed by the device on the corresponding acceleration hardware, and the acceleration hardware has functions related to the corresponding acceleration hardware, such as memory allocation, data copying, and device management.
  • the corresponding acceleration hardware such as GPU, TPU, Ascend 910, Ascend.
  • the deep learning framework may be an AI framework with various specific forms, for example, it may be mainstream deep learning such as mindspore, tensorflow, tensornetwork, pytorch, mxnet, caffe, and theano framework, or other niche deep learning frameworks, as long as the deep learning framework can optimize and compile the calculation graph, it can be considered as the deep learning framework described in the embodiments of this application.
  • the form of expression of the learning framework is not limited.
  • FIG. 9 is a schematic flowchart of the node fusion method of the computation graph provided by the present application, and the method can be applied to the depth described above.
  • Learning framework the method may include the following steps:
  • the deep learning framework can obtain the network structure of the neural network.
  • the network structure can be the network structure of the neural network customized through the API provided by the deep learning framework, or it can be directly obtained from the network model library (model zoo).
  • the network structure of a good neural network specifically, this application does not limit how the deep learning framework obtains the neural network.
  • the deep learning framework After obtaining the network structure of the neural network, the deep learning framework will further convert the neural network into a computational graph, and the obtained computational graph may be called a first computational graph.
  • each parallel branch group includes a plurality of sub-branches (ie, at least two sub-branches), the sub-branches are a sequential series structure without branches, and each sub-branches in each parallel branchable group includes one or more nodes.
  • the parallelizable branch group indicates that multiple sub-branches of the parallelizable branch group support being executed in parallel.
  • the sub-branches belonging to the same parallel branchable group need to satisfy two conditions: one is that there is no connection relationship between the sub-branches; the other is that any two or more nodes belonging to different sub-branches are merged There is no ring structure in the computation graph after forming a node, that is, there is no ring structure in the fused computation graph.
  • the connection relationship between the nodes in the first calculation graph means that in a specific data execution process, the execution sequence must be guaranteed between certain nodes.
  • the execution process of this step 902 may be referred to as network branch extraction.
  • At least one parallelizable branching group (which may be referred to as a first parallelizable branching group) included in the parallelizable branching group satisfies at least one of the following conditions: the first parallelizable branching group
  • the inputs of all sub-branches in the parallel branch group come from the same node, the outputs of at least two sub-branches in the first parallelizable branch group point to different nodes, and the outputs of all sub-branches in the first parallelizable branch group point to
  • the input of at least two sub-branches in the same node and the first parallelizable branch group comes from different nodes
  • the first node of all sub-branches in the first parallelizable branch group has no parent node
  • the first parallelizable branch The last node of all child branches within the group has no children.
  • the parallelable branchable group when there are multiple parallel branchable groups, the parallelable branchable group further includes at least one second parallelable branchable group, and the second parallelable branchable group satisfies The following conditions: the inputs of all sub-branches in the second parallelizable branch group come from the same node, the outputs of at least two sub-branches in the second parallelizable branch group point to the same node, and the Each sub-branch includes at least two nodes.
  • the obtained first calculation graph may be further optimized, and one of the purposes of optimization is to enable the optimized first calculation graph to intuitively reflect the dependencies between data streams
  • the optimization method can be: removing nodes that are not related to calculation in the first calculation graph, adding imaginary edges related to dependency analysis, etc.
  • the synchronization nodes in the first calculation graph represent the time sequence of execution A relationship is a synchronization message that has nothing to do with specific calculations, so the node can be eliminated and a dependent virtual edge is added to the related node.
  • the left part in Figure 10 is a partial schematic of the original first calculation graph.
  • the first calculation graph includes synchronization node D.
  • synchronization node D will be eliminated and added
  • An imaginary edge between node A and node E the imaginary edge is shown as a dashed line with an arrow on the right part of FIG. 10 , and is used to indicate that there is a direct connection relationship between node A and node E.
  • the optimized first computation graph includes the direct connection relationship between the nodes, that is, the data flow dependency relationship, and the direct connection relationship includes the supply node (that is, the supply node of the input data of each node in the first computation graph) parent node) and the consumer node (ie child node) of the output data, for example, A ⁇ B ⁇ C, then the supply node of the input data of node B is node A, and the consumption node of the output data of node B is node C, specifically , as shown in the right part of FIG. 10 , there is a direct connection relationship between node A and node B and node E, and a direct connection relationship between node B and node C, etc., which will not be illustrated here.
  • the deep learning framework After obtaining the first computational graph (or the optimized first computational graph), the deep learning framework will further perform a hierarchical search on the first computational graph (or the optimized first computational graph), and obtain one or more Parallel branch group, in which different levels correspond to different search methods. For ease of explanation, the following embodiments are described based on the first calculation diagram.
  • the hierarchical search is carried out for different branch structure features in the entire first calculation graph.
  • different branch structure features in the first calculation graph can be described by three structural features, which can be called multi-branch structure, non-convergence structure, and scattered structure respectively.
  • FIG. 11 Different branch structure features in the first calculation diagram provided by the embodiment of the present application are described below respectively:
  • the multi-fork structure is a branch structure with a common parent node or a common child node. Among them, the common parent node and common child node can be collectively called a confluence node.
  • This branch structure has the most intuitive parallelism, as shown in Figure 11. (a) As shown in the sub-schematic diagram, there is no clear connection between the branches of the multi-branch structure (excluding the confluence node), so the branches can be executed in parallel.
  • the non-convergence structure refers to the branch structure in which the first node of the branch structure has no parent node or the last node of the branch structure has no child nodes.
  • a virtual convergence node By adding a virtual convergence node to these branches, it can be converted into a multi-point structure.
  • Fork structure processing as shown in the sub-schematic diagram (b) in Figure 11, the white bottom node and solid arrow are the actual node and connection relationship, the gray bottom node is an additional virtual confluence node, and the dashed arrow is added synchronously Virtual connection relationship.
  • the remaining nodes in the first calculation graph can be collectively referred to as the scattered structure.
  • these scattered structures there may be some branches, and the nodes between the branches do not have a direct parent-child relationship.
  • Parallelism, but these branch structures can be executed in parallel in terms of topological order, as shown in the sub-schematic diagram (c) in Figure 11, there is no direct connection between these nodes, but some branch structures may be executed in parallel. sex.
  • the hierarchical search process may include at least one of the following search methods: multi-branch structure One-way search, one-way search without convergent structure, and acyclic search with scattered structure.
  • search methods multi-branch structure One-way search, one-way search without convergent structure, and acyclic search with scattered structure.
  • the process of one-way search in a multi-branch structure can also be referred to as a first-level search.
  • the deep learning framework performs a first-level search on the first computational graph.
  • the first-level search includes but is not limited to at least one of the following:
  • first branches Search for multiple branches (which may be called first branches) that have a common parent node (which may be called a first common parent node) in the first computation graph, and obtain a parallelizable branch group according to the multiple first branches .
  • the deep learning framework searches for multiple first branches with a common first parent node from the first computational graph, and obtains a parallelizable branch group according to the multiple first branches.
  • the parent node is any common parent node in the first computation graph.
  • a specific implementation manner of obtaining a parallelizable branch group according to the multiple first branches may be: using the first common branch
  • the parent node is the starting point.
  • each first branch in the plurality of first branches is respectively searched downward until other common parents in the first calculation graph are encountered during the downward search process.
  • node which can be called the second common parent node
  • common child node common child node.
  • the nodes traversed by each first branch each constitute a child branch.
  • each sub-branch can be called a first sub-branch.
  • the obtained first sub-branches form a parallelizable branch group.
  • the above four sub-branches constitute a parallelizable branch group. branch group. It should be noted here that each first sub-branch in the plurality of first sub-branches does not include the first common parent node and the common child node, that is, each first sub-branch does not include the first sub-branch as a starting point.
  • a common parent node and in the downward search process of each first branch, if other common parent nodes (ie second common parent nodes) are encountered, the second common parent node is included in the corresponding first child branch , if a common child node is encountered, the common child node is excluded from the corresponding first child branch.
  • second common parent nodes ie second common parent nodes
  • FIG. 12 is a schematic diagram of obtaining a parallel branchable group according to a plurality of first branches provided by an embodiment of the present application
  • node A is the first common parent node.
  • two branches ie, the first branch
  • one branch includes node B, node C, node F and node G
  • the other branch includes node D, node E, node H and node I, search down each of these two branches
  • the left branch stops the downward search when it encounters other common parent node C during the downward search process
  • the right branch encounters a common child during the downward search process. Stop the downward search at node H.
  • nodes A, B, and C are traversed by the left branch, and the nodes A, D, E, and H are traversed by the right branch.
  • node C is the common parent Nodes can be included in the sub-branch on the left
  • node H is a common sub-node and is excluded from the sub-branch on the right
  • node A is the first common parent node as the starting point, which is excluded from the two sub-branches, so , the sub-branch on the left is (B ⁇ C), and the sub-branch on the right is (D ⁇ E).
  • the deep learning framework searches the first computational graph for multiple second branches that have a common first child node, and obtains a parallelizable branch group according to the multiple second branches.
  • the child node is any common child node in the first computation graph.
  • a specific implementation manner of obtaining a parallelizable branch group according to the plurality of second branches may be: using the first common branch
  • the child node is the starting point.
  • each second branch in the plurality of second branches is searched upwards until it encounters other common child nodes or other common child nodes in the first calculation graph during the upward search process. Stop when there is a common parent node.
  • the nodes traversed by each second branch form a sub-branch. For example, if there are 3 second branches, then after the above upward search process, 3 sub-branches can be obtained.
  • the sub-branches may be called second sub-branches, and a plurality of obtained second sub-branches form a parallelizable branch group.
  • the above three sub-branches form a parallelizable branch group.
  • each second sub-branch in the plurality of second sub-branches does not include the first common child node and the common parent node, that is, each second sub-branch does not include the starting point of The first common child node, and in the upward search process of each second branch, if other common child nodes are encountered, the other common child nodes are incorporated into the corresponding second child branch, and if a common parent node is encountered, Then the common parent node is excluded from the corresponding second child branch.
  • FIG. 13 is a schematic diagram of obtaining a parallel branchable group according to a plurality of second branches provided by the embodiment of the application, and node H is the first common child node.
  • two branches ie, the second branch
  • the other branch includes node G, node E, node C , Node I and Node J
  • search upward along these two branches the left branch stops searching upward when it encounters other common child nodes D during the upward search process
  • the right branch encounters a common parent node during the upward search process.
  • nodes traversed by the left branch are H, F, and D
  • the nodes traversed by the right branch are H, G, E, C, and I.
  • node D is a common child Nodes can be included in the child branch on the left
  • node I is the common parent node, and is excluded from the child branch on the right
  • node H is the first common child node as the starting point, which is excluded from the two child branches, so , the sub-branch on the left is (D ⁇ F), and the sub-branch on the right is (C ⁇ E ⁇ G).
  • the above method of extracting a parallel branch group from a common parent node or from a common child node can ensure that nodes belonging to different child branches in the same parallel branch group do not have inconsistent dependency behaviors.
  • There is no common parent node in the sub-branch of and similarly, there is no common child node in the sub-branch obtained from the downward search based on the common parent node, so as to ensure that the fusion of nodes between branches will not form a ring structure.
  • the one-way search of the multi-branch structure may be only the search process of the above-mentioned method a, or the search process of only the above-mentioned method b.
  • the search process of the foregoing manner a and manner b may be performed simultaneously, which is not specifically limited here.
  • a certain node eg, node A
  • the search process of the above method a and method b is performed at the same time, then a certain node (eg, node A) in the first calculation graph belongs to the sub-branch that is first traversed, and will The node is identified, and the identification is used to indicate that the node has been programmed into the sub-branch to prevent repeated grouping in the subsequent search process.
  • the process of one-way search without a convergent structure can also be referred to as a second-level search.
  • the deep learning framework performs a second-level search on the first computational graph, and the second-level search includes but is not limited to at least one of the following specific forms The search process:
  • the deep learning framework searches for multiple third branches without parent nodes from the first computational graph, and obtains a parallelizable branch group according to the multiple third branches.
  • a specific implementation manner of obtaining a parallelizable branch group according to the multiple third branches may be: first, the deep learning framework One or more nodes without a parent node are searched from the first computational graph, and the node without a parent node is used as a starting point. According to the connection relationship between the nodes, along each of the plurality of third branches Each of the three branches searches downwards until they stop when they encounter a common parent node or a common child node in the first calculation graph during the downward search process.
  • each third branch each constitute a Sub-branches, for example, if there are 2 third sub-branches, two sub-branches can be obtained after the above downward search process, each sub-branch can be called a third sub-branch, and the obtained multiple third sub-branches form a parallelizable sub-branch
  • a branch group for example, the above-mentioned two sub-branches constitute a parallel branch group.
  • each third sub-branch in the plurality of third sub-branches does not include a common child node encountered in the downward search process, that is, each third sub-branch may include as a starting point node, and in the downward search process of each third branch, if a common parent node is encountered, the common parent node can be incorporated into the corresponding third child branch; if a common child node is encountered, the corresponding first The common child node is excluded from the three child branches.
  • the search process of the second-level search method a is similar to the search process of the first-level search method a, except that the search process of the second-level search method a does not start from a node Instead, a plurality of non-sink nodes (that is, nodes without a parent node) are used as a group to perform a downward search, and the non-sink nodes are also included in their respective sub-branches.
  • FIG. 14 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of third branches according to an embodiment of the present application.
  • the deep learning framework starts from nodes B and D without a parent node, and searches downwards respectively. , get the sub-branch (B ⁇ C) and the sub-branch (D ⁇ E), and the sub-branch (B ⁇ C) and the sub-branch (D ⁇ E) constitute a parallel branch group.
  • the deep learning framework searches for multiple fourth branches without child nodes from the first computational graph, and obtains a parallelizable branch group according to the multiple fourth branches.
  • a specific implementation manner of obtaining a parallelizable branch group according to the plurality of fourth branches may be: first, the deep learning framework One or more nodes without child nodes are searched from the first computational graph, and the node without child nodes is used as the starting point, according to the direct connection relationship between the nodes, along each of the plurality of fourth branches
  • the fourth branch searches upwards until it encounters a common parent node or common child node in the first calculation graph during the upward search process.
  • the nodes traversed by each fourth branch each constitute a child branch.
  • each sub-branch can be called a fourth sub-branch, and the obtained multiple fourth sub-branches form a parallel branch group,
  • the above four sub-branches constitute a parallelizable branch group.
  • each fourth sub-branch in the plurality of fourth sub-branches does not include a common parent node encountered in the upward search process, that is, each fourth sub-branch may include a node as a starting point , and in the downward search process of each fourth branch, if a common child node is encountered, the common child node can be incorporated into the corresponding fourth child branch; if a common parent node is encountered, the corresponding fourth The common parent node is excluded from the child branch.
  • the search process of the second-level search mode b is similar to the search process of the first-level search mode b, the difference is that the search process of the second-level search mode b is not from
  • a plurality of non-sink nodes that is, nodes without child nodes
  • the non-sink nodes are also included in their respective sub-branches.
  • FIG. 15 is a schematic diagram of obtaining a parallelizable branch group according to a plurality of fourth branches according to an embodiment of the present application.
  • the deep learning framework starts from nodes D, G, and F without child nodes, and goes up Search to obtain sub-branch (C ⁇ D), sub-branch (G) and sub-branch (F), and sub-branch (C ⁇ D), sub-branch (G) and sub-branch (F) constitute a parallelizable branch group.
  • the above method of extracting a parallelizable branch group from a node without a parent node or from a node without child nodes can also ensure that nodes belonging to different child branches in the same parallelizable branch group do not have inconsistent dependency behaviors, for example, based on no
  • no common parent node in the child branch obtained by the upward search from the node of the child node similarly, there is no common child node in the child branch obtained from the downward search based on the node without parent node, so that the branch can be guaranteed.
  • the fusion of internodes does not form a ring structure.
  • the above-mentioned extraction method of the one-way search without the convergence structure in the embodiment of the present application can also ensure that no ring structure occurs, and there is no need to additionally perform a ring-forming judgment on each obtained combination, which simplifies the operation process.
  • the one-way search without aggregation structure may be only the search process of the above method a, or the search process of only the above method b, or It is to perform the search process of the above-mentioned mode a and the mode b at the same time, which is not specifically limited here.
  • the search process of the above method a and method b is carried out at the same time, then a node in the first calculation graph belongs to the sub-branch that is first traversed, and the node will be identified. It is used to indicate that the node has been programmed into a sub-branch to prevent repeated grouping in the subsequent search process.
  • the process of acyclic search in the scattered structure can also be called the third-level search.
  • the search cost is higher to find nodes that can be executed in parallel in the scattered structure.
  • a loop determination is also required.
  • the deep learning framework determines from the first calculation graph a target node that is not divided into any parallel branchable group.
  • the target node can also be called a sensitive node, which refers to a node suitable for fusion.
  • the target node needs to be The following conditions are met: 1) the computing power resources consumed by the computing task corresponding to the target node when the device is executed cannot exceed the preset threshold; 2) it does not belong to a non-fused node.
  • the deep learning framework selects the target node, it will take the target node as the center, simplify the local structure in its upstream and downstream networks, and obtain a branch, which can be called the fifth branch.
  • the rhombus structure formed by one node scattered, intermediate or multi-level shunting, and finally merged into another node can be reduced to a virtual node in the first calculation graph. As shown in FIG. 16 , FIG.
  • FIG. 16 is a schematic diagram of simplifying a local structure around a target node provided by an embodiment of the present application.
  • the part enclosed by a dotted line between node A and node B in the left part of FIG. 16 is Diamond structure, the two diamond structures are each simplified into the gray bottom nodes (ie virtual nodes) in the right figure, where node A and node B are the target nodes; similarly, other target nodes can be found accordingly, and the The local structure around it is simplified, a plurality of fifth branches are obtained, and a parallelable branch group is obtained according to the obtained plurality of fifth branches. As shown in Fig. 17, Fig.
  • FIG. 17 shows the other two fifth branches found from the first calculation graph, wherein the target nodes of one fifth branch are nodes C and D, and the target nodes of the other fifth branch are nodes E, F, and finally a parallel branch group can be obtained according to the three fifth branches.
  • each virtual node in the multiple fifth branches is to assist in judging whether the target nodes belonging to the different fifth branches can be fused, and the real node corresponding to the virtual node may be a non-fused node, or it may It is a node that has been fused into a parallel branchable group. Therefore, for the convenience of operation, the virtual node will no longer participate in the judgment of whether it is a fused node or not. In the subsequent compilation process, it is still compiled as an ordinary node. For example, taking FIG. 16 and FIG. 17 as examples, the obtained three fifth branches include a total of five virtual nodes, and the functions of these five virtual nodes are used to determine whether the nodes A to F belong to different fifth branches.
  • both the first-level search and the second-level search are performed based on the direct connection relationship between nodes, so the upward search or the downward search can be directly performed according to the The search finds the respective parallel branchable groups.
  • the stopping condition is relatively strict, which will cause some potential parallelization after processing the first computational graph according to the above-mentioned search method.
  • the executing nodes are scattered in the first computation graph, and these nodes may be referred to as scattered nodes.
  • FIG. 18 is a schematic diagram of a typical scattered structure including nodes that can be executed in parallel. It can be seen from FIG. 18 that the sub-branch (B) and the sub-branch (I) is a parallel branch group searched. Sub-branch (H) and sub-branch (O) are another parallel branch group searched.
  • sub-branch (C ⁇ D ⁇ E ⁇ F) ⁇ G) and sub-branches (J ⁇ K ⁇ L ⁇ M ⁇ N) are groups of potentially parallelizable branches not found.
  • the reason why this potentially parallelizable branch group is not found is that: while among them, C, J are found in the downward search process based on the common parent node, but since the relevant branch is an empty branch (because (B ⁇ D) and (I) There is no node between ⁇ K), so it is an empty branch), so nodes C and J are abandoned, and nodes G and N are also abandoned in the same way; and nodes D, K, F, and M themselves are target nodes (that is, sensitive nodes). ), but because it is not hit by other search processes (that is, not found by the above-mentioned first-level search or second-level search search process), it is also discarded; and nodes E and L are blocked by the diamond structure.
  • FIG. 19 is a schematic diagram of a specific example of acyclic search of scattered structures provided by the embodiment of the application.
  • FIG. 19 is based on FIG. 18
  • the scattered structure is obtained, the node BI and node HO in part (a) in Figure 19 are the fusion nodes of nodes B and I and the fusion nodes of nodes H and O in Figure 18, respectively, that is, the nodes B and I in Figure 18 are processed.
  • a fusion node BI is obtained, and nodes H and O are fused to obtain a fusion node HO.
  • Step 1 the deep learning framework finds the combination of all scattered target nodes in the first calculation graph, such as the nodes in Figure 18 G, N, C, J, D, K, F, M, E, L, etc., respectively center on these nodes, search up/down and simplify to form a rhombus structure within 100 topological order (assume setting If the topological sequence interval is 100, then if the topological sequence of the bifurcation node is 150, it will stop when the topological sequence node 50 is found. If its bifurcation can converge on the same common ancestor, it can form a diamond-shaped structure within the topological sequence of 100).
  • nodes C and D are reduced to virtual node Q (since the fusion node BI is not only connected by nodes C and D, it cannot be reduced to a virtual node), Similarly, nodes J, K are reduced to virtual node R, nodes G, F are reduced to virtual node S, nodes M, N are reduced to virtual node T, thus a new branch (Q ⁇ E ⁇ S) and (R ⁇ L ⁇ T); Step 2, after that, the deep learning framework takes a node (such as the first node) from each new branch to determine the loop, and the branch that does not form a loop becomes a parallel branch A sub-branch of the group.
  • the deep learning framework takes a node (such as the first node) from each new branch to determine the loop, and the branch that does not form a loop becomes a parallel branch A sub-branch of the group.
  • Step 3 in the new In the branch of the virtual node, as shown in part (b) of Figure 19, the virtual node Q, the real nodes C and D represented by it cannot be paralleled, but the real nodes J and K represented by the virtual node R can be Parallel, that is, nodes C and J can be executed in parallel, and nodes D and K can also be executed in parallel.
  • the virtual nodes on each sub-branch of the parallel branch group are restored to real nodes, and they are sorted according to the local topological order of the real nodes to re-form a virtual branch structure, as shown in part (c) of FIG. 19 .
  • the computing logic represented by the virtual node and the original real node is the same, but its connection relationship is virtual, and it is only used for parallel analysis of potentially parallel branch groups.
  • the potential parallelizable molecular group can be found, that is, the parallel branchable group composed of sub-branch (C ⁇ D ⁇ E ⁇ F ⁇ G) and sub-branch (J ⁇ K ⁇ L ⁇ M ⁇ N) can be found ; Step 4. Repeat step 3 above until all potential parallel branch groups in step 2 are processed.
  • any one or more of the above levels of search can be performed as needed. It can be only the first-level search, or only the second-level search, or only the third-level search, which is not limited here;
  • multiple levels of search can be performed. For example, at least two levels of the above three levels of search can be searched, but it should be noted that when the deep learning framework executes at least two levels of search.
  • the execution order of the level search is not limited. Taking the deep learning framework to perform the above three levels of search as an example, the order of the above three levels of search is not limited, and the first level can be executed first. Search, then perform the second-level search, and finally perform the third-level search, or perform the second-level search first, then perform the first-level search, and finally perform the third-level search, or perform the hierarchical search in any order, here No further examples are given.
  • the above-mentioned multi-level search process may be performed iteratively, and the termination condition of the iteration may be until no node is found. Until a new parallel branchable group is created, it can also reach a certain iteration duration.
  • the conditions for terminating the search are not limited here.
  • the deep learning framework can obtain one or more parallelizable branch groups, and for each parallelizable branch group, the deep learning framework can separately Multiple nodes from different sub-branches are fused to obtain a second computational graph.
  • multiple nodes from different sub-branches in the target parallel branch group means that any two nodes in the multiple nodes cannot be from the same sub-branch, that is, if any target can be parallelized
  • each sub-branch includes one or more nodes. Nodes from different sub-branches can be combined as fusion nodes. Nodes on the same sub-branch are not fused due to the front-to-back connection relationship. Above, the fusion node can be obtained based on the search algorithm.
  • the parallelizable branch group includes a total of 3 sub-branches, of which the first sub-branch includes 4 nodes, and the remaining two Each sub-branch includes 3 nodes.
  • nodes A, C, and F come from three different sub-branches, and these three nodes can be fused to obtain a fused node; for example, nodes B and D also They come from two different sub-branches, and the two nodes can also be fused to obtain a fused node, which is not illustrated here specifically.
  • FIG. 20 when multiple nodes from different sub-branches in the target parallel branchable group are fused, one or more fused nodes can be obtained by fusion, which mainly depends on the fusion method.
  • some sub-branches in one or more parallel branchable groups obtained based on the above step 902 may still have some unfused nodes or nodes that are not suitable for parallel execution. Nodes, such nodes can be collectively referred to as non-fused nodes.
  • the reason why these nodes are retained when extracting parallel branch groups is to keep the branches intact, so that the branches are not interrupted by individual nodes during the search process. Interruptions will cause potentially parallel executable nodes to be missed.
  • the deep learning framework can also eliminate non-fused nodes from each sub-branch in each parallelizable branch group (ie, the target parallelizable branch group), thereby obtaining Each parallelable branching group (which may be referred to as the third parallelable branching group) of the non-fused node, and any sub-branch in the first parallelable branching group may be called a target sub-branch.
  • the third parallelable branching group which may be referred to as the third parallelable branching group of the non-fused node, and any sub-branch in the first parallelable branching group may be called a target sub-branch.
  • multiple nodes from different sub-branches in each parallelable branch group (that is, the third parallelable branch group) with the non-fused nodes removed are fused to obtain a fusion node, each fusion node and the first calculation graph.
  • the unfused nodes constitute the second computation graph.
  • the non-fused node may specifically be a node with exclusive operation operations.
  • the used compilation suite does not provide a kernel compilation solution after some nodes are fused; Nodes of specific operation operations, such as the matrix multiplication operation of neural network, convolution operation and other specific operation operations, are highly computationally intensive. It is fully used, so such nodes are also not suitable for parallel execution.
  • One of the exceptions is that the data itself is particularly small, which can be considered suitable for parallel execution.
  • the following example is used to illustrate: Assuming that a target parallel branching group obtained by the deep learning framework is shown in the sub-schematic diagram (a) in Figure 21, before the nodes are fused , it is necessary to "clean" each sub-branch of the target parallel branch group, and remove the non-fused nodes, assuming that nodes H, I, J, K are determined as non-fused nodes, as shown in (b) sub-schematic diagram in Figure 21 shown, then you need to filter to the nodes H, I, J, K, and these filtered nodes will be compiled as ordinary nodes in the future, and the nodes left by each sub-branch are eligible for fusion. For example, the graph As shown in sub-schematic diagram (c) in 21, the remaining nodes A, B, C, D, F, and G are fused nodes, and these fused nodes participate in the specific fusion process in the subsequent process.
  • the deep learning framework obtains each third parallel branchable group after excluding the non-fused nodes, and then analyzes multiple parallelizable branch groups from different sub-branchs in the third parallelizable branch group. Nodes are fused, and the process of obtaining a fusion node can be obtained based on a combined search (also called a parallel fusion search).
  • the principle of the combined search is: in any third parallel branch group, each sub-branch includes a One or more nodes, nodes from different sub-branches can be combined as fusion nodes. Nodes on the same sub-branch cannot be fused due to the connection between the front and rear. On each third parallel branch group, it can be obtained based on the search algorithm.
  • the combination search process can be completed in two parts, one part can be called a candidate combination generation model, and the other part can be called a computing power evaluation model, wherein the candidate combination generation model is responsible for The node generates a series of candidate node combinations, and the computing power evaluation model is responsible for evaluating the computing power consumed after each candidate node combination is merged into a node. This evaluation can also be called income evaluation.
  • the specific search process may be as follows: first, select (eg, randomly select) an unfused node from the sub-branches in the third parallel branchable group to obtain n unfused nodes, and the unfused nodes are unfused nodes.
  • the computing power evaluation model evaluates the computing power consumed by the combination of m nodes, and obtains m evaluation results. In the case that the first evaluation result satisfies the preset condition, the first node corresponding to the first evaluation result
  • the combination is fused to obtain a first fusion node, and each unfused node in the first node combination is marked as a fused node, and the first evaluation result is one of the m evaluation results.
  • the first node combination is one of the m node combinations.
  • FIG. 22 is the computing power usage on the device after three different nodes are compiled into kernels according to the embodiment of the application. It is assumed that a certain device on the acceleration hardware includes three computing units, as shown in FIG.
  • kernelA, kernelB, and kernelC are the operator cores obtained by compiling the three nodes respectively.
  • the operator cores can be directly executed by the computing unit of the device.
  • the size represents the computing power consumed by each. It can be seen from Figure 22 that different kernels consume different computing resources, and in the process of executing each kernel in sequence, the resource utilization rate of the device is unstable, and there are still some Computing resources (eg, there are 2 computing units in Figure 22) are in an idle state.
  • the purpose of constructing a computing power evaluation model in this embodiment of the present application is to select a target node combination that can improve the computing power resource utilization rate of the device from the candidate node combinations for fusion, so that a plurality of parallel nodes can be combined in a fusion manner. Fusion into a node for execution, in order to improve the average resource utilization of the device, thereby improving the overall performance of the entire acceleration hardware.
  • the above-mentioned process of merging multiple nodes from different sub-branches in the third parallel branchable group is performed iteratively. , that is, the above step of fusing multiple nodes from different sub-branches in the third parallel branchable group is repeatedly performed until the number of unfused nodes in the third parallel branchable group is less than 2.
  • Step 1 First, select a third parallel branchable group that has not been fused from the parallel branchable group from which the unfused node has been eliminated.
  • Step 2 Select (eg, randomly select) an unfused node from each sub-branch in the third parallel branchable group to obtain n unfused nodes, where the unfused nodes refer to unfused nodes, n ⁇ 2.
  • the obtained number of unfused nodes may also be equal to the number of valid sub-branches, and valid sub-branches refer to at least one child of an unfused node before selecting nodes in this round. branch. For example, assuming that the number of current valid sub-branches is n', a total of n' unfused nodes are obtained.
  • Step 3 Generate a new candidate node combination (ie, one of the above m node combinations) based on the n nodes through the candidate combination generation model.
  • Step 4 Evaluate the computing power consumed by the new candidate combination through the computing power evaluation model to obtain a new evaluation result (ie, one of the above m evaluation results).
  • Step 5 Determine whether the evaluation result satisfies the preset condition, or, step 3 and step 4 can be repeated until the evaluation result satisfies the preset condition, and then the candidate node combination corresponding to the evaluation result that satisfies the preset condition is selected as
  • the target node combination (that is, the above-mentioned first node combination) is fused to obtain a fusion node (that is, the first fusion node). For example, suppose that there are 7 unfused nodes obtained through step 2, which are denoted as nodes 0 to 6 respectively, and after steps 3 to 5, 3 sets of target node combinations are obtained, which are [0, 1, 5], [2 , 4], [3, 6], according to the fusion, three fusion nodes can be obtained.
  • Step 6 Repeat steps 2 to 5 until the number of unfused nodes in the third parallel branchable group is less than 2.
  • Figure 23 is used as an example for description below. Assume that an unprocessed third parallelizable branch group obtained by the deep learning framework is shown in Figure 23.
  • the third parallelizable branch group includes a total of 3 sub-branches, and each sub-branch has its own Including 2 fused nodes (non-fused nodes have been eliminated), first select a node from each of these 3 sub-branches, for example, if nodes A, C, F are selected in the current round, then nodes A, C, F forms a candidate node combination A+C+F, and then evaluates the computing power consumed by the candidate node combination A+C+F through the constructed computing power evaluation model, and obtains an evaluation result. In this case , and the evaluation result is used to characterize the computing resources required by the node combination.
  • the deep learning framework determines whether the evaluation result obtained in the current round meets the computing power requirements of the device that accelerates the specific computing task on the hardware.
  • the candidate node combination A+C+F is fused to obtain a fusion node , since the fusion node is obtained by the combination of candidate nodes A+C+F, then in the original 3 sub-branches, the 3 fused nodes need to be marked as fused nodes, so that in the subsequent node search process Avoid nodes A, C, F.
  • the evaluation result does not meet the computing power requirements of the device on the acceleration hardware, it means that nodes A, C, and F are not suitable for fusion.
  • one node can be reduced in turn (for example, one node can be randomly reduced), and a new candidate node can be obtained.
  • Combination until a candidate node combination suitable for fusion is found. For example, if node A is reduced first, then a new candidate node combination C+F is obtained, and then the calculation power evaluation model is constructed. The consumption of the candidate node combination C+F
  • the computing power of the device is evaluated, and an evaluation result is obtained, and then the deep learning framework judges whether the evaluation result meets the computing power requirements of the device on the acceleration hardware in a similar way.
  • the candidate node combination A+F is a target node combination, and the candidate node combination A+F is fused to obtain a fusion node, and It is also necessary to mark the two fused nodes as fused nodes in the original 3 sub-branches, so that nodes A and F are avoided in the subsequent node search process, and only one node C is left at this time, then the node As an unfused node, C continues to participate in the next round of node search process, and the above process is one round of search process.
  • the combined search is continued according to the above search process until less than two nodes in the nodes are not fused. For example, suppose that after several rounds of searching in Figure 23, a total of 3 target node combinations are obtained, namely A+C, B+F and D+G, and all the nodes in the nodes are fused.
  • Step 1 Assuming that the number of sub-branches in a third parallel branchable group is n, and the number of nodes extracted in the first round is n, the n nodes can be marked to obtain nodes 1 to n.
  • Step 3 Repeat step 2 until less than 2 nodes among the n nodes are not fused.
  • the construction principle of the computing power evaluation model is only to consider whether the computing power resources consumed by the candidate node combination meet the computing power requirements of the device that accelerates the specific computing tasks on the hardware. Therefore, , the calculation formula of the computing power evaluation model can be shown in the following formula (1):
  • gain is the evaluation result corresponding to the currently selected candidate node combination
  • D all is the total computing resources on the device
  • k is the number of nodes selected to form the candidate node combination in the current round
  • i is the number of k nodes
  • cal i is the size of computing power resources that node i needs to consume during execution
  • cal i can be obtained by analyzing the size, data type, etc. of the corresponding node i.
  • the computing power The evaluation does not consider the influence of other nodes, and is only related to the characteristics of the node itself.
  • q is a preset coefficient (for example, the value of q can take any value between 0.7 and 1.5), which can be set according to user needs. Do limit.
  • the evaluation result corresponding to the combination of candidate nodes satisfies the above formula (1), it means that the evaluation result meets the computing power requirement of the device that accelerates the specific computing task on the hardware.
  • the evaluation results are used to characterize the node combination
  • the candidate node combination corresponding to the evaluation result with the best performance is selected from the four evaluation results. Fusion to get the fusion node. For example, if it is assumed that the evaluation result corresponding to the candidate node combination A+C+F is optimal, then the candidate node combination A+C+F is fused to obtain a fusion node, because the fusion node is composed of candidate nodes.
  • A+C+F is obtained by merging, so it is necessary to mark the 3 fused nodes as fused nodes in the original 3 sub-branches, so that nodes A, C, and F can be avoided in the subsequent node search process. If it is assumed that the evaluation result corresponding to the candidate node combination C+F is optimal, then the candidate node combination C+F is fused to obtain a fusion node, and it is also necessary to compare these 2 existing sub-branches in the original 3 sub-branches. The fused node is marked as a fused node, so that nodes A and F are avoided in the subsequent node search process. At this time, only one node C is left, then node C continues to participate in the next round of nodes as an unfused node. During the search process, the above process is a round of search process.
  • the combined search is continued according to the above search process until less than two nodes in the nodes are not merged.
  • the process is similar to the above method (1), and will not be repeated here.
  • gain is the evaluation result corresponding to the currently selected candidate node combination.
  • m candidate node combinations can be obtained according to the formula (2), corresponding to m evaluation results gain can be obtained , k is the number of nodes currently selected to form the candidate node combination, k ⁇ n, n is the total number of nodes selected in the current round, i is the ith node in the number of k nodes, cali is node i
  • max k (cal i ) is the computing resources consumed by the node with the largest computing resource consumption in the currently selected candidate node combination.
  • m evaluation results gain can be obtained, and then the size of each gain is compared. The larger the value of gain, the better the evaluation result. Therefore, from the obtained m evaluation results gain
  • the candidate node combination corresponding to the evaluation result with the largest gain value is selected for fusion to obtain a fusion node, and the remaining unfused nodes enter the next round of search process. The process is similar to the above, and will not be repeated here.
  • the fusion node is to preferentially fuse the nodes with similar computing power consumption required, and the best benefit is obtained in this case.
  • the calculation formula of the computing power evaluation model can also be expressed as a ratio, which can be specifically shown in the following formula (3):
  • each parameter in the formula (3) has the same meaning as the above formula (2), and will not be repeated this time.
  • m evaluation results gains can be obtained, and then the size of each gain is compared. The smaller the gain value, the better the evaluation results. Therefore, from the obtained m evaluation results gains The candidate node combination corresponding to the evaluation result with the smallest gain value is selected for fusion to obtain a fusion node, and the remaining unfused nodes enter the next round of search process. The process is similar to the above, and will not be repeated here.
  • a situation may also be that a candidate node combination is formed based on all nodes on the third parallel branchable group, and all candidate node combinations are obtained according to the constructed computing power evaluation model. The corresponding evaluation results, and then based on the evaluation results, one or more candidate node combinations that are most suitable for fusion are sequentially selected for fusion to obtain one or more fusion nodes.
  • FIG. 23 For ease of understanding, the following examples are illustrated: still referring to Figure 23, suppose that an unprocessed third parallel branch group obtained by the deep learning framework is shown in Figure 23, and the third parallel branch group includes a total of 3 sub-branches, Each sub-branch includes 2 fused nodes (non-fused nodes have been eliminated), so it includes 6 nodes in total, namely nodes A, B, C, D, F, and G.
  • A+C is first fused to obtain a fused node, and the fused nodes A and C are marked as fused nodes, and accordingly
  • the candidate node combinations containing nodes A and C are excluded from the remaining 19 candidate node combinations, and 7 candidate node combinations remain after the exclusion, namely B+D+F, B+D+G, B+ F, B+D, B+G, D+F, D+G, and then select the candidate node combination corresponding to the evaluation result with the best performance from the seven evaluation results corresponding to the seven candidate node combinations in turn for fusion , to get the second fusion node.
  • the constructed computing power evaluation model is similar to the above formula (2) and formula (3), which will not be repeated here.
  • the evaluation result is optimal among the x evaluation results, and the x evaluation results are at least two evaluation results among all the m evaluation results.
  • Figure 23 is still used as an example for illustration.
  • An unprocessed third parallelizable branch group obtained by the deep learning framework is shown in Figure 23.
  • the third parallelizable branch group includes a total of 3 sub-branches, and each sub-branch includes 2 fused nodes (non-fused nodes).
  • the computing power evaluation model evaluates the computing power consumed by the candidate node combinations A+C+F and A+C, respectively, and obtains two evaluation results. Finally, the one with the best performance is selected from the two evaluation results.
  • the candidate node combinations corresponding to the evaluation results are fused to obtain fused nodes.
  • the candidate node combination A+C+F is fused to obtain a fusion node, because the fusion node is composed of candidate nodes.
  • A+C+F is obtained by merging, so it is necessary to mark the 3 fused nodes as fused nodes in the original 3 sub-branches, so that nodes A, C, and F can be avoided in the subsequent node search process.
  • the evaluation result corresponding to the candidate node combination A+C is optimal, then the candidate node combination A+C is fused to obtain a fusion node, and it is also necessary to compare these 2 existing sub-branches in the original 3 sub-branches.
  • the fused node is marked as a fused node, so that nodes A and C are avoided in the subsequent node search process. At this time, only one node F is left, then node F continues to participate in the next round of nodes as an unfused node.
  • the above process is a round of search process.
  • the combined search is continued according to the above search process until less than two nodes in the nodes are not merged.
  • the process is similar to the above method (2), which is not repeated here.
  • a calculation graph with fusion nodes can be obtained, and the calculation graph can be called the second calculation graph.
  • the deep learning framework can further analyze the second calculation graph.
  • the computation graph is compiled to obtain a compiled second computation graph, and the compiled second computation graph can be directly loaded into the acceleration hardware for execution.
  • the process of compiling the second computation graph includes two parts, as shown in FIG. 8 , one part is compiling common nodes, that is, compiling unfused nodes (including non-fused nodes) and not fused nodes), this part of the compiling process is similar to the existing method, specifically referring to the embodiment corresponding to FIG. 4, which will not be repeated here;
  • the compilation is performed to obtain the kernel corresponding to the fusion node.
  • Segment merging divides IR generation into two stages: 1) Independently schedule each node to obtain the number of sub-IRs equal to the number of nodes in the first stage; 2) Then fuse and modify these sub-IRs to obtain the second stage of the total IR.
  • the method for compiling the fusion node to obtain the corresponding kernel can be as follows: first, independently schedule the p nodes to obtain p sub-IRs, and then The p sub-IRs are fused to obtain a total IR, and finally the total IR is compiled to obtain the kernel corresponding to the fusion node.
  • FIG. 24 is a schematic diagram of a comparison between common node compilation and fusion node compilation provided by the embodiment of the present application. It is assumed that nodes A, B, and C are all fusionable nodes.
  • Nodes A, B, and C are fused, then the compiler (also called encoder) performs ordinary node compilation on nodes A, B, and C, that is, as shown in the left part of Figure 24, the compiler compiles nodes A, B , C schedule each, and get the intermediate representations IR1, IR2, IR3 corresponding to the three nodes, and then further compile IR1, IR2, IR3, and get the corresponding kernel1, kernel2, and kernel3 of nodes A, B, and C.
  • the obtained three kernels can be directly loaded onto the device in the corresponding acceleration hardware in turn, and the corresponding kernels are executed in turn by the computing unit (eg, SM, core, etc.) on the device.
  • the computing unit eg, SM, core, etc.
  • the compiler performs fusion node compilation on the nodes A, B, and C, that is, as shown in the right part of FIG. 24 , the compiler Schedule nodes A, B, and C, respectively, to obtain the corresponding intermediate representations IR1, IR2, and IR3 of these three nodes, and then fuse the intermediate representations IR1, IR2, and IR3 to obtain the total intermediate representation IR (A+B+ C), and finally compile the IR(A+B+C) to obtain an operator kernel (A+B+C).
  • each node of the fusion is not scheduled in a unified manner, but adopts the method of analyzing separately and then merging.
  • the relative independence of the kernel corresponding to each node can be maintained, and an independent code segment can be formed, which intuitively expresses its parallelism.
  • code branch processing is performed when compiling the total IR to isolate the computing resources of the original computing logic of each node (that is, each node before fusion corresponds to one or more specific computing units).
  • the parallelism of the code segment is made more explicit in the kernel.
  • a specific code example is used to illustrate the above process of segment merging: as shown in FIG. 25 , where C is the branch condition, A is the Cast node code segment, and B is the Sub node code segment.
  • C sets the resource condition blockIdx.x ⁇ 1.
  • the resource number (ie identification information) blockIdx.x satisfies the condition
  • the Cast calculation in code segment A will be executed; when the resource number does not satisfy the condition, the code segment B will be executed.
  • Sub calculation Therefore, with the hardware characteristics of accelerated hardware (eg, GPU) for parallel computing, code segments A and B can be executed in parallel.
  • the example in Figure 25 is the CUDA code of this kernel.
  • input_2 and input_0 are the input data addresses of the code
  • subtract_2 and cast_0 are the output data addresses of the code.
  • different computing resources can be distinguished by blockIdx.x and threadIdx.x, and the index of the data can be obtained by calculation. For example, in code segment B, by ((int)ablockIdx.x-1)*1024)+((int )threadIdx.x) to determine which data from input_2 needs to be used from the current subtraction operation.
  • the compiler in the deep learning framework can be used to schedule and analyze each fusion node, and the compiler can obtain the corresponding execution through hardware instruction transmission. code.
  • the compilation of the fusion node is performed by using AKG, and a segment merging process is added to the AKG compilation process, that is, the original AKG's scheduling analysis of nodes is transformed into segment merging processing for each fusion node. Specifically, before the hardware command is sent, the first stage of segment merging is performed.
  • the first stage is for AKG to perform scheduling analysis on each node before fusion, and obtain the information on the usage of computing resources after each node is scheduled (that is, each node).
  • the computing resources consumed during actual execution, such as Block size on the GPU) and the corresponding sub-IR, as shown in Figure 26, (a) sub-schematic diagram and (b) sub-schematic diagram in Figure 26 are respectively Sqrt nodes before fusion
  • each part of the sub-IR is integrated by adding a code branch.
  • the integrated code is shown in the gray bottom part of the sub-schematic diagram (c) in Figure 26, which is the total IR of the fusion node.
  • AKG continues to perform the remaining compilation process to generate the kernel code corresponding to the final fusion node (Sqrt+Divide).
  • the sub-schematic diagram expresses that the sqrt operation is performed on input_2, and the result is written into T_sqrt, and a total of 4 units of blocks need to be allocated during execution, and each block needs to be divided It is a thread of 256 units; 2.
  • the sub-diagram expresses the execution of the input_0/input_1 operation, and the result is written into T_divide_input_0_input_1.
  • (c) sub-diagram means that after combining the sub-IR of (a) and the sub-IR of (b), a total of 8 units of blocks need to be allocated, and each block needs to be divided into 256 units of threads.
  • the node fusion method of the computational graph provided by the embodiment of the present application can be summarized as a schematic flowchart as shown in FIG. 27 , which can include three main steps, which can be summarized as:
  • each parallelizable branch group has one or more sub-branches, there is no connection relationship between the sub-branches, and any two or more sub-branches come from different sub-branches respectively There is no ring structure in the computation graph after node fusion.
  • the parallel branch group is only searched based on the connection relationship between the nodes in the computation graph, and the search process will not be interrupted by individual non-fused nodes, which improves the search hit rate of fused nodes in the neural network.
  • each sub-branch contains one or more nodes, and nodes on different sub-branches can be combined as a fusion node, while nodes on the same sub-branch cannot be fused due to the front-to-back connection relationship.
  • one or more fusion nodes can be obtained according to the combined search method described in the embodiment corresponding to step 902 above.
  • the constructed computing power evaluation model can further guide the search for fusion nodes, which can ensure that the fused node combination can bring expected benefits.
  • each node is independently scheduled and analyzed, the code branches of each node are obtained, and the fusion node is obtained based on each code branch.
  • the advantage of segment merging is that each node is independently Scheduling analysis improves the flexibility of fusion, and ensures the correct generation of fusion node code, and the generated kernels still maintain parallelism, allowing the corresponding acceleration hardware to perform deeper parallel optimization.
  • FIG. 28 is a node provided by the embodiments of the present application.
  • the abscissa in Figure 28 represents the execution time t. It is assumed that there are 3 fused nodes in the calculation graph. If compiled as ordinary nodes, they are compiled into kernelA, kernelB, and kernelC respectively. , according to the original topological sorting method, kernelA, kernelB, and kernelC need to be executed on the device in turn.
  • cost a, cost b, and cost c indicate that kernelA, kernelB, and kernelC are executed on the device in turn. If the three fused nodes are fused according to the node fusion method described in the embodiment of the present application, a fusion node is obtained, and the fusion node is compiled into kernel(A+B+C), the cost (a
  • the benefits of the node fusion method of the computational graph mainly come from two parts: 1) The number of times the kernel is loaded on the device is reduced, and the number of times the kernel is loaded is equal to the computational graph after the nodes are fused. The total number of nodes reduced.
  • the computing power resources of the unit so the total computing power resources can allow A, B, and C to be fused (under a more complex and detailed analysis, even if the total amount of resources is insufficient, A, B, and C may be better in parallel than separately execution), at the same time, since A, B, and C still maintain parallelism in the fused kernel (A+B+C), the execution consumption is approximately equal to the maximum time consumption among the three, so the benefit of the fusion node is as follows (4) shows:
  • the node fusion method of the calculation graph provided by the embodiment of the present application runs Bert-Base with a batch size of 32 and uses the Lamb optimizer, and a total of 1425 groups of fusion nodes can be searched. Compared with the performance without parallel fusion, the benefit of one iteration is about 23.87ms, and the optimization ratio is about 6%.
  • the common nodes scattered in the computation graph are integrated into fusion nodes (which can be regarded as integrating small nodes into large nodes). ), which improves the utilization rate of device resources when the accelerated hardware is running, reduces the number of loading kernels, and improves the overall execution performance of the network.
  • FIG. 29 is a schematic structural diagram of a deep learning framework provided by an embodiment of the application.
  • the deep learning framework 2900 includes: a conversion module 2901, a search module 2902, and a fusion module 2903, wherein the conversion module 2901 is used for Convert the neural network into a computational graph, and the obtained computational graph can be called a first computational graph, and the first computational graph is used to represent the computational logic of the neural network; the search module 2902 is used for nodes in the first computational graph.
  • the connection relationship between extracts one or more parallelizable branch groups from the first computation graph, the parallelizable branch group indicates that a plurality of sub-branch supports of the parallelizable branch group support to be executed in parallel, and the first parallelizable branch group included in the parallelizable branch group.
  • a parallelizable branch group satisfies at least one of the following conditions: the inputs of all sub-branches in the first parallelizable branch group come from the same node and the outputs of at least two sub-branches in the first parallelizable branch group point to different node, the outputs of all sub-branches in the first parallelizable branch group point to the same node, and the inputs of at least two sub-branches in the first parallelizable branch group come from different nodes,
  • the first node of all sub-branches has no parent node, and the last node of all sub-branches in the first parallelizable branch group has no child nodes;
  • the fusion module 2903 is used for, for each parallelizable branch group, for each parallelizable branch group
  • Multiple nodes from different sub-branches in the parallel branch group are fused to obtain a second computation graph, that is, the sub-branch to which each node in the multiple nodes belongs is the same as any other node in the multiple no
  • the parallelable branchable group when there are multiple parallel branchable groups, the parallelable branchable group further includes at least one second parallelable branchable group, and the second parallelable branchable group satisfies the following conditions: the second parallelizable branch group
  • the inputs of all sub-branches in the parallelizable branch group come from the same node and the outputs of at least two sub-branches in the second parallelizable branch group point to the same node, and each sub-branch in the second parallelizable branch group includes at least two node.
  • the method for obtaining the parallel branchable group based on the connection relationship between the nodes in the first computation graph may be the method for obtaining the second parallel branchable group as described above, which has wide applicability.
  • the fusion module 2903 is further configured to: remove non-fused nodes from each sub-branch in each parallel branchable group (ie, target parallel branchable group), so as to obtain the removed non-fused nodes
  • Each parallelable branch group may be referred to as the third parallelizable branch group) of the target parallelizable branch group, and any sub-branch in the target parallelizable branch group can be called a target sub-branch.
  • the second computation graph includes the fused nodes and the unfused nodes in the first computation graph (unfused nodes refer to nodes that have not been fused), and the third computation graph can be parallelized.
  • the sub-branch to which each node of the plurality of nodes in the branch group belongs is different from the sub-branch to which any other node of the plurality of nodes belongs.
  • the specific method for obtaining the second computation graph is described, that is, the non-fused nodes are first removed from the first computation graph, and then the remaining fused nodes are fused to obtain the second computation graph , since the unfused nodes are removed from the first computation graph in advance, the fusion efficiency can be improved.
  • the deep learning framework 2900 in this embodiment may further include an iterative module 2904.
  • the iteration module 2904 is used to trigger the fusion module 2903 to repeatedly perform the step of merging multiple nodes in the third parallel branchable group to obtain fused nodes.
  • the number of unfused nodes in the third parallelizable branch group is less than two.
  • the iterative module 2904 ensures that as many fused nodes as possible can be fused into fused nodes, thereby improving the fusion coverage.
  • the fusion module 2903 is specifically configured to: select (for example, randomly select) an unfused node from each of the multiple sub-branches in the third parallel branchable group to obtain n unfused nodes , the unfused node is an unfused node, n ⁇ 2, then m node combinations are generated based on the selected n unfused nodes, and each node combination in the m node combinations includes at least two unfused nodes, m ⁇ 1 and 2m ⁇ n, and evaluate the computing power required by the m node combinations through the constructed computing power evaluation model to obtain m evaluation results, each of the m evaluation results using It is used to characterize one of the following situations: the computing power resources consumed by each node combination in the m node combinations, and the computing power resources saved by each node combination in the m node combinations.
  • the first node combination corresponding to the first evaluation result is fused to obtain a first fusion node, and each unfused node in the first node combination is marked is a fused node, the first evaluation result is one of the m evaluation results, and the first node combination is one of the m node combinations.
  • the fusion module 2903 fuses nodes based on a combination search method to obtain fusion nodes, and can further guide the search for fusion nodes through the constructed computing power evaluation model, which can ensure the combination of nodes to be fused Can bring expected benefits.
  • the first evaluation result satisfying the preset condition includes at least one of the following situations: each evaluation result in the m evaluation results is used to represent the cost required for each node combination in the m node combinations In the case of computing power resources, the first evaluation result meets the computing power requirements of the module (device) that specifically performs the computing task on the target acceleration hardware, and each evaluation result in the m evaluation results is used to represent each of the m node combinations.
  • the first evaluation result is optimal among the m evaluation results, and each evaluation result in the m evaluation results is used to represent each node combination in the m node combinations
  • the first evaluation result is optimal among the x evaluation results, and the x evaluation results are at least two evaluation results among the m evaluation results.
  • the search module 2902 is specifically configured to: search for a node having the same parent (may be referred to as the first common parent node or the common first parent node) in the first calculation graph. In the application embodiments, it is collectively referred to as the first common parent node) of multiple branches (which may be referred to as first branches), and a parallelizable branch group is obtained according to the multiple first branches; Multiple branches (may be referred to as second branches) of the same child node (may be referred to as a first common child node or a common first child node, and are collectively referred to as a first common child node in this embodiment of the present application) for ease of explanation , and obtain a parallel branchable group according to the plurality of second branches.
  • This search method can also be called a one-way search in a multi-fork structure.
  • a search method for extracting one or more parallel branchable groups from the first calculation graph based on the connection relationship between the nodes in the first calculation graph is described.
  • the search method is based on whether there are Common parent nodes or whether there are common child nodes, and common parent nodes or common child nodes widely exist in the computational graph converted from neural networks, this search method can search to obtain a large number of parallel branch groups.
  • the search module 2902 is further configured to: take the first parent node as a starting point, and search down each first branch respectively, until a common second branch is encountered in the downward search process stop when the parent node or the common second child node, to obtain a parallelizable branch group corresponding to the plurality of first branches, the parallelizable branch group includes the first child branches corresponding to the plurality of first branches, each of the The nodes contained in the first sub-branch are the nodes obtained in the process of searching down each of the first sub-branches.
  • each first sub-branch in the plurality of first sub-branches does not include the first common parent node and the common child node, that is, each first sub-branch does not include the first sub-branch as a starting point.
  • the common parent node, and in the downward search process of each first branch if other common parent nodes (ie the second common parent node) are encountered, the second common parent node is incorporated into the corresponding first child branch , if a common child node is encountered, the common child node is excluded from the corresponding first sub-branch.
  • the deep learning framework obtains a parallel branch group according to a plurality of first branches, that is, the downward search starts from the common parent node, and this method extracts parallel branches from the common parent node.
  • the group method can ensure that nodes belonging to different sub-branches in the same parallel branch group do not have inconsistent dependency behaviors. It is guaranteed that the fusion of nodes between branches will not form a ring structure.
  • the search method in the embodiment of the present application ensures that no ring structure will appear. There is no need to additionally perform looping judgment on each obtained combination, which simplifies the operation process.
  • the search module 2902 is further configured to: take the first child node as a starting point, and search upward for each second branch respectively, until a common third parent node is encountered in the upward search process or a common third child node, to obtain a parallelizable branch group corresponding to the plurality of second branches, the parallelizable branch group including the second branch corresponding to each of the plurality of second branches, and each of the second branches
  • the nodes contained in the sub-branches are the nodes obtained during the upward search of each of the second sub-branches.
  • each second sub-branch in the plurality of second sub-branches does not include the first common child node and the common parent node, that is, each second sub-branch does not include the first common child node as a starting point.
  • the deep learning framework obtains a parallel branchable group according to multiple second branches, that is, an upward search is performed starting from a common child node, and the parallel branchable group is extracted from a common child node.
  • This method can ensure that nodes belonging to different sub-branches in the same parallel branch group do not have inconsistent dependency behaviors.
  • the fusion of internodes does not form a ring structure. Therefore, the search method in the embodiment of the present application also ensures that no ring structure occurs, and there is no need to additionally perform a ring formation judgment on each combination obtained each time, which simplifies the operation process.
  • the search module 2902 is further configured to: search a plurality of third branches from the first computation graph, and obtain a parallelizable branch group according to the plurality of third branches, wherein each The first node of the three branches has no parent node; and/or, a plurality of fourth branches without child nodes are searched from the first computational graph, and a parallelizable branch group is obtained according to the plurality of fourth branches, wherein each The first node of the third branch has no parent.
  • search method for extracting one or more parallel branchable groups from the first calculation graph based on the connection relationship between the nodes in the first calculation graph is described, and the search method is based on no It is carried out by parent nodes or nodes without child nodes, and nodes without parent nodes or nodes without child nodes are also widely present in the computational graph converted from neural networks. Therefore, in addition to the above search based on common parent nodes or common child nodes In addition to the method, this search method can still search out a large number of parallel branch groups. This search method can also be called a one-way search without aggregation structure.
  • the search module 2902 is further configured to: take the first node of each of the third branches as a starting point, and search down each of the third branches respectively, until the downward search process is performed.
  • the parallelizable branch group corresponding to the multiple third branches is obtained, and the parallelizable branch group includes the respective third child branches corresponding to the multiple third branches.
  • the nodes contained in the third sub-branch are the nodes obtained in the process of searching down each of the third sub-branches.
  • each third sub-branch in the plurality of third sub-branches does not include a common child node encountered in the downward search process, that is, each third sub-branch may include a node as a starting point , and in the downward search process of each third branch, if a common parent node is encountered, the common parent node can be incorporated into the corresponding third child branch; if a common child node is encountered, the corresponding third The common child node is excluded from the child branch.
  • the deep learning framework obtains a parallel branchable group according to multiple third branches, that is, the downward search starts from a node without a parent node, which starts from a node without a parent node.
  • the method of extracting a parallelizable branch group can also ensure that nodes belonging to different sub-branchs in the same parallelizable branch group do not have inconsistent dependency behaviors. This ensures that the fusion of nodes between branches will not form a ring structure. Therefore, the search method in the embodiment of the present application can also ensure that no ring structure occurs, and there is no need to additionally perform a ring formation judgment on the combinations obtained each time, which simplifies the operation process.
  • the search module 2902 is further configured to: take the last node of each fourth branch as a starting point, and search upward for each of the fourth branches respectively, until encountering a problem in the upward search process
  • the same parent node or the same child node is stopped to obtain a parallelizable branch group corresponding to the plurality of fourth branches, and the parallelizable branch group includes the fourth child branches corresponding to each of the plurality of fourth branches.
  • the nodes contained in the sub-branches are the nodes obtained during the upward search of each of the fourth sub-branches.
  • each fourth sub-branch in the plurality of fourth sub-branches does not include a common parent node encountered in the upward search process, that is, each fourth sub-branch may include a node as a starting point, And in the downward search process of each fourth branch, if a common child node is encountered, the common child node can be incorporated into the corresponding fourth child branch, and if a common parent node is encountered, the corresponding fourth child The common parent node is excluded from the branch.
  • the deep learning framework obtains a parallel branchable group according to multiple fourth branches, that is, the upward search starts from the node without child nodes.
  • the method of parallelizable branch group can also ensure that nodes belonging to different sub-branches in the same parallelizable branch group do not have inconsistent dependency behaviors. For example, there is no common parent node in the sub-branch obtained from the upward search based on the node without child nodes. Therefore, it can be ensured that the fusion of nodes between branches will not form a ring structure. Therefore, the search method in this embodiment of the present application can also ensure that no ring structure occurs, and there is no need to additionally perform a ring-forming judgment on the combinations obtained each time, which simplifies the operation process.
  • the search module 2902 is further configured to: in the case that the target node does not belong to the non-fused node, simplify the local structure around the target node to obtain a fifth branch, and the target node is The first computation graph is not divided into any node that can be branched in parallel, and in the case that there are multiple fifth branches, a parallel branchable group is obtained according to the multiple fifth branches.
  • search method for extracting one or more parallel branchable groups from the first computation graph based on the connection relationship between the nodes in the first computation graph is described, and the search method is based on scattering
  • This search method as a supplementary search method of the above two methods, can search for fused nodes from the first calculation graph to the greatest extent. make the search more extensive.
  • the deep learning framework 2900 further includes: a compilation module 2905 for compiling the fusion node in the second computation graph to obtain an operator kernel (kernel) corresponding to the fusion node.
  • a compilation module 2905 for compiling the fusion node in the second computation graph to obtain an operator kernel (kernel) corresponding to the fusion node.
  • the compilation process of the second computation graph by the compilation module 2905 also includes the compilation process of the fusion node, which is achievable.
  • the compiling module 2905 is further configured to: in the case that the fusion node is obtained by fusion of p nodes, schedule the p nodes respectively, so as to obtain p sub-intermediates corresponding to the p nodes respectively Representation (IR); then, the p sub-IRs are fused to obtain a total IR; finally, the total IR is compiled to obtain an operator kernel (kernel) corresponding to the fusion node.
  • IR Representation
  • the deep learning framework 2900 may be an AI framework with various specific expressions, for example, it may be a mainstream deep learning framework such as mindspore, tensorflow, tensornetwork, pytorch, mxnet, caffe, theano, etc., or It is another niche deep learning framework. As long as the deep learning framework can optimize and compile the calculation graph, it can be considered as the deep learning framework 2900 in this embodiment of the present application. Specifically, the deep learning framework 2900 in this application is The form of expression is not limited.
  • FIG. 30 is a schematic structural diagram of the computer device provided by the embodiment of the present application. For the convenience of description, only the part related to the embodiment of the present application is shown. If the specific technical details are not disclosed, please refer to the method part of the embodiments of the present application.
  • the computer device 3000 can be deployed with the modules described in the corresponding embodiment of FIG. 29 for implementing the functions of the deep learning framework in the corresponding embodiment of FIG. 29 .
  • the computer device 3000 is implemented by one or more servers, and the computer device 3000 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPUs) 3022 and memory 3032, and one or more storage media 3030 for storing application programs 3042 or data 3044 (eg one or more mass storage devices).
  • the memory 3032 and the storage medium 3030 may be short-term storage or persistent storage.
  • the program stored in the storage medium 3030 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the computer device 3000.
  • the central processing unit 3022 may be configured to communicate with the storage medium 3030 to execute a series of instruction operations in the storage medium 3030 on the computer device 3000 .
  • Computer device 3000 may also include one or more power supplies 3026, one or more wired or wireless network interfaces 3050, one or more input and output interfaces 3058, and/or, one or more operating systems 3041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • one or more power supplies 3026 such as, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • operating systems 3041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • the central processing unit 3022 is configured to execute the method in the embodiment corresponding to FIG. 9 .
  • the central processing unit 3022 can be used to: obtain the network structure of the neural network, and convert the neural network into a calculation graph, and the obtained calculation graph can be called the first calculation graph.
  • the first computation graph is obtained, one or more parallelizable branch groups are extracted from the first computation graph based on the dependencies between the nodes in the first computation graph, wherein each parallelizable branch group includes multiple sub-branches (that is, at least 2 sub-branches), the sub-branches are a sequential series structure without forking, and each sub-branches in each parallelable branch group includes one or more nodes.
  • the parallelizable branch group indicates that multiple sub-branches of the parallelizable branch group support being executed in parallel. It should be noted here that the sub-branches belonging to the same parallel branchable group need to satisfy two conditions: one is that there is no dependency between the sub-branches; the other is that any two or more nodes belonging to different sub-branches are merged There is no ring structure in the computation graph after forming a node, that is, there is no ring structure in the fused computation graph.
  • At least one parallelizable branching group (which may be referred to as a first parallelizable branching group) included in the parallelizable branching group satisfies at least one of the following conditions: all the parallelizable branching groups in the first parallelizable branching group
  • the inputs of the sub-branches come from the same node, the outputs of at least two sub-branches in the first parallelizable branch group point to different nodes, the outputs of all sub-branches in the first parallelizable branch group point to the same node, and the first
  • the inputs of at least two sub-branches in the parallelizable branch group come from different nodes, the first node of all the child branches in the first parallelizable branch group has no parent node, and all the child branches in the first parallelizable branch group The last node has no children.
  • one or more parallelizable branch groups can be obtained, and for each parallelizable branch group, for each parallelizable branch group (which may be referred to as the first parallelizable branch group), multiple branches from different sub-branches are obtained.
  • the nodes are fused to obtain the second computational graph.
  • central processing unit 3022 can also be used to execute any step in the method embodiment corresponding to FIG. 9 in the present application.
  • the central processing unit 3022 can also be used to execute any step in the method embodiment corresponding to FIG. 9 in the present application.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, it causes the computer to execute the programs described in the foregoing embodiments. steps performed by computer equipment.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, training equipment, or network equipment, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, high-density digital video discs (DVDs)), or semiconductor media (eg, solid state disks) , SSD)) etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例公开了一种计算图的节点融合方法及设备,可应用于人工智能领域,具体可应用于深度学习框架,方法包括:将神经网络转换为计算图,并基于计算图中节点间的依赖关系从计算图提取一个或多个可并行分支组,依赖关系用于指示如下关系中的至少一种:可并行分支组拥有共同父节点、可并行分支组拥有共同子节点、可并行分支组无父节点、可并行分支组无子节点,最后对任一个可并行分支组中分别来自不同子分支的多个节点进行融合,得到新的计算图。本申请考虑了其他一些可并行的分支的可能性,从而找到不同于现有技术规则限定的可并行的分支组合,即可在计算图的节点融合中,融合这些分支组合中的节点,从而扩展了能够获取的可融合节点的范围。

Description

一种计算图的节点融合方法及设备
本申请要求于2020年12月28日提交中国专利局、申请号为202011595030.3、申请名称为“一种计算图的节点融合方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习领域,尤其涉及一种计算图的节点融合方法及设备。
背景技术
计算图(computational graph)是一种通用的计算过程表示方法,用于描述函数的有向无环图,普遍应用在各类数据处理平台上,一个计算图包括多个节点和有向边。在机器学习领域中,计算图则用于表示神经网络涉及的计算逻辑,计算图中的每个节点表示神经网络所进行的相应运算(如,add节点代表一个加法运算),该相应运算也可称为计算任务,一个节点代表一个计算任务,有向边将前一个节点(可称为前节点或父节点)连接至后一个节点(可称为后节点或子节点),表示父节点的输出作为子节点的输入。
深度学习框架对计算图的具体执行方式的一般做法是:首先,将用户定义好的神经网络转换为一张计算图(该计算图已经过优化处理),再按照计算图中各节点的拓扑序将这些排好序的节点对应的计算任务一一加载到加速硬件(如,图形处理器(graphics processing unit,GPU)、张量处理器(tensor processing unit,TPU)、昇腾处理器(如,昇腾910和昇腾310)等)上具体执行计算任务的模块device上执行。如图1所示,图1中的左图部分为计算图,图1中的右图部分为对应该计算图的拓扑排序结果。在整个神经网络的执行过程中,由于执行顺序的影响,device上的每一个计算任务必须在前一拓扑序的计算任务执行结束后才能开始,这意味着排序会隐式地在数据独立的计算任务间添加上执行上的依赖关系,如图1中,节点C和节点D对应的计算任务间本来无明确的依赖,但确定拓扑排序后,执行上节点C得先执行,而后节点D才能执行。最终该device上总的网络执行时间为各个独立计算任务的耗时总和加上交互、通信等带来的额外耗时。
但随着神经网络表现出来的网络体量越来越大(如图2所示),其网络结构也越来越复杂,如,网络模型Bert-Large有1024个隐藏层、网络模型GPT-3有2048个隐藏层,这使得对深度学习框架的性能要求越来越苛刻。而由于这类网络结构含有众多的分支结构,且表现出直接或间接的计算独立性,固而在执行上这些分支结构拥有并行执行的可能性,因此,为了获得更高的执行性能,对于具有多分支结构的神经网络,将多个可并行执行且消耗算力资源小的节点的计算逻辑进行整合后再执行。目前已有的计算图的节点融合方式主要有水平融合(horizontal fusion)方式和算子级并行(operator level parallelism)方式,其中,horizontal fusion方式是加速线性代数(accelerated linear algebra,XLA)编译器针对GPU的一个优化步骤,这种融合方式在寻找融合节点时保守,实现上以计算图的返回节点为起点往回查找可融合的节点,且查找过程中一旦中断就停止,且待融合的节点需要有相同的数据排布和输出数据类型;operator level parallelism方式是深度学习框架MXNet的一 个优化步骤,其要求被融合的多个节点的输入数据都来自于同一个父节点。已有的这些计算图的节点融合方式都具有很多约束条件,无法充分搜索可融合的节点,因此,一种高效的计算图的节点融合方式亟待推出。
发明内容
本申请实施例提供了一种计算图的节点融合方法及设备,相较于现有技术,考虑了其他一些可并行的分支的可能性,从而找到不同于现有技术的规则限定的可以并行的分支的组合,也就可以在计算图的节点融合中,融合这些分支的组合中的节点,从而扩展了能够获取的可融合节点的范围。
基于此,本申请实施例提供以下技术方案:
第一方面,本申请实施例首先提供一种计算图的节点融合方法,可用于人工智能领域中,该方法包括:
第一方面,本申请实施例首先提供一种计算图的节点融合方法,可用于人工智能领域中,具体可应用于深度学习框架,该方法包括:首先,深度学习框架可以先获取神经网络的网络结构,该网络结构可以是通过深度学习框架提供的API自定义神经网络的网络结构,也可以是从网络模型库(model zoo)中直接获取已经预定义好的神经网络的网络结构,具体本申请对深度学习框架如何获取神经网络的方式不做限定。在获得神经网络的网络结构后,深度学习框架将进一步将该神经网络转换成一张计算图,得到的该计算图可称为第一计算图。深度学习框架得到该第一计算图之后,会基于该第一计算图中各个节点之间的连接关系从第一计算图中提取到一个或多个可并行分支组,其中,每个可并行分支组都包括多个子分支(即至少2个子分支),子分支为不存在分叉的顺序串联结构,每个可并行分支组中的每个子分支包括一个或多个节点。该可并行分支组就指示该可并行分支组的多个子分支支持被并行执行。这里需要注意的是,属于同一个可并行分支组的子分支需要满足两个条件:一个是各子分支之间不存在依赖关系;二是属于不同子分支的任意两个或两个以上节点融合成一个节点后的计算图不存在环结构,即融合后的计算图中不存在环结构。在本申请实施例中,可并行分支组中包括的至少一个可并行分支组(可称为第一可并行分支组)满足以下条件中的至少一种:该第一可并行分支组内的所有子分支的输入来自于同一节点且该第一可并行分支组内的至少两个子分支的输出指向不同的节点、该第一可并行分支组内的所有子分支的输出指向同一节点且该第一可并行分支组内的至少两个子分支的输入来自不同的节点、该第一可并行分支组内的所有子分支的第一个节点没有父节点、该第一可并行分支组内的所有子分支的最后一个节点没有子节点。经过上述步骤后,深度学习框架可得到一个或多个可并行分支组,针对每一个可并行分支组,深度学习框架可对每个可并行分支组中分别来自不同子分支的多个节点进行融合,从而得到第二计算图。需要注意的是,本申请实施例所述的一个或多个可并行分支组不限定都是基于所述提取步骤得到的全部可并行分支组,也包括基于现有方式能搜索到的分支组,为便于描述,本申请仅对于区别于现有搜索方式不同的地方进行阐述,无论是何种方式提取到的可并行分支组,都可基于本申请所述的融合方式对每个可并行分支组的多个节点进行融合,以得到第二计算 图。
在本申请上述实施方式中,提供了一种计算图的节点融合方法,相较于现有技术,考虑了其他一些可并行的分支的可能性,从而找到不同于现有技术的规则限定的可以并行的分支的组合,也就可以在计算图的节点融合中,融合这些分支的组合中的节点,从而扩展了能够获取的可融合节点的范围。
在第一方面的一种可能的实现方式中,在可并行分支组为多个的情况下,该可并行分支组还包括至少一个第二可并行分支组,该第二可并行分支组满足以下条件:该第二可并行分支组内的所有子分支的输入来自于同一节点且该第二可并行分支组内的至少两个子分支的输出指向同一节点,该第二可并行分支组内的每个子分支包括至少两个节点。
在本申请上述实施方式中,基于第一计算图中各节点之间的连接关系得到可并行分支组的方式可以是如上述得到第二可并行分支组的方式,具备广泛适用性。
在第一方面的一种可能的实现方式中,由于深度学习框架一开始得到的一个或多个可并行分支组中的某些子分支上可能依然保留有一些不可融合的节点或不适合并行执行的节点,这类节点可统称为不可融合节点,这些节点之所以在提取可并行分支组时被保留下来,是为了让分支保留有完整性,使得分支在搜索过程中不被个别节点打断,一旦打断,将导致潜在的可并行执行的节点被遗漏。深度学习框架得到一个或多个可并行分支组之后,还可以从每个可并行分支组(即目标可并行分支组)中的每个子分支中剔除掉不可融合节点,从而得到剔除了不可融合节点的各个可并行分支组(可称为第三可并行分支组),而目标可并行分支组中的任意一个子分支可称为目标子分支。之后,将剔除了不可融合节点的各可并行分支组(即第三可并行分支组)中分别来自不同子分支的多个节点(指的是可以进行融合的节点,但还未融合的节点,即待融合的节点)进行融合,得到融合节点,该第二计算图中包括融合节点与第一计算图中的未融合节点(未融合节点是指没有被融合的节点),该第三可并行分支组中的多个节点中的每个节点所属的子分支都与该多个节点中的其他任何一个节点所属的子分支不同。需要说明的是,在本申请实施例中,不可融合节点具体可以是具有排他性运算操作的节点,例如,使用的编译套件不提供某些节点融合后的kernel编译方案;也可以是属于特定运算操作的节点,例如,神经网络的矩阵乘操作、卷积操作等特定运算操作,其本身计算密集度高,大多数情况下在device上执行时都会尽可能的将device上的算力资源使用完全。
在本申请上述实施方式中,阐述了得到第二计算图的具体方式,即先从第一计算图中剔除掉不可融合节点,再对可剩余的可融合节点进行融合,从而得到第二计算图,由于从第一计算图中事先剔除了不可融合节点,因此可提高融合效率。
在第一方面的一种可能的实现方式中,为了尽可能将所有可融合节点进行融合,上述将第三可并行分支组中分别来自不同子分支的多个节点进行融合的过程是迭代进行的,即重复执行上述将第三可并行分支组中分别来自不同子分支的多个节点进行融合的步骤,直至该第三可并行分支组中未融合节点的数量少于2个。
在本申请上述实施方式中,保证了能将可融合节点尽可能多的被融合为融合节点,提高了融合覆盖面。
在第一方面的一种可能的实现方式中,在本申请的一些实施方式中,深度学习框架在剔除了不可融合节点得到各个第三可并行分支组后,对第三可并行分支组中分别来自不同子分支的多个节点进行融合,得到融合节点的过程可以基于组合搜索(也可称为并行融合搜索)的方式得到,该组合搜索的原则是:在任意一个第三可并行分支组中,每个子分支都包括有一个或多个节点,来自不同子分支的节点可以组合起来作为融合节点,相同子分支上的节点由于前后连接关系则无法进行融合,在每个第三可并行分支组上,可基于搜索算法得到融合节点。具体的搜索过程可以是:首先,从第三可并行分支组中的多个子分支中各自选择(如,可以是随机选择)一个未融合节点,得到n个未融合节点,该未融合节点为未被融合的节点,n≥2,之后基于选出的n个未融合节点生成m个节点组合,m个节点组合中的每个节点组合包括至少两个未融合节点,m≥1且2m≤n,并通过构建的算力评估模型对该m个节点组合各自所需的算力进行评估,以得到m个评估结果,该m个评估结果中的每个评估结果用于表征如下情形一种:所述m个节点组合中每个节点组合所需耗费的算力资源、所述m个节点组合中每个节点组合所节省的算力资源。在第一评估结果满足预设条件的情况下,将与该第一评估结果对应的第一节点组合进行融合,以得到第一融合节点,并将该第一节点组合中的各个未融合节点标记为已融合节点,所述第一评估结果为这m个评估结果中的一个,该第一节点组合为该m个节点组合中的一个。
在本申请上述实施方式中,阐述了如何基于组合搜索的方式对节点进行融合,得到融合节点,并可以进一步通过构建的算力评估模型指导搜索融合节点,可保证被融合的节点组合可以带来符合预期的收益。
在第一方面的一种可能的实现方式中,可以在得到当前轮次的一个评估结果之后,就判断该评估结果是否满足预设条件,再基于判断结果进行后续处理;也可以是在进行多轮次搜索候选节点组合得到多个评估结果之后(即至少两个评估结果)再判断该多个评估结果是否存在满足预设条件的评估结果,再基于判断结果进行后续处理,因此,本申请实施例中第一评估结果满足预设条件至少包括如下一种情形:在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所需耗费的算力资源的情况下,第一评估结果达到目标加速硬件上具体执行计算任务的模块(device)的算力要求、在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所节省的算力资源的情况下,第一评估结果在所述m个评估结果中最优、在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所节省的算力资源的情况下,第一评估结果在x个评估结果中最优,该x个评估结果为该m个评估结果中的至少两个评估结果。
在本申请上述实施方式中,阐述了判断第一评估结果是否满足预设条件可以有多种具体的判断方式,用户可基于自身需求选择,具备灵活性。
在第一方面的一种可能的实现方式中,由于第一计算图中不同的分支结构特征可以用3种结构特征来描述,分别可称为多分叉结构、无汇聚结构、散落结构,多分叉结构为拥有同一父节点(也可称为共同父节点,为便于阐述,在本申请实施例中统称为共同父节点)或拥有同一子节点(也可称为共同子节点,为便于阐述,在本申请实施例中统称为共同子节点)的分支结构,无汇聚结构则是指那些分支结构的第一个节点没有父节点或分支结构 的最后一个节点没有子节点的分支结构,除了多分叉结构和无汇聚结构,第一计算图中剩下的节点就可以统称为散落结构。因此,基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的一种方式可以是:在第一计算图中搜索拥有同一父节点(可称为第一共同父节点或共同的第一父节点,为便于阐述,在本申请实施例中统称为第一共同父节点)的多个分支(可称为第一分支),并根据该多个第一分支得到一个可并行分支组;和/或,在第一计算图中搜索拥有同一子节点(可称为第一共同子节点或共同的第一子节点,为便于阐述,在本申请实施例中统称为第一共同子节点)的多个分支(可称为第二分支),并根据该多个第二分支得到一个可并行分支组。这种搜索方式也可称为多分叉结构单向搜索。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的一种搜索方式,该搜索方式是基于是否有共同父节点或是否有共同子节点来进行的,而共同父节点或共同子节点广泛存在于神经网络转换而来的计算图中,这种搜索方式可搜索得到大量的可并行分支组。
在第一方面的一种可能的实现方式中,深度学习框架得到多个第一分支之后,根据该多个第一分支得到一个可并行分支组的具体实现方式可以是:以该第一共同父节点为起始点,根据节点间的连接关系,沿该多个第一分支中的每个第一分支各自向下搜索,直至在向下搜索过程中遇到第一计算图中的其他共同父节点(可称为第二共同父节点)或共同子节点时停止,在向下搜索过程中每个第一分支遍历到的节点各自构成一个子分支,例如,若有4个第一分支,那么经过上述向下搜索过程就可得到4个子分支,每个子分支可称为第一子分支,得到的多个第一子分支构成一个可并行分支组,例如,上述4个子分支就构成一个可并行分支组。这里需要注意的是,多个第一子分支中的每个第一子分支不包括第一共同父节点和共同子节点,也就是说,每个第一子分支都不包括作为起始点的第一共同父节点,且在每个第一分支向下搜索过程中,若遇到其他共同父节点(即第二共同父节点),就将该第二共同父节点纳入到对应的第一子分支中,若遇到共同子节点,则在对应的第一子分支中排除该共同子节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第一分支得到一个可并行分支组,即从共同父节点出发进行向下搜索,这种从共同父节点出发提取可并行分支组的方法,可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于共同父节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。目前已有的计算图的节点融合方式中,都需要判断节点融合后是否成环,而判断是否成环是个很繁杂的操作,本申请实施例所述的搜索方式保证了不会出现环结构,因此无需额外对每次得到的组合进行成环判断,简化了操作流程。
在第一方面的一种可能的实现方式中,深度学习框架得到多个第二分支之后,根据该多个第二分支得到一个可并行分支组的具体实现方式可以是:以该第一共同子节点为起始点,根据节点间的连接关系,沿该多个第二分支中的每个第二分支各自向上搜索,直至在向上搜索过程中遇到第一计算图中的其他共同子节点或共同父节点时停止,在向上搜索过 程中每个第二分支遍历到的节点各自构成一个子分支,例如,若有3个第二分支,那么经过上述向上搜索过程就可得到3个子分支,每个子分支可称为第二子分支,得到的多个第二子分支构成一个可并行分支组,例如,上述3个子分支就构成一个可并行分支组。这里需要注意的是,多个第二子分支中的每个第二子分支不包括该第一共同子节点和共同父节点,也就是说,每个第二子分支都不包括作为起始点的第一共同子节点,且在每个第二分支向上搜索过程中,若遇到其他共同子节点,就将该其他共同子节点纳入到对应的第二子分支中,若遇到共同父节点,则在对应的第二子分支中排除该共同父节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第二分支得到一个可并行分支组,即从共同子节点出发进行向上搜索,这种从共同子节点出发提取可并行分支组的方法,可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于共同子节点出发向上搜索得到的子分支中不存在共同父节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例所述的搜索方式同样保证了不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在第一方面的一种可能的实现方式中,基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的方式还可以是:从第一计算图中搜索多个第三分支,并根据该多个第三分支得到一个可并行分支组,其中,每个第三分支的第一个节点没有父节点;和/或,从第一计算图中搜索没有子节点的多个第四分支,并根据该多个第四分支得到一个可并行分支组,其中,每个第三分支的第一个节点没有父节点。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的另一种搜索方式,该搜索方式是基于没有父节点或没有子节点的节点来进行的,而没有父节点或没有子节点的节点在神经网络转换而来的计算图中也广泛存在,因此,除了上述基于共同父节点或共同子节点的搜索方式外,这种搜索方式依然可搜索出大量的可并行分支组。这种搜索方式也可称为无汇聚结构单向搜索。
在第一方面的一种可能的实现方式中,深度学习框架得到多个第三分支之后,根据该多个第三分支得到一个可并行分支组的具体实现方式可以是:首先,深度学习框架从该第一计算图中搜索到没有父节点的一个或多个节点,并以该没有父节点的节点为起始点,根据节点间的连接关系,沿该多个第三分支中的每个第三分支各自向下搜索,直至在向下搜索过程中遇到第一计算图中的共同父节点或共同子节点时停止,在向下搜索过程中每个第三分支遍历到的节点各自构成一个子分支,例如,若有2个第三分支,那么经过上述向下搜索过程就可得到2个子分支,每个子分支可称为第三子分支,得到的多个第三子分支构成一个可并行分支组,例如,上述2个子分支就构成一个可并行分支组。这里需要注意的是,多个第三子分支中的每个第三子分支不包括向下搜索过程中遇到的共同子节点,也就是说,每个第三子分支可以包括作为起始点的节点,且在每个第三分支向下搜索过程中,若遇到共同父节点,可以将该共同父节点纳入到对应的第三子分支中,若遇到共同子节点,则在对应的第三子分支中排除该共同子节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第三分支得到一个可并行分支组,即从无父节点的节点出发进行向下搜索,这种从没有父节点的节点出发提 取可并行分支组的方法,同样可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于没有父节点的节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例所述的搜索方式同样可保证不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在第一方面的一种可能的实现方式中,深度学习框架得到多个第四分支之后,根据该多个第四分支得到一个可并行分支组的具体实现方式可以是:首先,深度学习框架从该第一计算图中搜索到没有子节点的一个或多个节点,并以该没有子节点的节点为起始点,根据节点间的直接连接关系,沿该多个第四分支中的每个第四分支各自向上搜索,直至在向上搜索过程中遇到第一计算图中的共同父节点或共同子节点时停止,在向上搜索过程中每个第四分支遍历到的节点各自构成一个子分支,例如,若有4个第四分支,那么经过上述向上搜索过程就可得到4个子分支,每个子分支可称为第四子分支,得到的多个第四子分支构成一个可并行分支组,例如,上述4个子分支就构成一个可并行分支组。这里需要注意的是,多个第四子分支中的每个第四子分支不包括向上搜索过程中遇到的共同父节点,也就是说,每个第四子分支可以包括作为起始点的节点,且在每个第四分支向下搜索过程中,若遇到共同子节点,可以将该共同子节点纳入到对应的第四子分支中,若遇到共同父节点,则在对应的第四子分支中排除该共同父节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第四分支得到一个可并行分支组,即从无子节点的节点出发进行向上搜索,这种从没有子节点的节点出发提取可并行分支组的方法,同样可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于没有子节点的节点出发向上搜索得到的子分支中不存在共同父节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例所述的搜索方式同样可保证不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在第一方面的一种可能的实现方式中,由于多分叉结构单向搜索和无汇聚结构单向搜索都是基于节点之间的直接连接关系进行的,因此可以直接根据向上搜索或向下搜索找到各自对应的可并行分支组。但在上述两种搜索中,为了避免繁杂的成环判定操作,其停止条件较为严格,这将使按上述搜索方式处理第一计算图后,仍有部分潜在的可并行执行的节点散落在第一计算图中,这些节点可称为散落节点。这些散落节点可能是一些前后节点已经被并入可并行分支组后遗落的独立节点,也可能是一些特殊的节点结构,如,局部的菱形结构,即由一个节点散出,中间或经过多级分流,但最终汇入另一个节点的结构,这些散落节点由于自身特殊的结构或所处位置等原因打断了被找到的机会,或因为相关分支上无有效节点而被抛弃。这类节点结构都可称为散落结构,在散落结构中找到可并行执行的节点相对于直接基于节点间的连接关系查找而言,搜索成本较高,同时为了保证其有效性,还需进行成环判定。因此,在这种情况下,基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的方式还可以是:从第一计算图中确定出未被划分进任意一个可并行分支组的目标节点,在该目标节点不属于所述不可融合节点的情况 下,对目标节点周围的局部结构进行化简,得到第五分支,之后,在该第五分支为多个的情况下,根据所述多个第五分支得到一个可并行分支组。这种搜索方式也可称为散落结构无环搜索。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的另一种搜索方式,该搜索方式是基于散落结构来进行的,该搜索方式作为上述两种方式的补充搜索方式,可以最大程度从第一计算图中搜索到可融合节点。使得搜索范围更加广泛。
在第一方面的一种可能的实现方式中,深度学习框架在得到该第二计算图之后,还可以进一步对该第二计算图进行编译,以得到编译后的第二计算图,该编译后的第二计算图就可直接加载至目标加速硬件去执行,对第二计算图的编译过程包括两部分,一部分是对普通节点编译,即对未融合的节点(包括不可融合节点以及未被融合的节点)进行的编译,而另一部分则是对融合节点编译,即对融合节点进行的编译,得到与融合节点对应的kernel。
在本申请上述实施方式中,阐述了对第二计算图的编译过程还包括对融合节点的编译过程,具备可实现性。
在第一方面的一种可能的实现方式中,为了保留融合节点前的各节点间的并行性,本申请在对融合节点编译生成kernel时,通过分段合并的方式完成kernel的生成,并在其中通过代码分支方式让各代码段间保持明确的独立性。分段合并将IR生成分为两阶段:1)对各个节点进行独立调度,得到第一阶段的与节点数相等数量的子IR;2)然后对这些子IR进行融合和修正,得到第二阶段的总IR。具体地,假设某融合节点是由p个节点融合得到,那么对该融合节点进行编译得到对应的kernel的方式具体可以是:首先,对该p个节点各自进行独立调度,得到p个子IR,再对该p个子IR进行融合,得到一个总IR,最后对该总IR进行编译,得到与该融合节点对应的kernel。
在本申请上述实施方式中,独立对各个节点进行调度分析提高了融合的灵活度,并保证融合节点代码的正确生成,且生成的kernel间依旧保有并行性,允许对应的目标加速硬件进行更深层次的并行优化。
在第一方面的一种可能的实现方式中,深度学习框架可以是多种具体表现形式的AI框架,例如,可以是mindspore、tensorflow、tensornetwork、pytorch、mxnet、caffe、theano等主流的深度学习框架,也可以是其他小众的深度学习框架,只要该深度学习框架能够对计算图进行优化、编译的处理过程,都可以认为是本申请实施例所述的深度学习框架,具体本申请对深度学习框架的表现形式不做限定。
在本申请上述实施方式中,对深度学习框架的几种常见的具体表现形式进行了说明,具备广泛性。
本申请实施例第二方面提供一种深度学习框架,该深度学习框架具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
本申请实施例第三方面提供一种计算机设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于调用该存储器中存储的程序以执行本申请实施例 第一方面或第一方面任意一种可能实现方式的方法。
本申请第四方面提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第五方面提供了一种计算机程序或计算机程序产品,当该计算机程序或计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第六方面提供了一种芯片,该芯片包括至少一个处理器和至少一个接口电路,该接口电路和该处理器耦合,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,至少一个处理器用于运行计算机程序或指令,其具有实现如上述第一方面或第一方面任意一种可能实现方式的方法的功能,该功能可以通过硬件实现,也可以通过软件实现,还可以通过硬件和软件组合实现,该硬件或软件包括一个或多个与上述功能相对应的模块。此外,该接口电路用于与该芯片之外的其它模块进行通信,例如,该接口电路可将芯片上处理器得到的第二计算图发送给目标加速硬件(如,CPU、NPU、GPU、TPU、ASIC、FPGA等)。
附图说明
图1为对计算图进行拓扑排序的一个示意图;
图2为不同类型神经网络的网络体量的一个示意图;
图3为本申请实施例提供的不同加速硬件包括的device与计算单元的一个结构示意图;
图4为本申请实施提供的神经网络的计算图中多个节点的编译过程(普通节点编译)的一个示意图;
图5为水平融合方式的一个示意图;
图6为算子级并行方式的一个示意图;
图7为本申请实施例提供的为人工智能主体框架的一种结构示意图;
图8为本申请实施例提供的深度学习框架处理计算图的一个处理流程图;
图9为本申请实施例提供的计算图的节点融合方法的一个流程示意图;
图10为本申请实施例提供的对第一计算图进行优化的一个流程示意图;
图11为本申请实施例提供的第一计算图中不同的分支结构特征的一个示意图;
图12为本申请实施例提供的根据多个第一分支得到一个可并行分支组的一个示意图;
图13为本申请实施例提供的根据多个第二分支得到一个可并行分支组的一个示意图;
图14为本申请实施例提供的根据多个第三分支得到一个可并行分支组的一个示意图;
图15为本申请实施例提供的根据多个第四分支得到一个可并行分支组的一个示意图;
图16为本申请实施例提供的对目标节点周围的局部结构进行化简的一个示意图;
图17为本申请实施例提供的从第一计算图中找到另外2个第五分支的一个示意图;
图18为本申请实施例提供的一种典型的散落结构内包含有可并行执行的节点的示意图;
图19为本申请实施例提供的散落结构无环搜索的具体实例的一个示意图;
图20为本申请实施例提供的深度学习框架得到的一个可并行分支组的示意图;
图21为本申请实施例提供的对第一可并行分支组剔除不可融合节点得到第二可并行分支组的一个示意图;
图22为本申请实施例提供的3个不同节点各自编译成kernel后在device上的算力使用情况的一个示意图;
图23为本申请实施例提供的组合搜索的搜索过程的一个示意图;
图24为本申请实施例提供的普通节点编译与融合节点编译的一个对比示意图;
图25为本申请实施例提供的对分段合并的一个具体的代码实例;
图26为本申请实施例提供的对分段合并的另一个具体的代码实例;
图27为本申请实施例提供的计算图的节点融合方法的流程的一个示意图;
图28为本申请实施例提供的节点融合前后的kernel在device上执行时的收益的一个示意图;
图29为本申请实施例提供的深度学习框架的一种结构示意图;
图30为本申请实施例提供的计算机设备的一种结构示意图。
具体实施方式
本申请实施例提供了一种计算图的节点融合方法及设备,相较于现有技术,考虑了其他一些可并行的分支的可能性,从而找到不同于现有技术的规则限定的可以并行的分支的组合,也就可以在计算图的节点融合中,融合这些分支的组合中的节点,从而扩展了能够获取的可融合节点的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
本申请实施例涉及了许多关于神经网络、计算图等相关知识,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及相关术语和概念进行介绍。应理解的是,相关的概念解释可能会因为本申请实施例的具体情况有所限制,但并不代表本申请仅能局限于该具体情况,在不同实施例的具体情况可能也会存在差异,具体此处不做限定。
(1)神经网络
神经网络可以是由神经单元组成的,具体可以理解为具有输入层、隐含层、输出层的神经网络,一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。其中,具有很多层隐含层的神经网络则称为深度神经网络(deep neural network,DNN)。神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021140906-appb-000001
来描述,从物理层面,神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入 空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021140906-appb-000002
完成,4的操作由“+b”完成,5的操作则由“a()”来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合,其中,W是神经网络各层的权重矩阵,该矩阵中的每一个值表示该层的一个神经元的权重值。该矩阵W决定着上文所述的输入空间到输出空间的空间变换,即神经网络每一层的W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
(2)计算图(computational graph)
计算图是一种通用的计算过程表示方法,用于描述函数的有向无环图,普遍应用在各类数据处理平台上,一个计算图包括多个节点和有向边。
在机器学习领域中,由于神经网络是使用一个或多个神经网络层(如,隐藏层、输出层等)来为接收到的输入生成输出,每个隐藏层的输出被用作下一层(如,神经网络的下一个隐藏层或输出层)的输入,神经网络的每一层则根据该层当前的相关参数(如,权重)的当前值由接收到的输入生成输出。因此,在机器学习领域中,计算图用于表示神经网络涉及的计算逻辑,计算图中的每个节点表示神经网络所进行的相应运算(如,add节点代表一个加法运算),该相应运算也可称为计算任务,一个节点代表一个计算任务,有向边的箭头方向表示的是数据的流向,带箭头的有向边将前一个节点(可称为前节点或父节点)连接至后一个节点(可称为后节点或子节点),表示父节点的输出作为子节点的输入。
如图1的左图部分所示,带箭头的线段中箭头所指向的节点就为该线段尾端连接的节点的子节点,该线段尾端连接的节点就为该线段箭头所指向的节点的父节点,例如,图1中左图部分的节点B为节点A的子节点,节点B同时又是节点C和节点D的父节点。此外,若某个节点的输出作为不同节点的输入,那么该某节点就称为该不同节点的同一父节点(也可称为共同父节点,为便于阐述,在本申请实施例中统称为共同父节点),例如,图1中左图部分的节点B同时是节点C和节点D的父节点,那么节点B就称为节点C和节点D的共同父节点。类似地,若某个节点的输入来自不同节点的输出,那么该某节点就称为该不同节点的同一子节点(也可称为共同子节点,为便于阐述,在本申请实施例中统称为共同子节点),例如,图1中左图部分的节点E同时是节点C和节点D的子节点,那么节点E就称为节点C和节点D的共同子节点。
需要注意的是,在本申请实施例中,对计算图进行搜索的过程包括下向搜索和向上搜索,其中,向下搜索的过程指的是以选定的某个节点为起始点,基于计算图中有向边的箭头指向(即箭头所指示的方向)进行搜索的过程;而向上搜索的过程指的是以选定的某个节点为起始点,基于计算图中有向边的箭头指向的反方向(即箭头尾端所指示的方向)进行搜索的过程。
(3)计算图的节点融合
计算图的节点融合是指将计算图中两个或两个以上节点的计算逻辑进行整合,得到一 个新节点,该新节点就称为融合节点,已被融合的节点在融合前可称为待融合节点或未融合节点,即可以进行融合的节点但还处于未被融合的状态,新节点的输入为融合前各个节点的输入集合,新节点的输出包括融合前各个节点的输出。通过节点融合,可将所述各个节点各自对应的计算任务整合成一个计算任务加载到对应的加速硬件上具体的计算任务的模块(该模块可称为device)上进行执行。在本申请实施例中,还有一些不适合进行融合的节点或不能融合的节点,这类节点则可以统称为不可融合节点。
(4)加速硬件和device
加速硬件也可以称为硬件加速器、硬件加速芯片等,随着神经网络表现出来的网络体量越来越大,其网络结构也越来越复杂,为了承担繁重的计算任务,专用的计算加速硬件应运而生,如,GPU、TPU、昇腾处理器(如,昇腾910和昇腾310)等,这些加速硬件加强了密集计算的能力,而且在设计初期或演进过程中拥有了越来越大的并行计算空间。
具体地,神经网络的计算图被编译后由加速硬件上具体执行计算任务的device执行,一般来说,每个加速硬件上都有一个或多个device,device数量越多,该加速硬件并行计算的能力就越强,每个device内又被划分为一个或多个小的计算单元,不同类型的GPU、TPU等加速硬件包括的device数量不同,且不同加速硬件内device包括的计算单元略有不同,例如,GPU内device包括的计算单元为基本执行单元(streaming multiprocessor,SM),如图3中的(a)子示意图所示;TPU(或昇腾处理器)内包括的计算单元则为计算核心(core),如图3中的(b)子示意图所示。需要注意的是,图3中的GPU、TPU包括的device数量以及各个device内包括的SM数量或core数量仅为示意,具体此处不再举例示意。如下表1所示,不同型号的GPU、DPU、昇腾处理器所包括的SM或core的数量都十分可观,且可并行执行的计算单元数量呈上升增长趋势。
表1:不同加速硬件中可并行执行的计算单元的数量
Figure PCTCN2021140906-appb-000003
(5)中间表示(intermediate representation,IR)和算子核(kernel)
IR是程序编译过程中,源代码与目标代码之间翻译的中介。根据编译原理知识,编译器不是直接将源代码翻译成目标代码,而是先将其翻译成一种“中间语言”,之后,再由“中间语言”翻译成目标代码。因此,通常将编译器的编译过程分为前端和后端,其中,前端会对所输入的源代码进行词法分析、语法分析、语义分析等,然后生成中间表达形式(即IR),后端再对IR进行优化,生成目标代码,该目标代码可直接在目标设备(如,各种加速硬件)上运行。
在深度学习领域中,神经网络被转换为计算图后,计算图中的节点所代表的计算任务就可看作是源代码,对计算图中的各个节点进行调度后就可得到各节点各自对应的IR,再对各个IR进行编译,就可得到各节点各自对应的目标代码,该目标代码就称为kernel,生成的kernel可被对应的加速硬件识别,由对应的加速硬件上具体执行计算任务的device来执行。
为便于理解,下面举例说明:请参阅图4,图4为本申请实施提供的神经网络的计算图中多个节点的编译过程的一个示意图,图4以3个节点为例进行示意,首先,编译器(也可称为编码器)对节点A、节点B、节点C各自进行调度,得到这3个节点各自对应的中间表示IR1、IR2、IR3,再进一步对IR1、IR2、IR3进行编译,得到与节点A、节点B、节点C各自对应的kernel1、kernel2、kernel3,得到的这3个kernel就可直接依次加载至对应的加速硬件内的device上,由device上的计算单元(如,SM、core等)依次执行对应的kernel。
(6)加速线性代数(accelerated linear algebra,XLA)编译器
XLA编译器是一种针对特定领域的线性代数编译器,能够加快tensorflow模型的运行速度,而且可能不需要更改源代码。tensorflow模型是一种常见的深度学习框架。
(7)自动算子核生成器(auto kernel generator,AKG)
AKG是深度学习框架中常见的一种编译器,是神经网络中kernel的自动生成器。
此外,在介绍本申请实施例之前,先对目前神经网络的计算图的节点融合的常见方式进行简单介绍,使得后续便于理解本申请实施例。
目前已有的计算图的节点融合方式主要有水平融合(horizontal fusion)方式和算子级并行(operator level parallelism)方式。其中,水平融合方式是XLA针对GPU的一个优化步骤,启用该优化步骤后,计算图中的许多小kernel(如,更新神经网络的训练参数时的乘法、加法等操作)将被融合起来。如图5所示,图5为水平融合方式的一个示意图,其实现上,为了保证不引入环结构,该方式从计算图的返回节点(即图5中左部分的(ROOT)tuple节点)向上搜索作为输入的节点,符合融合条件的节点则融合在一起,如图5中的右部分所示,从(ROOT)tuple节点往上找到Mul节点和Add节点这两个节点,然后通过添加类似Reshape、Concatenate、Slice的操作将其融合为一个新的节点。
算子级并行方式则是深度学习框架MXNet的一个优化步骤,其实现上,通过识别来自同一共同父节点的所有子节点,将其融合成一个新节点。由于有共同父节点,各子节点各自对应的计算任务相互独立,所以融合产生的新节点表现出了子节点的并行特性。如图6所示,图6示意的是算子级并行方式的一个示意图,具体地,从图6中的左部分图里的Split节点找到该Split的子节点Embeding0~Embeding25,并融合成图6中的右部分图的ParallelOP节点。按照类似的方式,在整个计算图上处理符合这种特征的节点,即完成了对计算图的优化。
由上述可知,水平融合方式在实现上有如下局限:1)寻找可融合的节点时很保守,实现上以计算图的返回节点为起点往回查找可融合的节点(即只作用在(ROOT)tuple节点上),且查找过程中一旦中断就停止;2)待融合的节点需要有相同的数据排布和输出数据类型。 而算子级并行方式在实现上的局限是:要求被融合的多个节点的输入数据都来自于共同父节点,也就是要求待融合的节点(如,图6中的各个Embeding节点)与共同父节点(如,图6中的Split节点)必须有直接连接关系。
已有的这些计算图的节点融合方式都具有很多约束条件,无法充分搜索可融合的节点,因此,如何在计算图中搜索可融合节点时覆盖面更为广泛、所受限制更小成为亟待解决的问题。基于此,本申请实施提供了一种计算图的节点融合方法,该方法只基于计算图中节点之间的连接关系搜索可并行分支组,搜索过程不会被个别不可融合节点所打断,提高了神经网络中可融合节点的搜索命中率。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
首先对人工智能系统总体工作流程进行描述,请参见图7,图7示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(也可称为AI芯片)提供,例如,CPU、NPU、GPU、TPU、ASIC、FPGA等硬件加速芯片提供,在本申请实施例中,智能芯片也包括昇腾系列(如,昇腾910、昇腾310等)的处理器;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、平安城市等。
本申请可以应用人工智能领域的各个领域中,例如,计算机视觉领域(如,图像处理领域)、语义分析领域等等,具体的,结合图7来讲,本申请实施例中计算图的节点融合方法属于上述“(3)数据处理”中的一种具体的数据处理方式,该计算图的节点融合方法应用于深度学习框架,通过深度学习框架将计算图中的可融合的节点进行融合,得到融合节点,并将得到的融合节点以及普通节点(即未被融合的节点)通过深度学习框架中的编译器处理成智能芯片(如,上述所述的GPU、TPU等加速硬件)能识别的kernel,最后由对应的加速硬件上的device具体执行各个kernel。
由于本申请实施例提供的计算图的节点融合方法应用于深度学习框架,因此在介绍本申请实施例前,先对本申请涉及到的深度学习框架进行介绍,请参阅图8,图8为本申请实施例提供的深度学习框架处理计算图的一个处理流程图,深度学习框架对计算图的处理流程主体上可分为4个步骤,分别为网络定义、计算图优化、编译、硬件加载,下面分别进行介绍:
(1)网络定义
用户通过深度学习框架提供的应用程序接口(application programming interface,API)自定义神经网络的网络结构,或从网络模型库(model zoo)中直接获取已经预定义好的神经网络的网络结构。
(2)计算图优化
神经网络的网络结构转化为原始的计算图后,将通过公共子表达式消除(common subexpression elimination,CSE)和死码消除(dead code elimination,DCE)等优化步骤进行优化,这些优化步骤将依次修改优化计算图,得到优化后的计算图。
在本申请实施例中,额外增加了并行融合这一优化步骤,用来融合计算图中潜在的可并行执行的节点(即可融合节点),得到融合节点。
(3)编译
经过上述计算图优化后,深度学习框架中的编译器将负责依据计算图中的各个节点(包括融合节点)编译出加速硬件序所需要的kernel。
这里需要注意的是,深度学习框架的类型不同,所采用的编译器也略有不同,例如,在各个主流深度学习框架中,常见的编译器有AKG、LLVM(low level virtual machine)等,LLVM是架构编译器(compiler)的框架系统,以C++编写而成,用于优化以任意程序语言编写的程序的编译时间(compile-time)、链接时间(link-time)、运行时间(run-time)以及空闲时间(idle-time),对开发者保持开放,并兼容已有脚本。
在本申请实施中,在编译时区分开了普通节点编译和融合节点编译,其中,普通节点编译是指对未进行融合的计算图中的节点进行编译,与现有的编译过程类似,而融合节点编译是指对得到的各个融合节点进行编译,编译过程与现有的普通节点的编译过程不同。
(4)硬件加载
对上述各个节点(包括普通节点和融合节点)编译完成后,就可得到各个节点各自对应的kernel,各个kernel就可被加载至对应的加速硬件(如,GPU、TPU、昇腾910、昇腾310等),由对应的加速硬件上的device具体执行各个节点的计算任务,加速硬件上具有与对应加速硬件相关的如内存分配、数据拷贝、设备管理等功能。
需要说明的是,在本申请的一些实施方式中,深度学习框架可以是多种具体表现形式的AI框架,例如,可以是mindspore、tensorflow、tensornetwork、pytorch、mxnet、caffe、theano等主流的深度学习框架,也可以是其他小众的深度学习框架,只要该深度学习框架能够对计算图进行优化、编译的处理过程,都可以认为是本申请实施例所述的深度学习框架,具体本申请对深度学习框架的表现形式不做限定。
接下来介绍本申请实施例所提供的计算图的节点融合方法,请参阅图9,图9为本申请提供的计算图的节点融合方法的一个流程示意图,该方法可应用于上述所述的深度学习框架,该方法可以包括如下步骤:
901、将第一神经网络转换为第一计算图。
首先,深度学习框架可以获取神经网络的网络结构,该网络结构可以是通过深度学习框架提供的API自定义神经网络的网络结构,也可以是从网络模型库(model zoo)中直接获取已经预定义好的神经网络的网络结构,具体本申请对深度学习框架如何获取神经网络的方式不做限定。
在获得神经网络的网络结构后,深度学习框架将进一步将该神经网络转换成一张计算图,得到的该计算图可称为第一计算图。
902、从该第一计算图中提取一个或多个可并行分支组,该可并行分支组指示属于一个可并行分支组的多个子分支支持被并行执行。
深度学习框架得到该第一计算图之后,会基于该第一计算图中各个节点之间的连接关系从第一计算图中提取到一个或多个可并行分支组,其中,每个可并行分支组都包括多个子分支(即至少2个子分支),子分支为不存在分叉的顺序串联结构,每个可并行分支组中的每个子分支包括一个或多个节点。该可并行分支组就指示该可并行分支组的多个子分支支持被并行执行。这里需要注意的是,属于同一个可并行分支组的子分支需要满足两个条件:一个是各子分支之间不存在连接关系;二是属于不同子分支的任意两个或两个以上节点融合成一个节点后的计算图不存在环结构,即融合后的计算图中不存在环结构。此外,所述的第一计算图中各个节点之间的连接关系是指在具体的数据执行过程中,某些节点间必须保证执行的先后顺序。该步骤902的执行过程可称为网络分支提取。还需要注意的是,在本申请实施例中,可并行分支组中包括的至少一个可并行分支组(可称为第一可并行分支组)满足以下条件中的至少一种:该第一可并行分支组内的所有子分支的输入来自于同一节点且该第一可并行分支组内的至少两个子分支的输出指向不同的节点、该第一可并行 分支组内的所有子分支的输出指向同一节点且该第一可并行分支组内的至少两个子分支的输入来自不同的节点、该第一可并行分支组内的所有子分支的第一个节点没有父节点、该第一可并行分支组内的所有子分支的最后一个节点没有子节点。
需要说明的是,在本申请的一些实施方式中,在可并行分支组为多个的情况下,该可并行分支组还包括至少一个第二可并行分支组,该第二可并行分支组满足以下条件:该第二可并行分支组内的所有子分支的输入来自于同一节点且该第二可并行分支组内的至少两个子分支的输出指向同一节点,该第二可并行分支组内的每个子分支包括至少两个节点。
需要说明的是,在本申请的一些实施方式中,得到第一计算图还可以进一步进行优化,优化的目的之一是使得优化后的第一计算图能够直观反映出数据流之间的依赖关系,例如,优化的手段可以是:去除第一计算图中与计算无关的节点、增加与依赖分析相关的虚边等,例如,第一计算图中的同步节点,代表的是执行上的时间先后关系,是一种同步消息,其与具体计算无关,因此可消除该节点,并为相关节点添加上依赖虚边。如图10所示,图10中的左边部分是原始的第一计算图的一个局部示意,该第一计算图中包括有同步节点D,因此经过优化后,同步节点D会被消除,并添加节点A和节点E之间的虚边,该虚边如图10中的右边部分带箭头的虚线所示,用于表示节点A与节点E之间存在直接连接关系。经过该优化过程,优化后的第一计算图包括各个节点之间的直接连接关系,即数据流依赖关系,该直接连接关系中包含有第一计算图中每个节点输入数据的供给节点(即父节点)与输出数据的消费节点(即子节点),例如,A→B→C,则节点B的输入数据的供给节点为节点A,节点B的输出数据的消费节点为节点C,具体地,如图10中的右边部分,节点A与节点B和节点E都存在直接的连接关系,节点B与节点C存在直接的连接关系等,具体此处不再举例示意。
在得到第一计算图(或优化后的第一计算图)后,深度学习框架将进一步对该第一计算图(或优化后的第一计算图)进行分层次搜索,得到一个或多个可并行分支组,其中,不同层次对应不同的搜索方式。为便于阐述,下述实施例均基于第一计算图进行说明。
需要注意的是,由于本申请对第一计算图是进行分层次搜索,并且不同的层次对应不同的搜索方式,分层次搜索是针对整个第一计算图中不同的分支结构特征来进行的,在本申请的一些实施方式中,第一计算图中不同的分支结构特征可以用3种结构特征来描述,分别可称为多分叉结构、无汇聚结构、散落结构,具体可参阅图11,图11为本申请实施例提供的第一计算图中不同的分支结构特征,下面分别进行说明:
(1)多分叉结构
多分叉结构为拥有共同父节点或拥有共同子节点的分支结构,其中,共同父节点和共同子节点可统称为汇合节点,这种分支结构有着最直观的可并行性,如图11中的(a)子示意图所示,多分叉结构的各个分支间(不包括汇合节点),没有明确的连接关系,因而分支间可并行执行。
(2)无汇聚结构
无汇聚结构则是指那些分支结构的第一个节点没有父节点或分支结构的最后一个节点没有子节点的分支结构,通过给这些分支增加一个虚拟的汇合节点,就可以将其转化为多 分叉结构进行处理,如图11中的(b)子示意图所示,白色底节点和实箭头为实际的节点和连接关系,灰色底节点是额外增加的虚拟汇合节点,虚箭头为同步加上的虚拟连接关系。
(3)散落结构
除了多分叉结构和无汇聚结构,第一计算图中剩下的节点就可以统称为散落结构,在这些散落结构中,可能还存在一些分支,分支间的节点没有直接的父子关系可以表现出并行性,但这些分支结构就拓扑序来说是可以并行执行的,如图11中的(c)子示意图所示,这些节点之间没有直接连接关系,但是有些分支结构存在可并行执行的可能性。
下面对如何对第一计算图进行分层次搜索的具体搜索过程进行描述,根据第一计算图所呈现的分支结构特征,分层次搜索的过程可以包括如下至少一种搜索方式:多分叉结构单向搜索、无汇聚结构单向搜索、散落结构无环搜索,下面对每种搜索方式分别进行介绍:
(1)多分叉结构单向搜索
多分叉结构单向搜索的过程也可以称为第一层次搜索,具体地,深度学习框架对第一计算图进行第一层次搜索,该第一层次搜索至少包括但不限于如下至少一种具体形式的搜索过程:
a、在第一计算图中搜索拥有共同父节点(可称为第一共同父节点)的多个分支(可称为第一分支),并根据该多个第一分支得到一个可并行分支组。
在这种搜索方式中,深度学习框架会从第一计算图中搜索拥有共同第一父节点的多个第一分支,并根据该多个第一分支得到一个可并行分支组,该第一共同父节点为第一计算图中的任意一个共同父节点。
需要说明的是,在本申请的一些实施方式中,深度学习框架得到多个第一分支之后,根据该多个第一分支得到一个可并行分支组的具体实现方式可以是:以该第一共同父节点为起始点,根据节点间的连接关系,沿该多个第一分支中的每个第一分支各自向下搜索,直至在向下搜索过程中遇到第一计算图中的其他共同父节点(可称为第二共同父节点)或共同子节点时停止,在向下搜索过程中每个第一分支遍历到的节点各自构成一个子分支,例如,若有4个第一分支,那么经过上述向下搜索过程就可得到4个子分支,每个子分支可称为第一子分支,得到的多个第一子分支构成一个可并行分支组,例如,上述4个子分支就构成一个可并行分支组。这里需要注意的是,多个第一子分支中的每个第一子分支不包括第一共同父节点和共同子节点,也就是说,每个第一子分支都不包括作为起始点的第一共同父节点,且在每个第一分支向下搜索过程中,若遇到其他共同父节点(即第二共同父节点),就将该第二共同父节点纳入到对应的第一子分支中,若遇到共同子节点,则在对应的第一子分支中排除该共同子节点。
为便于理解上述过程,下面举例进行示意:请参阅图12,图12为本申请实施例提供的一个根据多个第一分支得到一个可并行分支组的一个示意图,节点A为第一共同父节点,以该节点A为起始点,可搜索到2个分支(即第一分支),其中一个分支包括节点B、节点C、节点F和节点G,另一个分支包括节点D、节点E、节点H和节点I,沿着这2个分支各自向下搜索,左边的分支在向下搜索过程中遇到其他共同父节点C时停止向下搜索,右边的分支在向下搜索过程中遇到共同子节点H时停止向下搜索,此时,左边的分支被遍历 到的节点为A、B、C,右边的分支被遍历到的节点为A、D、E、H,然而,节点C为共同父节点,可纳入左边的子分支中,节点H为共同子节点,排除在右边的子分支外,而节点A又是作为起始点的第一共同父节点,均被排除在2个子分支外,因此,左边的子分支为(B→C),右边的子分支为(D→E),这2个子分支就构成一个可并行分支组。
b、在第一计算图中搜索拥有共同子节点(可称为第一共同子节点)的多个分支(可称为第二分支),并根据该多个第二分支得到一个可并行分支组。
在这种搜索方式中,深度学习框架会从第一计算图中搜索拥有共同第一子节点的多个第二分支,并根据该多个第二分支得到一个可并行分支组,该第一共同子节点为第一计算图中的任意一个共同子节点。
需要说明的是,在本申请的一些实施方式中,深度学习框架得到多个第二分支之后,根据该多个第二分支得到一个可并行分支组的具体实现方式可以是:以该第一共同子节点为起始点,根据节点间的连接关系,沿该多个第二分支中的每个第二分支各自向上搜索,直至在向上搜索过程中遇到第一计算图中的其他共同子节点或共同父节点时停止,在向上搜索过程中每个第二分支遍历到的节点各自构成一个子分支,例如,若有3个第二分支,那么经过上述向上搜索过程就可得到3个子分支,每个子分支可称为第二子分支,得到的多个第二子分支构成一个可并行分支组,例如,上述3个子分支就构成一个可并行分支组。这里需要注意的是,多个第二子分支中的每个第二子分支不包括该第一共同子节点和共同父节点,也就是说,每个第二子分支都不包括作为起始点的第一共同子节点,且在每个第二分支向上搜索过程中,若遇到其他共同子节点,就将该其他共同子节点纳入到对应的第二子分支中,若遇到共同父节点,则在对应的第二子分支中排除该共同父节点。
为便于理解上述过程,下面举例进行示意:请参阅图13,图13为本申请实施例提供的一个根据多个第二分支得到一个可并行分支组的一个示意图,节点H为第一共同子节点,以该节点H为起始点,可搜索到2个分支(即第二分支),其中一个分支包括节点F、节点D、节点A和节点B,另一个分支包括节点G、节点E、节点C、节点I和节点J,沿着这2个分支各自向上搜索,左边的分支在向上搜索过程中遇到其他共同子节点D时停止向上搜索,右边的分支在向上搜索过程中遇到共同父节点I时停止向上搜索,此时,左边的分支被遍历到的节点为H、F、D,右边的分支被遍历到的节点为H、G、E、C、I,然而,节点D为共同子节点,可纳入左边的子分支中,节点I为共同父节点,排除在右边的子分支外,而节点H又是作为起始点的第一共同子节点,均被排除在2个子分支外,因此,左边的子分支为(D→F),右边的子分支为(C→E→G),这2个子分支就构成一个可并行分支组。
以上从共同父节点或从共同子节点出发提取可并行分支组的方法,可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于共同子节点出发向上搜索得到的子分支中不存在共同父节点的情形,类似地,基于共同父节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。
目前已有的计算图的节点融合方式中,都需要判断节点融合后是否成环,而判断是否 成环是个很繁杂的操作,本申请实施例上述所述的多分叉结构单向搜索的提取方式保证了不会出现环结构,因此无需额外对每次得到的组合进行成环判断,简化了操作流程。
需要说明的是,在本申请实施例中,多分叉结构单向搜索(即第一层次搜索)可以是只进行上述方式a的搜索过程,也可以是只进行上述方式b的搜索过程,还可以是同时进行上述方式a和方式b的搜索过程,具体此处不做限定。但需要注意的是,若是同时进行上述方式a和方式b的搜索过程,那么第一计算图中的某个节点(如,节点A)属于最先被遍历到的那个子分支,并且会对该节点进行标识,该标识就用于表征该节点已被编入子分支,以防止在后续搜索过程中被重复分组。
(2)无汇聚结构单向搜索
无汇聚结构单向搜索的过程也可以称为第二层次搜索,具体地,深度学习框架对第一计算图进行第二层次搜索,该第二层次搜索至少包括但不限于如下至少一种具体形式的搜索过程:
a、在第一计算图中搜索没有父节点的多个分支(可称为第三分支),并根据该多个第三分支得到一个可并行分支组。
在这种搜索方式中,深度学习框架会从第一计算图中搜索没有父节点的多个第三分支,并根据该多个第三分支得到一个可并行分支组。
需要说明的是,在本申请的一些实施方式中,深度学习框架得到多个第三分支之后,根据该多个第三分支得到一个可并行分支组的具体实现方式可以是:首先,深度学习框架从该第一计算图中搜索到没有父节点的一个或多个节点,并以该没有父节点的节点为起始点,根据节点间的连接关系,沿该多个第三分支中的每个第三分支各自向下搜索,直至在向下搜索过程中遇到第一计算图中的共同父节点或共同子节点时停止,在向下搜索过程中每个第三分支遍历到的节点各自构成一个子分支,例如,若有2个第三分支,那么经过上述向下搜索过程就可得到2个子分支,每个子分支可称为第三子分支,得到的多个第三子分支构成一个可并行分支组,例如,上述2个子分支就构成一个可并行分支组。这里需要注意的是,多个第三子分支中的每个第三子分支不包括向下搜索过程中遇到的共同子节点,也就是说,每个第三子分支可以包括作为起始点的节点,且在每个第三分支向下搜索过程中,若遇到共同父节点,可以将该共同父节点纳入到对应的第三子分支中,若遇到共同子节点,则在对应的第三子分支中排除该共同子节点。
从特征上看,第二层次搜索的方式a的搜索过程与上述第一层次搜索的方式a的搜索过程类似,其不同之处在于:第二层次搜索的方式a的搜索过程不是从一个节点出发进行查找,而是将多个无汇聚节点(即没有父节点的节点)作为一组进行向下搜索,且无汇聚节点也被包含在各自的子分支内。如图14所示,图14为本申请实施例提供的一个根据多个第三分支得到一个可并行分支组的一个示意图,深度学习框架从没有父节点的节点B、D出发,各自向下搜索,得到子分支(B→C)和子分支(D→E),子分支(B→C)和子分支(D→E)就构成一个可并行分支组。
b、在第一计算图中搜索没有子节点的多个分支(可称为第四分支),并根据该多个第四分支得到一个可并行分支组。
在这种搜索方式中,深度学习框架会从第一计算图中搜索没有子节点的多个第四分支,并根据该多个第四分支得到一个可并行分支组。
需要说明的是,在本申请的一些实施方式中,深度学习框架得到多个第四分支之后,根据该多个第四分支得到一个可并行分支组的具体实现方式可以是:首先,深度学习框架从该第一计算图中搜索到没有子节点的一个或多个节点,并以该没有子节点的节点为起始点,根据节点间的直接连接关系,沿该多个第四分支中的每个第四分支各自向上搜索,直至在向上搜索过程中遇到第一计算图中的共同父节点或共同子节点时停止,在向上搜索过程中每个第四分支遍历到的节点各自构成一个子分支,例如,若有4个第四分支,那么经过上述向上搜索过程就可得到4个子分支,每个子分支可称为第四子分支,得到的多个第四子分支构成一个可并行分支组,例如,上述4个子分支就构成一个可并行分支组。这里需要注意的是,多个第四子分支中的每个第四子分支不包括向上搜索过程中遇到的共同父节点,也就是说,每个第四子分支可以包括作为起始点的节点,且在每个第四分支向下搜索过程中,若遇到共同子节点,可以将该共同子节点纳入到对应的第四子分支中,若遇到共同父节点,则在对应的第四子分支中排除该共同父节点。
同样地,从特征上看,第二层次搜索的方式b的搜索过程与上述第一层次搜索的方式b的搜索过程类似,其不同之处在于:第二层次搜索的方式b的搜索过程不是从一个节点出发进行查找,而是将多个无汇聚节点(即没有子节点的节点)作为一组进行向上搜索,且无汇聚节点也被包含在各自的子分支内。如图15所示,图15为本申请实施例提供的一个根据多个第四分支得到一个可并行分支组的一个示意图,深度学习框架从没有子节点的节点D、G、F出发,各自向上搜索,得到子分支(C→D)、子分支(G)和子分支(F),子分支(C→D)、子分支(G)和子分支(F)就构成一个可并行分支组。
类似地,以上从没有父节点或从没有子节点的节点出发提取可并行分支组的方法,同样可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于没有子节点的节点出发向上搜索得到的子分支中不存在共同父节点的情形,类似地,基于没有父节点的节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例上述无汇聚结构单向搜索的提取方式同样可保证不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
需要说明的是,在本申请实施例中,无汇聚结构单向搜索(即第二层次搜索)可以是只进行上述方式a的搜索过程,也可以是只进行上述方式b的搜索过程,还可以是同时进行上述方式a和方式b的搜索过程,具体此处不做限定。但需要注意的是,若是同时进行上述方式a和方式b的搜索过程,那么第一计算图中的某个节点属于最先被遍历到的那个子分支,并且会对该节点进行标识,该标识就用于表征该节点已被编入子分支,以防止在后续搜索过程中被重复分组。
(3)散落结构无环搜索
散落结构无环搜索的过程也可以称为第三层次搜索,在散落结构中找到可并行执行的节点相对于直接基于节点间的连接关系查找而言,搜索成本较高,同时为了保证其有效性, 还需进行成环判定。具体地,深度学习框架从第一计算图中确定出未被划分进任意一个可并行分支组的目标节点,该目标节点也可称为敏感节点,是指适合融合的节点,因此,目标节点需满足如下几个条件:1)该目标节点对应的计算任务在device执行时所需消耗的算力资源不能超过预设阈值;2)不属于不可融合节点。深度学习框架选定目标节点后,将以该目标节点为中心,化简其上下游网络中的局部结构,得到一个分支,该分支可称为第五分支。例如,由一个节点散出,中间或经过多级分流,最终汇入另一个节点构成的菱形结构,在第一计算图中可以被化简成一个虚拟节点。如图16所示,图16为本申请实施例提供的对目标节点周围的局部结构进行化简的一个示意图,图16中的左边部分的节点A与节点B之间用虚线框住的部分为菱形结构,这两个菱形结构各自被化简成右图中灰色底节点(即虚拟节点),其中,节点A和节点B即为目标节点;类似地,可据此找到其他目标节点,并化简其周围的局部结构,得到多个第五分支,并根据得到的多个第五分支得到一个可并行分支组。如图17所示,图17示意的是从第一计算图中找到的另外2个第五分支,其中一个第五分支的目标节点为节点C、D,另一个第五分支的目标节点为节点E、F,最后根据这3个第五分支可得到一个可并行分支组。
这里需要注意的是,多个第五分支中各虚拟节点的作用是:辅助判断属于各个不同第五分支的目标节点是否可融合,该虚拟节点对应的真实节点可能是不可融合的节点,也可能是已经被融合进可并行分支组里的节点,因此,为便于操作,该虚拟节点在后续不再参与是否是可融合节点的判断,在后续的编译过程中,依然是作为普通节点进行编译。例如,以图16和图17为例,得到的这3个第五分支一共包括了5个虚拟节点,这5个虚拟节点的作用是用于判断节点A至节点F中属于不同第五分支的至少两个节点是否可融合,这5个虚拟节点并不属于可融合节点,在后续的编译过程中,还是基于各自对应的菱形结构涉及的节点进行普通编译。最后,得到的多个第五分支之间需要进行成环判断,若互相之间不成环,再根据该多个第五分支得到可并行分支组。进行成环判断的目的在于保证其有效性。
还需要注意的是,本申请实施例进行第三层次搜索的目的在于:第一层次搜索和第二层次搜索都是基于节点之间的直接连接关系进行的,因此可以直接根据向上搜索或向下搜索找到各自对应的可并行分支组。但在上述第一层次搜索或第二层次搜索中,为了避免繁杂的成环判定操作,其停止条件较为严格,这将使按上述搜索方式处理第一计算图后,仍有部分潜在的可并行执行的节点散落在第一计算图中,这些节点可称为散落节点。这些散落节点可能是一些前后节点已经被并入可并行分支组后遗落的独立节点,也可能是一些特殊的节点结构,如,局部的菱形结构,即由一个节点散出,中间或经过多级分流,但最终汇入另一个节点的结构,这些散落节点由于自身特殊的结构或所处位置等原因打断了被找到的机会,或因为相关分支上无有效节点而被抛弃。为便于理解,下面举例进行说明,请参阅图18,图18为一种典型的散落结构内包含有可并行执行的节点的示意图,由图18可知,子分支(B)与子分支(I)是搜索到的一个可并行分支组,子分支(H)与子分支(O)是搜索到的另一可并行分支组,从图中结构可以看出,子分支(C→D→E→F→G)与子分支(J→K→L→M→N)是未被找到的潜在可并行分支组。该潜在可并行分支组未被找到的原因在于: 而其中,C、J在基于共同父节点的向下搜索过程中被找到,但由于相关分支是空分支(因为(B→D)与(I→K)间无节点,故为空分支),所以节点C、J被抛弃,同理被抛弃的还有节点G、N;而节点D、K、F、M本身为目标节点(即敏感节点),但由于未被其他的搜索过程命中(即没有被上述第一层次搜索或第二层次搜索的搜索过程找到),故而也被抛弃;而节点E、L则是被菱形结构阻断。
下面以一个具体的实例说明散落结构无环搜索的过程是怎样的,请参阅图19,图19为本申请实施例提供的散落结构无环搜索的具体实例的一个示意图,图19是基于图18的散落结构得到,图19中(a)部分的节点BI和节点HO分别是图18中节点B、I的融合节点以及节点H、O的融合节点,即对图18中的节点B、I进行融合,得到融合节点BI,并对节点H、O进行融合,得到融合节点HO。散落结构无环搜索的过程具体可分为4个步骤,下面分别进行阐述:步骤1、首先,深度学习框架在第一计算图中找到所有散落的目标节点的结合,如,图18中的节点G、N、C、J、D、K、F、M、E、L等,分别以这些节点为中心,向上搜索/向下搜索并化简在100拓扑序内能形成菱形结构(假设设定拓扑序间隔为100,则若记分叉节点拓扑序为150,则找到50号拓扑序节点时停止,若其分叉能汇聚于同一共同祖先,则能于100拓扑序内形成菱形结构)的局部结构,如图19中(b)部分所示,节点C、D被化简为虚拟节点Q(由于融合节点BI不是仅被节点C、D连接,所以不能化简到一个虚拟节点中),类似地,节点J、K被化简为虚拟节点R,节点G、F被化简为虚拟节点S,节点M、N被化简为虚拟节点T,由此可得到新的分支(Q→E→S)和(R→L→T);步骤2、之后,深度学习框架再分别从新的各分支取一个节点(如第一个节点),进行成环判定,不成环的分支成为可并行分支组的一个子分支。如图19中(b)部分所示,子分支(Q→E→S)和(R→L→T)均不成环,因此,这两个子分支构成一个可并行分支组;步骤3、在新的分支中,虚拟节点如图19中(b)部分所示的节点Q,其所代表的真实节点C、D本身无法并行,但与虚拟节点R所代表的真实节点J、K之间是可以并行的,即节点C与J可并行执行,节点D与K也可并行执行。因此,将该并行分支组的各子分支上的虚拟节点重新还原成真实节点,并按真实节点的局部拓扑序进行排序,重新形成虚拟的分支结构,如图19中的(c)部分所示,将虚拟节点Q展开为(C→D),将虚拟节点R展开为(J→K),将虚拟节点S展开为(F→G),将虚拟节点T展开为(M→N),这些虚拟节点与原真实节点代表的计算逻辑是一样的,但其连接关系是虚拟的,仅用于进行潜在可并行分支组的并行分析。经过上述所述的分析,可找到潜在的可并行分子组,即找到子分支(C→D→E→F→G)和子分支(J→K→L→M→N)构成的可并行分支组;步骤4、重复上述步骤3直到步骤2中的所有潜在可并行分支组都处理完成。
需要说明的是,在本申请的一些实施方式中,可根据需要进行上面任意一个或多个层次的搜索,例如,在用户需要快速进行搜索而对搜索结果的完备性要求不高的情况下,可以是仅进行第一层次搜索,也可以是仅进行第二层次搜索,还可以是仅进行第三层次搜索,具体此处不做限定;又例如,在用户需要尽可能多的搜索出可并行执行的节点的情况下,那么可以进行多个层次的搜索,如,可进行上述三个层次的搜索中的至少两个层次的搜索,但需要注意的是,当深度学习框架执行的是至少两个层次的搜索时,对层次搜索的执行顺 序不做限定,以深度学习框架执行的是上述三个层次的搜索为例,不限定上述三个层次的搜索的顺序,可以是先执行第一层次搜索,再执行第二层次搜索,最后执行第三层次搜索,也可以是先执行第二层次搜索,再执行第一层次搜索,最后执行第三层次搜索,或者是任意顺序进行层次搜索,此处不再举例示意。
还需要说明的是,在本申请的一些实施方式中,为了搜索出尽可能多的可并行执行的节点,上述的多层次搜索过程可以是迭代进行的,迭代的终止条件可以是直至找不到新的可并行分支组为止,也可以是达到一定的迭代时长,具体此处对终止搜索的条件不做限定。
903、对一个或多个该可并行分支组中每个可并行分支组中分别来自不同子分支的多个节点进行融合,以基于第一计算图得到第二计算图。
经过上述步骤902,深度学习框架可得到一个或多个可并行分支组,针对每一个可并行分支组,深度学习框架可对每个可并行分支组(可称为目标可并行分支组)中分别来自不同子分支的多个节点进行融合,从而得到第二计算图。这里需要注意的是,目标可并行分支组中分别来自不同子分支的多个节点是指这多个节点中的任意两个节点均不能是来自于同一子分支,也就是在任意一个目标可并行分支组中,每个子分支都包括有一个或多个节点,来自不同子分支的节点可以组合起来作为融合节点,相同子分支上的节点由于前后连接关系则不进行融合,在目标可并行分支组上,可基于搜索算法得到融合节点。
为便于理解,下面举例进行示意:假设深度学习框架得到的一个可并行分支组如图20所示,该可并行分支组一共包括3个子分支,其中第一个子分支包括4个节点,其余两个子分支各自包括3个节点,由图20可知,节点A、C、F分别来自于不同的3个子分支,可对这3个节点进行融合,得到一个融合节点;又例如,节点B、D也分别来自于不同的2个子分支,也可对这2个节点进行融合,得到一个融合节点,具体此处不再举例示意。由图20的示例可知,对目标可并行分支组中分别来自不同子分支的多个节点进行融合时,可以融合得到一个或多个融合节点,这主要取决于融合的方式。
需要说明的是,在本申请的一些实施方式中,基于上述步骤902得到的一个或多个可并行分支组中的某些子分支上可能依然保留有一些不可融合的节点或不适合并行执行的节点,这类节点可统称为不可融合节点,这些节点之所以在提取可并行分支组时被保留下来,是为了让分支保留有完整性,使得分支在搜索过程中不被个别节点打断,一旦打断,将导致潜在的可并行执行的节点被遗漏。因此,深度学习框架得到一个或多个可并行分支组之后,还可以将从每个可并行分支组(即目标可并行分支组)中的每个子分支中剔除掉不可融合节点,从而得到剔除了不可融合节点的各个可并行分支组(可称为第三可并行分支组),而第一可并行分支组中的任意一个子分支可称为目标子分支。之后,将剔除了不可融合节点的各个可并行分支组(即第三可并行分支组)中分别来自不同子分支的多个节点进行融合,得到融合节点,各个融合节点与第一计算图中的未融合节点就构成所述第二计算图。
还需要说明的是,在本申请的一些实施方式中,不可融合节点具体可以是具有排他性运算操作的节点,例如,使用的编译套件不提供某些节点融合后的kernel编译方案;也可以是属于特定运算操作的节点,例如,神经网络的矩阵乘操作、卷积操作等特定运算操作,其本身计算密集度高,大多数情况下在device上执行时都会尽可能的将device上的算力资 源使用完全,所以这类节点也不适合并行执行,其中一种例外的情况是数据本身特别小时,可视为适合并行执行。
为便于理解上述得到第三可并行分支组的过程,下面举例进行示意:假设深度学习框架得到的一个目标可并行分支组如图21中的(a)子示意图所示,在对节点进行融合之前,需要对该目标可并行分支组的各个子分支进行“清洗”,剔除掉不可融合节点,假设节点H、I、J、K被确定为不可融合节点,如图21中的(b)子示意图所示,那么就需过滤到该节点H、I、J、K,过滤掉的这些节点在后续就作为普通节点进行编译,而各个子分支留下来的节点才具备进行融合的资格,例如,图21中的(c)子示意图所示,留下来的节点A、B、C、D、F、G为可融合节点,这些可融合节点在后续过程中参与到具体的融合过程中。
还需要说明的是,在本申请的一些实施方式中,深度学习框架在剔除了不可融合节点得到各个第三可并行分支组后,对第三可并行分支组中分别来自不同子分支的多个节点进行融合,得到融合节点的过程可以基于组合搜索(也可称为并行融合搜索)的方式得到,该组合搜索的原则是:在任意一个第三可并行分支组中,每个子分支都包括有一个或多个节点,来自不同子分支的节点可以组合起来作为融合节点,相同子分支上的节点由于前后连接关系则无法进行融合,在每个第三可并行分支组上,可基于搜索算法得到融合节点。具体地,该组合搜索的流程可以由两部分完成,一部分可称为候选组合生成模型,另一部分可称为算力评估模型,其中,候选组合生成模型负责基于第三可并行分支组中的各个节点生成一系列的候选节点组合,而算力评估模型则负责对各个候选节点组合融合成一个节点后所需消耗的算力进行评估,该评估也可称为收益评估。具体的搜索过程可以是:首先,从第三可并行分支组中的子分支中各自选择(如,可以是随机选择)一个未融合节点,得到n个未融合节点,该未融合节点为未被融合的节点,n≥2,之后基于选出的n个未融合节点生成m个节点组合,m个节点组合中的每个节点组合包括至少两个未融合节点,m≥1,并通过构建的算力评估模型对该m个节点组合所需消耗的算力进行评估,得到m个评估结果,在第一评估结果满足预设条件的情况下,将与该第一评估结果对应的第一节点组合进行融合,得到第一融合节点,并将该第一节点组合中的各个未融合节点标记为已融合节点,所述第一评估结果为这m个评估结果中的一个,该第一节点组合为该m个节点组合中的一个。
需要注意的是,本申请实施例构建算力评估模型的原因在于:由于各个节点对应的计算任务对device上算力资源的使用率不同,会使得device的算力资源在整个执行过程中的使用率上下抖动。因此,device在执行资源使用率小的计算任务时,device的部分算力资源将会处于空闲状态。如图22所示,图22为本申请实施例提供的3个不同节点各自编译成kernel后在device上的算力使用情况,假设加速硬件上的某device上包括3个计算单元,如图22中的3个灰色方格所示,kernelA、kernelB、kernelC由3个节点各自被编译后得到的算子核,算子核可直接由device的计算单元执行,kernelA、kernelB、kernelC所在的框的大小表示各自所消耗的算力,由图22可知,不同的kernel所消耗的算力资源不同,并且在依次执行各个kernel的过程中,该device的资源使用率是不稳定的,且还有部分算力资源(如,图22中的有2个计算单元)处于空闲状态。因此,本申请实施例中构建算力评估 模型的目的在于:从候选节点组合中选择出能提高device算力资源使用率的目标节点组合来融合,这样以融合的方式将多个可并行的节点融合成一个节点进行执行,以此提高device的平均资源使用率,进而提高整个加速硬件的整体性能。
需要说明的是,在本申请的一些实施方式中,为了尽可能将所有可融合节点进行融合,上述将第三可并行分支组中分别来自不同子分支的多个节点进行融合的过程是迭代进行的,即重复执行上述将第三可并行分支组中分别来自不同子分支的多个节点进行融合的步骤,直至该第三可并行分支组中未融合节点的数量少于2个。
为便于理解,下面分步骤对整个组合搜索的过程详细介绍:
步骤1、首先从剔除了不可融合节点的可并行分支组中选取出未进行融合过的一个第三可并行分支组。
步骤2、从该第三可并行分支组中的每个子分支上各选择(如,可随机选择)一个未融合节点,得到n个未融合节点,其中未融合节点是指未被融合的节点,n≥2。需要注意的是,在本申请的一些实施方式中,得到的未融合节点的数量也可以等于有效子分支的数量,有效子分支是指在本轮次选择节点前至少存在一个未融合节点的子分支。例如,假设当前有效子分支的数量为n′,则一共得到n′个未融合节点。
步骤3、通过候选组合生成模型基于这n个节点生成一个新的候选节点组合(即上述m个节点组合中的一个)。
步骤4、通过算力评估模型对该新的候选组合所需消耗的算力进行评估,得到新的评估结果(即上述m个评估结果中的一个)。
步骤5、判断该评估结果是否满足预设条件,或者,也可以重复执行步骤3和步骤4,直到评估结果满足预设条件,之后将满足预设条件的评估结果对应的候选节点组合选定为目标节点组合(即上述第一节点组合),对该目标节点组合进行融合,得到融合节点(即第一融合节点)。例如,假设经过步骤2得到的未融合节点有7个,分别记为节点0~6,并且经过步骤3至步骤5后得到3组目标节点组合,分别为[0,1,5]、[2,4]、[3,6],据此各自进行融合,可得到3个融合节点。
步骤6、重复执行步骤2至步骤5,直至该第三可并行分支组中未融合节点的数量少于2个。
需要注意的是,在本申请的一些实施方式中,可以在得到当前轮次的一个评估结果之后,就判断该评估结果是否满足预设条件,再基于判断结果进行后续处理;也可以是在进行多轮次搜索候选节点组合得到多个评估结果之后(即至少两个评估结果)再判断该多个评估结果是否存在满足预设条件的评估结果,再基于判断结果进行后续处理,下面针对不同情形分别进行介绍,不同情形包括但不限定如下几种:
(1)评估结果达到加速硬件上具体执行计算任务的device的算力要求。
下面以图23为例进行说明,假设深度学习框架得到的一个未处理的第三可并行分支组如图23所示,该第三可并行分支组中一共包括3个子分支,每个子分支中各自包括2个可融合节点(不可融合节点已被剔除),首先从这3个子分支中各自选出一个节点,例如,假设在当前轮次选择出节点A、C、F,那么节点A、C、F就组成一个候选节点组合A+C+F, 再通过构建的算力评估模型对该候选节点组合A+C+F所需消耗的算力进行评估,得到一个评估结果,在这种情况下,评估结果用于表征节点组合所需耗费的算力资源。之后深度学习框架判断该当前轮次得到的评估结果是否达到加速硬件上具体执行计算任务的device的算力要求。
若假设该评估结果达到加速硬件上device的算力要求,说明该候选节点组合A+C+F为一个目标节点组合,那么就将该候选节点组合A+C+F进行融合,得到一个融合节点,由于该融合节点是由候选节点组合A+C+F融合得到的,那么在原来的3个子分支中需要对这3个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、C、F。
若假设该评估结果没有达到加速硬件上device的算力要求,则说明节点A、C、F不适合融合,此时可依次减少一个节点(如,可随机减少一个节点),得到新的候选节点组合,直至找到适合融合的候选节点组合为止,例如,先减少节点A,那么就得到新的候选节点组合C+F,再通过构建的算力评估模型对该候选节点组合C+F所需消耗的算力进行评估,得到一个评估结果,之后深度学习框架再按照类似的方式判断该评估结果是否达到加速硬件上device的算力要求。假设依然没有达到,则保留节点A,再减少节点C试试,此时就得到新的候选节点组合A+F,之后再依据上述类似的过程继续判断候选节点组合A+F对应的评估结果是否达到加速硬件上device的算力要求,假设达到了对应的算力要求,那么该候选节点组合A+F为一个目标节点组合,将该候选节点组合A+F进行融合,得到一个融合节点,并且同样需要在原来的3个子分支中对这2个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、F,此时只遗留下来的一个节点C,那么节点C作为未融合节点继续参与到下一个轮次的节点搜索过程中,以上过程就为一个轮次的搜索过程。
后续继续按照上述搜索过程进行组合搜索,直至节点中少于2个节点未被融合。例如,假设图23中经过几个轮次的搜索,一共得到3个目标节点组合,分别为A+C、B+F和D+G,节点中所有的节点都被融合。
这里需要注意的是,在当前轮次的搜索过程中,假设依次减少一个节点后依然没有可融合的候选节点组合,那么就再依次减少2个节点,但前提是保证减少后留下的节点数量大于2个。为便于理解,下面分步骤对该搜索过程进行详细介绍:
步骤1、假设某第三可并行分支组中的子分支数量为n,第一轮次提取出的节点数量为n,那么可对这n个节点进行标记,得到1~n号节点。
步骤2、取节点p~n生成候选节点组合(p的起始值为1),将生成的候选节点组合提供给算力评估模型进行算力评估,得到评估结果,若评估结果未达到加速硬件上device的算力要求,则依次减少节点个数,直至找到适合融合的候选节点组合,记找到的适合融合的候选节点组合包括的节点数量为k,那么得到的该适合融合的候选节点组合为[p,…,(p+k-1)],并更新p=p+k。
步骤3、重复执行步骤2,直至这n个节点中少于2个节点未被融合。
综上所述,在本申请实施例中,算力评估模型的构建原则是只需考虑候选节点组合所需消耗的算力资源是否达到加速硬件上具体执行计算任务的device的算力要求,因此,该 算力评估模型的计算公式可如下式(1)所示:
Figure PCTCN2021140906-appb-000004
其中,gain为当前选出的候选节点组合对应的评估结果,D all为device上总的算力资源,k为当前轮次选出的组成该候选节点组合的节点数量,i为k个节点数量中的第i个节点,cal i为节点i在执行时所需消耗的算力资源大小,cal i可以根据对应节点i的大小、数据类型等进行分析得到,在本申请实施例中,算力评估不考虑其他节点的影响,仅与节点自身特性相关,q为预设系数(如,q的取值可以取0.7~1.5之间的任意数值),具体根据用户需求自行设定,此处不做限定。当候选节点组合对应的评估结果满足上述公式(1),即说明该评估结果达到加速硬件上具体执行计算任务的device的算力要求。
(2)评估结果在得到的m个评估结果中表现最优。
依然以图23为例进行说明,假设深度学习框架得到的一个未处理的第三可并行分支组如图23所示,该第三可并行分支组中一共包括3个子分支,每个子分支中各自包括2个可融合节点(不可融合节点已被剔除),首先从这3个子分支中各自选出一个节点,例如,假设在当前轮次选择出节点A、C、F,那么基于该节点A、C、F一共可组成4个候选节点组合,分别为候选节点组合A+C+F、A+C、A+F、C+F,之后通过构建的算力评估模型对该候选节点组合A+C+F、A+C、A+F、C+F所需消耗的算力各自进行评估,得到4个评估结果(即m=4),在这种情况下,评估结果用于表征节点组合所节省的算力资源,节省的算力资源越多,则表明评估结果越优,收益越大,最后,再从这4个评估结果中选择表现最优的那个评估结果对应的候选节点组合进行融合,得到融合节点。例如,若假设候选节点组合A+C+F对应的评估结果是最优的,那么就将该候选节点组合A+C+F进行融合,得到一个融合节点,由于该融合节点是由候选节点组合A+C+F融合得到的,因此需要在原来的3个子分支中需要对这3个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、C、F。若假设候选节点组合C+F对应的评估结果是最优的,那么就将该候选节点组合C+F进行融合,得到一个融合节点,并且同样需要在原来的3个子分支中对这2个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、F,此时只遗留下来的一个节点C,那么节点C作为未融合节点继续参与到下一个轮次的节点搜索过程中,以上过程就为一个轮次的搜索过程。
后续继续按照上述搜索过程进行组合搜索,直至节点中少于2个节点未被融合,过程与上述方式(1)类似,此处不予赘述。
但需要注意的是,在本申请实施例中,算力评估模型的构建原则与上述方式(1)略有不同,本申请实施例算力评估模型的构建原则是从多个候选节点组合中选取评估结果最优的那个,因此,该算力评估模型的计算公式可如下式(2)所示:
Figure PCTCN2021140906-appb-000005
其中,gain为当前选出的候选节点组合对应的评估结果,例如,若有m个候选节点组合,那么就根据该公式(2)可得到m个候选节点组合,对应可得到m个评估结果gain,k为当前选出的组成该候选节点组合的节点数量,k≤n,n为当前轮次选出的总节点数量,i 为k个节点数量中的第i个节点,cal i为节点i在执行时所需消耗的算力资源大小,max k(cal i)为当前选出的候选节点组合中算力资源消耗最大的节点所需消耗的算力资源。
基于上述公式(2),可得到m个评估结果gain,之后,再比较各个gain之间的大小,gain的取值越大,则评估结果越优,因此,从得到的m个评估结果gain中选取gain的取值最大的那个评估结果对应的候选节点组合进行融合,得到融合节点,而剩下的未融合节点则进入下一个轮次的搜索过程,过程与上述类似,此处不予赘述。并且,由上述公式(2)可知,融合节点是优先将所需消耗算力资源相近的节点进行融合,这种情况下获得的收益最好。
需要说明的是,从上述算力评估模型的公式(2)可看出,评估结果的收益在某种程度上来源于总值与最大值间的差额,所以当最大值在总值中占比较大时,其收益相对较小,因此,在本申请的一些实施方式中,算力评估模型的计算公式也可以用比值来表示,具体可如下式(3)所示:
Figure PCTCN2021140906-appb-000006
其中,公式(3)中的各个参数与上述公式(2)是一样的含义,此次不予赘述。基于上述公式(3),可得到m个评估结果gain,之后,再比较各个gain之间的大小,gain的取值越小,则评估结果越优,因此,从得到的m个评估结果gain中选取gain的取值最小的那个评估结果对应的候选节点组合进行融合,得到融合节点,剩下的未融合节点则进入下一个轮次的搜索过程,过程与上述类似,此处不予赘述。
需要说明的是,在本申请的一些实施方式中,一种情形还可以是基于第三可并行分支组上的所有节点组成候选节点组合,并根据构建的算力评估模型得到所有候选节点组合各自对应的评估结果,然后基于评估结果依次选择最适合融合的一个或多个候选节点组合进行融合,得到一个或多个融合节点。为便于理解,下面举例进行示意:依然参阅图23,假设深度学习框架得到的一个未处理的第三可并行分支组如图23所示,该第三可并行分支组中一共包括3个子分支,每个子分支中各自包括2个可融合节点(不可融合节点已被剔除),因此一共包括6个节点,分别为节点A、B、C、D、F、G,根据候选节点组合中的任意两个节点不能来自于同一子分支的原则,这6个节点一共可以组合得到20个候选节点组合,分别为A+C+F、A+C+G、A+D+F、A+D+G、B+C+F、B+C+G、B+D+F、B+D+G、A+C、A+F、A+D、A+G、C+F、C+G、B+C、B+F、B+D、B+G、D+F、D+G,之后通过构建的算力评估模型对该20个候选节点组合所需消耗的算力各自进行评估,得到20个评估结果(即m=20),最后,依次从这20个评估结果中选择表现最优的那个评估结果对应的候选节点组合进行融合,得到融合节点。例如,假设这20个候选评估结果中,A+C的评估结果最优,那么首先将A+C进行融合,得到一个融合节点,并将融合的节点A、C标识为已融合节点,据此从剩下的19个候选节点组合中排除掉包含有节点A、C的候选节点组合,排除后还剩下7个候选节点组合,分别为B+D+F、B+D+G、B+F、B+D、B+G、D+F、D+G,之后再依次从这7个候选节点组合对应的7个评估结果中选择表现最优的那个评估结果对应的候选节点组合进行融合,得到第二个融合节点。例如,假设这7个候选评估结果中,B+F的评估结果最优,那么就将B+F进行融合,得到一个融合节点,并将融合的节点B、 F标识为已融合节点,经过上述两个轮次,最后还剩下候选节点组合D+G,那么就将该剩下的候选节点组合D+G进行融合,得到第三个融合节点。在本申请实施例中,构建的算力评估模型与上述公式(2)和公式(3)类似,此处不予赘述。
(3)评估结果在x个评估结果中最优,x个评估结果是所有m个评估结果中的至少两个评估结果。
在这种情况下,评估结果依然用于表征节点组合所节省的算力资源,节省的算力资源越多,则表明评估结果越优,收益越大,依然以图23为例进行说明,假设深度学习框架得到的一个未处理的第三可并行分支组如图23所示,该第三可并行分支组中一共包括3个子分支,每个子分支中各自包括2个可融合节点(不可融合节点已被剔除),首先从这3个子分支中各自选出一个节点,例如,假设在当前轮次选择出节点A、C、F,如果是按照上述方式(2)的组合方式,那么基于该节点A、C、F一共可组成4个候选节点组合(即m=4),但在方式(3)中,只需至少得到两个候选节点组合即可,从该至少两个候选节点组合中选择评估结果最优的候选节点组合进行融合,这种组合方式是一种局部选择最优的过程,在保证较优的同时减小计算量。例如,可基于节点A、C、F组合(如,可随机组合)得到2个候选节点组合(即x=2),分别组成候选节点组合A+C+F、A+C,之后通过构建的算力评估模型对该候选节点组合A+C+F、A+C所需消耗的算力各自进行评估,得到2个评估结果,最后,再从这2个评估结果中选择表现最优的那个评估结果对应的候选节点组合进行融合,得到融合节点。例如,若假设候选节点组合A+C+F对应的评估结果是最优的,那么就将该候选节点组合A+C+F进行融合,得到一个融合节点,由于该融合节点是由候选节点组合A+C+F融合得到的,因此需要在原来的3个子分支中需要对这3个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、C、F。若假设候选节点组合A+C对应的评估结果是最优的,那么就将该候选节点组合A+C进行融合,得到一个融合节点,并且同样需要在原来的3个子分支中对这2个已融合的节点标识为已融合节点,使得在后续的节点搜索过程中避开节点A、C,此时只遗留下来的一个节点F,那么节点F作为未融合节点继续参与到下一个轮次的节点搜索过程中,以上过程就为一个轮次的搜索过程。
后续继续按照上述搜索过程进行组合搜索,直至节点中少于2个节点未被融合,过程与上述方式(2)类似,此处不予赘述。
需要注意的是,在该方式(3)中,构建的算力评估模型与上述方式(2)中的类似,可参阅上述公式(2)和公式(3),具体此处不予赘述。
经过上述步骤901至步骤902,就可得到一张具有融合节点的计算图,该计算图可称为第二计算图,深度学习框架在得到该第二计算图之后,还可以进一步对该第二计算图进行编译,以得到编译后的第二计算图,该编译后的第二计算图就可直接加载至加速硬件去执行。
需要说明的是,在本申请的一些实施方式中,对第二计算图的编译过程包括两部分,如图8所示,一部分是对普通节点编译,即对未融合的节点(包括不可融合节点以及未被融合的节点)进行的编译,这部分编译过程与现有方式类似,具体可参阅图4对应的实施方式,此处不予赘述;而另一部分是对融合节点编译,即对融合节点进行的编译,得到与 融合节点对应的kernel。
需要说明的是,在本申请的一些实施方式中,为了保留融合节点前的各节点间的并行性,本申请在对融合节点编译生成kernel时,通过分段合并的方式完成kernel的生成,并在其中通过代码分支方式让各代码段间保持明确的独立性。分段合并将IR生成分为两阶段:1)对各个节点进行独立调度,得到第一阶段的与节点数相等数量的子IR;2)然后对这些子IR进行融合和修正,得到第二阶段的总IR。具体地,假设某融合节点是由p个节点融合得到,那么对该融合节点进行编译得到对应的kernel的方式具体可以是:首先,对该p个节点各自进行独立调度,得到p个子IR,再对该p个子IR进行融合,得到一个总IR,最后对该总IR进行编译,得到与该融合节点对应的kernel。
为便于理解,下面举例进行示意:请参阅图24,图24为本申请实施例提供的普通节点编译与融合节点编译的一个对比示意图,假设节点A、B、C均为可融合节点,若不对节点A、B、C进行融合,那么编译器(也可称为编码器)对节点A、B、C进行普通节点编译,即如图24中的左边部分所示,编译器对节点A、B、C各自进行调度,得到这3个节点各自对应的中间表示IR1、IR2、IR3,再进一步对IR1、IR2、IR3进行编译,得到与节点A、B、C各自对应的kernel1、kernel2、kernel3,得到的这3个kernel就可直接依次加载至对应的加速硬件内的device上,由device上的计算单元(如,SM、core等)依次执行对应的kernel。若通过本申请提供的计算图的节点融合方法将节点A、B、C进行融合,那么编译器对节点A、B、C进行融合节点编译,即如图24中的右边部分所示,编译器对节点A、B、C各自进行调度,得到这3个节点各自对应的中间表示IR1、IR2、IR3,接着再对中间表示IR1、IR2、IR3进行融合,得到总中间表示IR(A+B+C),最后对该IR(A+B+C)进行编译,得到一个算子核kernel(A+B+C)。
从上述可知,分段合并的优点在于:融合的各个节点并未进行统一调度,而是采取先各自分析、后融合的方式。这种方式可以保有各节点对应的kernel的相对独立性,形成独立的代码段,直观的表达了其可并行性。而在此基础上,在对总IR进行编译时进行代码分支处理,将原来各节点的计算逻辑进行计算资源隔离(即融合前的每个节点各自对应一个或多个特定的计算单元),在kernel中让代码段的并行性更为明确。
下面以一个具体的代码实例对上述分段合并的过程进行示意:如图25所示,其中C为分支条件,A为Cast节点代码段,B为Sub节点代码段。C设置了资源条件blockIdx.x<1,当资源号(即标识信息)blockIdx.x满足条件时,将执行代码段A中的Cast计算;当资源号不满足条件时,则执行代码段B中的Sub计算。因此,配合加速硬件(如,GPU)并行计算的硬件特性,可以让代码段A、B并行执行起来。需要注意的是,图25示例为该kernel的CUDA代码。其中,input_2、input_0为该代码的输入数据地址,subtract_2、cast_0为该代码的输出数据地址。CUDA中可以通过blockIdx.x和threadIdx.x来区分不同的计算资源,并通过计算得到数据的索引,如代码段B中,通过((int)ablockIdx.x-1)*1024)+((int)threadIdx.x)来确定从当前的减法操作需要使用来自input_2的哪个数据。
需要说明的是,在深度学习框架的实际应用中,可以是通过深度学习框架中的编译器(如,AKG、LLVM等)对各个融合节点进行调度分析,编译器通过硬件指令发射得到相 应的执行代码。下面以编码器AKG为例,对融合节点的整个编译过程进行示意:在本申请实施例中,融合节点的编译利用AKG来进行,并在AKG的编译流程中加上分段合并过程,即将原AKG对节点的调度分析转换为对各个融合节点的分段合并处理。具体地,在硬件指令发射前,进行分段合并的第一阶段,第一阶段为AKG对融合前的各节点进行调度分析,得到各节点调度后对算力资源的使用量信息(即各节点实际执行时消耗的算力资源,如在GPU上为Block size)以及对应的子IR,如图26所示,图26中的(a)子示意图和(b)子示意图分别为融合前Sqrt节点与Divide节点各自对应的子IR;在分段合并第二阶段,则通过修正各子IR间的资源信息,如图26中子IR中灰色底部分所示,其中Sqrt节点由于使用的是前0~3号block,其block id不需要更正,而Divide节点使用的是4~7号block,所以相关索引号需要更正为(blockIdx.x-4)。修正后,通过添加代码分支,将各部分子IR整合起来,整合代码如图26中的(c)子示意图灰色底部分所示,即为融合节点的总IR。得到融合节点的总IR后,AKG继续执行剩余编译过程,即可生成最终的融合节点(Sqrt+Divide)对应的kernel的代码。
需要注意的是,在图26对应的代码中,produce T_sqrt{…}代表T_sqrt由花括号所使用,//attr[iter_var(blockIdx.x,range(min=0,ext=8),blockIdx.x)]thread_extent=8表示执行当前IR时,blockIdx.x取值范围为0~8,//attr[iter_var(threadIdx.x,range(min=0,ext=256),threadIdx.x)]thread_extent=256则表示执行当前IR时,threadIdx.x的取值范围为0~1024。则图26中各个子示意图表示的是:1、(a)子示意图所表达的是对input_2执行sqrt操作,结果写入T_sqrt中,且执行时共需要分配4单位的block,每个block需要划分为256单位的thread;2、(b)子示意图所表达的是执行input_0/input_1操作,结果写入T_divide_input_0_input_1中,执行时共需要分配4单位block,每个block需要划分为256单位的thread;3、(c)子示意图则代表合并(a)的子IR和(b)的子IR后,共需要分配8单位block,每个block需要划分为256单位的thread。其中当blockIdx.x<4且threadId.x<256时,执行input_2操作,结果写入T_sqrt;当blockIdx.x≥4且threadIdx.x<256时,执行input_0/input_1操作,结果写入T_divide_input_0_input_1。
综上所述,本申请实施例提供的计算图的节点融合方法可概括为如图27所示的流程示意图,可包括3个主要步骤,可分别概括为:
①网络结构的分支提取
分析计算图的网络结构,得到网络结构中各个节点之间的连接关系,例如,是否具有共同父节点、共同子节点、没有父节点、没有子节点等,并以此为依据在网络结构中找到一个或多个可并行分支组,每个可并行分支组中存在一个或多个子分支,各子分支之间不存在连接关系,且子分支中分别来自不同子分支的任意两个或两个以上节点融合后的计算图不存在环结构,分支提取完毕后,可得到若干个可并行分支组。在提取过程中,只基于计算图中节点之间的连接关系搜索可并行分支组,搜索过程不会被个别不可融合节点所打断,提高了神经网络中可融合节点的搜索命中率。
②节点组合搜索
在可并行分支组中,每条子分支中包含一个或多个节点,不同子分支上的节点可以组 合起来作为融合节点,相同子分支上的节点则由于前后连接关系而无法进行融合。在可并行分支组上,根据上述步骤902对应实施例阐述的组合搜索的方式,可以得到一个或多个融合节点。在一些实施方式中,还可以进一步通过构建的算力评估模型指导搜索融合节点,可保证被融合的节点组合可以带来符合预期的收益。
③融合节点编译
与普通节点不同,为了让并行融合落实到执行中,需要根据融合前各个节点的信息,对整个编译过程进行并行化处理,从而让由融合节点编译而来成kernel带有并行特性,并作为融合前各节点的整合被device执行。具体采用的是上述所述的分段合并方式,即先独立对各个节点进行调度分析,得到各个节点的代码分支,并基于各代码分支得到融合节点,分段合并的好处在于:独立对各个节点进行调度分析提高了融合的灵活度,并保证融合节点代码的正确生成,且生成的kernel间依旧保有并行性,允许对应的加速硬件进行更深层次的并行优化。
为了对本申请实施例所带来的有益效果有更为直观的认识,以下对本申请实施例所带来的技术效果作进一步的对比,如图28所示,图28为本申请实施例提供的节点融合前后的kernel在device上执行时的收益,图28中的横坐标表示执行的时间t,假设计算图中有3个可融合节点,若作为普通节点编译,则分别编译为kernelA、kernelB、kernelC,依照原来的拓扑排序方式,kernelA、kernelB、kernelC需要依次在device上执行,如图28中横坐标上方所示,cost a、cost b、cost c表示kernelA、kernelB、kernelC依次在该device上执行时所花费的时间;若将这3个可融合节点按照本申请实施例所述的节点融合方法进行融合,得到一个融合节点,并将该融合节点编译为kernel(A+B+C),cost(a|b|c)表示kernel(A+B+C)在device上执行时所花费的时间,假设横坐标上的小圈圈表示计算任务的执行起始时刻,横坐标上的小方块表示计算任务的执行结束时刻,则由图28可知,节点融合后device执行的时间相对节点融合前device执行时间大大减短了。
综上所述,本申请实施例提供的计算图的节点融合方法的收益主要来自于两部分:1)kernel被加载到device上的次数减少,kernel加载的次数等于被节点被融合后,计算图上节点减少的总个数。例如,如图28所示,原来有3个节点,编译成3个kernel后需要各自加载3次,而3个节点融合成一个融合节点后,则可编译成1个kernel,只需要加载1次;2)融合节点在device上执行时的并行收益,例如,如图28所示,假设device上的算力资源为3个计算单元,kernelA、kernelB、kernelC分别执行时都只用了1个计算单元的算力资源,因此算力资源的总量可以允许A、B、C进行融合(在更复杂详细的分析下,即使资源总量不足,A、B、C并行起来也可能会优于分别执行),同时,由于在融合的kernel(A+B+C)中A、B、C仍保有并行性,则其执行消耗约等于三者间最大耗时,因此该融合节点的收益为如下式(4)所示:
gain=cost(a|b|c)≈max(cost a,cost b,cost c)      (4)
需要说明的是,本申请实施例提供的计算图的节点融合方法在GPU V100环境中,运行batch size为32、使用Lamb优化器的Bert-Base,可以搜索出来的融合节点共1425组。性能上与不开启并行融合相比,一个迭代收益约23.87ms,优化比例约为6%。
综上所述,在本申请上述实施方式中,通过对计算图中可融合节点的融合,将各个散落在计算图中的普通节点整合成融合节点(可看作是将小节点整合成大节点),提高了加速硬件运行时device资源的使用率,同时减少了加载kernel次数,提升了网络的整体执行性能。
在上述对应实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的深度学习框架。具体参阅图29,图29为本申请实施例提供的深度学习框架的一种结构示意图,该深度学习框架2900包括:转换模块2901、搜索模块2902、融合模块2903,其中,转换模块2901,用于将神经网络转换成一张计算图,得到的该计算图可称为第一计算图,该第一计算图用于表征该神经网络的计算逻辑;搜索模块2902,用于基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组,该可并行分支组指示该可并行分支组的多个子分支支持被并行执行,该可并行分支组中包括的第一可并行分支组满足以下条件中的至少一种:该第一可并行分支组内的所有子分支的输入来自于同一节点且该第一可并行分支组内的至少两个子分支的输出指向不同的节点、该第一可并行分支组内的所有子分支的输出指向同一节点且该第一可并行分支组内的至少两个子分支的输入来自不同的节点、该第一可并行分支组内的所有子分支的第一个节点没有父节点、该第一可并行分支组内的所有子分支的最后一个节点没有子节点;融合模块2903,用于针对每一个可并行分支组,对每个可并行分支组中分别来自不同子分支的多个节点进行融合,从而得到第二计算图,也就是该多个节点中的每个节点所属的子分支都与该多个节点中的其他任何一个节点所属的子分支不同。
在本申请上述实施方式中,相较于现有技术,考虑了其他一些可并行的分支的可能性,从而找到不同于现有技术的规则限定的可以并行的分支的组合,也就可以在计算图的节点融合中,融合这些分支的组合中的节点,从而扩展了能够获取的可融合节点的范围。
在一种可能的设计中,在可并行分支组为多个的情况下,该可并行分支组还包括至少一个第二可并行分支组,该第二可并行分支组满足以下条件:该第二可并行分支组内的所有子分支的输入来自于同一节点且该第二可并行分支组内的至少两个子分支的输出指向同一节点,该第二可并行分支组内的每个子分支包括至少两个节点。
在本申请上述实施方式中,基于第一计算图中各节点之间的连接关系得到可并行分支组的方式可以是如上述得到第二可并行分支组的方式,具备广泛适用性。
在一种可能的设计中,融合模块2903,还用于:从每个可并行分支组(即目标可并行分支组)中的每个子分支中剔除掉不可融合节点,从而得到剔除了不可融合节点的各个可并行分支组(可称为第三可并行分支组),而目标可并行分支组中的任意一个子分支可称为目标子分支。之后,将剔除了不可融合节点的各可并行分支组(即第三可并行分支组)中分别来自不同子分支的多个节点(指的是可以进行融合的节点,但还未融合的节点,即待融合的节点)进行融合,得到融合节点,该第二计算图中包括融合节点与第一计算图中的未融合节点(未融合节点是指没有被融合的节点),该第三可并行分支组中的多个节点中的每个节点所属的子分支都与该多个节点中的其他任何一个节点所属的子分支不同。
在本申请上述实施方式中,阐述了得到第二计算图的具体方式,即先从第一计算图中 剔除掉不可融合节点,再对可剩余的可融合节点进行融合,从而得到第二计算图,由于从第一计算图中事先剔除了不可融合节点,因此可提高融合效率。
在一种可能的设计中,为了尽可能将所有可融合节点进行融合,上述将第三可并行分支组中分别来自不同子分支的多个节点进行融合的过程是迭代进行的,也就是本申请实施例该的深度学习框架2900还可以包括迭代模块2904,迭代模块2904用于触发融合模块2903重复执行将第三可并行分支组中的多个节点进行融合,以得到融合节点的步骤,直至该第三可并行分支组中未融合节点的数量少于2个。
在本申请上述实施方式中,迭代模块2904保证了能将可融合节点尽可能多的被融合为融合节点,提高了融合覆盖面。
在一种可能的设计中,融合模块2903,具体用于:从第三可并行分支组中的多个子分支中各自选择(如,可以是随机选择)一个未融合节点,得到n个未融合节点,该未融合节点为未被融合的节点,n≥2,之后基于选出的n个未融合节点生成m个节点组合,m个节点组合中的每个节点组合包括至少两个未融合节点,m≥1且2m≤n,并通过构建的算力评估模型对该m个节点组合各自所需的算力进行评估,以得到m个评估结果,该m个评估结果中的每个评估结果用于表征如下情形一种:该m个节点组合中每个节点组合所需耗费的算力资源、该m个节点组合中每个节点组合所节省的算力资源。在第一评估结果满足预设条件的情况下,将与该第一评估结果对应的第一节点组合进行融合,以得到第一融合节点,并将该第一节点组合中的各个未融合节点标记为已融合节点,该第一评估结果为这m个评估结果中的一个,该第一节点组合为该m个节点组合中的一个。
在本申请上述实施方式中,阐述了融合模块2903如何基于组合搜索的方式对节点进行融合,得到融合节点,并可以进一步通过构建的算力评估模型指导搜索融合节点,可保证被融合的节点组合可以带来符合预期的收益。
在一种可能的设计中,第一评估结果满足预设条件至少包括如下一种情形:在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所需耗费的算力资源的情况下,第一评估结果达到目标加速硬件上具体执行计算任务的模块(device)的算力要求、在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所节省的算力资源的情况下,第一评估结果在该m个评估结果中最优、在m个评估结果中的每个评估结果用于表征m个节点组合中每个节点组合所节省的算力资源的情况下,第一评估结果在x个评估结果中最优,该x个评估结果为该m个评估结果中的至少两个评估结果。
在本申请上述实施方式中,阐述了判断第一评估结果是否满足预设条件可以有多种具体的判断方式,用户可基于自身需求选择,具备灵活性。
在一种可能的设计中,搜索模块2902,具体用于:在第一计算图中搜索拥有同一父节点(可称为第一共同父节点或共同的第一父节点,为便于阐述,在本申请实施例中统称为第一共同父节点)的多个分支(可称为第一分支),并根据该多个第一分支得到一个可并行分支组;或,在第一计算图中搜索拥有同一子节点(可称为第一共同子节点或共同的第一子节点,为便于阐述,在本申请实施例中统称为第一共同子节点)的多个分支(可称为第二分支),并根据该多个第二分支得到一个可并行分支组。这种搜索方式也可称为多分叉结 构单向搜索。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的一种搜索方式,该搜索方式是基于是否有共同父节点或是否有共同子节点来进行的,而共同父节点或共同子节点广泛存在于神经网络转换而来的计算图中,这种搜索方式可搜索得到大量的可并行分支组。
在一种可能的设计中,搜索模块2902,具体还用于:以该第一父节点为起始点,分别向下搜索每个第一分支,直至在向下搜索过程中遇到共同的第二父节点或共同的第二子节点时停止,以得到该多个第一分支对应的可并行分支组,该可并行分支组包含该多个第一分支各自对应的第一子分支,每个该第一子分支包含的节点是在向下搜索每个该第一子分支过程中得到的节点。需要注意的是,多个第一子分支中的每个第一子分支不包括第一共同父节点和共同子节点,也就是说,每个第一子分支都不包括作为起始点的第一共同父节点,且在每个第一分支向下搜索过程中,若遇到其他共同父节点(即第二共同父节点),就将该第二共同父节点纳入到对应的第一子分支中,若遇到共同子节点,则在对应的第一子分支中排除该共同子节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第一分支得到一个可并行分支组,即从共同父节点出发进行向下搜索,这种从共同父节点出发提取可并行分支组的方法,可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于共同父节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。目前已有的计算图的节点融合方式中,都需要判断节点融合后是否成环,而判断是否成环是个很繁杂的操作,本申请实施例该的搜索方式保证了不会出现环结构,因此无需额外对每次得到的组合进行成环判断,简化了操作流程。
在一种可能的设计中,搜索模块2902,具体还用于:以该第一子节点为起始点,分别向上搜索每个第二分支,直至在向上搜索过程中遇到共同的第三父节点或共同的第三子节点时停止,以得到该多个第二分支对应的可并行分支组,该可并行分支组包含该多个第二分支各自对应的第二子分支,每个该第二子分支包含的节点是在向上搜索每个该第二子分支过程中得到的节点。需要注意的是,多个第二子分支中的每个第二子分支不包括该第一共同子节点和共同父节点,也就是说,每个第二子分支都不包括作为起始点的第一共同子节点,且在每个第二分支向上搜索过程中,若遇到其他共同子节点,就将该其他共同子节点纳入到对应的第二子分支中,若遇到共同父节点,则在对应的第二子分支中排除该共同父节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第二分支得到一个可并行分支组,即从共同子节点出发进行向上搜索,这种从共同子节点出发提取可并行分支组的方法,可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于共同子节点出发向上搜索得到的子分支中不存在共同父节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例该的搜索方式同样保证了不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在一种可能的设计中,搜索模块2902,具体还用于:从第一计算图中搜索多个第三分支,并根据该多个第三分支得到一个可并行分支组,其中,每个第三分支的第一个节点没有父节点;和/或,从第一计算图中搜索没有子节点的多个第四分支,并根据该多个第四分支得到一个可并行分支组,其中,每个第三分支的第一个节点没有父节点。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的另一种搜索方式,该搜索方式是基于没有父节点或没有子节点的节点来进行的,而没有父节点或没有子节点的节点在神经网络转换而来的计算图中也广泛存在,因此,除了上述基于共同父节点或共同子节点的搜索方式外,这种搜索方式依然可搜索出大量的可并行分支组。这种搜索方式也可称为无汇聚结构单向搜索。
在一种可能的设计中,搜索模块2902,具体还用于:以每个该第三分支的第一个节点为起始点,分别向下搜索每个该第三分支,直至在向下搜索过程中遇到同一父节点或同一子节点时停止,以得到该多个第三分支对应的可并行分支组,该可并行分支组包含该多个第三分支各自对应的第三子分支,每个该第三子分支包含的节点是在向下搜索每个该第三子分支过程中得到的节点。需要注意的是,多个第三子分支中的每个第三子分支不包括向下搜索过程中遇到的共同子节点,也就是说,每个第三子分支可以包括作为起始点的节点,且在每个第三分支向下搜索过程中,若遇到共同父节点,可以将该共同父节点纳入到对应的第三子分支中,若遇到共同子节点,则在对应的第三子分支中排除该共同子节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第三分支得到一个可并行分支组,即从无父节点的节点出发进行向下搜索,这种从没有父节点的节点出发提取可并行分支组的方法,同样可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于没有父节点的节点出发向下搜索得到的子分支中不存在共同子节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例该的搜索方式同样可保证不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在一种可能的设计中,搜索模块2902,具体还用于:以每个该第四分支的最后一个节点为起始点,分别向上搜索每个该第四分支,直至在向上搜索过程中遇到同一父节点或同一子节点时停止,以得到该多个第四分支对应的可并行分支组,该可并行分支组包含该多个第四分支各自对应的第四子分支,每个该第四子分支包含的节点是在向上搜索每个该第四子分支过程中得到的节点。需要注意的是,多个第四子分支中的每个第四子分支不包括向上搜索过程中遇到的共同父节点,也就是说,每个第四子分支可以包括作为起始点的节点,且在每个第四分支向下搜索过程中,若遇到共同子节点,可以将该共同子节点纳入到对应的第四子分支中,若遇到共同父节点,则在对应的第四子分支中排除该共同父节点。
在本申请上述实施方式中,具体阐述了深度学习框架如何根据多个第四分支得到一个可并行分支组,即从无子节点的节点出发进行向上搜索,这种从没有子节点的节点出发提取可并行分支组的方法,同样可以保证属于同一可并行分支组中不同子分支的节点不存在不一致的依赖行为,例如,基于没有子节点的节点出发向上搜索得到的子分支中不存在共同父节点的情形,从而可保证分支间节点的融合不会形成环结构。因此,本申请实施例该 的搜索方式同样可保证不会出现环结构,无需额外对每次得到的组合进行成环判断,简化了操作流程。
在一种可能的设计中,搜索模块2902,具体还用于:在目标节点不属于不可融合节点的情况下,对该目标节点周围的局部结构进行化简,得到第五分支,该目标节点为该第一计算图中未被划分进任意一个可并行分支组的节点,并且,在该第五分支为多个的情况下,根据该多个第五分支得到一个可并行分支组。
在本申请上述实施方式中,阐述了基于第一计算图中节点之间的连接关系从第一计算图中提取一个或多个可并行分支组的另一种搜索方式,该搜索方式是基于散落结构来进行的,该搜索方式作为上述两种方式的补充搜索方式,可以最大程度从第一计算图中搜索到可融合节点。使得搜索范围更加广泛。
在一种可能的设计中,该深度学习框架2900还包括:编译模块2905,用于对第二计算图中的融合节点进行编译,得到与该融合节点对应的算子核(kernel)。
在本申请上述实施方式中,阐述了编译模块2905对第二计算图的编译过程还包括对融合节点的编译过程,具备可实现性。
在一种可能的设计中,编译模块2905,还用于:在该融合节点由p个节点融合得到的情况下,分别调度该p个节点,以得到与该p个节点分别对应的p个子中间表示(IR);之后,对该p个子IR进行融合,得到一个总IR;最后,对该总IR进行编译,得到与该融合节点对应的算子核(kernel)。
在一种可能的设计中,该深度学习框架2900可以是多种具体表现形式的AI框架,例如,可以是mindspore、tensorflow、tensornetwork、pytorch、mxnet、caffe、theano等主流的深度学习框架,也可以是其他小众的深度学习框架,只要该深度学习框架能够对计算图进行优化、编译的处理过程,都可以认为是本申请实施例该的深度学习框架2900,具体本申请对深度学习框架2900的表现形式不做限定。
在本申请上述实施方式中,对深度学习框架2900的几种常见的具体表现形式进行了说明,具备广泛性。
需要说明的是,图29提供的深度学习框架2900中各模块/单元之间的信息交互、执行过程等内容,与本申请中图9对应的方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供了一种计算机设备,请参阅图30,图30是本申请实施例提供的计算机设备一种结构示意图,为便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该计算机设备3000上可以部署有图29对应实施例中所描述的模块,用于实现图29对应实施例中深度学习框架的功能,具体的,计算机设备3000由一个或多个服务器实现,计算机设备3000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)3022和存储器3032,一个或一个以上存储应用程序3042或数据3044的存储介质3030(例如一个或一个以上海量存储设备)。其中,存储器3032和存储介质3030可以是短暂存储或持久存储。存储在存储介质3030的程序可以包括一个或一个以上模块(图示没标出),每个模 块可以包括对计算机设备3000中的一系列指令操作。更进一步地,中央处理器3022可以设置为与存储介质3030通信,在计算机设备3000上执行存储介质3030中的一系列指令操作。
计算机设备3000还可以包括一个或一个以上电源3026,一个或一个以上有线或无线网络接口3050,一个或一个以上输入输出接口3058,和/或,一个或一个以上操作系统3041,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器3022,用于执行图9对应实施例中的方法。例如,中央处理器3022可以用于:获取神经网络的网络结构,并将该神经网络转换成一张计算图,得到的该计算图可称为第一计算图。得到该第一计算图之后,基于该第一计算图中各个节点之间的依赖关系从第一计算图中提取到一个或多个可并行分支组,其中,每个可并行分支组都包括多个子分支(即至少2个子分支),子分支为不存在分叉的顺序串联结构,每个可并行分支组中的每个子分支包括一个或多个节点。该可并行分支组就指示该可并行分支组的多个子分支支持被并行执行。这里需要注意的是,属于同一个可并行分支组的子分支需要满足两个条件:一个是各子分支之间不存在依赖关系;二是属于不同子分支的任意两个或两个以上节点融合成一个节点后的计算图不存在环结构,即融合后的计算图中不存在环结构。在本申请实施例中,可并行分支组中包括的至少一个可并行分支组(可称为第一可并行分支组)满足以下条件中的至少一种:该第一可并行分支组内的所有子分支的输入来自于同一节点且该第一可并行分支组内的至少两个子分支的输出指向不同的节点、该第一可并行分支组内的所有子分支的输出指向同一节点且该第一可并行分支组内的至少两个子分支的输入来自不同的节点、该第一可并行分支组内的所有子分支的第一个节点没有父节点、该第一可并行分支组内的所有子分支的最后一个节点没有子节点。经过上述步骤后,可得到一个或多个可并行分支组,针对每一个可并行分支组,对每个可并行分支组(可称为第一可并行分支组)中分别来自不同子分支的多个节点进行融合,从而得到第二计算图。
需要说明的是,中央处理器3022还可以用于执行与本申请中图9对应的方法实施例中任意一个步骤,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述所示实施例描述中计算机设备所执行的步骤。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容 易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(digital video disc,DVD))、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。

Claims (36)

  1. 一种计算图的节点融合方法,应用于深度学习框架,其特征在于,包括:
    将第一神经网络转换为第一计算图;
    从所述第一计算图中提取一个或多个可并行分支组,所述可并行分支组指示属于一个可并行分支组的多个子分支支持被并行执行,所述可并行分支组包括第一可并行分支组,所述第一可并行分支组满足以下条件中的至少一种:所述第一可并行分支组内的所有子分支的输入来自于同一节点且所述第一可并行分支组内的至少两个子分支的输出指向不同的节点、所述第一可并行分支组内的所有子分支的输出指向同一节点且所述第一可并行分支组内的至少两个子分支的输入来自不同的节点、所述第一可并行分支组内的所有子分支的第一个节点没有父节点、所述第一可并行分支组内的所有子分支的最后一个节点没有子节点;
    对一个或多个所述可并行分支组中每个可并行分支组的多个节点进行融合,以基于所述第一计算图得到第二计算图,所述多个节点中的每个节点所属的子分支都与所述多个节点中的其他任何一个节点所属的子分支不同。
  2. 根据权利要求1所述的方法,其特征在于,在所述可并行分支组为多个的情况下,所述可并行分支组还包括第二可并行分支组,所述第二可并行分支组满足以下条件:
    所述第二可并行分支组内的所有子分支的输入来自于同一节点且所述第二可并行分支组内的至少两个子分支的输出指向同一节点,所述第二可并行分支组内的每个子分支包括至少两个节点。
  3. 根据权利要求1-2中任一项所述的方法,其特征在于,所述对一个或多个可并行分支组中的每个可并行分支组的多个节点进行融合,以基于所述第一计算图得到第二计算图包括:
    从所述每个可并行分支组的目标子分支中剔除不可融合节点,以得到第三可并行分支组,所述目标子分支为所述每个可并行分支组中的任意一个子分支,所述不可融合节点包括指示特定运算操作的节点,所述特定运算操作包括以下操作中的至少一种:矩阵乘操作和卷积操作;
    将所述第三可并行分支组中的多个节点进行融合,以得到融合节点,所述第二计算图中包括所述融合节点与所述第一计算图中的未融合节点,所述第三可并行分支组中的多个节点中的每个节点所属的子分支都与所述多个节点中的其他任何一个节点所属的子分支不同。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    重复执行所述将所述第三可并行分支组中的多个节点进行融合,以得到融合节点的步骤,直至所述第三可并行分支组中未融合节点的数量少于2个。
  5. 根据权利要求3-4中任一项所述的方法,其特征在于,所述将所述第三可并行分支组中的多个节点进行融合,以得到融合节点包括:
    基于所述第三可并行分支组中的n个节点,得到m个节点组合,所述n个节点分别来自构成所述第三可并行分支组的n个分支,且所述m个节点组合中的每一个节点组合都包 括至少两个节点,m≥1,n≥2且2m≤n;
    通过算力评估模型对所述m个节点组合各自所需的算力进行评估,以得到m个评估结果,所述m个评估结果中的每个评估结果用于表征如下情形一种:所述m个节点组合中每个节点组合所需耗费的算力资源、所述m个节点组合中每个节点组合所节省的算力资源;
    在第一评估结果满足预设条件的情况下,将与所述第一评估结果对应的第一节点组合中的节点进行融合,以得到一个或多个第一融合节点,所述第一评估结果为所述m个评估结果中的一个,所述第一节点组合为所述m个节点组合中的一个。
  6. 根据权利要求5所述的方法,其特征在于,所述第一评估结果满足预设条件至少包括如下一种情形:
    在所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所需耗费的算力资源的情况下,所述第一评估结果达到加速硬件上具体执行计算任务的模块(device)的算力要求;在所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所节省的算力资源的情况下,所述第一评估结果在所述m个评估结果中最优;或者,所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所节省的算力资源的情况下,所述第一评估结果在x个评估结果中最优,所述x个评估结果为所述m个评估结果中的至少两个评估结果。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述从所述第一计算图中提取一个或多个可并行分支组包括:
    在所述第一计算图中搜索拥有共同的第一父节点的多个第一分支,并根据所述多个第一分支得到一个可并行分支组,所述第一父节点为所述第一计算图中的任意一个父节点;
    和/或,
    在所述第一计算图中搜索拥有共同的第一子节点的多个第二分支,并根据所述多个第二分支得到一个可并行分支组,所述第一子节点为所述第一计算图中的任意一个子节点。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述多个第一分支得到一个可并行分支组包括:
    以所述第一父节点为起始点,分别向下搜索每个第一分支,直至在向下搜索过程中遇到共同的第二父节点或共同的第二子节点时停止,以得到所述多个第一分支对应的可并行分支组,所述可并行分支组包含所述多个第一分支各自对应的第一子分支,每个所述第一子分支包含的节点是在向下搜索每个所述第一子分支过程中得到的节点。
  9. 根据权利要求7-8中任一项所述的方法,其特征在于,所述根据所述多个第二分支得到一个可并行分支组包括:
    以所述第一子节点为起始点,分别向上搜索每个第二分支,直至在向上搜索过程中遇到共同的第三父节点或共同的第三子节点时停止,以得到所述多个第二分支对应的可并行分支组,所述可并行分支组包含所述多个第二分支各自对应的第二子分支,每个所述第二子分支包含的节点是在向上搜索每个所述第二子分支过程中得到的节点。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述从所述第一计算图中提取一个或多个可并行分支组还包括:
    在所述第一计算图中搜索多个第三分支,并根据所述多个第三分支得到一个可并行分支组,所述多个第三分支中每个第三分支的第一个节点都没有父节点;
    和/或,
    在所述第一计算图中搜索多个第四分支,并根据所述多个第四分支得到一个可并行分支组,所述第四分支中每个第四分支的最后一个节点都没有子节点。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述多个第三分支得到一个可并行分支组包括:
    以每个所述第三分支的第一个节点为起始点,分别向下搜索每个所述第三分支,直至在向下搜索过程中遇到同一父节点或同一子节点时停止,以得到所述多个第三分支对应的可并行分支组,所述可并行分支组包含所述多个第三分支各自对应的第三子分支,每个所述第三子分支包含的节点是在向下搜索每个所述第三子分支过程中得到的节点。
  12. 根据权利要求10-11中任一项所述的方法,其特征在于,所述根据所述多个第四分支得到一个可并行分支组包括:
    以每个所述第四分支的最后一个节点为起始点,分别向上搜索每个所述第四分支,直至在向上搜索过程中遇到同一父节点或同一子节点时停止,以得到所述多个第四分支对应的可并行分支组,所述可并行分支组包含所述多个第四分支各自对应的第四子分支,每个所述第四子分支包含的节点是在向上搜索每个所述第四子分支过程中得到的节点。
  13. 根据权利要求1-12中任一项所述的方法,其特征在于,所述从所述第一计算图中提取一个或多个可并行分支组还包括:
    在目标节点不属于不可融合节点的情况下,对所述目标节点周围的局部结构进行化简,得到第五分支,所述目标节点为所述第一计算图中未被划分进任意一个可并行分支组的节点;
    在所述第五分支为多个的情况下,根据所述多个第五分支得到一个可并行分支组。
  14. 根据权利要求1-13中任一项所述的方法,其特征在于,所述方法还包括:
    对所述第二计算图中的融合节点进行编译,得到与所述融合节点对应的算子核(kernel)。
  15. 根据权利要求14所述的方法,其特征在于,所述融合节点由p个节点融合得到,所述对所述第二计算图中包括的融合节点进行编译,得到与所述融合节点对应的算子核(kernel)包括:
    分别调度所述p个节点,以得到与所述p个节点分别对应的p个子中间表示(IR);
    对所述p个子IR进行融合,得到一个总IR;
    对所述总IR进行编译,得到与所述融合节点对应的算子核(kernel)。
  16. 根据权利要求1-15中任一项所述的方法,其特征在于,所述深度学习框架为:
    mindspore、tensorflow、tensornetwork、pytorch、mxnet、caffe或theano。
  17. 一种深度学习框架,其特征在于,包括:
    转换模块,用于将第一神经网络转换为第一计算图;
    搜索模块,用于从所述第一计算图中提取一个或多个可并行分支组,所述可并行分支组指示属于一个可并行分支组的多个子分支支持被并行执行,所述可并行分支组包括第一 可并行分支组,所述第一可并行分支组满足以下条件中的至少一种:所述第一可并行分支组内的所有子分支的输入来自于同一节点且所述第一可并行分支组内的至少两个子分支的输出指向不同的节点、所述第一可并行分支组内的所有子分支的输出指向同一节点且所述第一可并行分支组内的至少两个子分支的输入来自不同的节点、所述第一可并行分支组内的所有子分支的第一个节点没有父节点、所述第一可并行分支组内的所有子分支的最后一个节点没有子节点;
    融合模块,用于对一个或多个所述可并行分支组中每个可并行分支组的多个节点进行融合,以基于所述第一计算图得到第二计算图,所述多个节点中的每个节点所属的子分支都与所述多个节点中的其他任何一个节点所属的子分支不同。
  18. 根据权利要求17所述的框架,其特征在于,在所述可并行分支组为多个的情况下,所述可并行分支组还包括第二可并行分支组,所述第二可并行分支组满足以下条件:
    所述第二可并行分支组内的所有子分支的输入来自于同一节点且所述第二可并行分支组内的至少两个子分支的输出指向同一节点,所述第二可并行分支组内的每个子分支包括至少两个节点。
  19. 根据权利要求17-18中任一项所述的框架,其特征在于,所述融合模块,还用于:
    从所述每个可并行分支组的目标子分支中剔除不可融合节点,以得到第三可并行分支组,所述目标子分支为所述每个可并行分支组中的任意一个子分支,所述不可融合节点包括指示特定运算操作的节点,所述特定运算操作包括以下操作中的至少一种:矩阵乘操作和卷积操作;
    将所述第三可并行分支组中的多个节点进行融合,以得到融合节点,所述第二计算图中包括所述融合节点与所述第一计算图中的未融合节点,所述第三可并行分支组中的多个节点中的每个节点所属的子分支都与所述多个节点中的其他任何一个节点所属的子分支不同。
  20. 根据权利要求19所述的框架,其特征在于,所述框架还包括:
    迭代模块,用于触发所述融合模块重复执行将所述第三可并行分支组中的多个节点进行融合,以得到融合节点的步骤,直至所述第三可并行分支组中未融合节点的数量少于2个。
  21. 根据权利要求19-20中任一项所述的框架,其特征在于,所述融合模块,具体用于:
    基于所述第三可并行分支组中的n个节点,得到m个节点组合,所述n个节点分别来自构成所述第三可并行分支组的n个分支,且所述m个节点组合中的每一个节点组合都包括至少两个节点,m≥1,n≥2且2m≤n;
    通过算力评估模型对所述m个节点组合各自所需的算力进行评估,以得到m个评估结果,所述m个评估结果中的每个评估结果用于表征如下情形一种:所述m个节点组合中每个节点组合所需耗费的算力资源、所述m个节点组合中每个节点组合所节省的算力资源;
    在第一评估结果满足预设条件的情况下,将与所述第一评估结果对应的第一节点组合中的节点进行融合,以得到一个或多个第一融合节点,所述第一评估结果为所述m个评估结果中的一个,所述第一节点组合为所述m个节点组合中的一个。
  22. 根据权利要求21所述的框架,其特征在于,所述第一评估结果满足预设条件至少包括如下一种情形:
    在所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所需耗费的算力资源的情况下,所述第一评估结果达到加速硬件上具体执行计算任务的模块(device)的算力要求;在所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所节省的算力资源的情况下,所述第一评估结果在所述m个评估结果中最优;或者,所述m个评估结果中的每个评估结果用于表征所述m个节点组合中每个节点组合所节省的算力资源的情况下,所述第一评估结果在x个评估结果中最优,所述x个评估结果为所述m个评估结果中的至少两个评估结果。
  23. 根据权利要求17-22中任一项所述的框架,其特征在于,所述搜索模块,具体用于:
    在所述第一计算图中搜索拥有共同的第一父节点的多个第一分支,并根据所述多个第一分支得到一个可并行分支组,所述第一父节点为所述第一计算图中的任意一个父节点;
    和/或,
    在所述第一计算图中搜索拥有共同的第一子节点的多个第二分支,并根据所述多个第二分支得到一个可并行分支组,所述第一子节点为所述第一计算图中的任意一个子节点。
  24. 根据权利要求23所述的框架,其特征在于,所述搜索模块,具体还用于:
    以所述第一父节点为起始点,分别向下搜索每个第一分支,直至在向下搜索过程中遇到共同的第二父节点或共同的第二子节点时停止,以得到所述多个第一分支对应的可并行分支组,所述可并行分支组包含所述多个第一分支各自对应的第一子分支,每个所述第一子分支包含的节点是在向下搜索每个所述第一子分支过程中得到的节点。
  25. 根据权利要求23-24中任一项所述的框架,其特征在于,所述搜索模块,具体还用于:
    以所述第一子节点为起始点,分别向上搜索每个第二分支,直至在向上搜索过程中遇到共同的第三父节点或共同的第三子节点时停止,以得到所述多个第二分支对应的可并行分支组,所述可并行分支组包含所述多个第二分支各自对应的第二子分支,每个所述第二子分支包含的节点是在向上搜索每个所述第二子分支过程中得到的节点。
  26. 根据权利要求17-25中任一项所述的框架,其特征在于,所述搜索模块,具体还用于:
    在所述第一计算图中搜索多个第三分支,并根据所述多个第三分支得到一个可并行分支组,所述多个第三分支中每个第三分支的第一个节点都没有父节点;
    和/或,
    在所述第一计算图中搜索多个第四分支,并根据所述多个第四分支得到一个可并行分支组,所述第四分支中每个第四分支的最后一个节点都没有子节点。
  27. 根据权利要求26所述的框架,其特征在于,所述搜索模块,具体还用于:
    以每个所述第三分支的第一个节点为起始点,分别向下搜索每个所述第三分支,直至在向下搜索过程中遇到同一父节点或同一子节点时停止,以得到所述多个第三分支对应的可并行分支组,所述可并行分支组包含所述多个第三分支各自对应的第三子分支,每个所 述第三子分支包含的节点是在向下搜索每个所述第三子分支过程中得到的节点。
  28. 根据权利要求26-27中任一项所述的框架,其特征在于,所述搜索模块,具体还用于:
    以每个所述第四分支的最后一个节点为起始点,分别向上搜索每个所述第四分支,直至在向上搜索过程中遇到同一父节点或同一子节点时停止,以得到所述多个第四分支对应的可并行分支组,所述可并行分支组包含所述多个第四分支各自对应的第四子分支,每个所述第四子分支包含的节点是在向上搜索每个所述第四子分支过程中得到的节点。
  29. 根据权利要求17-28中任一项所述的框架,其特征在于,所述搜索模块,具体还用于:
    在目标节点不属于不可融合节点的情况下,对所述目标节点周围的局部结构进行化简,得到第五分支,所述目标节点为所述第一计算图中未被划分进任意一个可并行分支组的节点;
    在所述第五分支为多个的情况下,根据所述多个第五分支得到一个可并行分支组。
  30. 根据权利要求17-29中任一项所述的框架,其特征在于,所述框架还包括:
    编译模块,用于对所述第二计算图中的融合节点进行编译,得到与所述融合节点对应的算子核(kernel)。
  31. 根据权利要求30所述的框架,其特征在于,所述编译模块,还用于:
    在所述融合节点由p个节点融合得到的情况下,分别调度所述p个节点,以得到与所述p个节点分别对应的p个子中间表示(IR);
    对所述p个子IR进行融合,得到一个总IR;
    对所述总IR进行编译,得到与所述融合节点对应的算子核(kernel)。
  32. 根据权利要求17-31中任一项所述的框架,其特征在于,所述框架为:
    mindspore、tensorflow、tensornetwork、pytorch、mxnet、caffe或theano。
  33. 一种计算机设备,包括存储器和一个或多个处理器,所述一个和多个处理器与所述存储器耦合,其特征在于,
    所述存储器,用于存储程序;
    所述一个或多个处理器,用于执行所述存储器中的程序,使得所述计算机设备执行如权利要求1-16中任一项所述的方法。
  34. 一种计算机可读存储介质,包括计算机可读指令,其特征在于,当所述计算机可读指令在计算机上运行时,使得计算机执行如权利要求1-16中任一项所述的方法。
  35. 一种计算机程序产品,包括计算机可读指令,其特征在于,当所述计算机可读指令在计算机上运行时,使得计算机执行如权利要求1-16中任一项所述的方法。
  36. 一种芯片,其特征在于,所述芯片包括存储器和一个或多个处理器,所述芯片用于读取存储器中存储的计算机程序,使得所述一个或多个处理器执行如权利要求1-16任一项所述的方法。
PCT/CN2021/140906 2020-12-28 2021-12-23 一种计算图的节点融合方法及设备 WO2022143419A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21914156.1A EP4258175A4 (en) 2020-12-28 2021-12-23 NODE FUSION METHOD FOR COMPUTER GRAPH AND APPARATUS
US18/214,101 US20230334292A1 (en) 2020-12-28 2023-06-26 Node fusion method for computational graph and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011595030.3 2020-12-28
CN202011595030.3A CN114692860A (zh) 2020-12-28 2020-12-28 一种计算图的节点融合方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/214,101 Continuation US20230334292A1 (en) 2020-12-28 2023-06-26 Node fusion method for computational graph and device

Publications (1)

Publication Number Publication Date
WO2022143419A1 true WO2022143419A1 (zh) 2022-07-07

Family

ID=82132090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140906 WO2022143419A1 (zh) 2020-12-28 2021-12-23 一种计算图的节点融合方法及设备

Country Status (4)

Country Link
US (1) US20230334292A1 (zh)
EP (1) EP4258175A4 (zh)
CN (1) CN114692860A (zh)
WO (1) WO2022143419A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115756478A (zh) * 2022-11-02 2023-03-07 中科寒武纪科技股份有限公司 计算图的算子自动融合方法及相关产品
CN116820524B (zh) * 2023-08-22 2023-11-28 腾讯科技(深圳)有限公司 模型更新方法、装置、计算机设备及存储介质
CN116894469B (zh) * 2023-09-11 2023-12-15 西南林业大学 端边云计算环境中的dnn协同推理加速方法、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543825A (zh) * 2018-11-30 2019-03-29 上海寒武纪信息科技有限公司 神经网络模型算法编译方法、装置及相关产品
CN109740751A (zh) * 2018-12-24 2019-05-10 北京中科寒武纪科技有限公司 神经网络模型的架构融合方法及相关装置
CN111260019A (zh) * 2020-02-18 2020-06-09 深圳鲲云信息科技有限公司 神经网络模型的数据处理方法、装置、设备及存储介质
US20200249998A1 (en) * 2019-02-01 2020-08-06 Alibaba Group Holding Limited Scheduling computation graph heterogeneous computer system
CN111935005A (zh) * 2020-08-07 2020-11-13 腾讯科技(深圳)有限公司 数据传输方法、装置、处理设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543825A (zh) * 2018-11-30 2019-03-29 上海寒武纪信息科技有限公司 神经网络模型算法编译方法、装置及相关产品
CN109740751A (zh) * 2018-12-24 2019-05-10 北京中科寒武纪科技有限公司 神经网络模型的架构融合方法及相关装置
US20200249998A1 (en) * 2019-02-01 2020-08-06 Alibaba Group Holding Limited Scheduling computation graph heterogeneous computer system
CN111260019A (zh) * 2020-02-18 2020-06-09 深圳鲲云信息科技有限公司 神经网络模型的数据处理方法、装置、设备及存储介质
CN111935005A (zh) * 2020-08-07 2020-11-13 腾讯科技(深圳)有限公司 数据传输方法、装置、处理设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4258175A4

Also Published As

Publication number Publication date
CN114692860A (zh) 2022-07-01
EP4258175A1 (en) 2023-10-11
US20230334292A1 (en) 2023-10-19
EP4258175A4 (en) 2024-05-29

Similar Documents

Publication Publication Date Title
WO2022143419A1 (zh) 一种计算图的节点融合方法及设备
US12014257B2 (en) Domain specific language for generation of recurrent neural network architectures
Chatterjee et al. Measuring and synthesizing systems in probabilistic environments
CN107563512B (zh) 一种数据处理方法、装置以及存储介质
WO2021000970A1 (zh) 深度学习算法的编译方法、装置及相关产品
WO2021190597A1 (zh) 一种神经网络模型的处理方法以及相关设备
JP6763072B2 (ja) データ処理グラフのコンパイル
CN111160551A (zh) 计算图执行方法、计算机设备及存储介质
WO2024131097A1 (zh) 神经网络模型的编译方法、装置、电子设备和存储介质
CN110689116B (zh) 一种神经网络剪枝方法、装置、计算机设备及存储介质
CN111782637A (zh) 一种模型构建方法、装置及设备
Barchi et al. Exploration of convolutional neural network models for source code classification
WO2023093689A1 (zh) 一种计算图优化方法、装置及设备
US20240062116A1 (en) Model processing method and apparatus
WO2023160290A1 (zh) 神经网络推理加速方法、目标检测方法、设备及存储介质
CN117009038B (zh) 一种基于云原生技术的图计算平台
US11461662B1 (en) Compilation time reduction for memory and compute bound neural networks
CN115860061A (zh) 图神经网络优化方法和图神经网络推理系统
Ali et al. Parallelizing user-defined functions in the ETL workflow using orchestration style sheets
Ahmed et al. Toward a novel engine for compiler optimization space exploration of big data workloads
Kimovski et al. Autotuning of exascale applications with anomalies detection
Bruce The Blind Software Engineer: Improving the Non-Functional Properties of Software by Means of Genetic Improvement
CN116755714B (zh) 深度神经网络模型的运行方法、装置、设备和存储介质
US11809849B1 (en) Global modulo allocation in neural network compilation
US20240211312A1 (en) Node symmetry in machine learning compiler optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914156

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021914156

Country of ref document: EP

Effective date: 20230703

NENP Non-entry into the national phase

Ref country code: DE