WO2024066847A1 - Multi-die-based computation method and related device - Google Patents

Multi-die-based computation method and related device Download PDF

Info

Publication number
WO2024066847A1
WO2024066847A1 PCT/CN2023/115085 CN2023115085W WO2024066847A1 WO 2024066847 A1 WO2024066847 A1 WO 2024066847A1 CN 2023115085 W CN2023115085 W CN 2023115085W WO 2024066847 A1 WO2024066847 A1 WO 2024066847A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
axis
segmentation
die
operators
Prior art date
Application number
PCT/CN2023/115085
Other languages
French (fr)
Chinese (zh)
Inventor
刘锡明
朱思宇
林惠敏
葛根华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024066847A1 publication Critical patent/WO2024066847A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of computer technology, and in particular to a multi-die-based computing method and related equipment.
  • UMA uniform memory access
  • UMA non-uniform memory access
  • HBM high-bandwidth memory
  • the NUMA structure is usually chosen for multi-die packaging.
  • it is easy to access data across dies, which greatly increases the access latency and reduces the overall computing efficiency. How to improve the computing efficiency of multi-die packaged chips is an urgent problem to be solved.
  • the embodiments of the present application provide a multi-die based computing method and related equipment, which can effectively improve the computing efficiency of multi-die packaged chips.
  • the multi-die-based computing method provided in the embodiments of the present application can be executed by an electronic device, etc.
  • An electronic device refers to a device that can be abstracted as a computer system, wherein an electronic device based on the computing function of multiple dies can also be referred to as a computing device based on multiple dies.
  • the computing device based on multiple dies can be a complete machine of the electronic device, such as: a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a car computer or a server, etc.; it can also be a system/device composed of multiple complete machines; it can also be a part of the electronic device, such as: a chip related to the computing function based on multiple dies, such as a system on a chip (SoC), etc., which is not specifically limited in the embodiments of the present application. Among them, the system chip is also called a system on chip.
  • SoC system on a chip
  • an embodiment of the present application provides a computing method based on multiple bare chips, the method comprising: obtaining a first computing graph, the first computing graph comprising M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs comprises the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N bare chips for execution; the N second computing graphs correspond one-to-one to the N bare chips.
  • the present application can divide each complete calculation (e.g., the first operator) in the calculation graph into multiple smaller sub-calculations (e.g., the first operator after being divided) based on operator segmentation, and distribute the multiple smaller sub-calculations to the corresponding multiple bare chips for parallel calculation.
  • each complete calculation e.g., the first operator
  • sub-calculations e.g., the first operator after being divided
  • splitting the M first operators respectively to obtain the splitting results of the M first operators includes: determining the optimal splitting axis of the i-th first operator among the M first operators; splitting the i-th first operator based on the optimal splitting axis of the i-th first operator to obtain the splitting result of the i-th first operator; i is an integer greater than or equal to 1 and less than or equal to M.
  • each operator e.g., the first operator in the original computation graph may correspond to multiple split axes. Based on this, the embodiment of the present application first needs to determine an optimal split axis from the multiple split axes corresponding to each first operator, and then split the first operator based on the optimal split axis, thereby ensuring the computational efficiency of the multiple operators obtained after each first operator is split on multiple dies.
  • the i-th first operator includes K segmentation axes; determining the optimal segmentation axis of the i-th first operator among the M first operators includes: determining the computational benefits and communication time corresponding to each of the K segmentation axes included in the i-th first operator; determining the segmentation benefits corresponding to each of the K segmentation axes based on the difference between the computational benefits and the communication time corresponding to each of the K segmentation axes; wherein the segmentation axis with the largest segmentation benefit is the optimal segmentation axis of the i-th first operator; K is an integer greater than or equal to 1.
  • the actual slicing benefit that each slicing axis can bring can be calculated based on the computational benefit (i.e., the reduction in computational time) and communication time corresponding to each slicing axis in the current operator. Then, the slicing axis with the largest benefit is determined as the optimal slicing axis of the current operator to ensure the computational efficiency of each operator when it is distributed to multiple chips for execution after slicing.
  • K may also be equal to 0, that is, an operator may not have a slicing axis. Obviously, in the case where the operator has no slicing axis, the operator will not be able to perform slicing processing.
  • the operator can be deployed on N chips respectively so that the N chips perform the same calculation.
  • the N chips need to obtain the data required for calculation through cross-chip communication when executing the operator.
  • the computational benefit corresponding to the j-th splitting axis of the i-th first operator is the difference between the first computational time and the second computational time; the first computational time is the time required for a single die to execute the i-th first operator, and the second computational time is the time required for multiple die to execute in parallel the i-th first operator after being split by the j-th splitting axis; j is an integer greater than or equal to 1 and less than or equal to K.
  • the reduction in the computation time of a complete data calculation (e.g., the first operator) after being split by the current split axis and distributed to multiple dies can be used as the computation benefit corresponding to the current split axis, thereby providing support for the subsequent determination of the optimal split axis and ensuring the computation efficiency after the operator is split.
  • the computation time of the i-th first operator on a single die is T
  • the computation time of the i-th first operator after being split by its j-th split axis and distributed to 4 dies can theoretically be T/4
  • the computation benefit of the j-th split axis is (T-T/4).
  • the communication time corresponding to the j-th splitting axis of the i-th first operator is the time required for the p-th die to obtain target data from other die;
  • the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th splitting axis;
  • the target data is the data required for the p-th die to execute the i-th first operator after being split by the j-th splitting axis;
  • the communication time is related to the amount of the target data and the memory layout of the target data;
  • p is an integer greater than or equal to 1 and less than or equal to N.
  • the communication time corresponding to each split axis can be determined based on the time consumed by the operator after being split by the split axis to obtain the data required for calculation from other bare chips when calculating on the corresponding bare chip.
  • the present application can make the calculation allocated to each bare chip after the optimal split axis is split based on the parameter of communication time, so that the calculation only needs to access the storage in the bare chip as much as possible, avoiding the delay caused by accessing data across bare chips.
  • the present application can still achieve higher computing performance, so that the area of the multi-bare chip package chip can be applied to the calculation as much as possible, and the computing power density of the multi-bare chip package chip can be improved.
  • the previous operator has no split axis (that is, the previous operator has not been split)
  • the bare chip often needs to communicate across bare chips to obtain computing data from other bare chips when executing the current operator after being split.
  • the segmentation result of the i-th first operator includes: a list of output tensors of the i-th first operator, the original shapes of one or more input tensors and output tensors corresponding to the i-th first operator, the shapes of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis, and one or more bare chips allocated to the i-th first operator after being segmented by the optimal segmentation axis.
  • each operator may include a series of split results after being split (such as a list of output tensors of the operator, the input/output tensor shapes before the operator is split, the input/output tensor shapes after the operator is split, and which bare chips are assigned to after the operator is split, etc.).
  • split results can provide effective support for subsequent graph splitting of the original computational graph (such as the first computational graph), thereby quickly and efficiently constructing multiple second computational graphs corresponding to multiple bare chips one by one, thereby improving the computing efficiency of multiple bare chips.
  • the pth second computation graph among the N second computation graphs includes multiple second operators; the multiple second operators include the i-th first operator after being split by the optimal splitting axis; and the p-th second computation graph is a second computation graph allocated to be executed on the p-th die.
  • each second computation graph obtained after operator segmentation and graph segmentation can include the segmented first operator.
  • each complete computation is segmented and then distributed to multiple dies, and each die only needs to perform the computation of part of the data in the original first operator, effectively improving the computation efficiency of the multi-die packaged chip.
  • the plurality of second operators in the pth second computation graph further include a segmentation operator, a communication operator and one or more of the reduction operators; wherein the slicing operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis; the communication operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis from other bare chips when the optimal slicing axis of the i-th first operator is different from the optimal slicing axis of the i-1-th first operator; the reduction operator is used to reduce the data on the corresponding multiple bare chips when the optimal slicing axis of the i-th first operator is the reduction axis.
  • an embodiment of the present application provides an electronic device, the electronic device comprising N bare chips, where N is an integer greater than or equal to 1.
  • the electronic device is used to implement the corresponding functions of any one of the multi-bare-chip-based computing methods provided in the first aspect above.
  • an embodiment of the present application provides an electronic device, the electronic device includes a processor, and the processor is configured to support the electronic device to perform the corresponding functions in any one of the methods provided in the first aspect.
  • the electronic device may also include a memory, the memory is used to couple with the processor, and the memory stores the necessary program instructions and data of the electronic device.
  • the electronic device may also include a communication interface for the electronic device to communicate with other devices or a communication network.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements any one of the multi-die-based computing method processes provided in the first aspect.
  • an embodiment of the present application provides a computer program, which includes instructions.
  • the computer program When the computer program is executed by a computer, the computer can execute any one of the multi-die-based computing method processes provided in the first aspect above.
  • an embodiment of the present application provides a chip, which includes a processor and a communication interface, and the processor is used to call and run instructions from the communication interface.
  • the processor executes the instructions
  • the chip executes any one of the multi-bare-die based computing method processes provided in the first aspect above.
  • an embodiment of the present application provides a chip system, which includes the electronic device described in any one of the second aspect or the third aspect, and is used to implement the functions involved in any one of the multi-die-based computing method processes provided in the first aspect.
  • the chip system also includes a memory, which is used to store program instructions and data necessary for the multi-die-based computing method.
  • the chip system can be composed of chips, or it can include chips and other discrete devices.
  • FIG1 is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of the structure of a graph compiler provided in an embodiment of the present application.
  • FIG. 3a is a schematic diagram of operator segmentation provided in an embodiment of the present application.
  • FIG3 b is a schematic diagram of another operator segmentation provided in an embodiment of the present application.
  • FIG4 is a flow chart of a multi-die-based computing method provided in an embodiment of the present application.
  • FIG5 is a flow chart of an operator segmentation method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of calculating a split profit provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a method for calculating time consumption provided in an embodiment of the present application.
  • FIG8 is a flow chart of a method for calculating communication time consumption provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram showing that the segmentation axes of the front and rear operators are the same, provided by an embodiment of the present application.
  • FIG. 10 is a flow chart of a graph segmentation method provided in an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • At least one (item) means one or more, and “more than one” means two or more.
  • "and/or” Used to describe the association relationship of associated objects, indicating that there can be three relationships.
  • a and/or B can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural.
  • the character “/” generally indicates that the previous and next associated objects are in an “or” relationship.
  • At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c", where a, b, c can be single or plural.
  • a component can be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program and/or a computer running on a processor.
  • a component can be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program and/or a computer running on a processor.
  • the application running on the processor and the processor can be a component.
  • One or more components may reside in a process and/or an execution thread, and the component may be located on a computer and/or distributed between two or more computers.
  • these components may be executed from various computer-readable media having various data structures stored thereon.
  • Components may, for example, communicate through local and/or remote processes according to signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).
  • signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).
  • Die refers to the crystal grain of the chip before it is packaged. It is a small piece cut from a silicon wafer by laser. Each die is an independent functional chip. Dies will be packaged as a unit to become common chips. In order to meet the computing power requirements of today's artificial intelligence chips, the industry has proposed a technical solution to package multiple dies in one chip to provide greater computing power.
  • An operator is a mapping from a function space to another function space.
  • operators can be extended to any space, such as inner product space.
  • any operation on any function can be considered an operator, such as matrix multiplication (matmul), even exponentiation and square root can be considered an operator.
  • Directed acyclic graph also known as computational graph.
  • DAG Directed acyclic graph
  • the technical problems that the present application specifically aims to solve will be further analyzed and proposed below.
  • the UNMA architecture when multi-die packaging technology is adopted, the UNMA architecture is usually adopted.
  • the UNMA architecture usually brings problems of accessing data across dies, resulting in a significant increase in access latency, thereby reducing the overall computing efficiency.
  • MCM-GPU multi-chip module-graphics processing unit
  • MCM-GPU can include the following technical points:
  • L2Cahce A portion of the space in the L2 cache (L2Cahce) is allocated as L1.5 Cache, which is specifically responsible for caching data from remote Dies and improving the data access performance of remote Dies.
  • CTA Distributed and batched compute thread array
  • MCM-GPU has the following main disadvantages:
  • MCM-GPU has high requirements for inter-die bandwidth, which has a great impact on computing performance.
  • MCM-GPU due to its poor scalability, it is difficult to achieve synchronous expansion of inter-die bandwidth as the number of dies continues to increase.
  • the technical problems that this application actually aims to solve include the following aspects: Based on the existing hardware devices and NUMA architecture, by compiling the computational graph, each complete operator in the original computational graph is split into multiple ones, and distributed to the corresponding multiple Dies for calculation, so as to make full use of the computing resources of each Die, greatly improving the computing efficiency of multi-die packaged chips.
  • the data required for the calculation allocated to each Die after the operator is split is basically in the local storage of the Die, avoiding cross-Die data access as much as possible. In this way, the embodiments of the present application can achieve even in a relatively large Under low inter-die interconnection bandwidth, the computing efficiency of the entire multi-die chip can be effectively improved.
  • Figure 1 is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application.
  • the embodiment of the present application can be mainly applied to the training and reasoning scenarios of AI models.
  • AI developers can use AI frameworks such as TensorFow, Pytorch, MindSpore, etc. to develop training/reasoning scripts, and then trigger the execution of training/reasoning through the AI framework.
  • the system architecture may include an AI framework 101, a graph compiler 102, a memory management 103, an operation compiler 104, and a chip 105.
  • chip 105 is a multi-Die chip, which may include multiple bare chips such as Die 0, Die 1, Die 2, Die 3, etc. shown in Figure 1.
  • the AI framework 101 is used to construct the user's calculation into a DAG graph. Then, the AI framework 101 sends the DAG graph to the graph compiler 102 for compilation. It should be understood that in the DAG graph, the calculation is expressed by nodes, and the data transferred between calculations or the dependency between calculations are expressed by the edges between nodes. Each node in the DAG graph is an operator.
  • the graph compiler 102 is used to compile the operators in the DAG graph in a certain topological order, and finally compile all nodes on the DAG graph into a task list and send it to the memory management (runtime) 103.
  • the task list may include multiple computing tasks.
  • the memory management 103 is used to send the task list to all the dies in the chip 105 .
  • Chip 105 is used to execute the sent task list through multiple Dies therein.
  • memory management 103 only needs to simply send the task list to the only die in chip 105.
  • memory management 103 needs to send the task list to multiple dies in chip 105.
  • graph compiler 102 can split each complete operator in the DAG graph, and based on the split result, split the original DAG graph into multiple subgraphs, which are then distributed to the corresponding multiple dies for execution through memory management 103, thereby greatly improving the computing efficiency of multi-die chips.
  • the landing product of this application can be a training/inference device in a data center, deployed on a training/inference server in a data center, or any other possible device, such as a camera on a street lamp, etc., with AI functions (such as face recognition, etc.), edge devices that perform calculations based on DAG graphs, which may include.
  • This application mainly realizes the deployment of calculation compilation to multiple Dies in a chip by improving the compilation process of the DAG graph.
  • the graph compiler 102 may include a graph sorting unit 21, an operator segmentation unit 22, a graph segmentation unit 23, a model compilation unit 24, and a model deployment unit 25.
  • the graph compiler 102 may also include an environment information library 26 and an operator information library 27.
  • the AI framework 101 can be connected to the graph sorting unit 21 in the graph compiler 102, and the model deployment unit 25 can be connected to the memory management 103.
  • the graph sorting unit 21 is used to convert the DAG into an operator list by topological sorting, wherein the operator list includes multiple operators.
  • the order of the operators in the operator list expresses the order in which the operators are executed.
  • the operator segmentation unit 22 is used to perform operator segmentation processing on multiple operators in the DAG graph respectively, and generate segmentation results for each operator. Specifically, as shown in FIG2 , when the operator segmentation unit 22 performs operator segmentation, it can read the corresponding environment information and operator information through the environment information library 26 and the operator information library 27 respectively.
  • the environment information in the environment information library 26 may include: the number of dies in the chip, the memory specifications of each die, the interconnection topology and bandwidth between the dies, etc.
  • the operator information in the operator information library 27 may include: operator type, input/output list, and axis information of each input/output.
  • the environment information library 26 in the present application needs to modify the original single die in a single chip to multiple dies in a single chip in the environment information, and add a description of the topological structure of the interconnection network between dies and the interconnection bandwidth, etc.
  • the operator information library 27 in the present application needs to add a description of the input/output axis information for each operator in the operator information.
  • the axis information is an expression of the segmentation relationship between different input/output tensors (Tensor) of the operator. All inputs/outputs of the same axis must be segmented using the same segmentation method (i.e., segmentation axis).
  • Figure 3a is a schematic diagram of an operator segmentation provided in an embodiment of the present application.
  • Operator A in Figure 3a is a matrix multiplication (MatMul) operator, and the operator A includes two input matrices and an output matrix, wherein the left input matrix is an M ⁇ K matrix, the right input matrix is a K ⁇ N matrix, and the output matrix is an M ⁇ N matrix.
  • Operator A includes three segmentation axes, namely, the M axis, the K axis, and the N axis.
  • the left input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the M axis, as indicated by the dotted line in Figure 3a), thereby obtaining the segmented operators a1 and a2, wherein the left input matrices of operators a1 and a2 are (M/2) ⁇ K matrices, and the output matrices are (M/2) ⁇ N matrices.
  • the data required to be calculated by operators a1 and a2 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.
  • FIG. 3b is a schematic diagram of another operator segmentation provided in an embodiment of the present application.
  • the right input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the N axis, as indicated by the dotted line in FIG. 3b), thereby obtaining Operator a3 and operator a4, wherein the right input matrix of operator a3 and operator a4 is a K ⁇ (N/2) matrix, and the output matrix is an M ⁇ (N/2) matrix.
  • the data required to be calculated by operator a3 and operator a4 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.
  • the graph segmentation unit 23 is used to reorganize the segmented operators (such as operator a1 and operator a2 shown in FIG3a ) into a sub-graph (sub DAG) executed on multiple Dies based on the segmentation results of the operator segmentation, and each sub-graph includes multiple segmented operators.
  • the sub-graph 1 assigned to Die 0 for calculation may include the above operator a1
  • the sub-graph 1 assigned to Die 1 for calculation may include the above operator a2.
  • the operator segmentation unit 22 and the graph segmentation unit 23 are newly added units in the graph compiler of the present application.
  • the model compilation unit 24 is used to compile multiple sub DAGs into a model list that can be deployed.
  • the model list includes multiple models, and each model includes multiple computing tasks. It should be understood that compared with conventional model compilation, the model compilation unit 24 in the present application needs to increase the compilation support for multiple sub DAGs.
  • the model deployment unit 25 is used to deploy multiple model lists to corresponding multiple Dies. Finally, the multiple model lists are allocated to their corresponding Dies for execution through the memory management 103, making full use of the computing power of each Die in the chip and effectively improving the computing efficiency of the multi-Die chip.
  • Figure 4 is a flow chart of a multi-die-based calculation method provided in an embodiment of the present application.
  • the method is mainly aimed at a chip packaged by N dies, where N is an integer greater than or equal to 1.
  • the method can be applied to the system architecture described in Figure 1. Specifically, the method can be applied to the graph compiler 102 shown in Figure 2.
  • the method provided in an embodiment of the present application will be described in detail below in conjunction with the graph compiler 102 described in Figure 2. As shown in Figure 4, the method may include the following steps S501-S504.
  • Step S501 Obtain a first computation graph, where the first computation graph includes M first operators.
  • the graph compiler 102 obtains a first computation graph, which includes M first operators.
  • M is an integer greater than or equal to 1.
  • the first computation graph is a DAG graph before segmentation, and accordingly, the M first operators included in the first computation graph are all operators before segmentation, and each first operator corresponds to a calculation of a complete data.
  • the graph compiler 102 may obtain the first computation graph through the graph sorting unit 21 therein, and convert the first computation graph into an operator list through topological sorting, wherein the operator list includes the above-mentioned M first operators arranged in order.
  • Step S502 segmenting the M first operators respectively to obtain segmentation results of the M first operators.
  • the graph compiler 102 performs operator segmentation processing on the M first operators in sequence, thereby obtaining segmentation results for each of the M first operators.
  • the graph compiler 102 may first determine the optimal segmentation axis of each first operator, and then segment the first operator based on the optimal segmentation axis of each first operator, so as to obtain the most ideal segmentation result.
  • the optimal segmentation axis may be the segmentation axis that brings the greatest segmentation benefit.
  • the graph compiler 102 may first determine the optimal segmentation axis of the i-th first operator among the M first operators, and then segment the i-th first operator based on the optimal segmentation axis of the i-th first operator to obtain the segmentation result of the i-th first operator.
  • i is an integer greater than or equal to 1 and less than or equal to M.
  • the segmentation result of the i-th first operator includes: the output tensor list of the i-th first operator, the original shape of one or more input tensors and output tensors corresponding to the i-th first operator (for example, M ⁇ N shown in Figure 3a above), the shape of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis (for example, (M/2) ⁇ N shown in Figure 3a above), and one or more bare chips allocated corresponding to the i-th first operator after being segmented by the optimal segmentation axis.
  • the optimal segmentation axis for example, (M/2) ⁇ N shown in Figure 3a above
  • Figure 5 is a schematic flow chart of an operator segmentation method provided in an embodiment of the present application. As shown in Figure 5, the method includes the following steps S11 to S17.
  • Step S11 obtaining operator information.
  • the graph compiler 102 obtains the operator information of M first operators (i.e., all operators to be split) from the operator information library 27.
  • the operator information mainly includes the input tensor (Input Tensor) list, output tensor (Output Tensor) list and axis information of each first operator.
  • the axis information of each first operator may include: (1) axis type, such as Element-wise, reduction (Reduction), sliding window (SlidingWindow) and other types. Different axis types express different computing characteristics of operators.
  • Step S12 whether there is an uncalculated segmentation axis, if yes, execute step S13, if not, execute step S17.
  • the graph compiler 102 checks whether the current first operator (for example, the ith first operator) still has a slicing axis whose slicing benefit has not been calculated. If the current ith first operator still has a slicing axis that has not been calculated, the graph compiler 102 may then select the next slicing axis and calculate the slicing benefit; if the slicing benefit calculation has been completed for all slicing axes included in the current ith first operator (for example, including K slicing axes, K is an integer greater than or equal to 1), the graph compiler 102 may record the slicing axis with the largest slicing benefit as the optimal slicing axis, and record the slicing result of the ith first operator under the optimal slicing axis in the tensor slicing table.
  • the tensor slicing table may be located at Database or in memory.
  • Step S13 segmentation axis selection.
  • the graph compiler 102 can then select the next split axis (for example, the j-th split axis among the K split axes) to calculate the split benefit. For example, as shown in FIG3a above, operator A includes three split axes. If the split benefit of axis N has been calculated at this time, the graph compiler 102 can then select axis N or axis K to calculate the corresponding split benefit.
  • the next split axis for example, the j-th split axis among the K split axes
  • Step S14 calculating the split revenue.
  • the graph compiler 102 calculates the splitting benefit of the i-th first operator under the current splitting axis.
  • Figure 6 is a schematic diagram of calculating a split benefit provided in an embodiment of the present application.
  • the split benefit is related to the calculation benefit and the communication time (or communication loss).
  • the split benefit corresponding to each split axis is specifically the difference between the calculation benefit and the communication time. The greater the calculation benefit and the smaller the communication time, the greater the split benefit.
  • the computational benefit is the reduction in computational time of the ith first operator after the split. Specifically, it is the difference between the time required for the ith first operator to be computed on one die before the split (e.g., the first computational time) and the time required for the ith first operator to be computed on multiple die in parallel after being split by the current jth split axis (e.g., the second computational time).
  • the computation time of the i-th first operator on a single die is T
  • the computation time of the i-th first operator after being split by its j-th splitting axis and distributed to 4 die for computation can theoretically be T/4
  • the computation benefit of the j-th splitting axis is (T-T/4).
  • the amount of computational data and the computational time of the operator are not linearly related, and the computational time on each die is often related to many factors.
  • FIG. 7, is a schematic diagram of a computational time calculation method provided in an embodiment of the present application. As shown in FIG.
  • the graph compiler 102 needs to consider factors such as the chip type (which determines the number of cores/main frequency of various types of accelerated computing units, the size of caches at all levels, etc.), the type of input data (DType), the shape of the input data (Shape), and the like, and calculates the computational time required for each die to execute the i-th first operator (such as the operator a1 shown in FIG. 3a above) after being split by the current j-th splitting axis based on the cost model (Cost Model).
  • the chip type which determines the number of cores/main frequency of various types of accelerated computing units, the size of caches at all levels, etc.
  • DType type of input data
  • shape of the input data shape of the input data
  • Cost Model Cost Model
  • the communication time is the time consumed when the die needs to communicate with other die to obtain the calculation data stored on other die when executing the split operator after the i-th first operator is split.
  • the communication time corresponding to the j-th split axis of the i-th first operator is the time required for the p-th die to obtain the target data from other die.
  • the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th split axis
  • the target data is the data required when the p-th die executes the i-th first operator after the split.
  • the communication time is generally related to the number of target data (i.e., the amount of communication data) and the memory arrangement of the target data (i.e., the memory arrangement of the communication data), and can also be related to factors such as the interconnection topology between Dies and the connection bandwidth.
  • p is an integer greater than or equal to 1 and less than or equal to N.
  • Figure 8 is a flow chart of a method for calculating communication time consumption provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps S21 to S25.
  • Step S21 reading the communication topology and bandwidth.
  • the graph compiler 102 reads the communication topology between multiple Dies in a Chip and the bandwidth data of the communication links from the environment information library 26 .
  • Step S22 reading the segmentation result of the preceding operator.
  • the graph compiler 102 reads the segmentation result of the predecessor operator of the current i-th first operator from the tensor segmentation table, and determines the optimal segmentation axis of the predecessor operator, that is, determines the optimal segmentation axis of the i-1th first operator. If the optimal segmentation axis of the i-1th first operator is the same as the j-th segmentation axis of the current i-th first operator, there is no need for cross-die communication, that is, the communication time consumption is zero.
  • Figure 9 is a schematic diagram of the same segmentation axis of the previous and next operators provided in an embodiment of the present application.
  • operator A is, for example, the i-1th first operator
  • operator B is, for example, the i-th first operator
  • operator A is the predecessor operator of operator B.
  • segmentation axis of operator A and the segmentation axis of operator B are both M axes. After operator A is segmented, there are only (M/2) ⁇ N output data on the current die.
  • the current die executes the split operator B, it can directly obtain (M/2) ⁇ N input data for calculation on the current die without the need for cross-die communication; otherwise, if the split axis of operator B is the N axis, the current die needs to obtain the other half of the M ⁇ (N/2) data from other die, that is, cross-die communication is required.
  • Step S23 calculating the communication data volume.
  • the graph compiler 102 needs to calculate the communication time. At this time, the graph compiler 102 can first calculate the amount of communication data that needs to be accessed across Dies.
  • the graph compiler 102 also needs to calculate the amount of data that needs to be exchanged with other Dies based on the type and shape of the operator input tensor.
  • Step S24 calculating the memory arrangement of communication data.
  • the embodiments of the present application can also increase the calculation of the memory layout of communication data.
  • the data that needs to be exchanged across the bare chip after the i-th first operator is divided by the j-th division axis is non-continuous memory, it is often necessary to use multiple exchange tasks to complete the exchange, which brings additional task consumption, thereby bringing greater communication time consumption.
  • the memory layout is very scattered and the amount of data is not large, it is easy to cause multiple frequent exchanges of small numbers, which brings greater communication time consumption.
  • Step S25 calculating the communication time consumption.
  • the graph compiler 102 can preliminarily calculate the communication time consumption based on the communication data volume ⁇ the inter-die bandwidth.
  • the communication time consumption can be comprehensively calculated based on the communication data volume and the memory layout information of the communication data.
  • the graph compiler 102 can specifically perform a simple evaluation of the communication time consumption based on some typical packet length test values, or use a more complex Cost Model to calculate the communication time consumption, etc., which is not specifically limited in the embodiments of the present application.
  • the communication time consumption is not only affected by the above-mentioned data volume, but also by other factors such as the size of the transmitted packet and the control signal during the communication process, which is not specifically limited in the embodiments of the present application.
  • Step S15 whether it is the optimal segmentation axis.
  • the graph compiler 102 obtains the segmentation benefit corresponding to the current j-th segmentation axis based on the calculation of the computational benefit and the communication time consumption in the above-mentioned embodiments corresponding to Figures 7 and 8. Then, the graph compiler 102 compares the segmentation benefit of the current j-th segmentation axis with the segmentation benefits corresponding to the previous j-1 segmentation axes. If the segmentation benefit of the current j-th segmentation axis is greater than the segmentation benefits corresponding to the previous j-1 segmentation axes, the graph compiler 102 can determine that the current j-th segmentation axis is the optimal segmentation axis and execute step S16, otherwise execute step S12.
  • Step S16 recording the optimal segmentation axis.
  • the graph compiler 102 records the current j-th slicing axis as the optimal slicing axis for the current i-th first operator.
  • the graph compiler 102 may update the optimal slicing axis, i.e., record the j+1-th slicing axis as the optimal slicing axis for the current i-th first operator.
  • Step S17 recording the operator segmentation result.
  • the optimal slicing axis of the current i-th first operator can be determined. Then, the graph compiler 102 slicing the current i-th first operator through the optimal slicing axis, obtains the slicing result of the i-th first operator, and records the slicing result of the i-th first operator into the tensor slicing table.
  • all the method processes in the above step 502 can be specifically executed by the operator segmentation unit 22 in the graph compiler 102.
  • Step S503 Split the first computation graph based on the segmentation results of the M first operators to obtain corresponding N second computation graphs.
  • the graph compiler 102 divides the first computation graph based on the division results of the M first operators to obtain corresponding N second computation graphs.
  • the N second computation graphs correspond one-to-one to the N dies in the chip.
  • N is an integer greater than or equal to 1.
  • Figure 10 is a schematic diagram of a method flow of graph segmentation provided by an embodiment of the present application. As shown in Figure 10, the method may include the following steps S31 to S35.
  • Step S31 create sub DAG (subgraph).
  • the graph compiler 102 constructs a corresponding number of sub-DAGs according to the number of dies in the Chip. It should be understood that each sub-graph is an empty graph initially. Exemplarily, the graph compiler 102 creates N sub-graphs corresponding to N dies.
  • Step S32 traverse the first computation graph.
  • the graph compiler 102 traverses each first operator in the first computation graph, and obtains the segmentation result corresponding to each first operator.
  • Step S33 reading the operator segmentation result.
  • the graph compiler 102 reads the segmentation result of the current first operator. For example, the graph compiler 102 reads the segmentation result of the i-th first operator.
  • the multiple operators obtained after the i-th first operator is segmented are distributed to multiple dies among the N dies (for example, including the above-mentioned p-th die).
  • Step S34 whether it is necessary to insert a communication operator.
  • the graph compiler 102 determines whether it is necessary to insert a communication operator before the i-th first operator after the segmentation based on the segmentation result of the current i-th first operator. If not, step S35 is directly executed. If yes, step S36 is executed. As described above, if the segmentation axis of two adjacent operators changes, the data part required by the post-operator is on other dies. At this time, a communication operator needs to be inserted to obtain the data required on the current die from other dies.
  • the original Tensor is complete data, but the operator after segmentation often only needs to use part of the data for calculation. At this time, it is necessary to insert a slice operator to split the original Tensor into the data required by the operator after segmentation in the subgraph.
  • the current i-th first operator is splitting the Reduction axis, that is, when the i-th first operator is a Reduce calculation (such as ReduceSum/ReduceMax, etc.), it is necessary to reduce the data of multiple Dies, so it is necessary to insert AllReduce (Reduction) operator.
  • a Reduce calculation such as ReduceSum/ReduceMax, etc.
  • Step S35 constructing the segmented operator.
  • the graph compiler 102 constructs multiple i-th first operators after segmentation according to the segmentation result of the current i-th first operator.
  • the i-th first operator is operator A in Figure 3a
  • the i-th first operators after multiple segmentations may be operator a1 and operator a2 shown in Figure 3a.
  • the shape of its corresponding input/output tensor will change, so it is necessary to construct a new segmented operator.
  • the various attributes of the original operator can be copied, but the shape of the input/output tensor needs to be modified to the shape after segmentation.
  • Step S36 construct a communication operator.
  • step S34 if the splitting axes of two adjacent operators change, the data part required by the post-operator is on other dies. In this way, if the optimal splitting axes of the i-1 first operator and the i-th first operator are different, the graph compiler 102 can construct a corresponding communication operator before the i-th first operator after the splitting.
  • Step S37 add the constructed operator to sub DAG.
  • the graph compiler 102 adds the split operators constructed in step S35 (e.g., multiple second operators obtained after the i-th first operator is split by the optimal split axis) and the inserted operators constructed in step S36 (e.g., communication operators and split operators, etc.) to the corresponding sub DAG. Further, steps S33-S37 are looped until each first operator in the first computation graph is traversed, thereby obtaining N second computation graphs corresponding to N dies one by one. Each second computation graph includes multiple second operators, which include the split first operator, as well as the inserted split operator, communication operator, and reduction operator, etc.
  • all the method processes in the above step 503 can be specifically executed by the graph segmentation unit 23 in the graph compiler 102.
  • Step S504 distribute the N second computation graphs to N dies for execution.
  • the graph segmentation unit 23 in the graph compiler 102 outputs a subgraph list containing the N second computation graphs to the model compilation unit 24.
  • Each second computation graph corresponds to a bare chip for execution.
  • the model compilation unit 24 outputs a corresponding model list based on the subgraph list to the model deployment unit 25.
  • the model deployment unit 25 deploys each model corresponding to the N second computation graphs to the corresponding multiple bare chips for execution.
  • each method flow in the multi-die-based computing method described in the embodiments of the present application can be implemented based on software, hardware, or a combination thereof.
  • the hardware implementation can include logic circuits, algorithm circuits, or analog circuits.
  • the software implementation can include program instructions, which can be regarded as a software product, stored in a memory, and can be executed by a processor to implement related functions.
  • the embodiments of the present application have improved the graph compiler and optimized the compilation deployment of the DAG graph in the multi-Die chip scenario.
  • the embodiments of the present application can divide each complete calculation in the DAG graph into multiple smaller sub-calculations based on operator segmentation, and divide the original DAG graph into multiple sub-DAG graphs, and then compile these sub-DAG graphs into multiple models, and finally deploy them on multiple Dies in a Chip of a NUMA architecture.
  • the embodiments of the present application select the optimal segmentation scheme for each operator in the DAG graph by comparing the segmentation benefits of different segmentation schemes (i.e., different segmentation axes), and divide each operator into multiple operators according to the optimal segmentation scheme for each operator. These operators are calculated on multiple Dies at the same time to make full use of the computing resources of multiple Dies in the Chip and improve computing efficiency.
  • the embodiments of the present application can bring the following beneficial effects.
  • This application calculates the slicing benefit and selects the optimal slicing axis so that the calculation on each NUMA node only needs to access the storage of this node.
  • the calculation performed by each die only needs to access the storage on this die, so that the implementation of the operator does not need to be aware of cross-die memory access, simplifying operator development.
  • the requirements for inter-die bandwidth and topology are reduced, and even at a lower inter-die bandwidth, higher computing performance can still be achieved. Users can use the chip area for computing as much as possible, thereby improving the computing power density of the chip.
  • the embodiment of the present application also provides an electronic device.
  • Figure 11 is a schematic diagram of the structure of an electronic device provided in the embodiment of the present application.
  • the electronic device 110 at least includes a processor 1101, an input device 1102, an output device 1103 and a memory 1104.
  • the electronic device may also include other common components, which are not described in detail here.
  • the processor 1101, the input device 1102, the output device 1103 and the memory 1104 in the electronic device can be connected via a bus or other means.
  • the electronic device 110 may be a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a vehicle-mounted computer or a server, etc., or may be a server cluster or a cloud computing service center composed of multiple servers.
  • the memory 1104 in the electronic device 110 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • the memory 1104 may exist independently and be connected to the processor 1101 through a bus.
  • the memory 1104 may also be integrated with the processor 1101.
  • the computer-readable storage medium may be stored in the memory 1104 of the electronic device 110, the computer-readable storage medium is used to store a computer program, the computer program includes program instructions, and the processor 1101 is used to execute the program instructions stored in the computer-readable storage medium.
  • the processor 1101 (or CPU (Central Processing Unit)) is the computing core and control core of the electronic device 110, which is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions to implement the corresponding method flow or corresponding function.
  • the processor 1101 described in the embodiment of the present application can be used to perform a series of processing based on a multi-die computing method, including: obtaining a first computing graph, the first computing graph includes M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs includes the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N dies for execution; the N second computing graphs correspond to the N dies one by one, etc.
  • a multi-die computing method including: obtaining a first computing graph, the first computing graph includes M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs includes the divided first operator; N and M are integers
  • An embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium may store a program, and when the program is executed by a processor, the processor can execute part or all of the steps of any one of the above method embodiments.
  • An embodiment of the present application also provides a computer program, which includes instructions.
  • the computer program is executed by a multi-core processor, the processor can execute part or all of the steps of any one of the above method embodiments.
  • the disclosed devices can be implemented in other ways.
  • the device embodiments described above are only schematic, such as the division of the above-mentioned units, which is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of devices or units can be electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to perform all or part of the steps of the above-mentioned methods of each embodiment of the present application.
  • a computer device which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device
  • the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM), double data rate synchronous dynamic random access memory (DDR), flash memory (flash) or random access memory (RAM) and other media that can store program codes.
  • ROM read-only memory
  • DDR double data rate synchronous dynamic random access memory
  • flash flash memory
  • RAM random access memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

Disclosed in embodiments of the present application are a multi-die-based computation method and a related device. The method comprises: acquiring a first computational graph, wherein the first computational graph comprises M first operators; respectively segmenting the M first operators to obtain segmentation results of the M first operators; segmenting the first computational graph on the basis of the segmentation results of the M first operators to obtain N corresponding second computational graphs, wherein each of the N second computational graphs comprises a segmented first operator, and N and M are integers greater than or equal to 1; and allocating the N second computational graphs to N dies for execution, wherein the N second computational graphs have one-to-one correspondence to the N dies. By means of the embodiments of the present application, the computation efficiency of multi-die package chips can be improved.

Description

一种基于多裸片的计算方法及相关设备A computing method based on multiple bare chips and related equipment
本申请要求于2022年09月29日提交中国专利局、申请号为202211198266.2、申请名称为“一种基于多裸片的计算方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on September 29, 2022, with application number 202211198266.2 and application name “A computing method based on multiple bare chips and related equipment”, all contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及计算机技术领域,尤其涉及一种基于多裸片的计算方法及相关设备。The present application relates to the field of computer technology, and in particular to a multi-die-based computing method and related equipment.
背景技术Background technique
随着摩尔定律的放缓,单个裸片(Die)内的晶体管数量无法再持续快速的增长,但是人工智能(artificial intelligent,AI)芯片的算力需求仍然在高速发展。为解决这个问题,业内提出了多Die封装的技术,即在一个芯片(Chip)内封装多个AI Die,从而提供更大的算力。As Moore's Law slows down, the number of transistors in a single die can no longer continue to grow rapidly, but the computing power demand of artificial intelligence (AI) chips is still growing rapidly. To solve this problem, the industry has proposed a multi-die packaging technology, that is, packaging multiple AI dies in one chip to provide greater computing power.
采用多Die封装技术时,存在两种不同的网络体系结构:统一内存访问(uniform memory access,UMA)和非统一内存访问(non-uniform memory access,UMA)。在UMA架构下,Chip中的每个Die可以平等的使用Chip内所有的存储,基于此,UMA架构通常要求Die间的互连带宽能够和存储器带宽持平,以实现不同位置的存储访问的性能一致。然而,为了提供尽可能高的算力密度,Chip内会尽可能的将面积用于AI算力,使得Chip内的Die间互连可用的面积非常小,并且由于AI芯片通常会使用高宽带存储器(high bandwidth memory,HBM),即Chip内的存储器带宽非常高,这就导致Die间的互连带宽很难提高到和存储器带宽持平的水平。When using multi-die packaging technology, there are two different network architectures: uniform memory access (UMA) and non-uniform memory access (UMA). Under the UMA architecture, each die in the chip can use all the storage in the chip equally. Based on this, the UMA architecture usually requires that the interconnection bandwidth between dies be on par with the memory bandwidth to achieve consistent performance of storage access at different locations. However, in order to provide the highest possible computing power density, the area within the chip will be used for AI computing power as much as possible, making the area available for interconnection between dies within the chip very small, and because AI chips usually use high-bandwidth memory (HBM), that is, the memory bandwidth within the chip is very high, it is difficult to increase the interconnection bandwidth between dies to the same level as the memory bandwidth.
因此,在多Die封装时通常会选择采用NUMA结构。然而在NUMA结构下,容易出现跨Die访问数据的情况,使得访问时延大大增加,从而降低整体的计算效率。如何提升多Die封装芯片的计算效率是亟待解决的问题。Therefore, the NUMA structure is usually chosen for multi-die packaging. However, under the NUMA structure, it is easy to access data across dies, which greatly increases the access latency and reduces the overall computing efficiency. How to improve the computing efficiency of multi-die packaged chips is an urgent problem to be solved.
发明内容Summary of the invention
本申请实施例提供一种基于多裸片的计算方法及相关设备,可以有效提升多Die封装芯片的计算效率。The embodiments of the present application provide a multi-die based computing method and related equipment, which can effectively improve the computing efficiency of multi-die packaged chips.
本申请实施例提供的基于多裸片的计算方法可以由电子设备等执行。电子设备是指能够被抽象为计算机系统的设备,其中,基于多裸片的计算功能的电子设备也可称为基于多裸片的计算装置。基于多裸片的计算装置可以是该电子设备的整机,例如:智能可穿戴设备、智能手机、平板电脑、笔记本电脑、台式电脑、车载计算机或服务器,等等;也可以是由多个整机构成的系统/装置;还可以是该电子设备中的部分器件,例如:基于多裸片的计算功能相关的芯片,如系统芯片(system on a chip,SoC),等等,本申请实施例对此不作具体限定。其中,系统芯片也称为片上系统。The multi-die-based computing method provided in the embodiments of the present application can be executed by an electronic device, etc. An electronic device refers to a device that can be abstracted as a computer system, wherein an electronic device based on the computing function of multiple dies can also be referred to as a computing device based on multiple dies. The computing device based on multiple dies can be a complete machine of the electronic device, such as: a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a car computer or a server, etc.; it can also be a system/device composed of multiple complete machines; it can also be a part of the electronic device, such as: a chip related to the computing function based on multiple dies, such as a system on a chip (SoC), etc., which is not specifically limited in the embodiments of the present application. Among them, the system chip is also called a system on chip.
第一方面,本申请实施例提供了一种基于多裸片的计算方法,所述方法包括:获取第一计算图,所述第一计算图包括M个第一算子;对所述M个第一算子分别进行切分,得到所述M个第一算子的切分结果;基于所述M个第一算子的切分结果对所述第一计算图进行切分,得到对应的N个第二计算图;所述N个第二计算图中的每一个第二计算图包括被切分后的第一算子;N、M为大于或者等于1的整数;将所述N个第二计算图分配至N个裸片上执行;所述N个第二计算图与所述N个裸片一一对应。In a first aspect, an embodiment of the present application provides a computing method based on multiple bare chips, the method comprising: obtaining a first computing graph, the first computing graph comprising M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs comprises the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N bare chips for execution; the N second computing graphs correspond one-to-one to the N bare chips.
通过第一方面提供的方法,本申请可以基于算子切分,将计算图中的每一个完整的计算(例如第一算子)切分成多个较小的子计算(例如被切分后的第一算子),并将该多个较小的子计算分配到对应的多个裸片上并行计算。如此,可以充分利用每一个裸片的计算资源,大大提升了多裸片封装芯片的计算效率。Through the method provided in the first aspect, the present application can divide each complete calculation (e.g., the first operator) in the calculation graph into multiple smaller sub-calculations (e.g., the first operator after being divided) based on operator segmentation, and distribute the multiple smaller sub-calculations to the corresponding multiple bare chips for parallel calculation. In this way, the computing resources of each bare chip can be fully utilized, greatly improving the computing efficiency of the multi-bare chip package.
在一种可能的实施方式中,所述对所述M个第一算子分别进行切分,得到所述M个第一算子的切分结果,包括:确定所述M个第一算子中的第i个第一算子的最优切分轴;基于所述第i个第一算子的最优切分轴,对所述第i个第一算子进行切分,获得所述第i个第一算子的切分结果;i为大于或者等于1,且小于或者等于M的整数。In a possible implementation, splitting the M first operators respectively to obtain the splitting results of the M first operators includes: determining the optimal splitting axis of the i-th first operator among the M first operators; splitting the i-th first operator based on the optimal splitting axis of the i-th first operator to obtain the splitting result of the i-th first operator; i is an integer greater than or equal to 1 and less than or equal to M.
在本申请实施例中,原始计算图中的每个算子(例如第一算子)可以对应有多个切分轴。基于此,本申请实施例针对每个第一算子,首先需要从其对应的多个切分轴中确定一个最优切分轴,然后再基于该最优切分轴对该第一算子进行切分,从而确保每个第一算子被切分后得到的多个算子在多个裸片上的计算效率。 In the embodiment of the present application, each operator (e.g., the first operator) in the original computation graph may correspond to multiple split axes. Based on this, the embodiment of the present application first needs to determine an optimal split axis from the multiple split axes corresponding to each first operator, and then split the first operator based on the optimal split axis, thereby ensuring the computational efficiency of the multiple operators obtained after each first operator is split on multiple dies.
在一种可能的实施方式中,所述第i个第一算子包括K个切分轴;所述确定所述M个第一算子中的第i个第一算子的最优切分轴,包括:确定所述第i个第一算子包括的所述K个切分轴各自对应的计算收益和通信耗时;基于所述K个切分轴各自对应的所述计算收益和所述通信耗时的差值,确定所述K个切分轴各自对应的切分收益;其中,所述切分收益最大的切分轴为所述第i个第一算子的最优切分轴;K为大于或者等于1的整数。In a possible implementation, the i-th first operator includes K segmentation axes; determining the optimal segmentation axis of the i-th first operator among the M first operators includes: determining the computational benefits and communication time corresponding to each of the K segmentation axes included in the i-th first operator; determining the segmentation benefits corresponding to each of the K segmentation axes based on the difference between the computational benefits and the communication time corresponding to each of the K segmentation axes; wherein the segmentation axis with the largest segmentation benefit is the optimal segmentation axis of the i-th first operator; K is an integer greater than or equal to 1.
在本申请实施例中,可以基于当前算子中每个切分轴对应的计算收益(即计算耗时的减少量)和通信耗时,计算出每个切分轴实际可以带来的切分收益。然后,从中确定收益最大的切分轴作为当前算子的最优切分轴,以保证每个算子切分后分配到多个裸片上执行时的计算效率。此外,在一些可能的实施例中,K也可以等于0,即一个算子可以没有切分轴。显然,在算子没有切分轴的情况下,该算子将无法进行切分处理,此时可以将该算子分别部署到N个裸片上,以使得N个裸片进行相同的计算。相应的,若该算子的前置算子已经做了切分处理,那么N个裸片在执行该算子时需要通过跨裸片的通信获取计算需要的数据。In an embodiment of the present application, the actual slicing benefit that each slicing axis can bring can be calculated based on the computational benefit (i.e., the reduction in computational time) and communication time corresponding to each slicing axis in the current operator. Then, the slicing axis with the largest benefit is determined as the optimal slicing axis of the current operator to ensure the computational efficiency of each operator when it is distributed to multiple chips for execution after slicing. In addition, in some possible embodiments, K may also be equal to 0, that is, an operator may not have a slicing axis. Obviously, in the case where the operator has no slicing axis, the operator will not be able to perform slicing processing. At this time, the operator can be deployed on N chips respectively so that the N chips perform the same calculation. Correspondingly, if the preceding operator of the operator has been slicing, then the N chips need to obtain the data required for calculation through cross-chip communication when executing the operator.
在一种可能的实施方式中,所述第i个第一算子的第j个切分轴对应的计算收益为第一计算耗时与第二计算耗时的差值;所述第一计算耗时为单个裸片执行所述第i个第一算子所需的时间,所述第二计算耗时为多个裸片并行执行被所述第j个切分轴切分后的所述第i个第一算子所需的时间;j为大于或者等于1,且小于或者等于K的整数。In one possible implementation, the computational benefit corresponding to the j-th splitting axis of the i-th first operator is the difference between the first computational time and the second computational time; the first computational time is the time required for a single die to execute the i-th first operator, and the second computational time is the time required for multiple die to execute in parallel the i-th first operator after being split by the j-th splitting axis; j is an integer greater than or equal to 1 and less than or equal to K.
在本申请实施例中,可以将一个完整数据的计算(例如第一算子)被当前切分轴切分后分布到多个裸片上后计算耗时的减少量作为当前切分轴对应的计算收益,从而为后续最优切分轴的确定提供支持,保证算子切分后的计算效率。例如,第i个第一算子在单个裸片上的计算耗时为T,第i个第一算子被其第j个切分轴切分后分布到4个裸片上计算时的计算耗时理论上可以为T/4,那么该第j个切分轴的计算收益为(T-T/4)。In the embodiment of the present application, the reduction in the computation time of a complete data calculation (e.g., the first operator) after being split by the current split axis and distributed to multiple dies can be used as the computation benefit corresponding to the current split axis, thereby providing support for the subsequent determination of the optimal split axis and ensuring the computation efficiency after the operator is split. For example, the computation time of the i-th first operator on a single die is T, and the computation time of the i-th first operator after being split by its j-th split axis and distributed to 4 dies can theoretically be T/4, then the computation benefit of the j-th split axis is (T-T/4).
在一种可能的实施方式中,所述第i个第一算子的第j个切分轴对应的通信耗时为第p个裸片从其他裸片上获取目标数据所需的时间;所述第p个裸片为所述第i个第一算子被所述第j个切分轴切分后对应分配的多个裸片中的一个;所述目标数据为所述第p个裸片执行被所述第j个切分轴切分后的所述第i个第一算子时所需的数据;所述通信耗时与所述目标数据的数量以及所述目标数据的内存排布相关;p为大于或者等于1,且小于或者等于N的整数。In a possible implementation, the communication time corresponding to the j-th splitting axis of the i-th first operator is the time required for the p-th die to obtain target data from other die; the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th splitting axis; the target data is the data required for the p-th die to execute the i-th first operator after being split by the j-th splitting axis; the communication time is related to the amount of the target data and the memory layout of the target data; p is an integer greater than or equal to 1 and less than or equal to N.
在本申请实施例中,可以基于被切分轴切分后的算子在对应裸片上计算时,需要从其他裸片上获取计算所需数据所消耗的时间,确定每个切分轴对应的通信耗时。显然,切分轴对应的通信耗时越小或者为零,能够带来更大的切分收益。如此,本申请可以基于通信耗时这一参数,尽可能让最优切分轴切分后分配到每个裸片上的计算仅需要访问本裸片中的存储,避免了跨裸片访问数据带来的延时。基于此,即使在较低的Die间带宽下,本申请仍然可以达到较高的计算性能,从而可以将多裸片封装芯片的面积尽可能应用于计算,提升多裸片封装芯片的算力密度。需要说明的是,除了上一个算子没有切分轴(即上一个算子没有进行切分处理)的情况外,一般情况下,若当前算子的当前切分轴与上一个算子的最优切分轴不同,那么裸片在执行被切分后的当前算子时往往需要跨裸片通信以从其他裸片中获取计算数据。In an embodiment of the present application, the communication time corresponding to each split axis can be determined based on the time consumed by the operator after being split by the split axis to obtain the data required for calculation from other bare chips when calculating on the corresponding bare chip. Obviously, the smaller or zero the communication time corresponding to the split axis is, the greater the splitting benefit can be brought. In this way, the present application can make the calculation allocated to each bare chip after the optimal split axis is split based on the parameter of communication time, so that the calculation only needs to access the storage in the bare chip as much as possible, avoiding the delay caused by accessing data across bare chips. Based on this, even at a lower inter-die bandwidth, the present application can still achieve higher computing performance, so that the area of the multi-bare chip package chip can be applied to the calculation as much as possible, and the computing power density of the multi-bare chip package chip can be improved. It should be noted that, except for the case where the previous operator has no split axis (that is, the previous operator has not been split), in general, if the current split axis of the current operator is different from the optimal split axis of the previous operator, then the bare chip often needs to communicate across bare chips to obtain computing data from other bare chips when executing the current operator after being split.
在一种可能的实施方式中,所述第i个第一算子的切分结果包括:所述第i个第一算子的输出张量列表,所述第i个第一算子对应的一个或多个输入张量和输出张量的原始形状,所述第i个第一算子对应的一个或多个输入张量和/或输出张量被所述最优切分轴切分后的形状,以及所述第i个第一算子被所述最优切分轴切分后对应分配的一个或多个裸片。In one possible implementation, the segmentation result of the i-th first operator includes: a list of output tensors of the i-th first operator, the original shapes of one or more input tensors and output tensors corresponding to the i-th first operator, the shapes of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis, and one or more bare chips allocated to the i-th first operator after being segmented by the optimal segmentation axis.
在本申请实施例中,每个算子被切分后可以包括一系列切分结果(例如算子的输出张量列表、算子切分前的输入/输出张量形状,算子切分后的输入/输出张量形状,以及算子切分后分配给哪些裸片,等等),这些切分结果可以为后续对原始计算图(例如第一计算图)进行图切分提供有效支撑,从而快速、高效地构造出与多个裸片一一对应的多个第二计算图,提升多裸片的计算效率。In an embodiment of the present application, each operator may include a series of split results after being split (such as a list of output tensors of the operator, the input/output tensor shapes before the operator is split, the input/output tensor shapes after the operator is split, and which bare chips are assigned to after the operator is split, etc.). These split results can provide effective support for subsequent graph splitting of the original computational graph (such as the first computational graph), thereby quickly and efficiently constructing multiple second computational graphs corresponding to multiple bare chips one by one, thereby improving the computing efficiency of multiple bare chips.
在一种可能的实施方式中,所述N个第二计算图中的第p个第二计算图包括多个第二算子;所述多个第二算子包括被所述最优切分轴切分后的所述第i个第一算子;所述第p个第二计算图为分配到第p个裸片上执行的第二计算图。In one possible implementation, the pth second computation graph among the N second computation graphs includes multiple second operators; the multiple second operators include the i-th first operator after being split by the optimal splitting axis; and the p-th second computation graph is a second computation graph allocated to be executed on the p-th die.
在本申请实施例中,基于算子切分和图切分后得到的每个第二计算图中均可以包括被切分后的第一算子。如此,实现了将每一个完整的计算切分然后分布到多个裸片上,每个裸片只需执行原本第一算子中部分数据的计算,有效提升了多裸片封装芯片的计算效率。In the embodiment of the present application, each second computation graph obtained after operator segmentation and graph segmentation can include the segmented first operator. In this way, each complete computation is segmented and then distributed to multiple dies, and each die only needs to perform the computation of part of the data in the original first operator, effectively improving the computation efficiency of the multi-die packaged chip.
在一种可能的实施方式中,所述第p个第二计算图中的所述多个第二算子还包括切分算子、通信算子 和归约算子中的一个或多个;其中,所述切分算子,用于获取被所述最优切分轴切分后的所述第i个第一算子的输入张量;所述通信算子,用于在所述第i个第一算子的所述最优切分轴与第i-1个第一算子的所述最优切分轴不同时,从其他裸片上获取被所述最优切分轴切分后的所述第i个第一算子的输入张量;所述归约算子,用于在所述第i个第一算子的最优切分轴为归约轴时,将对应的多个裸片上的数据进行归约。In a possible implementation, the plurality of second operators in the pth second computation graph further include a segmentation operator, a communication operator and one or more of the reduction operators; wherein the slicing operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis; the communication operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis from other bare chips when the optimal slicing axis of the i-th first operator is different from the optimal slicing axis of the i-1-th first operator; the reduction operator is used to reduce the data on the corresponding multiple bare chips when the optimal slicing axis of the i-th first operator is the reduction axis.
在本申请实施例中,除了被切分后的第一算子,还需要基于不同的实际情况在第二计算图中构造其他相应的算子,例如用于跨裸片获取计算数据的通信算子等,从而保证被切分后的第一算子在裸片上的可靠执行。In an embodiment of the present application, in addition to the first operator after segmentation, other corresponding operators need to be constructed in the second computational graph based on different actual situations, such as a communication operator for obtaining computational data across bare chips, etc., so as to ensure the reliable execution of the first operator after segmentation on the bare chip.
第二方面,本申请实施例提供了一种电子设备,所述电子设备包括N个裸片,N为大于或者等于1的整数。所述电子设备用于实现上述第一方面提供的任意一种基于多裸片的计算方法中相应的功能。In a second aspect, an embodiment of the present application provides an electronic device, the electronic device comprising N bare chips, where N is an integer greater than or equal to 1. The electronic device is used to implement the corresponding functions of any one of the multi-bare-chip-based computing methods provided in the first aspect above.
第三方面,本申请实施例提供一种电子设备,该电子设备中包括处理器,处理器被配置为支持该电子设备执行第一方面提供的任意一种方法中相应的功能。该电子设备还可以包括存储器,存储器用于与处理器耦合,其保存该电子设备必要的程序指令和数据。该电子设备还可以包括通信接口,用于该电子设备与其他设备或通信网络通信。In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, and the processor is configured to support the electronic device to perform the corresponding functions in any one of the methods provided in the first aspect. The electronic device may also include a memory, the memory is used to couple with the processor, and the memory stores the necessary program instructions and data of the electronic device. The electronic device may also include a communication interface for the electronic device to communicate with other devices or a communication network.
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述第一方面提供的任意一种基于多裸片的计算方法流程。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements any one of the multi-die-based computing method processes provided in the first aspect.
第五方面,本申请实施例提供了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第一方面提供的任意一种基于多裸片的计算方法流程。In a fifth aspect, an embodiment of the present application provides a computer program, which includes instructions. When the computer program is executed by a computer, the computer can execute any one of the multi-die-based computing method processes provided in the first aspect above.
第六方面,本申请实施例提供了一种芯片,该芯片包括处理器和通信接口,所述处理器用于从该通信接口调用并运行指令,当该处理器执行所述指令时,使得该芯片执行上述第一方面提供的任意一种基于多裸片的计算方法流程。In the sixth aspect, an embodiment of the present application provides a chip, which includes a processor and a communication interface, and the processor is used to call and run instructions from the communication interface. When the processor executes the instructions, the chip executes any one of the multi-bare-die based computing method processes provided in the first aspect above.
第七方面,本申请实施例提供了一种芯片系统,该芯片系统包括上述第二方面或第三方面中任意一项所述的电子设备,用于实现上述第一方面提供的任意一种基于多裸片的计算方法流程所涉及的功能。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存基于多裸片的计算方法必要的程序指令和数据。该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。In a seventh aspect, an embodiment of the present application provides a chip system, which includes the electronic device described in any one of the second aspect or the third aspect, and is used to implement the functions involved in any one of the multi-die-based computing method processes provided in the first aspect. In a possible design, the chip system also includes a memory, which is used to store program instructions and data necessary for the multi-die-based computing method. The chip system can be composed of chips, or it can include chips and other discrete devices.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的一种基于多Die芯片的系统架构示意图。FIG1 is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application.
图2是本申请实施例提供的一种图编译器的结构示意图。FIG2 is a schematic diagram of the structure of a graph compiler provided in an embodiment of the present application.
图3a是本申请实施例提供的一种算子切分的示意图。FIG. 3a is a schematic diagram of operator segmentation provided in an embodiment of the present application.
图3b是本申请实施例提供的另一种算子切分的示意图。FIG3 b is a schematic diagram of another operator segmentation provided in an embodiment of the present application.
图4是本申请实施例提供的一种基于多裸片的计算方法的流程示意图。FIG4 is a flow chart of a multi-die-based computing method provided in an embodiment of the present application.
图5是本申请实施例提供的一种算子切分方法的流程示意图。FIG5 is a flow chart of an operator segmentation method provided in an embodiment of the present application.
图6为本申请实施例提供的一种切分收益的计算示意图。FIG6 is a schematic diagram of calculating a split profit provided in an embodiment of the present application.
图7是本申请实施例提供的一种计算耗时的计算方法示意图。FIG. 7 is a schematic diagram of a method for calculating time consumption provided in an embodiment of the present application.
图8是本申请实施例提供的一种通信耗时的计算方法流程示意图。FIG8 is a flow chart of a method for calculating communication time consumption provided in an embodiment of the present application.
图9是本申请实施例提供的一种前后算子的切分轴相同的示意图。FIG. 9 is a schematic diagram showing that the segmentation axes of the front and rear operators are the same, provided by an embodiment of the present application.
图10是本申请实施例提供的一种图切分的方法流程示意图。FIG. 10 is a flow chart of a graph segmentation method provided in an embodiment of the present application.
图11是本申请实施例提供的一种电子设备的结构示意图。FIG. 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例进行描述。The embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
本申请的说明书和权利要求书及所述附图中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。需要说明的是,当一个元件被称作与另一个或多个元件“耦合”、“连接”时,它可以是一个元件直接连接到另一个或多个元件,也可以是间接连接至该另一个或多个元件。The terms "first" and "second" in the specification and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order. In addition, the terms "including" and "having" and any variation thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes other steps or units inherent to these processes, methods, products or devices. It should be noted that when an element is referred to as being "coupled" or "connected" to another or more elements, it can be an element directly connected to another or more elements, or it can be indirectly connected to the other or more elements.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”, 用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in this application, "at least one (item)" means one or more, and "more than one" means two or more. "and/or", Used to describe the association relationship of associated objects, indicating that there can be three relationships. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or plural.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本邻域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference to "embodiments" herein means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various locations in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在处理器上运行的应用和处理器都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。The terms "component", "module", "system", etc. used in this specification are used to represent computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program and/or a computer running on a processor. By way of illustration, both the application running on the processor and the processor can be a component. One or more components may reside in a process and/or an execution thread, and the component may be located on a computer and/or distributed between two or more computers. In addition, these components may be executed from various computer-readable media having various data structures stored thereon. Components may, for example, communicate through local and/or remote processes according to signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).
首先,对本申请中的部分用语进行解释说明,以便于本邻域技术人员理解。First, some of the terms in this application are explained to facilitate understanding by technicians in this field.
(1)裸片(Die),指的是芯片未封装前的晶粒,是从硅晶元(Wafer)上用激光切割而成的小片,每一个Die就是一个独立的功能芯片。Die后续将会被作为一个单位封装起来成为常见的芯片。为了满足现如今人工智能芯片的算力需求,业内提出了将多个Die封装在一个芯片的技术方案,从而提供更大的算力。(1) Die refers to the crystal grain of the chip before it is packaged. It is a small piece cut from a silicon wafer by laser. Each die is an independent functional chip. Dies will be packaged as a unit to become common chips. In order to meet the computing power requirements of today's artificial intelligence chips, the industry has proposed a technical solution to package multiple dies in one chip to provide greater computing power.
(2)算子,是一个函数空间到函数空间上的映射,广义上的算子可以推广到任何空间,如内积空间等。广义的讲,对任何函数进行某一项操作都可以认为是一个算子,例如矩阵相乘(matmul),甚至包括求幂次,开方都可以认为是一个算子。(2) An operator is a mapping from a function space to another function space. In a broad sense, operators can be extended to any space, such as inner product space. In a broad sense, any operation on any function can be considered an operator, such as matrix multiplication (matmul), even exponentiation and square root can be considered an operator.
(3)有向无环图(direct acyclic graph,DAG),即计算图。在图论中,如果一个有向图无法从某个节点出发经过若干条边回到该节点,则这个图是一个有向无环图。(3) Directed acyclic graph (DAG), also known as computational graph. In graph theory, if a directed graph cannot return to a node through several edges from a certain node, then the graph is a directed acyclic graph.
为了便于理解本申请实施例,下面将进一步分析并提出本申请所具体要解决的技术问题。如上所述,在采用多Die封装技术时,通常会采用UNMA架构。但是,UNMA架构通常会带来跨Die访问数据的问题,导致访问时延大大增加,从而降低整体的计算效率。在现有技术中,关于在UNMA架构下提升多Die封装芯片的计算效率的技术,包括多种方案。以下示例性的列举如下常用的多芯片模块-图形处理器(multi-chip module-graphics processing unit,MCM-GPU)。In order to facilitate the understanding of the embodiments of the present application, the technical problems that the present application specifically aims to solve will be further analyzed and proposed below. As mentioned above, when multi-die packaging technology is adopted, the UNMA architecture is usually adopted. However, the UNMA architecture usually brings problems of accessing data across dies, resulting in a significant increase in access latency, thereby reducing the overall computing efficiency. In the prior art, there are a variety of technologies for improving the computing efficiency of multi-die packaged chips under the UNMA architecture. The following exemplarily lists the following commonly used multi-chip module-graphics processing unit (MCM-GPU).
MCM-GPU可以包括以下几个技术点:MCM-GPU can include the following technical points:
(1)从二级缓存(L2Cahce)中划出一部分空间作为L1.5Cache,专门负责对远程(remote)Die的数据缓存,提升远程Die的数据访问性能。(1) A portion of the space in the L2 cache (L2Cahce) is allocated as L1.5 Cache, which is specifically responsible for caching data from remote Dies and improving the data access performance of remote Dies.
(2)分布式批处理(distributed and batched)计算线程数组(compute thread array,CTA),对CTA进行分组调度,同组CTA调度到同一个GPU模块(GPU Modules,GPM),提升cache命中率。(2) Distributed and batched compute thread array (CTA), group scheduling of CTAs, scheduling CTAs in the same group to the same GPU module (GPU Modules, GPM) to improve cache hit rate.
(3)First-Touch Mapping,页表首次被访问时,将其映射到访问它的GPM所在的物理内存。(3) First-Touch Mapping: When a page table is accessed for the first time, it is mapped to the physical memory where the GPM that accesses it is located.
相应的,MCM-GPU主要存在以下缺点:Accordingly, MCM-GPU has the following main disadvantages:
(1)降低了L2Cache的规格,对各GPM的cache命中率会产生影响。(1) The specifications of L2Cache are reduced, which will affect the cache hit rate of each GPM.
(2)MCM-GPU对Die间带宽要求较高,Die间带宽会对计算性能产生极大的影响。但是,由于其扩展性不好,随着Die数量的继续增加,很难实现Die间带宽的同步扩展。最终使得MCM-GPU的计算效率也难以有效提升。(2) MCM-GPU has high requirements for inter-die bandwidth, which has a great impact on computing performance. However, due to its poor scalability, it is difficult to achieve synchronous expansion of inter-die bandwidth as the number of dies continues to increase. Ultimately, it is difficult to effectively improve the computing efficiency of MCM-GPU.
因此,为了解决当前多Die芯片计算效率低的问题,本申请实际要解决的技术问题包括如下方面:基于现有的硬件设备和NUMA架构,通过对计算图的编译,将原本计算图中的每一个完整的算子切分成多个,并将其分布到对应的多个Die上进行计算,以充分利用每一个Die的计算资源,大大提升了多裸片封装芯片的计算效率。与此同时,通过最优切分轴的选择,使得算子切分后每个Die上分配到的计算所需的数据基本在该Die的本地存储中,尽可能的避免跨Die访问数据。如此,本申请实施例可以实现即使在较 低的Die间互连带宽下,可以有效提升整个多Die芯片的计算效率。Therefore, in order to solve the problem of low computing efficiency of current multi-Die chips, the technical problems that this application actually aims to solve include the following aspects: Based on the existing hardware devices and NUMA architecture, by compiling the computational graph, each complete operator in the original computational graph is split into multiple ones, and distributed to the corresponding multiple Dies for calculation, so as to make full use of the computing resources of each Die, greatly improving the computing efficiency of multi-die packaged chips. At the same time, by selecting the optimal splitting axis, the data required for the calculation allocated to each Die after the operator is split is basically in the local storage of the Die, avoiding cross-Die data access as much as possible. In this way, the embodiments of the present application can achieve even in a relatively large Under low inter-die interconnection bandwidth, the computing efficiency of the entire multi-die chip can be effectively improved.
请参阅图1,图1是本申请实施例提供的一种基于多Die芯片的系统架构示意图。如图1所示,本申请实施例中主要可以应用于AI模型的训练和推理场景。AI开发人员可以使用TensorFow、Pytorch、MindSpore等AI框架开发训练/推理脚本,然后通过AI框架触发训练/推理的执行。如图1所示,该系统架构可以包括AI框架101、图编译器102、内存管理103、操作编译器104和芯片105。其中,芯片105为多Die芯片,可以包括图1所示的Die 0、Die 1、Die 2、Die 3等多个裸片。Please refer to Figure 1, which is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application. As shown in Figure 1, the embodiment of the present application can be mainly applied to the training and reasoning scenarios of AI models. AI developers can use AI frameworks such as TensorFow, Pytorch, MindSpore, etc. to develop training/reasoning scripts, and then trigger the execution of training/reasoning through the AI framework. As shown in Figure 1, the system architecture may include an AI framework 101, a graph compiler 102, a memory management 103, an operation compiler 104, and a chip 105. Among them, chip 105 is a multi-Die chip, which may include multiple bare chips such as Die 0, Die 1, Die 2, Die 3, etc. shown in Figure 1.
AI框架101,用于将用户的计算构造为DAG图。然后,AI框架101将DAG图发送到图编译器102中进行编译。应理解,在DAG图中通过节点表达计算,通过节点与节点之间的边表达计算间传递的数据或计算间的依赖关系,DAG图中的每个节点即为一个算子。The AI framework 101 is used to construct the user's calculation into a DAG graph. Then, the AI framework 101 sends the DAG graph to the graph compiler 102 for compilation. It should be understood that in the DAG graph, the calculation is expressed by nodes, and the data transferred between calculations or the dependency between calculations are expressed by the edges between nodes. Each node in the DAG graph is an operator.
图编译器102,用于以一定的拓扑序对DAG图中的算子进行编译,最终将DAG图上的所有节点编译为任务(Task)列表下发给内存管理(runtime)103。任务列表中可以包括多个计算任务。The graph compiler 102 is used to compile the operators in the DAG graph in a certain topological order, and finally compile all nodes on the DAG graph into a task list and send it to the memory management (runtime) 103. The task list may include multiple computing tasks.
内存管理103,用于将任务列表下发到芯片105内的所有Die上。The memory management 103 is used to send the task list to all the dies in the chip 105 .
芯片105,用于通过其中的多个Die执行下发的任务列表。Chip 105 is used to execute the sent task list through multiple Dies therein.
显然,在单Die封装场景下(即芯片105中只包括一个Die),内存管理103只需要简单的将任务列表下发到芯片105内唯一的Die。但在多DIE封装场景下,内存管理103需要将任务列表下发到芯片105内的多个Die上。在本申请的一些实施例中,图编译器102可以将DAG图中每个完整的算子进行切分,基于该切分结果,将原本的DAG图切分为多个子图,然后再通过内存管理103分配到对应的多个Die上执行,从而大大提升多Die芯片的计算效率。Obviously, in a single-die packaging scenario (i.e., chip 105 includes only one die), memory management 103 only needs to simply send the task list to the only die in chip 105. However, in a multi-die packaging scenario, memory management 103 needs to send the task list to multiple dies in chip 105. In some embodiments of the present application, graph compiler 102 can split each complete operator in the DAG graph, and based on the split result, split the original DAG graph into multiple subgraphs, which are then distributed to the corresponding multiple dies for execution through memory management 103, thereby greatly improving the computing efficiency of multi-die chips.
需要说明的是,本申请的落地产品可以是数据中心的训练/推理设备,部署在数据中心的训练/推理服务器上,或者其他任何可能的设备,例如路灯上的摄像头等具备AI功能(例如人脸识别等),基于DAG图进行计算的边缘设备,其中可以包括。本申请主要通过对DAG图的编译过程的改进,实现将计算编译部署到芯片内的多Die上。It should be noted that the landing product of this application can be a training/inference device in a data center, deployed on a training/inference server in a data center, or any other possible device, such as a camera on a street lamp, etc., with AI functions (such as face recognition, etc.), edge devices that perform calculations based on DAG graphs, which may include. This application mainly realizes the deployment of calculation compilation to multiple Dies in a chip by improving the compilation process of the DAG graph.
进一步地,请参阅图2,图2是本申请实施例提供的一种图编译器的结构示意图。如图2所示,该图编译器102中可以包括图排序单元21、算子切分单元22、图切分单元23、模型编译单元24、模型部署单元25。可选地,该图编译器102还可以包括环境信息库26和算子信息库27。如图2所示,AI框架101可以与图编译器102中的图排序单元21连接,模型部署单元25可以与内存管理103连接。Further, please refer to Figure 2, which is a structural diagram of a graph compiler provided in an embodiment of the present application. As shown in Figure 2, the graph compiler 102 may include a graph sorting unit 21, an operator segmentation unit 22, a graph segmentation unit 23, a model compilation unit 24, and a model deployment unit 25. Optionally, the graph compiler 102 may also include an environment information library 26 and an operator information library 27. As shown in Figure 2, the AI framework 101 can be connected to the graph sorting unit 21 in the graph compiler 102, and the model deployment unit 25 can be connected to the memory management 103.
图排序单元21,用于通过拓扑排序将DAG转换为算子列表,算子列表中包括多个算子。算子列表中算子的顺序表达了算子执行的先后顺序。The graph sorting unit 21 is used to convert the DAG into an operator list by topological sorting, wherein the operator list includes multiple operators. The order of the operators in the operator list expresses the order in which the operators are executed.
算子切分单元22,用于对DAG图中的多个算子分别进行算子切分的处理,生成每个算子的切分结果。具体地,如图2所示,算子切分单元22在进行算子切分时,可以通过环境信息库26以及算子信息库27分别读取相应的环境信息和算子信息。其中,环境信息库26中的环境信息可以包括:芯片内的裸片数量、各个裸片的存储器规格、裸片间的互连拓扑和带宽等。其中,算子信息库27中的算子信息可以包括:算子类型、输入/输出列表,每个输入/输出的轴信息。The operator segmentation unit 22 is used to perform operator segmentation processing on multiple operators in the DAG graph respectively, and generate segmentation results for each operator. Specifically, as shown in FIG2 , when the operator segmentation unit 22 performs operator segmentation, it can read the corresponding environment information and operator information through the environment information library 26 and the operator information library 27 respectively. Among them, the environment information in the environment information library 26 may include: the number of dies in the chip, the memory specifications of each die, the interconnection topology and bandwidth between the dies, etc. Among them, the operator information in the operator information library 27 may include: operator type, input/output list, and axis information of each input/output.
如上所述,相较于常规的环境信息库,本申请中的环境信息库26在环境信息中需要将原单芯片内单Die,修改为单芯片多Die,并增加对Die间互连网络的拓扑结构以及互连带宽的描述等。此外,本申请中的算子信息库27在算子信息中需要对每一个算子增加输入/输出的轴信息的描述。应理解,轴信息是算子的不同输入/输出张量(Tensor)间的切分关系的表达。同一个轴的所有输入/输出,必须采用相同的切分方式(即切分轴)进行切分。As described above, compared to the conventional environment information library, the environment information library 26 in the present application needs to modify the original single die in a single chip to multiple dies in a single chip in the environment information, and add a description of the topological structure of the interconnection network between dies and the interconnection bandwidth, etc. In addition, the operator information library 27 in the present application needs to add a description of the input/output axis information for each operator in the operator information. It should be understood that the axis information is an expression of the segmentation relationship between different input/output tensors (Tensor) of the operator. All inputs/outputs of the same axis must be segmented using the same segmentation method (i.e., segmentation axis).
可选地,请参阅图3a,图3a是本申请实施例提供的一种算子切分的示意图。图3a中的算子A为矩阵相乘(MatMul)算子,该算子A包括两个输入矩阵和一个输出矩阵,其中,左输入矩阵为M×K的矩阵,右输入矩阵为K×N的矩阵,输出矩阵为M×N的矩阵。算子A包括三个切分轴,分别为M轴、K轴和N轴。如图3a所示,该算子A的左输入矩阵和输出矩阵被相同的切分轴(即M轴,如图3a中的虚线所指)切分,从而得到切分后的算子a1和算子a2,其中,算子a1和算子a2的左输入矩阵为(M/2)×K的矩阵,输出矩阵为(M/2)×N的矩阵,显然,算子a1和算子a2所需计算的数据仅为原本算子A的一半。如此,通过算子切分单元22,可以将一个完整数据计算的算子,切分为两个部分数据计算的算子。Optionally, please refer to Figure 3a, which is a schematic diagram of an operator segmentation provided in an embodiment of the present application. Operator A in Figure 3a is a matrix multiplication (MatMul) operator, and the operator A includes two input matrices and an output matrix, wherein the left input matrix is an M×K matrix, the right input matrix is a K×N matrix, and the output matrix is an M×N matrix. Operator A includes three segmentation axes, namely, the M axis, the K axis, and the N axis. As shown in Figure 3a, the left input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the M axis, as indicated by the dotted line in Figure 3a), thereby obtaining the segmented operators a1 and a2, wherein the left input matrices of operators a1 and a2 are (M/2)×K matrices, and the output matrices are (M/2)×N matrices. Obviously, the data required to be calculated by operators a1 and a2 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.
可选地,请参阅图3b,图3b是本申请实施例提供的另一种算子切分的示意图。如图3b所示,该算子A的右输入矩阵和输出矩阵被相同的切分轴(即N轴,如图3b中的虚线所指)切分,从而得到切分后的 算子a3和算子a4,其中,算子a3和算子a4的右输入矩阵为K×(N/2)的矩阵,输出矩阵为M×(N/2)的矩阵,显然,算子a3和算子a4所需计算的数据仅为原本算子A的一半。如此,通过算子切分单元22,可以将一个完整数据计算的算子,切分为两个部分数据计算的算子。Optionally, please refer to FIG. 3b, which is a schematic diagram of another operator segmentation provided in an embodiment of the present application. As shown in FIG. 3b, the right input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the N axis, as indicated by the dotted line in FIG. 3b), thereby obtaining Operator a3 and operator a4, wherein the right input matrix of operator a3 and operator a4 is a K×(N/2) matrix, and the output matrix is an M×(N/2) matrix. Obviously, the data required to be calculated by operator a3 and operator a4 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.
图切分单元23,用于基于算子切分的切分结果,将切分后的算子(例如图3a所示的算子a1和算子a2)重新组织为在多个Die上执行的子图(sub DAG)的形式,每个子图中包括多个被切分后的算子。例如,分配到Die 0上计算的子图1中可以包括上述算子a1,分配到Die 1上计算的子图1中可以包括上述算子a2。应理解,算子切分单元22和图切分单元23为本申请在图编译器中新增的单元。The graph segmentation unit 23 is used to reorganize the segmented operators (such as operator a1 and operator a2 shown in FIG3a ) into a sub-graph (sub DAG) executed on multiple Dies based on the segmentation results of the operator segmentation, and each sub-graph includes multiple segmented operators. For example, the sub-graph 1 assigned to Die 0 for calculation may include the above operator a1, and the sub-graph 1 assigned to Die 1 for calculation may include the above operator a2. It should be understood that the operator segmentation unit 22 and the graph segmentation unit 23 are newly added units in the graph compiler of the present application.
模型编译单元24,用于将多个sub DAG编译为可以部署的模型列表(model list)。模型列表中包括多个模型,每个模型包括多个计算任务。应理解,相较于常规的模型编译,本申请中的模型编译单元24需要增加对于多个sub DAG的编译支持。The model compilation unit 24 is used to compile multiple sub DAGs into a model list that can be deployed. The model list includes multiple models, and each model includes multiple computing tasks. It should be understood that compared with conventional model compilation, the model compilation unit 24 in the present application needs to increase the compilation support for multiple sub DAGs.
模型部署单元25,用于将多个模型列表,部署到对应的多个Die上。最后,通过内存管理103将多个模型列表分配到各自对应Die上执行,充分利用芯片中每个Die的算力,有效提升多Die芯片的计算效率。The model deployment unit 25 is used to deploy multiple model lists to corresponding multiple Dies. Finally, the multiple model lists are allocated to their corresponding Dies for execution through the memory management 103, making full use of the computing power of each Die in the chip and effectively improving the computing efficiency of the multi-Die chip.
请参阅图4,图4是本申请实施例提供的一种基于多裸片的计算方法的流程示意图。该方法主要针对由N个裸片封装成的芯片,N为大于或者等于1的整数。该方法可以应用于图1所述的系统架构中,具体地,该方法可以应用于图2所示的图编译器102中,下面将结合图2所述的图编译器102对本申请实施例提供的方法进行详细阐述。如图4所示,该方法可以包括以下步骤S501-步骤S504。Please refer to Figure 4, which is a flow chart of a multi-die-based calculation method provided in an embodiment of the present application. The method is mainly aimed at a chip packaged by N dies, where N is an integer greater than or equal to 1. The method can be applied to the system architecture described in Figure 1. Specifically, the method can be applied to the graph compiler 102 shown in Figure 2. The method provided in an embodiment of the present application will be described in detail below in conjunction with the graph compiler 102 described in Figure 2. As shown in Figure 4, the method may include the following steps S501-S504.
步骤S501,获取第一计算图,第一计算图包括M个第一算子。Step S501: Obtain a first computation graph, where the first computation graph includes M first operators.
具体地,图编译器102获取第一计算图,该第一计算图包括M个第一算子。M为大于或者等于1的整数。应理解,该第一计算图为切分前的DAG图,相应的,该第一计算图中包括的M个第一算子均为切分前的算子,每个第一算子对应了一个完整数据的计算。Specifically, the graph compiler 102 obtains a first computation graph, which includes M first operators. M is an integer greater than or equal to 1. It should be understood that the first computation graph is a DAG graph before segmentation, and accordingly, the M first operators included in the first computation graph are all operators before segmentation, and each first operator corresponds to a calculation of a complete data.
可选地,图编译器102可以通过其中的图排序单元21获取该第一计算图,并通过拓扑排序将该第一计算图转换为算子列表,该算子列表中包括按序排列的上述M个第一算子。Optionally, the graph compiler 102 may obtain the first computation graph through the graph sorting unit 21 therein, and convert the first computation graph into an operator list through topological sorting, wherein the operator list includes the above-mentioned M first operators arranged in order.
步骤S502,对M个第一算子分别进行切分,得到M个第一算子的切分结果。Step S502, segmenting the M first operators respectively to obtain segmentation results of the M first operators.
具体地,图编译器102对该M个第一算子依次进行算子切分的处理,从而得到该M个第一算子各自的切分结果。Specifically, the graph compiler 102 performs operator segmentation processing on the M first operators in sequence, thereby obtaining segmentation results for each of the M first operators.
可选地,由于每个第一算子可能对应有多个切分轴,图编译器102可以先确定每个第一算子的最优切分轴,然后再基于每个第一算子的最优切分轴对第一算子进行切分,从而得到最理想的切分结果。可选地,该最优切分轴可以为带来切分收益最大的切分轴。示例性的,图编译器102可以先确定M个第一算子中的第i个第一算子的最优切分轴,然后基于第i个第一算子的最优切分轴,对第i个第一算子进行切分,获得第i个第一算子的切分结果。i为大于或者等于1,且小于或者等于M的整数。其中,该第i个第一算子的切分结果包括:第i个第一算子的输出张量列表,第i个第一算子对应的一个或多个输入张量和输出张量的原始形状(例如上述图3a所示的M×N),第i个第一算子对应的一个或多个输入张量和/或输出张量被最优切分轴切分后的形状(例如上述图3a所示的(M/2)×N),以及第i个第一算子被最优切分轴切分后对应分配的一个或多个裸片。Optionally, since each first operator may correspond to multiple segmentation axes, the graph compiler 102 may first determine the optimal segmentation axis of each first operator, and then segment the first operator based on the optimal segmentation axis of each first operator, so as to obtain the most ideal segmentation result. Optionally, the optimal segmentation axis may be the segmentation axis that brings the greatest segmentation benefit. Exemplarily, the graph compiler 102 may first determine the optimal segmentation axis of the i-th first operator among the M first operators, and then segment the i-th first operator based on the optimal segmentation axis of the i-th first operator to obtain the segmentation result of the i-th first operator. i is an integer greater than or equal to 1 and less than or equal to M. Among them, the segmentation result of the i-th first operator includes: the output tensor list of the i-th first operator, the original shape of one or more input tensors and output tensors corresponding to the i-th first operator (for example, M×N shown in Figure 3a above), the shape of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis (for example, (M/2)×N shown in Figure 3a above), and one or more bare chips allocated corresponding to the i-th first operator after being segmented by the optimal segmentation axis.
请参阅图5,图5是本申请实施例提供的一种算子切分方法的流程示意图。如图5所示,该方法包括如下步骤S11-步骤S17。Please refer to Figure 5, which is a schematic flow chart of an operator segmentation method provided in an embodiment of the present application. As shown in Figure 5, the method includes the following steps S11 to S17.
步骤S11,获取算子信息。Step S11, obtaining operator information.
具体地,图编译器102从算子信息库27中获取M个第一算子(即所有待切分算子)的算子信息。该算子信息主要包括每个第一算子的输入张量(Input Tensor)列表、输出张量(Output Tensor)列表和轴信息。其中,每个第一算子的轴信息可以包括:(1)轴类型,例如可以是Element-wise、归约(Reduction)、滑动窗(SlidingWindow)等类型。不同的轴类型表达算子不同的计算特点。(2)每个轴涉及的输入张量,以及该输入张量的维度(dimension)。(3)每个轴涉及的输出张量,以及该输出张量的dimension。Specifically, the graph compiler 102 obtains the operator information of M first operators (i.e., all operators to be split) from the operator information library 27. The operator information mainly includes the input tensor (Input Tensor) list, output tensor (Output Tensor) list and axis information of each first operator. Among them, the axis information of each first operator may include: (1) axis type, such as Element-wise, reduction (Reduction), sliding window (SlidingWindow) and other types. Different axis types express different computing characteristics of operators. (2) The input tensor involved in each axis, and the dimension (dimension) of the input tensor. (3) The output tensor involved in each axis, and the dimension of the output tensor.
步骤S12,是否存在未计算的切分轴,若是,则执行步骤S13,若否,则执行步骤S17。Step S12, whether there is an uncalculated segmentation axis, if yes, execute step S13, if not, execute step S17.
具体地,图编译器102查看当前第一算子(例如第i个第一算子)是否还存在未计算切分收益的切分轴。若当前第i个第一算子还存在未计算的切分轴,则图编译器102可以接着选择下一个切分轴,计算切分收益;若针对当前第i个第一算子包括的所有切分轴(例如包括K个切分轴,K为大于或者等于1的整数)均已经完成切分收益的计算,则图编译器102可以将其中切分收益最大的切分轴记录为最优切分轴,并将该第i个第一算子在最优切分轴下的切分结果记录在张量切分表中。可选地,该张量切分表可以位于 数据库或者内存中。Specifically, the graph compiler 102 checks whether the current first operator (for example, the ith first operator) still has a slicing axis whose slicing benefit has not been calculated. If the current ith first operator still has a slicing axis that has not been calculated, the graph compiler 102 may then select the next slicing axis and calculate the slicing benefit; if the slicing benefit calculation has been completed for all slicing axes included in the current ith first operator (for example, including K slicing axes, K is an integer greater than or equal to 1), the graph compiler 102 may record the slicing axis with the largest slicing benefit as the optimal slicing axis, and record the slicing result of the ith first operator under the optimal slicing axis in the tensor slicing table. Optionally, the tensor slicing table may be located at Database or in memory.
步骤S13,切分轴选择。Step S13, segmentation axis selection.
具体地,若当前第i个第一算子还存在未计算的切分轴,则图编译器102可以接着选择下一个切分轴(例如为K个切分轴中的第j个切分轴)进行切分收益的计算。例如,如上图3a所示,算子A包括三个切分轴,若此时N轴的切分收益已经计算完成,接下来图编译器102可以选自N轴或者K轴进行相应切分收益的计算。Specifically, if there are still uncalculated split axes in the current i-th first operator, the graph compiler 102 can then select the next split axis (for example, the j-th split axis among the K split axes) to calculate the split benefit. For example, as shown in FIG3a above, operator A includes three split axes. If the split benefit of axis N has been calculated at this time, the graph compiler 102 can then select axis N or axis K to calculate the corresponding split benefit.
步骤S14,计算切分收益。Step S14, calculating the split revenue.
具体地,图编译器102计算第i个第一算子在当前切分轴下的切分收益。Specifically, the graph compiler 102 calculates the splitting benefit of the i-th first operator under the current splitting axis.
可选地,请参阅图6,图6为本申请实施例提供的一种切分收益的计算示意图。如图6所示,该切分收益与计算收益和通信耗时(或者说通信损失)有关。每个切分轴对应的切分收益具体为计算收益与通信耗时的差值,计算收益越大,通信耗时越小,则切分收益越大。Optionally, please refer to Figure 6, which is a schematic diagram of calculating a split benefit provided in an embodiment of the present application. As shown in Figure 6, the split benefit is related to the calculation benefit and the communication time (or communication loss). The split benefit corresponding to each split axis is specifically the difference between the calculation benefit and the communication time. The greater the calculation benefit and the smaller the communication time, the greater the split benefit.
其中,计算收益即为第i个第一算子在切分后带来的计算耗时的减少量。具体为第i个第一算子在切分前通过一个裸片计算时需要消耗的时间(例如第一计算耗时),与第i个第一算子被当前第j个切分轴切分后,分布到多个裸片上并行计算时需要消耗的时间(例如第二计算耗时)的差值。The computational benefit is the reduction in computational time of the ith first operator after the split. Specifically, it is the difference between the time required for the ith first operator to be computed on one die before the split (e.g., the first computational time) and the time required for the ith first operator to be computed on multiple die in parallel after being split by the current jth split axis (e.g., the second computational time).
示例性的,第i个第一算子在单个裸片上的计算耗时为T,第i个第一算子被其第j个切分轴切分后分布到4个裸片上计算时的计算耗时理论上可以为T/4,那么该第j个切分轴的计算收益为(T-T/4)。需要说明的是,实际上算子的计算数据量和计算时长并不是线性关系,每个裸片上的计算耗时往往与许多因素相关。可选地,请参阅图7,图7是本申请实施例提供的一种计算耗时的计算方法示意图。如图7所示,图编译器102需要考虑芯片类型(决定了各种类型加速计算单元的核数/主频、各级缓存大小等)、输入数据的类型(DType)、输入数据的形状(Shape)等因素,并基于成本模型(Cost Model)计算出各个裸片在执行被当前第j切分轴切分后的第i个第一算子(例如上述图3a所示的算子a1等)所需的计算耗时。Exemplarily, the computation time of the i-th first operator on a single die is T, and the computation time of the i-th first operator after being split by its j-th splitting axis and distributed to 4 die for computation can theoretically be T/4, then the computation benefit of the j-th splitting axis is (T-T/4). It should be noted that, in fact, the amount of computational data and the computational time of the operator are not linearly related, and the computational time on each die is often related to many factors. Optionally, please refer to FIG. 7, which is a schematic diagram of a computational time calculation method provided in an embodiment of the present application. As shown in FIG. 7, the graph compiler 102 needs to consider factors such as the chip type (which determines the number of cores/main frequency of various types of accelerated computing units, the size of caches at all levels, etc.), the type of input data (DType), the shape of the input data (Shape), and the like, and calculates the computational time required for each die to execute the i-th first operator (such as the operator a1 shown in FIG. 3a above) after being split by the current j-th splitting axis based on the cost model (Cost Model).
其中,通信耗时即为第i个第一算子在切分后,裸片在执行切分后的算子时需要与其他裸片通信以获取其他裸片上存储的计算数据时消耗的时间。例如,第i个第一算子的第j个切分轴对应的通信耗时为第p个裸片从其他裸片上获取目标数据所需的时间。其中,该第p个裸片为第i个第一算子被第j个切分轴切分后对应分配的多个裸片中的一个,该目标数据为第p个裸片执行切分后的第i个第一算子时所需的数据。可选地,通信耗时一般与该目标数据的数量(即通信数据量)以及目标数据的内存排布(即通信数据的内存排布)相关,还可以与Die间互联拓扑以及连接带宽等因素相关。p为大于或者等于1,且小于或者等于N的整数。Among them, the communication time is the time consumed when the die needs to communicate with other die to obtain the calculation data stored on other die when executing the split operator after the i-th first operator is split. For example, the communication time corresponding to the j-th split axis of the i-th first operator is the time required for the p-th die to obtain the target data from other die. Among them, the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th split axis, and the target data is the data required when the p-th die executes the i-th first operator after the split. Optionally, the communication time is generally related to the number of target data (i.e., the amount of communication data) and the memory arrangement of the target data (i.e., the memory arrangement of the communication data), and can also be related to factors such as the interconnection topology between Dies and the connection bandwidth. p is an integer greater than or equal to 1 and less than or equal to N.
请参阅图8,图8是本申请实施例提供的一种通信耗时的计算方法流程示意图。如图8所示,该方法包括如下步骤S21-步骤S25。Please refer to Figure 8, which is a flow chart of a method for calculating communication time consumption provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps S21 to S25.
步骤S21,读取通信拓扑与带宽。Step S21, reading the communication topology and bandwidth.
具体地,图编译器102从环境信息库26中读取Chip内多Die间的通信拓扑,以及通信链路的带宽数据。Specifically, the graph compiler 102 reads the communication topology between multiple Dies in a Chip and the bandwidth data of the communication links from the environment information library 26 .
步骤S22,读取前置算子切分结果。Step S22, reading the segmentation result of the preceding operator.
具体地,图编译器102从张量切分表中读取当前第i个第一算子的前置算子的切分结果,确定前置算子的最优切分轴,即确定第i-1个第一算子的最优切分轴。若第i-1个第一算子的最优切分轴与当前第i个第一算子的第j个切分轴相同,则不需要进行跨裸片通信,即通信耗时为零。示例性的,请参阅图9,图9是本申请实施例提供的一种前后算子的切分轴相同的示意图。如图9所示,算子A例如为第i-1个第一算子,算子B例如为第i个第一算子,算子A为算子B的前置算子。如图9所示,算子A的切分轴与算子B的切分轴均M轴,算子A切分后,当前裸片上仅有(M/2)×N的输出数据。则当前裸片在执行切分后的算子B时可以直接在本裸片上获取(M/2)×N的输入数据进行计算,不需要进行跨裸片通信;否则,若算子B的切分轴为N轴,则当前裸片需要从其他裸片上获取另一半M×(N/2)的数据,即需要进行跨裸片通信。Specifically, the graph compiler 102 reads the segmentation result of the predecessor operator of the current i-th first operator from the tensor segmentation table, and determines the optimal segmentation axis of the predecessor operator, that is, determines the optimal segmentation axis of the i-1th first operator. If the optimal segmentation axis of the i-1th first operator is the same as the j-th segmentation axis of the current i-th first operator, there is no need for cross-die communication, that is, the communication time consumption is zero. For example, please refer to Figure 9, which is a schematic diagram of the same segmentation axis of the previous and next operators provided in an embodiment of the present application. As shown in Figure 9, operator A is, for example, the i-1th first operator, operator B is, for example, the i-th first operator, and operator A is the predecessor operator of operator B. As shown in Figure 9, the segmentation axis of operator A and the segmentation axis of operator B are both M axes. After operator A is segmented, there are only (M/2)×N output data on the current die. Then, when the current die executes the split operator B, it can directly obtain (M/2)×N input data for calculation on the current die without the need for cross-die communication; otherwise, if the split axis of operator B is the N axis, the current die needs to obtain the other half of the M×(N/2) data from other die, that is, cross-die communication is required.
步骤S23,计算通信数据量。Step S23, calculating the communication data volume.
具体地,若第j-1个第一算子的最优切分轴与当前第i个第一算子的第j个切分轴不同,则图编译器102需要计算通信耗时,此时,图编译器102可以先计算需要跨Die访问的通信数据量。可选地,在当前第j个切分轴是需要跨Die交换的轴类型时(比如Reduction、SlidingWindow等轴类型),图编译器102也需要根据算子输入张量的类型和形状,计算出需要和其它Die进行交换的数据量。Specifically, if the optimal split axis of the j-1th first operator is different from the jth split axis of the current i-th first operator, the graph compiler 102 needs to calculate the communication time. At this time, the graph compiler 102 can first calculate the amount of communication data that needs to be accessed across Dies. Optionally, when the current jth split axis is an axis type that needs to be exchanged across Dies (such as Reduction, SlidingWindow, and other axis types), the graph compiler 102 also needs to calculate the amount of data that needs to be exchanged with other Dies based on the type and shape of the operator input tensor.
步骤S24,计算通信数据内存排布。 Step S24, calculating the memory arrangement of communication data.
具体地,为了保证通信耗时的计算精确性,本申请实施例除了计算通信数据量之外,还可以增加计算通信数据内存排布。其中,当第i个第一算子被第j个切分轴切分后需要跨裸片交换的数据是非连续的内存时,往往需要采用多次交换任务才能完成交换,这带来了额外的任务消耗,从而带来更大的通信耗时。此外,当内存排布非常零散而数据量不大时,容易导致多次小数量的频繁交换,从而带来更大的通信耗时。Specifically, in order to ensure the accuracy of the calculation of communication time consumption, in addition to calculating the amount of communication data, the embodiments of the present application can also increase the calculation of the memory layout of communication data. Among them, when the data that needs to be exchanged across the bare chip after the i-th first operator is divided by the j-th division axis is non-continuous memory, it is often necessary to use multiple exchange tasks to complete the exchange, which brings additional task consumption, thereby bringing greater communication time consumption. In addition, when the memory layout is very scattered and the amount of data is not large, it is easy to cause multiple frequent exchanges of small numbers, which brings greater communication time consumption.
步骤S25,计算通信耗时。Step S25, calculating the communication time consumption.
具体地,图编译器102可以根据通信数据量÷Die间带宽,初步计算出通信耗时,此外,结合通信的数据量和通信数据的内存排布信息,可以综合计算通信所需的耗时。可选地,图编译器102具体可以基于一些典型包长的测试值进行简单评估通信耗时,也可采用更加复杂的Cost Model计算通信耗时,等等,本申请实施例对此不作具体限定。此外,通信耗时不仅受上述数据量影响,还受传输的包大小、通信过程中的控制信号等其他因素影响,本申请实施例对此不作具体限定。Specifically, the graph compiler 102 can preliminarily calculate the communication time consumption based on the communication data volume ÷ the inter-die bandwidth. In addition, the communication time consumption can be comprehensively calculated based on the communication data volume and the memory layout information of the communication data. Optionally, the graph compiler 102 can specifically perform a simple evaluation of the communication time consumption based on some typical packet length test values, or use a more complex Cost Model to calculate the communication time consumption, etc., which is not specifically limited in the embodiments of the present application. In addition, the communication time consumption is not only affected by the above-mentioned data volume, but also by other factors such as the size of the transmitted packet and the control signal during the communication process, which is not specifically limited in the embodiments of the present application.
步骤S15,是否为最优的切分轴。Step S15: whether it is the optimal segmentation axis.
具体地,图编译器102基于上述图7和图8对应实施例中对计算收益和通信耗时的计算,得到当前第j个切分轴对应的切分收益。然后,图编译器102比较当前第j个切分轴的切分收益与前面j-1个切分轴各自对应的切分收益。若当前第j个切分轴的切分收益大于前面j-1个切分轴各自对应的切分收益,则图编译器102可以确定当前第j个切分轴为最优切分轴,并执行步骤S16,否则执行步骤S12。Specifically, the graph compiler 102 obtains the segmentation benefit corresponding to the current j-th segmentation axis based on the calculation of the computational benefit and the communication time consumption in the above-mentioned embodiments corresponding to Figures 7 and 8. Then, the graph compiler 102 compares the segmentation benefit of the current j-th segmentation axis with the segmentation benefits corresponding to the previous j-1 segmentation axes. If the segmentation benefit of the current j-th segmentation axis is greater than the segmentation benefits corresponding to the previous j-1 segmentation axes, the graph compiler 102 can determine that the current j-th segmentation axis is the optimal segmentation axis and execute step S16, otherwise execute step S12.
步骤S16,记录最优切分轴。Step S16, recording the optimal segmentation axis.
具体地,若当前第j个切分轴的切分收益大于前面j-1个切分轴各自对应的切分收益,则图编译器102将当前第j个切分轴记录为当前第i个第一算子的最优切分轴。应理解,若后续第j+1个切分轴的切分收益大于第j个切分轴对应的切分收益,则图编译器102可以更新最优切分轴,即将第j+1个切分轴记录为当前第i个第一算子的最优切分轴。Specifically, if the current j-th slicing axis has a slicing benefit greater than the slicing benefits corresponding to the previous j-1 slicing axes, the graph compiler 102 records the current j-th slicing axis as the optimal slicing axis for the current i-th first operator. It should be understood that if the subsequent j+1-th slicing axis has a slicing benefit greater than the slicing benefit corresponding to the j-th slicing axis, the graph compiler 102 may update the optimal slicing axis, i.e., record the j+1-th slicing axis as the optimal slicing axis for the current i-th first operator.
步骤S17,记录算子切分结果。Step S17, recording the operator segmentation result.
具体地,在图编译器102计算完当前第i个第一算子所有切分轴的切分收益后,可以确定当前第i个第一算子的最优切分轴。然后,图编译器102通过该最优切分轴对当前第i个第一算子进行切分,得到第i个第一算子的切分结果,并将该第i个第一算子的切分结果记录到张量切分表中。Specifically, after the graph compiler 102 calculates the slicing benefits of all slicing axes of the current i-th first operator, the optimal slicing axis of the current i-th first operator can be determined. Then, the graph compiler 102 slicing the current i-th first operator through the optimal slicing axis, obtains the slicing result of the i-th first operator, and records the slicing result of the i-th first operator into the tensor slicing table.
可选地,上述步骤502中的所有方法流程具体可以通过图编译器102中的算子切分单元22执行。Optionally, all the method processes in the above step 502 can be specifically executed by the operator segmentation unit 22 in the graph compiler 102.
步骤S503,基于M个第一算子的切分结果对第一计算图进行切分,得到对应的N个第二计算图。Step S503: Split the first computation graph based on the segmentation results of the M first operators to obtain corresponding N second computation graphs.
具体地,图编译器102基于所述M个第一算子的切分结果对所述第一计算图进行切分,得到对应的N个第二计算图。该N个第二计算图与芯片中的N个裸片一一对应。N为大于或者等于1的整数。Specifically, the graph compiler 102 divides the first computation graph based on the division results of the M first operators to obtain corresponding N second computation graphs. The N second computation graphs correspond one-to-one to the N dies in the chip. N is an integer greater than or equal to 1.
请参阅图10,图10是本申请实施例提供的一种图切分的方法流程示意图。如图10所示,该方法可以包括如下步骤S31-步骤S35。Please refer to Figure 10, which is a schematic diagram of a method flow of graph segmentation provided by an embodiment of the present application. As shown in Figure 10, the method may include the following steps S31 to S35.
步骤S31,创建sub DAG(子图)。Step S31, create sub DAG (subgraph).
具体地,图编译器102根据Chip内的裸片数量,构造对应个数的sub DAG。应理解,初始时每个子图都是一个空图。示例性的,图编译器102创建与N个裸片对应的N个子图。Specifically, the graph compiler 102 constructs a corresponding number of sub-DAGs according to the number of dies in the Chip. It should be understood that each sub-graph is an empty graph initially. Exemplarily, the graph compiler 102 creates N sub-graphs corresponding to N dies.
步骤S32,遍历第一计算图。Step S32, traverse the first computation graph.
具体地,图编译器102遍历第一计算图中的每个第一算子,并获取每个第一算子对应的切分结果。Specifically, the graph compiler 102 traverses each first operator in the first computation graph, and obtains the segmentation result corresponding to each first operator.
步骤S33,读取算子切分结果。Step S33, reading the operator segmentation result.
具体地,图编译器102读取当前第一算子的切分结果。例如,图编译器102读取第i个第一算子的切分结果。该第i个第一算子切分后得到的多个算子分布到N个裸片中的多个裸片上(例如包括上述第p个裸片)。Specifically, the graph compiler 102 reads the segmentation result of the current first operator. For example, the graph compiler 102 reads the segmentation result of the i-th first operator. The multiple operators obtained after the i-th first operator is segmented are distributed to multiple dies among the N dies (for example, including the above-mentioned p-th die).
步骤S34,是否需要插入通信算子。Step S34: whether it is necessary to insert a communication operator.
具体地,图编译器102基于当前第i个第一算子的切分结果,判断是否需要在切分后的该第i个第一算子前插入通信算子,若否,则直接执行步骤S35,若是,则执行步骤S36。如上所述,若相邻两个算子的切分轴发生变化,则后置算子需要的数据部分在其它裸片上,此时便需要插入通信算子,以将本裸片上需要的数据从其它裸片上获取过来。Specifically, the graph compiler 102 determines whether it is necessary to insert a communication operator before the i-th first operator after the segmentation based on the segmentation result of the current i-th first operator. If not, step S35 is directly executed. If yes, step S36 is executed. As described above, if the segmentation axis of two adjacent operators changes, the data part required by the post-operator is on other dies. At this time, a communication operator needs to be inserted to obtain the data required on the current die from other dies.
可选地,请一并参考图3a和图3b对应的实施例,一般情况下原始Tensor均是完整的数据,但切分后算子往往只需要使用部分数据进行计算,此时便需要插入切分(Slice)算子,以将原始Tensor切分为子图中切分后的算子需要的数据。Optionally, please refer to the embodiments corresponding to Figures 3a and 3b. Generally, the original Tensor is complete data, but the operator after segmentation often only needs to use part of the data for calculation. At this time, it is necessary to insert a slice operator to split the original Tensor into the data required by the operator after segmentation in the subgraph.
此外,除了通信算子和切分算子,若当前第i个第一算子是切分了Reduction轴,即当第i个第一算子是Reduce计算时(比如ReduceSum/ReduceMax等),需要将多Die的数据进行归约,因此需要插入AllReduce (归约)算子。In addition to the communication operator and the split operator, if the current i-th first operator is splitting the Reduction axis, that is, when the i-th first operator is a Reduce calculation (such as ReduceSum/ReduceMax, etc.), it is necessary to reduce the data of multiple Dies, so it is necessary to insert AllReduce (Reduction) operator.
步骤S35,构造切分后的算子。Step S35, constructing the segmented operator.
具体地,图编译器102根据当前第i个第一算子的切分结果,构造多个切分后的第i个第一算子。例如,该第i个第一算子为图3a中的算子A,多个切分后的第i个第一算子可以为图3a所示的算子a1和算子a2。显然,原DAG中的算子(即第一计算图中的第一算子)在被最优切分轴切分后,其对应的输入/输出张量的形状会发生变化,因此需要构造新的切分后的算子。构造过程中,可以复制原算子的各项属性,但需要将输入/输出张量的形状为修改切分后的形状。Specifically, the graph compiler 102 constructs multiple i-th first operators after segmentation according to the segmentation result of the current i-th first operator. For example, the i-th first operator is operator A in Figure 3a, and the i-th first operators after multiple segmentations may be operator a1 and operator a2 shown in Figure 3a. Obviously, after the operator in the original DAG (i.e., the first operator in the first computational graph) is segmented by the optimal segmentation axis, the shape of its corresponding input/output tensor will change, so it is necessary to construct a new segmented operator. During the construction process, the various attributes of the original operator can be copied, but the shape of the input/output tensor needs to be modified to the shape after segmentation.
示例性的,仍以图3a为例,构造算子a1和算子a2时,可以复制算子A的各项属性(例如矩阵相乘的计算类型),并将左输入矩阵的形状修改为(M/2)×K,将输出矩阵的形状修改为(M/2)×N,右输入矩阵的形状不变。Exemplarily, still taking Figure 3a as an example, when constructing operators a1 and a2, the properties of operator A (such as the calculation type of matrix multiplication) can be copied, and the shape of the left input matrix can be modified to (M/2)×K, and the shape of the output matrix can be modified to (M/2)×N, while the shape of the right input matrix remains unchanged.
步骤S36,构造通信算子。Step S36, construct a communication operator.
具体地,如上步骤S34所述,若相邻两个算子的切分轴发生变化,则后置算子需要的数据部分在其它裸片上。如此,若体i-1个第一算子与第i个第一算子的最优切分轴不同,则图编译器102可以在切分后的第i个第一算子前构造相应的通信算子。Specifically, as described in step S34 above, if the splitting axes of two adjacent operators change, the data part required by the post-operator is on other dies. In this way, if the optimal splitting axes of the i-1 first operator and the i-th first operator are different, the graph compiler 102 can construct a corresponding communication operator before the i-th first operator after the splitting.
步骤S37,将构造的算子加入sub DAG。Step S37, add the constructed operator to sub DAG.
具体地,图编译器102将步骤S35中构造的切分后的算子(例如第i个第一算子被最优切分轴切分后得到的多个第二算子)以及步骤S36中构造的插入算子(例如通信算子和切分算子等)加入到对应的sub DAG中。进一步地,循环步骤S33-步骤S37,直至遍历第一计算图中的每个第一算子,从而得到与N个裸片一一对应的N个第二计算图。其中,每个第二计算图中包括多个第二算子,该第二算子包括被切分后的第一算子,以及插入的切分算子、通信算子和归约算子等。Specifically, the graph compiler 102 adds the split operators constructed in step S35 (e.g., multiple second operators obtained after the i-th first operator is split by the optimal split axis) and the inserted operators constructed in step S36 (e.g., communication operators and split operators, etc.) to the corresponding sub DAG. Further, steps S33-S37 are looped until each first operator in the first computation graph is traversed, thereby obtaining N second computation graphs corresponding to N dies one by one. Each second computation graph includes multiple second operators, which include the split first operator, as well as the inserted split operator, communication operator, and reduction operator, etc.
可选地,上述步骤503中的所有方法流程具体可以通过图编译器102中的图切分单元23执行。Optionally, all the method processes in the above step 503 can be specifically executed by the graph segmentation unit 23 in the graph compiler 102.
步骤S504,将N个第二计算图分配至N个裸片上执行。Step S504, distribute the N second computation graphs to N dies for execution.
具体地,图编译器102中的图切分单元23输出包含该N个第二计算图的子图列表至模型编译单元24。每个第二计算图都对应有一个裸片执行。然后,模型编译单元24基于该子图列表输出相应的模型列表至模型部署单元25。最后,模型部署单元25将N个第二计算图对应的各个模型部署到对应的多个裸片上执行。Specifically, the graph segmentation unit 23 in the graph compiler 102 outputs a subgraph list containing the N second computation graphs to the model compilation unit 24. Each second computation graph corresponds to a bare chip for execution. Then, the model compilation unit 24 outputs a corresponding model list based on the subgraph list to the model deployment unit 25. Finally, the model deployment unit 25 deploys each model corresponding to the N second computation graphs to the corresponding multiple bare chips for execution.
可选地,本申请实施例中所描述的基于多裸片的计算方法中的各方法流程具体可以基于软件、硬件、或其结合的方式实现。其中,以硬件实现的方式可以包括逻辑电路、算法电路或模拟电路等。以软件实现的方式可以包括程序指令,可以被视为是一种软件产品,被存储于存储器中,并可以被处理器运行以实现相关功能。Optionally, each method flow in the multi-die-based computing method described in the embodiments of the present application can be implemented based on software, hardware, or a combination thereof. Among them, the hardware implementation can include logic circuits, algorithm circuits, or analog circuits. The software implementation can include program instructions, which can be regarded as a software product, stored in a memory, and can be executed by a processor to implement related functions.
综上,本申请实施例对图编译器进行了改进,优化了多Die芯片场景下针对DAG图的编译部署。本申请实施例可以基于算子切分,将DAG图中的每一个完整的计算切分成多个较小的子计算,并将原始的DAG图切分为多个sub DAG图,再将这些sub DAG图编译为多个模型,最终部署在一个NUMA架构的Chip内的多个Die上。此外,本申请实施例通过对不同切分方案(即不同切分轴)的切分收益对比,为DAG图中的每一个算子选择最优的切分方案,并按照每个算子最优的切分方案将每个算子切分为多个算子,这些算子同时在多Die上进行计算,以充分利用Chip内多Die的计算资源,提升计算效率。In summary, the embodiments of the present application have improved the graph compiler and optimized the compilation deployment of the DAG graph in the multi-Die chip scenario. The embodiments of the present application can divide each complete calculation in the DAG graph into multiple smaller sub-calculations based on operator segmentation, and divide the original DAG graph into multiple sub-DAG graphs, and then compile these sub-DAG graphs into multiple models, and finally deploy them on multiple Dies in a Chip of a NUMA architecture. In addition, the embodiments of the present application select the optimal segmentation scheme for each operator in the DAG graph by comparing the segmentation benefits of different segmentation schemes (i.e., different segmentation axes), and divide each operator into multiple operators according to the optimal segmentation scheme for each operator. These operators are calculated on multiple Dies at the same time to make full use of the computing resources of multiple Dies in the Chip and improve computing efficiency.
如上所述,基于本申请实施例提供的一系列方案,本申请实施例可以带来如下几点有益效果。As described above, based on a series of solutions provided in the embodiments of the present application, the embodiments of the present application can bring the following beneficial effects.
(1)本申请可以让用户将多Die芯片作为一个Chip使用,不需要关心Chip内的多Die拓扑,简化了用户开发。(1) This application allows users to use multi-die chips as one chip without having to worry about the multi-die topology within the chip, thus simplifying user development.
(2)本申请通过切分收益的计算和最优切分轴的选择,使得每个NUMA节点上的计算仅需要访问本节点的存储。换言之,每个裸片执行的计算仅需要访问本裸片上的存储,使得算子的实现不需要感知跨Die内存访问,简化了算子开发。并且,由于每个裸片执行的计算仅需要访问本裸片上的存储,从而降低了对Die间带宽和拓扑的要求,即使在较低的Die间带宽下,仍然可以达到较高的计算性能。用户可以将芯片的面积尽可能应用于计算,提升了芯片的算力密度。(2) This application calculates the slicing benefit and selects the optimal slicing axis so that the calculation on each NUMA node only needs to access the storage of this node. In other words, the calculation performed by each die only needs to access the storage on this die, so that the implementation of the operator does not need to be aware of cross-die memory access, simplifying operator development. In addition, since the calculation performed by each die only needs to access the storage on this die, the requirements for inter-die bandwidth and topology are reduced, and even at a lower inter-die bandwidth, higher computing performance can still be achieved. Users can use the chip area for computing as much as possible, thereby improving the computing power density of the chip.
基于上述方法实施例的描述,本申请实施例还提供一种电子设备。请参阅图11,图11是本申请实施例提供的一种电子设备的结构示意图。如图11所示,该电子设备110至少包括处理器1101,输入设备1102、输出设备1103和存储器1104,该电子设备还可以包括其他通用部件,在此不再详述。其中,电子设备内的处理器1101,输入设备1102、输出设备1103和存储器1104可通过总线或其他方式连接。该电子设备 110可以为智能可穿戴设备、智能手机、平板电脑、笔记本电脑、台式电脑、车载计算机或服务器,等等,也可以是由多个服务器构成的服务器集群或者云计算服务中心。Based on the description of the above method embodiment, the embodiment of the present application also provides an electronic device. Please refer to Figure 11, which is a schematic diagram of the structure of an electronic device provided in the embodiment of the present application. As shown in Figure 11, the electronic device 110 at least includes a processor 1101, an input device 1102, an output device 1103 and a memory 1104. The electronic device may also include other common components, which are not described in detail here. Among them, the processor 1101, the input device 1102, the output device 1103 and the memory 1104 in the electronic device can be connected via a bus or other means. The electronic device 110 may be a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a vehicle-mounted computer or a server, etc., or may be a server cluster or a cloud computing service center composed of multiple servers.
电子设备110内的存储器1104可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器1104可以是独立存在,通过总线与处理器1101相连接。存储器1104也可以和处理器1101集成在一起。The memory 1104 in the electronic device 110 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto. The memory 1104 may exist independently and be connected to the processor 1101 through a bus. The memory 1104 may also be integrated with the processor 1101.
计算机可读存储介质可以存储在电子设备110的存储器1104中,所述计算机可读存储介质用于存储计算机程序,所述计算机程序包括程序指令,所述处理器1101用于执行所述计算机可读存储介质存储的程序指令。处理器1101(或称CPU(Central Processing Unit,中央处理器))是电子设备110的计算核心以及控制核心,其适于实现一条或一条以上指令,具体适于加载并执行一条或一条以上指令从而实现相应方法流程或相应功能。在一个实施例中,本申请实施例所述的处理器1101可以用于进行基于多裸片的计算方法的一系列处理,包括:获取第一计算图,所述第一计算图包括M个第一算子;对所述M个第一算子分别进行切分,得到所述M个第一算子的切分结果;基于所述M个第一算子的切分结果对所述第一计算图进行切分,得到对应的N个第二计算图;所述N个第二计算图中的每一个第二计算图包括被切分后的第一算子;N、M为大于或者等于1的整数;将所述N个第二计算图分配至N个裸片上执行;所述N个第二计算图与所述N个裸片一一对应,等等。具体可参见上述图1-图10对应实施例中的相关描述,此处不再赘述。The computer-readable storage medium may be stored in the memory 1104 of the electronic device 110, the computer-readable storage medium is used to store a computer program, the computer program includes program instructions, and the processor 1101 is used to execute the program instructions stored in the computer-readable storage medium. The processor 1101 (or CPU (Central Processing Unit)) is the computing core and control core of the electronic device 110, which is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions to implement the corresponding method flow or corresponding function. In one embodiment, the processor 1101 described in the embodiment of the present application can be used to perform a series of processing based on a multi-die computing method, including: obtaining a first computing graph, the first computing graph includes M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs includes the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N dies for execution; the N second computing graphs correspond to the N dies one by one, etc. For details, please refer to the relevant descriptions in the embodiments corresponding to Figures 1 to 10 above, which will not be repeated here.
本申请实施例还提供一种计算机可读存储介质,其中,该计算机可读存储介质可存储有程序,该程序被处理器执行时,使得所述处理器可以执行上述方法实施例中记载的任意一种的部分或全部步骤。An embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium may store a program, and when the program is executed by a processor, the processor can execute part or all of the steps of any one of the above method embodiments.
本申请实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被多核处理器执行时,使得所述处理器可以执行上述方法实施例中记载的任意一种的部分或全部步骤。An embodiment of the present application also provides a computer program, which includes instructions. When the computer program is executed by a multi-core processor, the processor can execute part or all of the steps of any one of the above method embodiments.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。In the above embodiments, the description of each embodiment has its own emphasis. For the parts that are not described in detail in a certain embodiment, please refer to the relevant description of other embodiments. It should be noted that for the aforementioned method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should be aware that this application is not limited to the described order of actions, because according to this application, some steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed devices can be implemented in other ways. For example, the device embodiments described above are only schematic, such as the division of the above-mentioned units, which is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of devices or units can be electrical or other forms.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(read-only memory,ROM)、双倍速率同步动态随机存储器(double data rate,DDR)、闪存(flash)或者随机存取存储器(random access memory,RAM)等各种可以存储程序代码的介质。 If the above-mentioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to perform all or part of the steps of the above-mentioned methods of each embodiment of the present application. Among them, the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM), double data rate synchronous dynamic random access memory (DDR), flash memory (flash) or random access memory (RAM) and other media that can store program codes.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。 As described above, the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, a person of ordinary skill in the art should understand that the technical solutions described in the aforementioned embodiments can still be modified, or some of the technical features therein can be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

  1. 一种基于多裸片的计算方法,其特征在于,所述方法包括:A multi-die based computing method, characterized in that the method comprises:
    获取第一计算图,所述第一计算图包括M个第一算子;Obtain a first computation graph, where the first computation graph includes M first operators;
    对所述M个第一算子分别进行切分,得到所述M个第一算子的切分结果;Segmenting the M first operators respectively to obtain segmentation results of the M first operators;
    基于所述M个第一算子的切分结果对所述第一计算图进行切分,得到对应的N个第二计算图;所述N个第二计算图中的每一个第二计算图包括被切分后的第一算子;N、M为大于或者等于1的整数;The first computation graph is segmented based on the segmentation results of the M first operators to obtain corresponding N second computation graphs; each of the N second computation graphs includes the segmented first operator; N and M are integers greater than or equal to 1;
    将所述N个第二计算图分配至N个裸片上执行;所述N个第二计算图与所述N个裸片一一对应。The N second computation graphs are distributed to N bare chips for execution; the N second computation graphs correspond one-to-one to the N bare chips.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述M个第一算子分别进行切分,得到所述M个第一算子的切分结果,包括:The method according to claim 1, characterized in that the segmenting of the M first operators respectively to obtain the segmentation results of the M first operators comprises:
    确定所述M个第一算子中的第i个第一算子的最优切分轴;Determining an optimal segmentation axis of an i-th first operator among the M first operators;
    基于所述第i个第一算子的最优切分轴,对所述第i个第一算子进行切分,获得所述第i个第一算子的切分结果;i为大于或者等于1,且小于或者等于M的整数。Based on the optimal segmentation axis of the i-th first operator, segment the i-th first operator to obtain a segmentation result of the i-th first operator; i is an integer greater than or equal to 1 and less than or equal to M.
  3. 根据权利要求2所述的方法,其特征在于,所述第i个第一算子包括K个切分轴;所述确定所述M个第一算子中的第i个第一算子的最优切分轴,包括:The method according to claim 2, characterized in that the i-th first operator includes K segmentation axes; and the determining the optimal segmentation axis of the i-th first operator among the M first operators comprises:
    确定所述第i个第一算子包括的所述K个切分轴各自对应的计算收益和通信耗时;Determine the computational benefits and communication time consumptions corresponding to each of the K segmentation axes included in the i-th first operator;
    基于所述K个切分轴各自对应的所述计算收益和所述通信耗时的差值,确定所述K个切分轴各自对应的切分收益;其中,所述切分收益最大的切分轴为所述第i个第一算子的最优切分轴;K为大于或者等于1的整数。Based on the difference between the calculation benefit and the communication time corresponding to each of the K slicing axes, the slicing benefits corresponding to each of the K slicing axes are determined; wherein the slicing axis with the largest slicing benefit is the optimal slicing axis of the i-th first operator; K is an integer greater than or equal to 1.
  4. 根据权利要求3所述的方法,其特征在于,所述第i个第一算子的第j个切分轴对应的计算收益为第一计算耗时与第二计算耗时的差值;所述第一计算耗时为单个裸片执行所述第i个第一算子所需的时间,所述第二计算耗时为多个裸片并行执行被所述第j个切分轴切分后的所述第i个第一算子所需的时间;j为大于或者等于1,且小于或者等于K的整数。The method according to claim 3 is characterized in that the calculation benefit corresponding to the j-th segmentation axis of the i-th first operator is the difference between the first calculation time and the second calculation time; the first calculation time is the time required for a single bare chip to execute the i-th first operator, and the second calculation time is the time required for multiple bare chips to execute in parallel the i-th first operator after being segmented by the j-th segmentation axis; j is an integer greater than or equal to 1 and less than or equal to K.
  5. 根据权利要求4所述的方法,其特征在于,所述第i个第一算子的第j个切分轴对应的通信耗时为第p个裸片从其他裸片上获取目标数据所需的时间;所述第p个裸片为所述第i个第一算子被所述第j个切分轴切分后对应分配的多个裸片中的一个;所述目标数据为所述第p个裸片执行被所述第j个切分轴切分后的所述第i个第一算子时所需的数据;所述通信耗时与所述目标数据的数量以及所述目标数据的内存排布相关;p为大于或者等于1,且小于或者等于N的整数。The method according to claim 4 is characterized in that the communication time corresponding to the j-th splitting axis of the i-th first operator is the time required for the p-th die to obtain the target data from other die; the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th splitting axis; the target data is the data required for the p-th die to execute the i-th first operator after being split by the j-th splitting axis; the communication time is related to the amount of the target data and the memory arrangement of the target data; p is an integer greater than or equal to 1 and less than or equal to N.
  6. 根据权利要求5所述的方法,其特征在于,所述第i个第一算子的切分结果包括:所述第i个第一算子的输出张量列表,所述第i个第一算子对应的一个或多个输入张量和输出张量的原始形状,所述第i个第一算子对应的一个或多个输入张量和/或输出张量被所述最优切分轴切分后的形状,以及所述第i个第一算子被所述最优切分轴切分后对应分配的一个或多个裸片。The method according to claim 5 is characterized in that the segmentation result of the i-th first operator includes: a list of output tensors of the i-th first operator, original shapes of one or more input tensors and output tensors corresponding to the i-th first operator, shapes of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis, and one or more bare chips allocated to the i-th first operator after being segmented by the optimal segmentation axis.
  7. 根据权利要求6所述的方法,其特征在于,所述N个第二计算图中的第p个第二计算图包括多个第二算子;所述多个第二算子包括被所述最优切分轴切分后的所述第i个第一算子;所述第p个第二计算图为分配到第p个裸片上执行的第二计算图。The method according to claim 6 is characterized in that the pth second computation graph among the N second computation graphs includes multiple second operators; the multiple second operators include the i-th first operator after being split by the optimal splitting axis; and the p-th second computation graph is a second computation graph allocated to the p-th die for execution.
  8. 根据权利要求7所述的方法,其特征在于,所述第p个第二计算图中的所述多个第二算子还包括切分算子、通信算子和归约算子中的一个或多个;其中,The method according to claim 7, characterized in that the multiple second operators in the p-th second computation graph further include one or more of a split operator, a communication operator, and a reduction operator; wherein,
    所述切分算子,用于获取被所述最优切分轴切分后的所述第i个第一算子的输入张量;The slicing operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis;
    所述通信算子,用于在所述第i个第一算子的所述最优切分轴与第i-1个第一算子的所述最优切分轴不同时,从其他裸片上获取被所述最优切分轴切分后的所述第i个第一算子的输入张量;The communication operator is used for obtaining the input tensor of the i-th first operator divided by the optimal dividing axis from other bare chips when the optimal dividing axis of the i-th first operator is different from the optimal dividing axis of the i-1-th first operator;
    所述归约算子,用于在所述第i个第一算子的最优切分轴为归约轴时,将对应的多个裸片上的数据进行归约。 The reduction operator is used to reduce the data on the corresponding multiple bare chips when the optimal segmentation axis of the i-th first operator is the reduction axis.
  9. 一种电子设备,所述电子设备包括N个裸片,所述电子设备用于实现如权利要求1-8所述的方法;N为大于或者等于1的整数。An electronic device comprising N bare chips, wherein the electronic device is used to implement the method described in claims 1-8; N is an integer greater than or equal to 1.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被计算机或处理器执行时实现上述权利要求1-8所述的方法。 A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer or a processor, the method described in claims 1 to 8 is implemented.
PCT/CN2023/115085 2022-09-29 2023-08-25 Multi-die-based computation method and related device WO2024066847A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211198266.2 2022-09-29
CN202211198266.2A CN117827419A (en) 2022-09-29 2022-09-29 Computing method based on multiple bare chips and related equipment

Publications (1)

Publication Number Publication Date
WO2024066847A1 true WO2024066847A1 (en) 2024-04-04

Family

ID=90476042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/115085 WO2024066847A1 (en) 2022-09-29 2023-08-25 Multi-die-based computation method and related device

Country Status (2)

Country Link
CN (1) CN117827419A (en)
WO (1) WO2024066847A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225100A1 (en) * 2017-02-03 2018-08-09 International Business Machines Corporation Splitting operators in a streaming application
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and data processing equipment
CN113449859A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and device
CN113994350A (en) * 2020-03-27 2022-01-28 华为技术有限公司 Generating parallel computing schemes for neural networks
CN114723014A (en) * 2022-04-20 2022-07-08 上海燧原科技有限公司 Tensor segmentation mode determination method and device, computer equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225100A1 (en) * 2017-02-03 2018-08-09 International Business Machines Corporation Splitting operators in a streaming application
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and data processing equipment
CN113449859A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and device
CN113994350A (en) * 2020-03-27 2022-01-28 华为技术有限公司 Generating parallel computing schemes for neural networks
CN114723014A (en) * 2022-04-20 2022-07-08 上海燧原科技有限公司 Tensor segmentation mode determination method and device, computer equipment and medium

Also Published As

Publication number Publication date
CN117827419A (en) 2024-04-05

Similar Documents

Publication Publication Date Title
AU2019284011B2 (en) Data processing method and related products
US11847554B2 (en) Data processing method and related products
EP3861489A1 (en) Parcelled quantum resources
US7694290B2 (en) System and method for partitioning an application utilizing a throughput-driven aggregation and mapping approach
US10862765B2 (en) Allocation of shared computing resources using a classifier chain
US11188348B2 (en) Hybrid computing device selection analysis
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
US10977098B2 (en) Automatically deploying hardware accelerators based on requests from users
Wu et al. Performance analysis of graph neural network frameworks
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
EP4280107A1 (en) Data processing method and apparatus, device, and medium
WO2024066847A1 (en) Multi-die-based computation method and related device
US20230409302A1 (en) Computer-readable recording medium storing conversion program and conversion processing method
Yang et al. Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster
Anastasiadis et al. PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems
US20220222177A1 (en) Systems, apparatus, articles of manufacture, and methods for improved data transfer for heterogeneous programs
Vigueras et al. On the use of GPU for accelerating communication-aware mapping techniques
Foyer et al. Enabling System Wide Shared Memory for Performance Improvement in PyCOMPSs Applications
US20220114137A1 (en) Methods, apparatus, and articles of manufacture to generate command lists to be offloaded to accelerator circuitry
Alnaasan et al. AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters
Hashemi et al. Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling
Farzaneh et al. HRAV: Hierarchical virtual machine placement algorithm in multi-hierarchy RF cloud architecture
Aqasizade et al. Kubernetes in Action: Exploring the Performance of Kubernetes Distributions in the Cloud
WO2022125133A1 (en) Program execution strategies for heterogeneous computing systems
Parekh et al. Characterizing, constructing and managing resource usage profiles of system S applications: challenges and experience

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23870090

Country of ref document: EP

Kind code of ref document: A1