WO2024066847A1

WO2024066847A1 - Multi-die-based computation method and related device

Info

Publication number: WO2024066847A1
Application number: PCT/CN2023/115085
Authority: WO
Inventors: 刘锡明; 朱思宇; 林惠敏; 葛根华
Original assignee: 华为技术有限公司
Priority date: 2022-09-29
Filing date: 2023-08-25
Publication date: 2024-04-04
Also published as: CN117827419A

Abstract

Disclosed in embodiments of the present application are a multi-die-based computation method and a related device. The method comprises: acquiring a first computational graph, wherein the first computational graph comprises M first operators; respectively segmenting the M first operators to obtain segmentation results of the M first operators; segmenting the first computational graph on the basis of the segmentation results of the M first operators to obtain N corresponding second computational graphs, wherein each of the N second computational graphs comprises a segmented first operator, and N and M are integers greater than or equal to 1; and allocating the N second computational graphs to N dies for execution, wherein the N second computational graphs have one-to-one correspondence to the N dies. By means of the embodiments of the present application, the computation efficiency of multi-die package chips can be improved.

Description

A computing method based on multiple bare chips and related equipment

This application claims the priority of the Chinese patent application filed with the China Patent Office on September 29, 2022, with application number 202211198266.2 and application name “A computing method based on multiple bare chips and related equipment”, all contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of computer technology, and in particular to a multi-die-based computing method and related equipment.

Background technique

As Moore's Law slows down, the number of transistors in a single die can no longer continue to grow rapidly, but the computing power demand of artificial intelligence (AI) chips is still growing rapidly. To solve this problem, the industry has proposed a multi-die packaging technology, that is, packaging multiple AI dies in one chip to provide greater computing power.

When using multi-die packaging technology, there are two different network architectures: uniform memory access (UMA) and non-uniform memory access (UMA). Under the UMA architecture, each die in the chip can use all the storage in the chip equally. Based on this, the UMA architecture usually requires that the interconnection bandwidth between dies be on par with the memory bandwidth to achieve consistent performance of storage access at different locations. However, in order to provide the highest possible computing power density, the area within the chip will be used for AI computing power as much as possible, making the area available for interconnection between dies within the chip very small, and because AI chips usually use high-bandwidth memory (HBM), that is, the memory bandwidth within the chip is very high, it is difficult to increase the interconnection bandwidth between dies to the same level as the memory bandwidth.

Therefore, the NUMA structure is usually chosen for multi-die packaging. However, under the NUMA structure, it is easy to access data across dies, which greatly increases the access latency and reduces the overall computing efficiency. How to improve the computing efficiency of multi-die packaged chips is an urgent problem to be solved.

Summary of the invention

The embodiments of the present application provide a multi-die based computing method and related equipment, which can effectively improve the computing efficiency of multi-die packaged chips.

The multi-die-based computing method provided in the embodiments of the present application can be executed by an electronic device, etc. An electronic device refers to a device that can be abstracted as a computer system, wherein an electronic device based on the computing function of multiple dies can also be referred to as a computing device based on multiple dies. The computing device based on multiple dies can be a complete machine of the electronic device, such as: a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a car computer or a server, etc.; it can also be a system/device composed of multiple complete machines; it can also be a part of the electronic device, such as: a chip related to the computing function based on multiple dies, such as a system on a chip (SoC), etc., which is not specifically limited in the embodiments of the present application. Among them, the system chip is also called a system on chip.

In a first aspect, an embodiment of the present application provides a computing method based on multiple bare chips, the method comprising: obtaining a first computing graph, the first computing graph comprising M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs comprises the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N bare chips for execution; the N second computing graphs correspond one-to-one to the N bare chips.

Through the method provided in the first aspect, the present application can divide each complete calculation (e.g., the first operator) in the calculation graph into multiple smaller sub-calculations (e.g., the first operator after being divided) based on operator segmentation, and distribute the multiple smaller sub-calculations to the corresponding multiple bare chips for parallel calculation. In this way, the computing resources of each bare chip can be fully utilized, greatly improving the computing efficiency of the multi-bare chip package.

In a possible implementation, splitting the M first operators respectively to obtain the splitting results of the M first operators includes: determining the optimal splitting axis of the i-th first operator among the M first operators; splitting the i-th first operator based on the optimal splitting axis of the i-th first operator to obtain the splitting result of the i-th first operator; i is an integer greater than or equal to 1 and less than or equal to M.

In the embodiment of the present application, each operator (e.g., the first operator) in the original computation graph may correspond to multiple split axes. Based on this, the embodiment of the present application first needs to determine an optimal split axis from the multiple split axes corresponding to each first operator, and then split the first operator based on the optimal split axis, thereby ensuring the computational efficiency of the multiple operators obtained after each first operator is split on multiple dies.

In a possible implementation, the i-th first operator includes K segmentation axes; determining the optimal segmentation axis of the i-th first operator among the M first operators includes: determining the computational benefits and communication time corresponding to each of the K segmentation axes included in the i-th first operator; determining the segmentation benefits corresponding to each of the K segmentation axes based on the difference between the computational benefits and the communication time corresponding to each of the K segmentation axes; wherein the segmentation axis with the largest segmentation benefit is the optimal segmentation axis of the i-th first operator; K is an integer greater than or equal to 1.

In an embodiment of the present application, the actual slicing benefit that each slicing axis can bring can be calculated based on the computational benefit (i.e., the reduction in computational time) and communication time corresponding to each slicing axis in the current operator. Then, the slicing axis with the largest benefit is determined as the optimal slicing axis of the current operator to ensure the computational efficiency of each operator when it is distributed to multiple chips for execution after slicing. In addition, in some possible embodiments, K may also be equal to 0, that is, an operator may not have a slicing axis. Obviously, in the case where the operator has no slicing axis, the operator will not be able to perform slicing processing. At this time, the operator can be deployed on N chips respectively so that the N chips perform the same calculation. Correspondingly, if the preceding operator of the operator has been slicing, then the N chips need to obtain the data required for calculation through cross-chip communication when executing the operator.

In one possible implementation, the computational benefit corresponding to the j-th splitting axis of the i-th first operator is the difference between the first computational time and the second computational time; the first computational time is the time required for a single die to execute the i-th first operator, and the second computational time is the time required for multiple die to execute in parallel the i-th first operator after being split by the j-th splitting axis; j is an integer greater than or equal to 1 and less than or equal to K.

In the embodiment of the present application, the reduction in the computation time of a complete data calculation (e.g., the first operator) after being split by the current split axis and distributed to multiple dies can be used as the computation benefit corresponding to the current split axis, thereby providing support for the subsequent determination of the optimal split axis and ensuring the computation efficiency after the operator is split. For example, the computation time of the i-th first operator on a single die is T, and the computation time of the i-th first operator after being split by its j-th split axis and distributed to 4 dies can theoretically be T/4, then the computation benefit of the j-th split axis is (T-T/4).

In a possible implementation, the communication time corresponding to the j-th splitting axis of the i-th first operator is the time required for the p-th die to obtain target data from other die; the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th splitting axis; the target data is the data required for the p-th die to execute the i-th first operator after being split by the j-th splitting axis; the communication time is related to the amount of the target data and the memory layout of the target data; p is an integer greater than or equal to 1 and less than or equal to N.

In an embodiment of the present application, the communication time corresponding to each split axis can be determined based on the time consumed by the operator after being split by the split axis to obtain the data required for calculation from other bare chips when calculating on the corresponding bare chip. Obviously, the smaller or zero the communication time corresponding to the split axis is, the greater the splitting benefit can be brought. In this way, the present application can make the calculation allocated to each bare chip after the optimal split axis is split based on the parameter of communication time, so that the calculation only needs to access the storage in the bare chip as much as possible, avoiding the delay caused by accessing data across bare chips. Based on this, even at a lower inter-die bandwidth, the present application can still achieve higher computing performance, so that the area of the multi-bare chip package chip can be applied to the calculation as much as possible, and the computing power density of the multi-bare chip package chip can be improved. It should be noted that, except for the case where the previous operator has no split axis (that is, the previous operator has not been split), in general, if the current split axis of the current operator is different from the optimal split axis of the previous operator, then the bare chip often needs to communicate across bare chips to obtain computing data from other bare chips when executing the current operator after being split.

In one possible implementation, the segmentation result of the i-th first operator includes: a list of output tensors of the i-th first operator, the original shapes of one or more input tensors and output tensors corresponding to the i-th first operator, the shapes of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis, and one or more bare chips allocated to the i-th first operator after being segmented by the optimal segmentation axis.

In an embodiment of the present application, each operator may include a series of split results after being split (such as a list of output tensors of the operator, the input/output tensor shapes before the operator is split, the input/output tensor shapes after the operator is split, and which bare chips are assigned to after the operator is split, etc.). These split results can provide effective support for subsequent graph splitting of the original computational graph (such as the first computational graph), thereby quickly and efficiently constructing multiple second computational graphs corresponding to multiple bare chips one by one, thereby improving the computing efficiency of multiple bare chips.

In one possible implementation, the pth second computation graph among the N second computation graphs includes multiple second operators; the multiple second operators include the i-th first operator after being split by the optimal splitting axis; and the p-th second computation graph is a second computation graph allocated to be executed on the p-th die.

In the embodiment of the present application, each second computation graph obtained after operator segmentation and graph segmentation can include the segmented first operator. In this way, each complete computation is segmented and then distributed to multiple dies, and each die only needs to perform the computation of part of the data in the original first operator, effectively improving the computation efficiency of the multi-die packaged chip.

In a possible implementation, the plurality of second operators in the pth second computation graph further include a segmentation operator, a communication operator and one or more of the reduction operators; wherein the slicing operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis; the communication operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis from other bare chips when the optimal slicing axis of the i-th first operator is different from the optimal slicing axis of the i-1-th first operator; the reduction operator is used to reduce the data on the corresponding multiple bare chips when the optimal slicing axis of the i-th first operator is the reduction axis.

In an embodiment of the present application, in addition to the first operator after segmentation, other corresponding operators need to be constructed in the second computational graph based on different actual situations, such as a communication operator for obtaining computational data across bare chips, etc., so as to ensure the reliable execution of the first operator after segmentation on the bare chip.

In a second aspect, an embodiment of the present application provides an electronic device, the electronic device comprising N bare chips, where N is an integer greater than or equal to 1. The electronic device is used to implement the corresponding functions of any one of the multi-bare-chip-based computing methods provided in the first aspect above.

In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, and the processor is configured to support the electronic device to perform the corresponding functions in any one of the methods provided in the first aspect. The electronic device may also include a memory, the memory is used to couple with the processor, and the memory stores the necessary program instructions and data of the electronic device. The electronic device may also include a communication interface for the electronic device to communicate with other devices or a communication network.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements any one of the multi-die-based computing method processes provided in the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program, which includes instructions. When the computer program is executed by a computer, the computer can execute any one of the multi-die-based computing method processes provided in the first aspect above.

In the sixth aspect, an embodiment of the present application provides a chip, which includes a processor and a communication interface, and the processor is used to call and run instructions from the communication interface. When the processor executes the instructions, the chip executes any one of the multi-bare-die based computing method processes provided in the first aspect above.

In a seventh aspect, an embodiment of the present application provides a chip system, which includes the electronic device described in any one of the second aspect or the third aspect, and is used to implement the functions involved in any one of the multi-die-based computing method processes provided in the first aspect. In a possible design, the chip system also includes a memory, which is used to store program instructions and data necessary for the multi-die-based computing method. The chip system can be composed of chips, or it can include chips and other discrete devices.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application.

FIG2 is a schematic diagram of the structure of a graph compiler provided in an embodiment of the present application.

FIG. 3a is a schematic diagram of operator segmentation provided in an embodiment of the present application.

FIG3 b is a schematic diagram of another operator segmentation provided in an embodiment of the present application.

FIG4 is a flow chart of a multi-die-based computing method provided in an embodiment of the present application.

FIG5 is a flow chart of an operator segmentation method provided in an embodiment of the present application.

FIG6 is a schematic diagram of calculating a split profit provided in an embodiment of the present application.

FIG. 7 is a schematic diagram of a method for calculating time consumption provided in an embodiment of the present application.

FIG8 is a flow chart of a method for calculating communication time consumption provided in an embodiment of the present application.

FIG. 9 is a schematic diagram showing that the segmentation axes of the front and rear operators are the same, provided by an embodiment of the present application.

FIG. 10 is a flow chart of a graph segmentation method provided in an embodiment of the present application.

FIG. 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

Detailed ways

The embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.

The terms "first" and "second" in the specification and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order. In addition, the terms "including" and "having" and any variation thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes other steps or units inherent to these processes, methods, products or devices. It should be noted that when an element is referred to as being "coupled" or "connected" to another or more elements, it can be an element directly connected to another or more elements, or it can be indirectly connected to the other or more elements.

It should be understood that in this application, "at least one (item)" means one or more, and "more than one" means two or more. "and/or", Used to describe the association relationship of associated objects, indicating that there can be three relationships. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or plural.

Reference to "embodiments" herein means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various locations in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

The terms "component", "module", "system", etc. used in this specification are used to represent computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program and/or a computer running on a processor. By way of illustration, both the application running on the processor and the processor can be a component. One or more components may reside in a process and/or an execution thread, and the component may be located on a computer and/or distributed between two or more computers. In addition, these components may be executed from various computer-readable media having various data structures stored thereon. Components may, for example, communicate through local and/or remote processes according to signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).

First, some of the terms in this application are explained to facilitate understanding by technicians in this field.

(1) Die refers to the crystal grain of the chip before it is packaged. It is a small piece cut from a silicon wafer by laser. Each die is an independent functional chip. Dies will be packaged as a unit to become common chips. In order to meet the computing power requirements of today's artificial intelligence chips, the industry has proposed a technical solution to package multiple dies in one chip to provide greater computing power.

(2) An operator is a mapping from a function space to another function space. In a broad sense, operators can be extended to any space, such as inner product space. In a broad sense, any operation on any function can be considered an operator, such as matrix multiplication (matmul), even exponentiation and square root can be considered an operator.

(3) Directed acyclic graph (DAG), also known as computational graph. In graph theory, if a directed graph cannot return to a node through several edges from a certain node, then the graph is a directed acyclic graph.

In order to facilitate the understanding of the embodiments of the present application, the technical problems that the present application specifically aims to solve will be further analyzed and proposed below. As mentioned above, when multi-die packaging technology is adopted, the UNMA architecture is usually adopted. However, the UNMA architecture usually brings problems of accessing data across dies, resulting in a significant increase in access latency, thereby reducing the overall computing efficiency. In the prior art, there are a variety of technologies for improving the computing efficiency of multi-die packaged chips under the UNMA architecture. The following exemplarily lists the following commonly used multi-chip module-graphics processing unit (MCM-GPU).

MCM-GPU can include the following technical points:

(1) A portion of the space in the L2 cache (L2Cahce) is allocated as L1.5 Cache, which is specifically responsible for caching data from remote Dies and improving the data access performance of remote Dies.

(2) Distributed and batched compute thread array (CTA), group scheduling of CTAs, scheduling CTAs in the same group to the same GPU module (GPU Modules, GPM) to improve cache hit rate.

(3) First-Touch Mapping: When a page table is accessed for the first time, it is mapped to the physical memory where the GPM that accesses it is located.

Accordingly, MCM-GPU has the following main disadvantages:

(1) The specifications of L2Cache are reduced, which will affect the cache hit rate of each GPM.

(2) MCM-GPU has high requirements for inter-die bandwidth, which has a great impact on computing performance. However, due to its poor scalability, it is difficult to achieve synchronous expansion of inter-die bandwidth as the number of dies continues to increase. Ultimately, it is difficult to effectively improve the computing efficiency of MCM-GPU.

Therefore, in order to solve the problem of low computing efficiency of current multi-Die chips, the technical problems that this application actually aims to solve include the following aspects: Based on the existing hardware devices and NUMA architecture, by compiling the computational graph, each complete operator in the original computational graph is split into multiple ones, and distributed to the corresponding multiple Dies for calculation, so as to make full use of the computing resources of each Die, greatly improving the computing efficiency of multi-die packaged chips. At the same time, by selecting the optimal splitting axis, the data required for the calculation allocated to each Die after the operator is split is basically in the local storage of the Die, avoiding cross-Die data access as much as possible. In this way, the embodiments of the present application can achieve even in a relatively large Under low inter-die interconnection bandwidth, the computing efficiency of the entire multi-die chip can be effectively improved.

Please refer to Figure 1, which is a schematic diagram of a system architecture based on a multi-Die chip provided in an embodiment of the present application. As shown in Figure 1, the embodiment of the present application can be mainly applied to the training and reasoning scenarios of AI models. AI developers can use AI frameworks such as TensorFow, Pytorch, MindSpore, etc. to develop training/reasoning scripts, and then trigger the execution of training/reasoning through the AI framework. As shown in Figure 1, the system architecture may include an AI framework 101, a graph compiler 102, a memory management 103, an operation compiler 104, and a chip 105. Among them, chip 105 is a multi-Die chip, which may include multiple bare chips such as Die 0, Die 1, Die 2, Die 3, etc. shown in Figure 1.

The AI framework 101 is used to construct the user's calculation into a DAG graph. Then, the AI framework 101 sends the DAG graph to the graph compiler 102 for compilation. It should be understood that in the DAG graph, the calculation is expressed by nodes, and the data transferred between calculations or the dependency between calculations are expressed by the edges between nodes. Each node in the DAG graph is an operator.

The graph compiler 102 is used to compile the operators in the DAG graph in a certain topological order, and finally compile all nodes on the DAG graph into a task list and send it to the memory management (runtime) 103. The task list may include multiple computing tasks.

The memory management 103 is used to send the task list to all the dies in the chip 105 .

Chip 105 is used to execute the sent task list through multiple Dies therein.

Obviously, in a single-die packaging scenario (i.e., chip 105 includes only one die), memory management 103 only needs to simply send the task list to the only die in chip 105. However, in a multi-die packaging scenario, memory management 103 needs to send the task list to multiple dies in chip 105. In some embodiments of the present application, graph compiler 102 can split each complete operator in the DAG graph, and based on the split result, split the original DAG graph into multiple subgraphs, which are then distributed to the corresponding multiple dies for execution through memory management 103, thereby greatly improving the computing efficiency of multi-die chips.

It should be noted that the landing product of this application can be a training/inference device in a data center, deployed on a training/inference server in a data center, or any other possible device, such as a camera on a street lamp, etc., with AI functions (such as face recognition, etc.), edge devices that perform calculations based on DAG graphs, which may include. This application mainly realizes the deployment of calculation compilation to multiple Dies in a chip by improving the compilation process of the DAG graph.

Further, please refer to Figure 2, which is a structural diagram of a graph compiler provided in an embodiment of the present application. As shown in Figure 2, the graph compiler 102 may include a graph sorting unit 21, an operator segmentation unit 22, a graph segmentation unit 23, a model compilation unit 24, and a model deployment unit 25. Optionally, the graph compiler 102 may also include an environment information library 26 and an operator information library 27. As shown in Figure 2, the AI framework 101 can be connected to the graph sorting unit 21 in the graph compiler 102, and the model deployment unit 25 can be connected to the memory management 103.

The graph sorting unit 21 is used to convert the DAG into an operator list by topological sorting, wherein the operator list includes multiple operators. The order of the operators in the operator list expresses the order in which the operators are executed.

The operator segmentation unit 22 is used to perform operator segmentation processing on multiple operators in the DAG graph respectively, and generate segmentation results for each operator. Specifically, as shown in FIG2 , when the operator segmentation unit 22 performs operator segmentation, it can read the corresponding environment information and operator information through the environment information library 26 and the operator information library 27 respectively. Among them, the environment information in the environment information library 26 may include: the number of dies in the chip, the memory specifications of each die, the interconnection topology and bandwidth between the dies, etc. Among them, the operator information in the operator information library 27 may include: operator type, input/output list, and axis information of each input/output.

As described above, compared to the conventional environment information library, the environment information library 26 in the present application needs to modify the original single die in a single chip to multiple dies in a single chip in the environment information, and add a description of the topological structure of the interconnection network between dies and the interconnection bandwidth, etc. In addition, the operator information library 27 in the present application needs to add a description of the input/output axis information for each operator in the operator information. It should be understood that the axis information is an expression of the segmentation relationship between different input/output tensors (Tensor) of the operator. All inputs/outputs of the same axis must be segmented using the same segmentation method (i.e., segmentation axis).

Optionally, please refer to Figure 3a, which is a schematic diagram of an operator segmentation provided in an embodiment of the present application. Operator A in Figure 3a is a matrix multiplication (MatMul) operator, and the operator A includes two input matrices and an output matrix, wherein the left input matrix is an M×K matrix, the right input matrix is a K×N matrix, and the output matrix is an M×N matrix. Operator A includes three segmentation axes, namely, the M axis, the K axis, and the N axis. As shown in Figure 3a, the left input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the M axis, as indicated by the dotted line in Figure 3a), thereby obtaining the segmented operators a1 and a2, wherein the left input matrices of operators a1 and a2 are (M/2)×K matrices, and the output matrices are (M/2)×N matrices. Obviously, the data required to be calculated by operators a1 and a2 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.

Optionally, please refer to FIG. 3b, which is a schematic diagram of another operator segmentation provided in an embodiment of the present application. As shown in FIG. 3b, the right input matrix and the output matrix of the operator A are segmented by the same segmentation axis (i.e., the N axis, as indicated by the dotted line in FIG. 3b), thereby obtaining Operator a3 and operator a4, wherein the right input matrix of operator a3 and operator a4 is a K×(N/2) matrix, and the output matrix is an M×(N/2) matrix. Obviously, the data required to be calculated by operator a3 and operator a4 is only half of the original operator A. In this way, through the operator splitting unit 22, an operator for complete data calculation can be split into two operators for partial data calculation.

The graph segmentation unit 23 is used to reorganize the segmented operators (such as operator a1 and operator a2 shown in FIG3a ) into a sub-graph (sub DAG) executed on multiple Dies based on the segmentation results of the operator segmentation, and each sub-graph includes multiple segmented operators. For example, the sub-graph 1 assigned to Die 0 for calculation may include the above operator a1, and the sub-graph 1 assigned to Die 1 for calculation may include the above operator a2. It should be understood that the operator segmentation unit 22 and the graph segmentation unit 23 are newly added units in the graph compiler of the present application.

The model compilation unit 24 is used to compile multiple sub DAGs into a model list that can be deployed. The model list includes multiple models, and each model includes multiple computing tasks. It should be understood that compared with conventional model compilation, the model compilation unit 24 in the present application needs to increase the compilation support for multiple sub DAGs.

The model deployment unit 25 is used to deploy multiple model lists to corresponding multiple Dies. Finally, the multiple model lists are allocated to their corresponding Dies for execution through the memory management 103, making full use of the computing power of each Die in the chip and effectively improving the computing efficiency of the multi-Die chip.

Please refer to Figure 4, which is a flow chart of a multi-die-based calculation method provided in an embodiment of the present application. The method is mainly aimed at a chip packaged by N dies, where N is an integer greater than or equal to 1. The method can be applied to the system architecture described in Figure 1. Specifically, the method can be applied to the graph compiler 102 shown in Figure 2. The method provided in an embodiment of the present application will be described in detail below in conjunction with the graph compiler 102 described in Figure 2. As shown in Figure 4, the method may include the following steps S501-S504.

Step S501: Obtain a first computation graph, where the first computation graph includes M first operators.

Specifically, the graph compiler 102 obtains a first computation graph, which includes M first operators. M is an integer greater than or equal to 1. It should be understood that the first computation graph is a DAG graph before segmentation, and accordingly, the M first operators included in the first computation graph are all operators before segmentation, and each first operator corresponds to a calculation of a complete data.

Optionally, the graph compiler 102 may obtain the first computation graph through the graph sorting unit 21 therein, and convert the first computation graph into an operator list through topological sorting, wherein the operator list includes the above-mentioned M first operators arranged in order.

Step S502, segmenting the M first operators respectively to obtain segmentation results of the M first operators.

Specifically, the graph compiler 102 performs operator segmentation processing on the M first operators in sequence, thereby obtaining segmentation results for each of the M first operators.

Optionally, since each first operator may correspond to multiple segmentation axes, the graph compiler 102 may first determine the optimal segmentation axis of each first operator, and then segment the first operator based on the optimal segmentation axis of each first operator, so as to obtain the most ideal segmentation result. Optionally, the optimal segmentation axis may be the segmentation axis that brings the greatest segmentation benefit. Exemplarily, the graph compiler 102 may first determine the optimal segmentation axis of the i-th first operator among the M first operators, and then segment the i-th first operator based on the optimal segmentation axis of the i-th first operator to obtain the segmentation result of the i-th first operator. i is an integer greater than or equal to 1 and less than or equal to M. Among them, the segmentation result of the i-th first operator includes: the output tensor list of the i-th first operator, the original shape of one or more input tensors and output tensors corresponding to the i-th first operator (for example, M×N shown in Figure 3a above), the shape of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis (for example, (M/2)×N shown in Figure 3a above), and one or more bare chips allocated corresponding to the i-th first operator after being segmented by the optimal segmentation axis.

Please refer to Figure 5, which is a schematic flow chart of an operator segmentation method provided in an embodiment of the present application. As shown in Figure 5, the method includes the following steps S11 to S17.

Step S11, obtaining operator information.

Specifically, the graph compiler 102 obtains the operator information of M first operators (i.e., all operators to be split) from the operator information library 27. The operator information mainly includes the input tensor (Input Tensor) list, output tensor (Output Tensor) list and axis information of each first operator. Among them, the axis information of each first operator may include: (1) axis type, such as Element-wise, reduction (Reduction), sliding window (SlidingWindow) and other types. Different axis types express different computing characteristics of operators. (2) The input tensor involved in each axis, and the dimension (dimension) of the input tensor. (3) The output tensor involved in each axis, and the dimension of the output tensor.

Step S12, whether there is an uncalculated segmentation axis, if yes, execute step S13, if not, execute step S17.

Specifically, the graph compiler 102 checks whether the current first operator (for example, the ith first operator) still has a slicing axis whose slicing benefit has not been calculated. If the current ith first operator still has a slicing axis that has not been calculated, the graph compiler 102 may then select the next slicing axis and calculate the slicing benefit; if the slicing benefit calculation has been completed for all slicing axes included in the current ith first operator (for example, including K slicing axes, K is an integer greater than or equal to 1), the graph compiler 102 may record the slicing axis with the largest slicing benefit as the optimal slicing axis, and record the slicing result of the ith first operator under the optimal slicing axis in the tensor slicing table. Optionally, the tensor slicing table may be located at Database or in memory.

Step S13, segmentation axis selection.

Specifically, if there are still uncalculated split axes in the current i-th first operator, the graph compiler 102 can then select the next split axis (for example, the j-th split axis among the K split axes) to calculate the split benefit. For example, as shown in FIG3a above, operator A includes three split axes. If the split benefit of axis N has been calculated at this time, the graph compiler 102 can then select axis N or axis K to calculate the corresponding split benefit.

Step S14, calculating the split revenue.

Specifically, the graph compiler 102 calculates the splitting benefit of the i-th first operator under the current splitting axis.

Optionally, please refer to Figure 6, which is a schematic diagram of calculating a split benefit provided in an embodiment of the present application. As shown in Figure 6, the split benefit is related to the calculation benefit and the communication time (or communication loss). The split benefit corresponding to each split axis is specifically the difference between the calculation benefit and the communication time. The greater the calculation benefit and the smaller the communication time, the greater the split benefit.

The computational benefit is the reduction in computational time of the ith first operator after the split. Specifically, it is the difference between the time required for the ith first operator to be computed on one die before the split (e.g., the first computational time) and the time required for the ith first operator to be computed on multiple die in parallel after being split by the current jth split axis (e.g., the second computational time).

Exemplarily, the computation time of the i-th first operator on a single die is T, and the computation time of the i-th first operator after being split by its j-th splitting axis and distributed to 4 die for computation can theoretically be T/4, then the computation benefit of the j-th splitting axis is (T-T/4). It should be noted that, in fact, the amount of computational data and the computational time of the operator are not linearly related, and the computational time on each die is often related to many factors. Optionally, please refer to FIG. 7, which is a schematic diagram of a computational time calculation method provided in an embodiment of the present application. As shown in FIG. 7, the graph compiler 102 needs to consider factors such as the chip type (which determines the number of cores/main frequency of various types of accelerated computing units, the size of caches at all levels, etc.), the type of input data (DType), the shape of the input data (Shape), and the like, and calculates the computational time required for each die to execute the i-th first operator (such as the operator a1 shown in FIG. 3a above) after being split by the current j-th splitting axis based on the cost model (Cost Model).

Among them, the communication time is the time consumed when the die needs to communicate with other die to obtain the calculation data stored on other die when executing the split operator after the i-th first operator is split. For example, the communication time corresponding to the j-th split axis of the i-th first operator is the time required for the p-th die to obtain the target data from other die. Among them, the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th split axis, and the target data is the data required when the p-th die executes the i-th first operator after the split. Optionally, the communication time is generally related to the number of target data (i.e., the amount of communication data) and the memory arrangement of the target data (i.e., the memory arrangement of the communication data), and can also be related to factors such as the interconnection topology between Dies and the connection bandwidth. p is an integer greater than or equal to 1 and less than or equal to N.

Please refer to Figure 8, which is a flow chart of a method for calculating communication time consumption provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps S21 to S25.

Step S21, reading the communication topology and bandwidth.

Specifically, the graph compiler 102 reads the communication topology between multiple Dies in a Chip and the bandwidth data of the communication links from the environment information library 26 .

Step S22, reading the segmentation result of the preceding operator.

Specifically, the graph compiler 102 reads the segmentation result of the predecessor operator of the current i-th first operator from the tensor segmentation table, and determines the optimal segmentation axis of the predecessor operator, that is, determines the optimal segmentation axis of the i-1th first operator. If the optimal segmentation axis of the i-1th first operator is the same as the j-th segmentation axis of the current i-th first operator, there is no need for cross-die communication, that is, the communication time consumption is zero. For example, please refer to Figure 9, which is a schematic diagram of the same segmentation axis of the previous and next operators provided in an embodiment of the present application. As shown in Figure 9, operator A is, for example, the i-1th first operator, operator B is, for example, the i-th first operator, and operator A is the predecessor operator of operator B. As shown in Figure 9, the segmentation axis of operator A and the segmentation axis of operator B are both M axes. After operator A is segmented, there are only (M/2)×N output data on the current die. Then, when the current die executes the split operator B, it can directly obtain (M/2)×N input data for calculation on the current die without the need for cross-die communication; otherwise, if the split axis of operator B is the N axis, the current die needs to obtain the other half of the M×(N/2) data from other die, that is, cross-die communication is required.

Step S23, calculating the communication data volume.

Specifically, if the optimal split axis of the j-1th first operator is different from the jth split axis of the current i-th first operator, the graph compiler 102 needs to calculate the communication time. At this time, the graph compiler 102 can first calculate the amount of communication data that needs to be accessed across Dies. Optionally, when the current jth split axis is an axis type that needs to be exchanged across Dies (such as Reduction, SlidingWindow, and other axis types), the graph compiler 102 also needs to calculate the amount of data that needs to be exchanged with other Dies based on the type and shape of the operator input tensor.

Step S24, calculating the memory arrangement of communication data.

Specifically, in order to ensure the accuracy of the calculation of communication time consumption, in addition to calculating the amount of communication data, the embodiments of the present application can also increase the calculation of the memory layout of communication data. Among them, when the data that needs to be exchanged across the bare chip after the i-th first operator is divided by the j-th division axis is non-continuous memory, it is often necessary to use multiple exchange tasks to complete the exchange, which brings additional task consumption, thereby bringing greater communication time consumption. In addition, when the memory layout is very scattered and the amount of data is not large, it is easy to cause multiple frequent exchanges of small numbers, which brings greater communication time consumption.

Step S25, calculating the communication time consumption.

Specifically, the graph compiler 102 can preliminarily calculate the communication time consumption based on the communication data volume ÷ the inter-die bandwidth. In addition, the communication time consumption can be comprehensively calculated based on the communication data volume and the memory layout information of the communication data. Optionally, the graph compiler 102 can specifically perform a simple evaluation of the communication time consumption based on some typical packet length test values, or use a more complex Cost Model to calculate the communication time consumption, etc., which is not specifically limited in the embodiments of the present application. In addition, the communication time consumption is not only affected by the above-mentioned data volume, but also by other factors such as the size of the transmitted packet and the control signal during the communication process, which is not specifically limited in the embodiments of the present application.

Step S15: whether it is the optimal segmentation axis.

Specifically, the graph compiler 102 obtains the segmentation benefit corresponding to the current j-th segmentation axis based on the calculation of the computational benefit and the communication time consumption in the above-mentioned embodiments corresponding to Figures 7 and 8. Then, the graph compiler 102 compares the segmentation benefit of the current j-th segmentation axis with the segmentation benefits corresponding to the previous j-1 segmentation axes. If the segmentation benefit of the current j-th segmentation axis is greater than the segmentation benefits corresponding to the previous j-1 segmentation axes, the graph compiler 102 can determine that the current j-th segmentation axis is the optimal segmentation axis and execute step S16, otherwise execute step S12.

Step S16, recording the optimal segmentation axis.

Specifically, if the current j-th slicing axis has a slicing benefit greater than the slicing benefits corresponding to the previous j-1 slicing axes, the graph compiler 102 records the current j-th slicing axis as the optimal slicing axis for the current i-th first operator. It should be understood that if the subsequent j+1-th slicing axis has a slicing benefit greater than the slicing benefit corresponding to the j-th slicing axis, the graph compiler 102 may update the optimal slicing axis, i.e., record the j+1-th slicing axis as the optimal slicing axis for the current i-th first operator.

Step S17, recording the operator segmentation result.

Specifically, after the graph compiler 102 calculates the slicing benefits of all slicing axes of the current i-th first operator, the optimal slicing axis of the current i-th first operator can be determined. Then, the graph compiler 102 slicing the current i-th first operator through the optimal slicing axis, obtains the slicing result of the i-th first operator, and records the slicing result of the i-th first operator into the tensor slicing table.

Optionally, all the method processes in the above step 502 can be specifically executed by the operator segmentation unit 22 in the graph compiler 102.

Step S503: Split the first computation graph based on the segmentation results of the M first operators to obtain corresponding N second computation graphs.

Specifically, the graph compiler 102 divides the first computation graph based on the division results of the M first operators to obtain corresponding N second computation graphs. The N second computation graphs correspond one-to-one to the N dies in the chip. N is an integer greater than or equal to 1.

Please refer to Figure 10, which is a schematic diagram of a method flow of graph segmentation provided by an embodiment of the present application. As shown in Figure 10, the method may include the following steps S31 to S35.

Step S31, create sub DAG (subgraph).

Specifically, the graph compiler 102 constructs a corresponding number of sub-DAGs according to the number of dies in the Chip. It should be understood that each sub-graph is an empty graph initially. Exemplarily, the graph compiler 102 creates N sub-graphs corresponding to N dies.

Step S32, traverse the first computation graph.

Specifically, the graph compiler 102 traverses each first operator in the first computation graph, and obtains the segmentation result corresponding to each first operator.

Step S33, reading the operator segmentation result.

Specifically, the graph compiler 102 reads the segmentation result of the current first operator. For example, the graph compiler 102 reads the segmentation result of the i-th first operator. The multiple operators obtained after the i-th first operator is segmented are distributed to multiple dies among the N dies (for example, including the above-mentioned p-th die).

Step S34: whether it is necessary to insert a communication operator.

Specifically, the graph compiler 102 determines whether it is necessary to insert a communication operator before the i-th first operator after the segmentation based on the segmentation result of the current i-th first operator. If not, step S35 is directly executed. If yes, step S36 is executed. As described above, if the segmentation axis of two adjacent operators changes, the data part required by the post-operator is on other dies. At this time, a communication operator needs to be inserted to obtain the data required on the current die from other dies.

Optionally, please refer to the embodiments corresponding to Figures 3a and 3b. Generally, the original Tensor is complete data, but the operator after segmentation often only needs to use part of the data for calculation. At this time, it is necessary to insert a slice operator to split the original Tensor into the data required by the operator after segmentation in the subgraph.

In addition to the communication operator and the split operator, if the current i-th first operator is splitting the Reduction axis, that is, when the i-th first operator is a Reduce calculation (such as ReduceSum/ReduceMax, etc.), it is necessary to reduce the data of multiple Dies, so it is necessary to insert AllReduce (Reduction) operator.

Step S35, constructing the segmented operator.

Specifically, the graph compiler 102 constructs multiple i-th first operators after segmentation according to the segmentation result of the current i-th first operator. For example, the i-th first operator is operator A in Figure 3a, and the i-th first operators after multiple segmentations may be operator a1 and operator a2 shown in Figure 3a. Obviously, after the operator in the original DAG (i.e., the first operator in the first computational graph) is segmented by the optimal segmentation axis, the shape of its corresponding input/output tensor will change, so it is necessary to construct a new segmented operator. During the construction process, the various attributes of the original operator can be copied, but the shape of the input/output tensor needs to be modified to the shape after segmentation.

Exemplarily, still taking Figure 3a as an example, when constructing operators a1 and a2, the properties of operator A (such as the calculation type of matrix multiplication) can be copied, and the shape of the left input matrix can be modified to (M/2)×K, and the shape of the output matrix can be modified to (M/2)×N, while the shape of the right input matrix remains unchanged.

Step S36, construct a communication operator.

Specifically, as described in step S34 above, if the splitting axes of two adjacent operators change, the data part required by the post-operator is on other dies. In this way, if the optimal splitting axes of the i-1 first operator and the i-th first operator are different, the graph compiler 102 can construct a corresponding communication operator before the i-th first operator after the splitting.

Step S37, add the constructed operator to sub DAG.

Specifically, the graph compiler 102 adds the split operators constructed in step S35 (e.g., multiple second operators obtained after the i-th first operator is split by the optimal split axis) and the inserted operators constructed in step S36 (e.g., communication operators and split operators, etc.) to the corresponding sub DAG. Further, steps S33-S37 are looped until each first operator in the first computation graph is traversed, thereby obtaining N second computation graphs corresponding to N dies one by one. Each second computation graph includes multiple second operators, which include the split first operator, as well as the inserted split operator, communication operator, and reduction operator, etc.

Optionally, all the method processes in the above step 503 can be specifically executed by the graph segmentation unit 23 in the graph compiler 102.

Step S504, distribute the N second computation graphs to N dies for execution.

Specifically, the graph segmentation unit 23 in the graph compiler 102 outputs a subgraph list containing the N second computation graphs to the model compilation unit 24. Each second computation graph corresponds to a bare chip for execution. Then, the model compilation unit 24 outputs a corresponding model list based on the subgraph list to the model deployment unit 25. Finally, the model deployment unit 25 deploys each model corresponding to the N second computation graphs to the corresponding multiple bare chips for execution.

Optionally, each method flow in the multi-die-based computing method described in the embodiments of the present application can be implemented based on software, hardware, or a combination thereof. Among them, the hardware implementation can include logic circuits, algorithm circuits, or analog circuits. The software implementation can include program instructions, which can be regarded as a software product, stored in a memory, and can be executed by a processor to implement related functions.

In summary, the embodiments of the present application have improved the graph compiler and optimized the compilation deployment of the DAG graph in the multi-Die chip scenario. The embodiments of the present application can divide each complete calculation in the DAG graph into multiple smaller sub-calculations based on operator segmentation, and divide the original DAG graph into multiple sub-DAG graphs, and then compile these sub-DAG graphs into multiple models, and finally deploy them on multiple Dies in a Chip of a NUMA architecture. In addition, the embodiments of the present application select the optimal segmentation scheme for each operator in the DAG graph by comparing the segmentation benefits of different segmentation schemes (i.e., different segmentation axes), and divide each operator into multiple operators according to the optimal segmentation scheme for each operator. These operators are calculated on multiple Dies at the same time to make full use of the computing resources of multiple Dies in the Chip and improve computing efficiency.

As described above, based on a series of solutions provided in the embodiments of the present application, the embodiments of the present application can bring the following beneficial effects.

(1) This application allows users to use multi-die chips as one chip without having to worry about the multi-die topology within the chip, thus simplifying user development.

(2) This application calculates the slicing benefit and selects the optimal slicing axis so that the calculation on each NUMA node only needs to access the storage of this node. In other words, the calculation performed by each die only needs to access the storage on this die, so that the implementation of the operator does not need to be aware of cross-die memory access, simplifying operator development. In addition, since the calculation performed by each die only needs to access the storage on this die, the requirements for inter-die bandwidth and topology are reduced, and even at a lower inter-die bandwidth, higher computing performance can still be achieved. Users can use the chip area for computing as much as possible, thereby improving the computing power density of the chip.

Based on the description of the above method embodiment, the embodiment of the present application also provides an electronic device. Please refer to Figure 11, which is a schematic diagram of the structure of an electronic device provided in the embodiment of the present application. As shown in Figure 11, the electronic device 110 at least includes a processor 1101, an input device 1102, an output device 1103 and a memory 1104. The electronic device may also include other common components, which are not described in detail here. Among them, the processor 1101, the input device 1102, the output device 1103 and the memory 1104 in the electronic device can be connected via a bus or other means. The electronic device 110 may be a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a vehicle-mounted computer or a server, etc., or may be a server cluster or a cloud computing service center composed of multiple servers.

The memory 1104 in the electronic device 110 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto. The memory 1104 may exist independently and be connected to the processor 1101 through a bus. The memory 1104 may also be integrated with the processor 1101.

The computer-readable storage medium may be stored in the memory 1104 of the electronic device 110, the computer-readable storage medium is used to store a computer program, the computer program includes program instructions, and the processor 1101 is used to execute the program instructions stored in the computer-readable storage medium. The processor 1101 (or CPU (Central Processing Unit)) is the computing core and control core of the electronic device 110, which is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions to implement the corresponding method flow or corresponding function. In one embodiment, the processor 1101 described in the embodiment of the present application can be used to perform a series of processing based on a multi-die computing method, including: obtaining a first computing graph, the first computing graph includes M first operators; dividing the M first operators respectively to obtain the dividing results of the M first operators; dividing the first computing graph based on the dividing results of the M first operators to obtain corresponding N second computing graphs; each of the N second computing graphs includes the divided first operator; N and M are integers greater than or equal to 1; allocating the N second computing graphs to N dies for execution; the N second computing graphs correspond to the N dies one by one, etc. For details, please refer to the relevant descriptions in the embodiments corresponding to Figures 1 to 10 above, which will not be repeated here.

An embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium may store a program, and when the program is executed by a processor, the processor can execute part or all of the steps of any one of the above method embodiments.

An embodiment of the present application also provides a computer program, which includes instructions. When the computer program is executed by a multi-core processor, the processor can execute part or all of the steps of any one of the above method embodiments.

In the above embodiments, the description of each embodiment has its own emphasis. For the parts that are not described in detail in a certain embodiment, please refer to the relevant description of other embodiments. It should be noted that for the aforementioned method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should be aware that this application is not limited to the described order of actions, because according to this application, some steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by this application.

In the several embodiments provided in the present application, it should be understood that the disclosed devices can be implemented in other ways. For example, the device embodiments described above are only schematic, such as the division of the above-mentioned units, which is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of devices or units can be electrical or other forms.

The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

If the above-mentioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to perform all or part of the steps of the above-mentioned methods of each embodiment of the present application. Among them, the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM), double data rate synchronous dynamic random access memory (DDR), flash memory (flash) or random access memory (RAM) and other media that can store program codes.

As described above, the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, a person of ordinary skill in the art should understand that the technical solutions described in the aforementioned embodiments can still be modified, or some of the technical features therein can be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A multi-die based computing method, characterized in that the method comprises:

Obtain a first computation graph, where the first computation graph includes M first operators;

Segmenting the M first operators respectively to obtain segmentation results of the M first operators;

The first computation graph is segmented based on the segmentation results of the M first operators to obtain corresponding N second computation graphs; each of the N second computation graphs includes the segmented first operator; N and M are integers greater than or equal to 1;

The N second computation graphs are distributed to N bare chips for execution; the N second computation graphs correspond one-to-one to the N bare chips.
The method according to claim 1, characterized in that the segmenting of the M first operators respectively to obtain the segmentation results of the M first operators comprises:

Determining an optimal segmentation axis of an i-th first operator among the M first operators;

Based on the optimal segmentation axis of the i-th first operator, segment the i-th first operator to obtain a segmentation result of the i-th first operator; i is an integer greater than or equal to 1 and less than or equal to M.
The method according to claim 2, characterized in that the i-th first operator includes K segmentation axes; and the determining the optimal segmentation axis of the i-th first operator among the M first operators comprises:

Determine the computational benefits and communication time consumptions corresponding to each of the K segmentation axes included in the i-th first operator;

Based on the difference between the calculation benefit and the communication time corresponding to each of the K slicing axes, the slicing benefits corresponding to each of the K slicing axes are determined; wherein the slicing axis with the largest slicing benefit is the optimal slicing axis of the i-th first operator; K is an integer greater than or equal to 1.
The method according to claim 3 is characterized in that the calculation benefit corresponding to the j-th segmentation axis of the i-th first operator is the difference between the first calculation time and the second calculation time; the first calculation time is the time required for a single bare chip to execute the i-th first operator, and the second calculation time is the time required for multiple bare chips to execute in parallel the i-th first operator after being segmented by the j-th segmentation axis; j is an integer greater than or equal to 1 and less than or equal to K.
The method according to claim 4 is characterized in that the communication time corresponding to the j-th splitting axis of the i-th first operator is the time required for the p-th die to obtain the target data from other die; the p-th die is one of the multiple die correspondingly allocated after the i-th first operator is split by the j-th splitting axis; the target data is the data required for the p-th die to execute the i-th first operator after being split by the j-th splitting axis; the communication time is related to the amount of the target data and the memory arrangement of the target data; p is an integer greater than or equal to 1 and less than or equal to N.
The method according to claim 5 is characterized in that the segmentation result of the i-th first operator includes: a list of output tensors of the i-th first operator, original shapes of one or more input tensors and output tensors corresponding to the i-th first operator, shapes of one or more input tensors and/or output tensors corresponding to the i-th first operator after being segmented by the optimal segmentation axis, and one or more bare chips allocated to the i-th first operator after being segmented by the optimal segmentation axis.
The method according to claim 6 is characterized in that the pth second computation graph among the N second computation graphs includes multiple second operators; the multiple second operators include the i-th first operator after being split by the optimal splitting axis; and the p-th second computation graph is a second computation graph allocated to the p-th die for execution.
The method according to claim 7, characterized in that the multiple second operators in the p-th second computation graph further include one or more of a split operator, a communication operator, and a reduction operator; wherein,

The slicing operator is used to obtain the input tensor of the i-th first operator after being sliced by the optimal slicing axis;

The communication operator is used for obtaining the input tensor of the i-th first operator divided by the optimal dividing axis from other bare chips when the optimal dividing axis of the i-th first operator is different from the optimal dividing axis of the i-1-th first operator;

The reduction operator is used to reduce the data on the corresponding multiple bare chips when the optimal segmentation axis of the i-th first operator is the reduction axis.
An electronic device comprising N bare chips, wherein the electronic device is used to implement the method described in claims 1-8; N is an integer greater than or equal to 1.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer or a processor, the method described in claims 1 to 8 is implemented.