CN117827419A - Computing method based on multiple bare chips and related equipment - Google Patents

Computing method based on multiple bare chips and related equipment Download PDF

Info

Publication number
CN117827419A
CN117827419A CN202211198266.2A CN202211198266A CN117827419A CN 117827419 A CN117827419 A CN 117827419A CN 202211198266 A CN202211198266 A CN 202211198266A CN 117827419 A CN117827419 A CN 117827419A
Authority
CN
China
Prior art keywords
operator
segmentation
ith
axis
die
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211198266.2A
Other languages
Chinese (zh)
Inventor
刘锡明
朱思宇
林惠敏
葛根华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202211198266.2A priority Critical patent/CN117827419A/en
Priority to PCT/CN2023/115085 priority patent/WO2024066847A1/en
Publication of CN117827419A publication Critical patent/CN117827419A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

The embodiment of the application discloses a computing method based on multiple bare chips and related equipment. The method comprises the following steps: acquiring a first calculation map, wherein the first calculation map comprises M first operators; respectively segmenting the M first operators to obtain segmentation results of the M first operators; based on the segmentation results of the M first operators, segmenting the first computational graph to obtain N corresponding second computational graphs; each of the N second computation graphs includes a segmented first operator; n, M is an integer greater than or equal to 1; distributing the N second computational graphs to N dies for execution; the N second computation graphs are in one-to-one correspondence with the N bare chips. By adopting the embodiment of the application, the calculation efficiency of the multi-die package chip can be improved.

Description

Computing method based on multiple bare chips and related equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a computing method based on multiple dies and related devices.
Background
With the relaxation of moore's law, the number of transistors within a single Die (Die) is no longer continually growing rapidly, but the computational power requirements of artificial intelligence (artificial intelligent, AI) chips are still evolving at a high rate. To solve this problem, a multi-Die package technology has been proposed in which a plurality of AIDie are packaged in one Chip (Chip), thereby providing greater computing power.
With multi-Die encapsulation techniques, there are two different network architectures: unified memory access (uniform memory access, UMA) and non-unified memory access (non-uniform memory access, UMA). Under UMA architecture, each Die in the Chip can use all the memory in the Chip equally, based on which UMA architecture generally requires that the interconnect bandwidth between Die can be leveled with the memory bandwidth to achieve consistent performance of memory accesses at different locations. However, in order to provide as high a computational power density as possible, the area within the Chip is used for AI computational power as much as possible, so that the area available for inter-Die interconnects within the Chip is very small, and since AI chips typically use high-bandwidth memory (high bandwidth memory, HBM), i.e. the memory bandwidth within the Chip is very high, this results in difficulty in increasing the inter-Die interconnect bandwidth to a level that is level with the memory bandwidth.
Therefore, a NUMA structure is often chosen for multi-Die packaging. However, under the NUMA structure, the situation of accessing data across Die is easy to occur, so that the access time delay is greatly increased, and the overall calculation efficiency is reduced. How to improve the computing efficiency of the multi-Die packaged chip is a problem to be solved.
Disclosure of Invention
The embodiment of the application provides a computing method based on multiple bare chips and related equipment, which can effectively improve the computing efficiency of a multi-Die packaged chip.
The multi-die-based computing method provided by the embodiment of the application can be executed by electronic equipment and the like. An electronic device refers to a device that can be abstracted into a computer system, where a multi-die based computing function electronic device may also be referred to as a multi-die based computing apparatus. The multi-die based computing device may be a complete machine of the electronic device, such as: smart wearable devices, smart phones, tablet computers, notebook computers, desktop computers, on-board computers or servers, and so on; or a system/device consisting of a plurality of complete machines; but also part of the devices in the electronic apparatus, for example: the embodiments of the present application are not limited in particular to computing function related chips based on multiple dies, such as system on a chip (SoC), and the like. Wherein the system chip is also referred to as a system on chip.
In a first aspect, embodiments of the present application provide a multi-die based computing method, the method comprising: acquiring a first calculation map, wherein the first calculation map comprises M first operators; respectively segmenting the M first operators to obtain segmentation results of the M first operators; based on the segmentation results of the M first operators, segmenting the first computational graph to obtain N corresponding second computational graphs; each of the N second computation graphs includes a segmented first operator; n, M is an integer greater than or equal to 1; distributing the N second computational graphs to N dies for execution; the N second computation graphs are in one-to-one correspondence with the N bare chips.
With the method provided in the first aspect, the present application may segment each complete computation (e.g., the first operator) in the computation graph into a plurality of smaller sub-computations (e.g., the segmented first operator) based on the operator segmentation, and allocate the plurality of smaller sub-computations to the corresponding plurality of die-on-parallel computations. Therefore, the computing resource of each bare chip can be fully utilized, and the computing efficiency of the multi-bare chip packaged chip is greatly improved.
In a possible implementation manner, the splitting the M first operators to obtain splitting results of the M first operators includes: determining an optimal division axis of an ith first operator in the M first operators; based on the optimal segmentation axis of the ith first operator, segmenting the ith first operator to obtain a segmentation result of the ith first operator; i is an integer greater than or equal to 1 and less than or equal to M.
In embodiments of the present application, each operator (e.g., first operator) in the original computational graph may correspond to a plurality of split axes. Based on this, for each first operator, the embodiment of the present application first needs to determine an optimal splitting axis from a plurality of splitting axes corresponding to the first operator, and then split the first operator based on the optimal splitting axis, so as to ensure the calculation efficiency of a plurality of operators obtained after each first operator is split on a plurality of dies.
In one possible embodiment, the ith first operator includes K split axes; the determining the optimal dividing axis of the ith first operator in the M first operators includes: determining calculation gains and communication time consumption corresponding to the K segmentation axes respectively included by the ith first operator; determining the segmentation benefits corresponding to the K segmentation axes based on the calculated benefits corresponding to the K segmentation axes and the difference value of the communication time consumption; the segmentation axis with the largest segmentation income is the optimal segmentation axis of the ith first operator; k is an integer greater than or equal to 1.
In the embodiment of the application, the segmentation benefits which can be actually brought by each segmentation axis can be calculated based on the calculation benefits (namely the reduction amount of calculation time consumption) and the communication time consumption corresponding to each segmentation axis in the current operator. Then, determining the splitting axis with the maximum benefit as the optimal splitting axis of the current operator, so as to ensure the calculation efficiency when each operator is split and then distributed to a plurality of bare chips for execution. Furthermore, in some possible embodiments, K may also be equal to 0, i.e. one operator may not have a split axis. Obviously, in the case that the operator does not split the axis, the operator cannot perform the splitting process, and at this time, the operator may be deployed on N dies respectively, so that the N dies perform the same calculation. Accordingly, if the pre-operator of the operator has already performed a slicing process, then the N dies need to obtain the data needed for computation through the communication across the dies when executing the operator.
In a possible implementation manner, the calculation gain corresponding to the j-th segmentation axis of the i-th first operator is a difference value between the first calculation time consumption and the second calculation time consumption; the first computation time consuming time is time required by a single die to execute the ith first operator, and the second computation time consuming time is time required by a plurality of dies to execute the ith first operator after being split by the jth split axis in parallel; j is an integer greater than or equal to 1 and less than or equal to K.
In the embodiment of the application, the reduction amount of calculation time consumption after calculation (for example, the first operator) of one complete data is divided by the current dividing axis and distributed on a plurality of bare chips is used as the calculation income corresponding to the current dividing axis, so that support is provided for the subsequent determination of the optimal dividing axis, and the calculation efficiency after the operator is divided is ensured. For example, the calculation time of the ith first operator on a single die is T, the calculation time of the ith first operator when the ith first operator is distributed to 4 dies for calculation after being split by the jth split axis of the ith first operator can be theoretically T/4, and then the calculation gain of the jth split axis is (T-T/4).
In one possible implementation manner, the communication time consumption corresponding to the j-th split axis of the i-th first operator is the time required by the p-th die to acquire the target data from other dies; the p-th bare chip is one of a plurality of bare chips correspondingly allocated after the ith first operator is sliced by the jth slicing axis; the target data is the data required by the p-th bare chip when executing the ith first calculator after being split by the j-th splitting axis; the communication time consumption is related to the quantity of the target data and the memory arrangement of the target data; p is an integer greater than or equal to 1 and less than or equal to N.
In the embodiment of the application, when the operator after being split by the split axis is calculated on the corresponding bare chip, the time consumed by the data required by calculation needs to be acquired from other bare chips, and the communication time consumption corresponding to each split axis can be determined. Obviously, the smaller or zero communication time consumption corresponding to the splitting axis is, the greater splitting benefit can be brought. Therefore, the method and the device can make the calculation distributed on each bare chip after the optimal segmentation axis is segmented as much as possible based on the parameter of communication time consumption only need to access the storage in the bare chip, and avoid delay caused by accessing data across bare chips. Based on the method, even under the condition of lower inter-Die bandwidth, the method can still achieve higher calculation performance, so that the area of the multi-Die package chip can be applied to calculation as much as possible, and the calculation force density of the multi-Die package chip is improved. It should be noted that, except for the case that the last operator does not have a splitting axis (i.e., the last operator does not perform the splitting process), in general, if the current splitting axis of the current operator is different from the optimal splitting axis of the last operator, the die often needs to communicate across the die to obtain the calculation data from other dies when executing the current operator after being split.
In one possible implementation manner, the segmentation result of the ith first operator includes: the output tensor list of the ith first operator, one or more input tensors corresponding to the ith first operator and the original shape of the output tensor, the shape of the one or more input tensors and/or output tensors corresponding to the ith first operator after being segmented by the optimal segmentation axis, and one or more dies correspondingly distributed after the ith first operator is segmented by the optimal segmentation axis.
In this embodiment of the present application, each operator after being segmented may include a series of segmentation results (for example, an output tensor list of operators, an input/output tensor shape before operator segmentation, an input/output tensor shape after operator segmentation, and which dies are allocated after operator segmentation, etc.), where these segmentation results may provide effective support for subsequent graph segmentation on an original computation graph (for example, a first computation graph), so as to quickly and efficiently construct multiple second computation graphs corresponding to multiple dies one to one, and improve computation efficiency of multiple dies.
In one possible implementation, a p-th second computational graph of the N second computational graphs includes a plurality of second operators; the plurality of second operators comprise the ith first operator after being split by the optimal splitting axis; the p-th second computational graph is a second computational graph assigned to be executed on a p-th die.
In this embodiment of the present application, each second computation graph obtained based on operator segmentation and graph segmentation may include the segmented first operator. Therefore, each complete calculation segmentation is realized and then distributed to a plurality of bare chips, and each bare chip only needs to execute the calculation of partial data in the original first operator, so that the calculation efficiency of the multi-bare chip package chip is effectively improved.
In one possible implementation, the plurality of second operators in the p-th second computational graph further includes one or more of a segmentation operator, a communication operator, and a reduction operator; the segmentation operator is used for acquiring the input tensor of the ith first operator after being segmented by the optimal segmentation axis; the communication operator is used for acquiring input tensors of the ith first operator after being split by the optimal splitting axis from other bare chips when the optimal splitting axis of the ith first operator is different from the optimal splitting axis of the (i-1) th first operator; the reduction operator is used for reducing the data on the corresponding multiple bare chips when the optimal cutting axis of the ith first operator is the reduction axis.
In this embodiment of the present application, in addition to the first operator after being cut, other corresponding operators, for example, a communication operator for acquiring calculation data across the die, need to be constructed in the second calculation map based on different actual situations, so as to ensure reliable execution of the first operator after being cut on the die.
In a second aspect, an embodiment of the present application provides an electronic device including N dies, where N is an integer greater than or equal to 1. The electronic device is configured to implement the corresponding function in any of the multi-die-based computing methods provided in the first aspect.
In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor configured to support the electronic device to perform a corresponding function in any one of the methods provided in the first aspect. The electronic device may also include a memory for coupling with the processor that holds the program instructions and data necessary for the electronic device. The electronic device may also include a communication interface for the electronic device to communicate with other devices or communication networks.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program that, when executed by a processor, implements any of the multi-die based computing method flows provided in the first aspect above.
In a fifth aspect, embodiments of the present application provide a computer program comprising instructions that, when executed by a computer, cause the computer to perform any of the multi-die based computing method flows provided in the first aspect above.
In a sixth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the processor is configured to invoke and execute instructions from the communication interface, and when the processor executes the instructions, cause the chip to perform any of the multi-die-based computing method flows provided in the first aspect above.
In a seventh aspect, an embodiment of the present application provides a chip system, where the chip system includes the electronic device of any one of the second aspect or the third aspect, and is configured to implement a function related to any one of the multi-die-based computing method flows provided in the first aspect. In one possible design, the chip system further includes a memory to hold program instructions and data necessary for the multi-die based computing method. The chip system may be formed of a chip or may include a chip and other discrete devices.
Drawings
Fig. 1 is a schematic diagram of a system architecture based on a multi-Die chip according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a graph compiler according to an embodiment of the present application.
Fig. 3a is a schematic diagram of operator segmentation according to an embodiment of the present application.
Fig. 3b is a schematic diagram of another operator segmentation provided in an embodiment of the present application.
Fig. 4 is a flowchart of a multi-die-based computing method according to an embodiment of the present application.
Fig. 5 is a flow chart of an operator segmentation method according to an embodiment of the present application.
Fig. 6 is a schematic diagram of calculation of a segmentation profit according to an embodiment of the present application.
Fig. 7 is a schematic diagram of a calculation method for time consumption in calculation according to an embodiment of the present application.
Fig. 8 is a flowchart of a communication time-consuming computing method according to an embodiment of the present application.
Fig. 9 is a schematic diagram with the same segmentation axes of the front and rear operators according to an embodiment of the present application.
Fig. 10 is a flowchart of a method for graph cut according to an embodiment of the present application.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
The terms first and second and the like in the description and in the claims of the present application and in the drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprising," "including," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. It will be understood that when an element is referred to as being "coupled" to "or" connected "to another element or elements, it can be directly or indirectly connected to the other element or elements.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.
As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a processor and the processor can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between 2 or more computers. Furthermore, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with one another in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
First, some terms in this application will be explained to facilitate understanding by those skilled in the art.
(1) Die (Die), which refers to the Die before the chip is unpackaged, is a Die cut from a silicon Wafer (Wafer) with a laser, and each Die is an independent functional chip. Die will be subsequently packaged as a unit as a common chip. In order to meet the demand of the present artificial intelligence chip for computing power, a technical scheme of packaging a plurality of Die in one chip is proposed in the industry, so that greater computing power is provided.
(2) An operator is a mapping from a function space to a function space, and the operator in a broad sense can be generalized to any space, such as an inner product space, and the like. In a broad sense, any function that performs an operation can be considered an operator, such as matrix multiplication (matmul), even including exponentiation, and the evolution can be considered an operator.
(3) Directed acyclic graphs (direct acyclic graph, DAG), i.e., computation graphs. In graph theory, a directed graph is a directed acyclic graph if it cannot start from a node and go through several edges back to the node.
In order to facilitate understanding of the embodiments of the present application, the technical problems specifically addressed by the present application will be further analyzed and presented below. As described above, when the multi-Die encapsulation technology is adopted, an UNMA architecture is generally adopted. However, the UNMA architecture typically presents a problem of accessing data across Die, resulting in a significant increase in access latency, thereby reducing overall computational efficiency. In the prior art, various schemes are involved in the technology for improving the computing efficiency of the multi-Die package chip under the un ma architecture. The following exemplary list is a multi-chip module-graphics processor (MCM-GPU) as commonly used.
The MCM-GPU may include the following technical points:
(1) And a part of space is drawn from the second-level Cache (L2 Cahce) to be used as an L1.5 Cache, and the data Cache of a remote Die is specially responsible for improving the data access performance of the remote Die.
(2) The distributed batch processing (distributed and batched) calculates thread arrays (compute thread array, CTA), performs grouping scheduling on CTA, and schedules the same group of CTA to the same GPU module (GPM), so as to improve the cache hit rate.
(3) First-Touch Mapping, when a page table is First accessed, maps it to the physical memory in which the GPM accessing it is located.
Accordingly, MCM-GPUs suffer from the following shortcomings:
(1) The specification of the L2 Cache is reduced, and the Cache hit rate of each GPM is affected.
(2) MCM-GPU has high inter-Die bandwidth requirements, which can have a great impact on computational performance. However, due to poor expandability, as the number of Die continues to increase, it is difficult to realize synchronous expansion of bandwidth between Die. And finally, the calculation efficiency of the MCM-GPU is difficult to effectively improve.
Therefore, in order to solve the problem of low calculation efficiency of the current multi-Die chip, the technical problems to be actually solved in the application include the following aspects: based on the existing hardware equipment and NUMA architecture, each complete operator in the original computational graph is segmented into a plurality of operators through compiling the computational graph, and the operators are distributed on a plurality of corresponding Die for calculation, so that the calculation resources of each Die are fully utilized, and the calculation efficiency of the multi-Die package chip is greatly improved. Meanwhile, through the selection of the optimal segmentation axis, data required by calculation distributed on each Die after operator segmentation are basically stored in the local storage of the Die, and data access across the Die is avoided as much as possible. Therefore, the embodiment of the application can realize that the calculation efficiency of the whole multi-Die chip can be effectively improved even under the condition of lower inter-Die interconnection bandwidth.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture based on a multi-Die chip according to an embodiment of the present application. As shown in fig. 1, the embodiment of the present application is mainly applicable to training and reasoning scenarios of AI models. AI developers can develop training/reasoning scripts using an AI framework, tensorFow, pytorch, mindSpore, etc., and then trigger execution of training/reasoning through the AI framework. As shown in fig. 1, the system architecture may include an AI framework 101, a graph compiler 102, memory management 103, an operations compiler 104, and a chip 105. The chip 105 is a multi-Die chip, and may include a plurality of dies such as Die 0, die1, die 2, die 3 shown in fig. 1.
The AI framework 101 is used for constructing the computation of the user into a DAG graph. The AI framework 101 then sends the DAG graph to the graph compiler 102 for compilation. It should be understood that, in the DAG graph, the computation is expressed by the nodes, and the data transferred between the computation or the dependency relationship between the computation is expressed by the edges between the nodes, and each node in the DAG graph is an operator.
The graph compiler 102 is configured to compile operators in the DAG graph in a certain topological order, and finally compile all nodes on the DAG graph into a Task (Task) list and issue the Task list to the memory management (running) 103. A plurality of computing tasks may be included in the task list.
Memory management 103 for issuing a task list onto all Die within chip 105.
A chip 105 for executing the issued task list through a plurality of Die therein.
Obviously, in a single Die package scenario (i.e., only one Die is included in the chip 105), the memory management 103 need only simply issue a task list to the unique Die within the chip 105. In a multiple DIE package scenario, however, memory management 103 needs to issue a task list onto multiple DIE within chip 105. In some embodiments of the present application, the graph compiler 102 may segment each complete operator in the DAG graph, segment the original DAG graph into multiple subgraphs based on the segmentation result, and then allocate the segmented DAG graph to the corresponding multiple Die for execution through the memory management 103, so as to greatly improve the computing efficiency of the multiple Die chip.
It should be noted that, the floor product of the present application may be a training/reasoning device of the data center, deployed on a training/reasoning server of the data center, or any other possible device, for example, a camera on a street lamp, etc. has an AI function (for example, face recognition, etc.), and an edge device that performs computation based on a DAG map may include. The method mainly realizes the deployment of computing compilation on multiple Die in a chip through improvement of the compilation process of the DAG graph.
Further, referring to fig. 2, fig. 2 is a schematic structural diagram of a graph compiler according to an embodiment of the present application. As shown in fig. 2, the graph compiler 102 may include a graph sorting unit 21, an operator segmentation unit 22, a graph segmentation unit 23, a model compiling unit 24, and a model deployment unit 25. Optionally, the graph compiler 102 may also include an environment information base 26 and an operator information base 27. As shown in fig. 2, AI framework 101 may be coupled to graph ordering unit 21 in graph compiler 102 and model deployment unit 25 may be coupled to memory management 103.
The graph ordering unit 21 is configured to convert the DAG into an operator list through topological ordering, where the operator list includes a plurality of operators. The order of operators in the operator list expresses the order in which the operators are executed.
The operator segmentation unit 22 is configured to perform operator segmentation processing on a plurality of operators in the DAG graph, and generate a segmentation result of each operator. Specifically, as shown in fig. 2, when the operator slicing unit 22 performs operator slicing, the corresponding environmental information and operator information may be read through the environmental information base 26 and the operator information base 27, respectively. The environmental information in the environmental information base 26 may include: the number of die within a chip, the memory specifications of the individual die, the interconnect topology and bandwidth between the die, etc. Wherein the operator information in the operator information base 27 may include: operator type, input/output list, axis information for each input/output.
As described above, compared with the conventional environmental information base, the environmental information base 26 in the present application needs to modify the original single-chip internal single Die into single-chip multi-Die in the environmental information, and increase the description of the topology structure and the interconnection bandwidth of the inter-Die interconnection network. Further, the operator information base 27 in the present application needs to add a description of the input/output axis information to each operator in the operator information. It should be understood that axis information is an expression of the slicing relationship between the different input/output tensors (tensors) of the operators. All inputs/outputs of the same axis must be split in the same splitting manner (i.e., splitting axis).
Optionally, referring to fig. 3a, fig. 3a is a schematic diagram of operator segmentation according to an embodiment of the present application. The operator a in fig. 3a is a matrix multiplication (MatMul) operator comprising two input matrices and one output matrix, wherein the left input matrix is a matrix of mxk, the right input matrix is a matrix of kxn, and the output matrix is a matrix of mxn. The operator A comprises three dividing axes which are respectively an M axis, a K axis and an N axis. As shown in fig. 3a, the left input matrix and the output matrix of the operator a are split by the same splitting axis (i.e., M axis, as indicated by the dashed line in fig. 3 a), so as to obtain the split operator a1 and operator a2, where the left input matrix of the operator a1 and the operator a2 is a matrix of (M/2) ×k, and the output matrix is a matrix of (M/2) ×n, and obviously, the data required to be calculated by the operator a1 and the operator a2 is only half of that required by the original operator a. Thus, by the operator splitting unit 22, one operator of the complete data calculation can be split into two operators of the partial data calculation.
Optionally, referring to fig. 3b, fig. 3b is a schematic diagram of another operator segmentation provided in an embodiment of the present application. As shown in fig. 3b, the right input matrix and the output matrix of the operator a are split by the same splitting axis (i.e., N axis, as indicated by the dashed line in fig. 3 b), so as to obtain the split operator a3 and operator a4, where the right input matrix of the operator a3 and the operator a4 is a matrix of k× (N/2), and the output matrix is a matrix of mx (N/2), and it is obvious that the data required to be calculated by the operator a3 and the operator a4 is only half of the original operator a. Thus, by the operator splitting unit 22, one operator of the complete data calculation can be split into two operators of the partial data calculation.
The graph splitting unit 23 is configured to reorganize the split operators (for example, the operator a1 and the operator a2 shown in fig. 3 a) into sub-graphs (sub DAGs) executed on multiple Die, where each sub-graph includes multiple split operators, based on a splitting result of the operator splitting. For example, the above operator a1 may be included in the sub-graph 1 allocated to the computation on Die 0, and the above operator a2 may be included in the sub-graph 1 allocated to the computation on Die 1. It should be understood that the operator segmentation unit 22 and the graph segmentation unit 23 are newly added units in the graph compiler of the present application.
A model compiling unit 24 for compiling a plurality of sub DAGs into a model list (model list) that can be deployed. The model list includes a plurality of models, each model including a plurality of computing tasks. It should be appreciated that the model compilation unit 24 in the present application requires increased compilation support for multiple sub DAGs as compared to conventional model compilation.
The model deployment unit 25 is configured to deploy the plurality of model lists to a corresponding plurality of Die. Finally, the memory management 103 distributes the model lists to the corresponding Dies for execution, so that the computing power of each Die in the chip is fully utilized, and the computing efficiency of the multi-Die chip is effectively improved.
Referring to fig. 4, fig. 4 is a flowchart of a multi-die-based computing method according to an embodiment of the present application. The method is mainly aimed at a chip packaged by N bare chips, wherein N is an integer greater than or equal to 1. The method may be applied to the system architecture described in fig. 1, and in particular, the method may be applied to the graph compiler 102 shown in fig. 2, and the method provided by the embodiment of the present application will be described in detail below with reference to the graph compiler 102 described in fig. 2. As shown in fig. 4, the method may include the following steps S501 to S504.
In step S501, a first computation graph is obtained, where the first computation graph includes M first operators.
Specifically, the graph compiler 102 obtains a first computational graph that includes M first operators. M is an integer greater than or equal to 1. It should be understood that the first computation graph is a DAG graph before segmentation, and accordingly, M first operators included in the first computation graph are operators before segmentation, and each first operator corresponds to computation of one complete data.
Alternatively, the graph compiler 102 may obtain the first computation graph through the graph sorting unit 21 therein, and convert the first computation graph into an operator list through topological sorting, where the operator list includes the M first operators in a sequential arrangement.
Step S502, the M first operators are respectively segmented, and segmentation results of the M first operators are obtained.
Specifically, the graph compiler 102 sequentially performs operator segmentation processing on the M first operators, so as to obtain respective segmentation results of the M first operators.
Alternatively, since each first operator may have multiple segmentation axes, the graph compiler 102 may determine the optimal segmentation axis of each first operator, and then segment the first operator based on the optimal segmentation axis of each first operator, so as to obtain the optimal segmentation result. Alternatively, the optimal split axis may be the split axis that yields the greatest split benefit. For example, the graph compiler 102 may determine an optimal segmentation axis of an ith first operator of the M first operators, and then segment the ith first operator based on the optimal segmentation axis of the ith first operator to obtain a segmentation result of the ith first operator. i is an integer greater than or equal to 1 and less than or equal to M. The segmentation result of the ith first operator comprises: the output tensor list of the ith first operator, the original shape (e.g., m×n as shown in fig. 3 a) of the one or more input tensors and the output tensor corresponding to the ith first operator, the shape (e.g., (M/2) ×n) of the one or more input tensors and/or the output tensors corresponding to the ith first operator after being segmented by the optimal segmentation axis, and the one or more dies correspondingly allocated after the ith first operator is segmented by the optimal segmentation axis.
Referring to fig. 5, fig. 5 is a flow chart of an operator segmentation method according to an embodiment of the present application. As shown in fig. 5, the method includes the following steps S11 to S17.
Step S11, operator information is acquired.
Specifically, the graph compiler 102 acquires operator information of M first operators (i.e., all operators to be sliced) from the operator information base 27. The operator information mainly includes an Input Tensor (Input Tensor) list, an Output Tensor (Output Tensor) list, and axis information of each first operator. Wherein the axis information of each first operator may include: (1) The shaft type may be, for example, element-wise, reduction (Reduction), sliding window (sliding window), and the like. Different axis types express different computational characteristics of the operator. (2) The input tensor that each axis involves, and the dimension (dimension) of the input tensor. (3) The output tensor that each axis relates to, and the dimension of that output tensor.
Step S12, if there is an uncomputed split axis, executing step S13, and if not, executing step S17.
In particular, the graph compiler 102 checks to see if there are segmentation axes for the current first operator (e.g., the ith first operator) for which the segmentation benefits have not been calculated. If there are no computed split axes for the current ith first operator, then the graph compiler 102 may then select the next split axis to compute the split benefit; if the calculation of the segmentation yield has been completed for all the segmentation axes (for example, K segmentation axes, where K is an integer greater than or equal to 1) included in the current ith first operator, the graph compiler 102 may record the segmentation axis in which the segmentation yield is the largest as the optimal segmentation axis, and record the segmentation result of the ith first operator under the optimal segmentation axis in the tensor segmentation table. Alternatively, the tensor cut table may be located in a database or memory.
Step S13, selecting a split axis.
Specifically, if there is an uncomputed split axis for the current ith first operator, the graph compiler 102 may then select the next split axis (e.g., the jth split axis of the K split axes) for computation of the split yield. For example, as shown in FIG. 3a above, operator A includes three split axes, and if the split gain for the N axis is already calculated at this time, then the graph compiler 102 may select the N axis or the K axis to perform the corresponding split gain calculation.
Step S14, calculating segmentation profits.
Specifically, the graph compiler 102 calculates the segmentation yield of the ith first operator under the current segmentation axis.
Optionally, referring to fig. 6, fig. 6 is a schematic diagram illustrating calculation of a segmentation profit according to an embodiment of the present application. As shown in fig. 6, the split benefit relates to computational benefit and communication time consumption (or communication loss). The segmentation benefits corresponding to each segmentation axis are specifically calculated differences between benefits and communication time consumption, and the larger the calculated benefits are, the smaller the communication time consumption is, the larger the segmentation benefits are.
The calculation benefit is the reduction of calculation time consumption brought by the ith first operator after segmentation. Specifically, the difference between the time (for example, the first calculation time consumption) required to be consumed when the ith first operator passes through one die before being split and the time (for example, the second calculation time consumption) required to be consumed when the ith first operator is distributed to a plurality of dies to be calculated in parallel after being split by the current jth splitting axis is provided.
Illustratively, the calculation time of the ith first operator on a single die is T, the calculation time of the ith first operator when the ith first operator is distributed to 4 dies for calculation after being split by the jth split axis of the ith first operator can be theoretically T/4, and then the calculation benefit of the jth split axis is (T-T/4). It should be noted that, in practice, the amount of data calculated by the operator and the calculation time are not linear, and the calculation time consumption on each die is often related to many factors. Optionally, referring to fig. 7, fig. 7 is a schematic diagram of a calculation method for time consumption in calculation according to an embodiment of the present application. As shown in fig. 7, the graph compiler 102 needs to consider factors such as the chip type (which determines the core number/main frequency, each level of buffer size, etc. of each type of acceleration calculation unit), the type of input data (DType), the Shape of input data (Shape), etc., and calculate the calculation time required for each die to execute the ith first operator (e.g. operator a1, etc. shown in fig. 3 a) after being sliced by the current jth slicing axis based on the Cost Model (Cost Model).
The communication time consumption is the time consumed by the ith first operator after the segmentation, when the die needs to communicate with other dies to acquire the calculation data stored on the other dies when executing the segmented operator. For example, the communication time corresponding to the j-th split axis of the i-th first operator is the time required for the p-th die to acquire the target data from the other die. The p-th bare chip is one of a plurality of bare chips correspondingly allocated after the ith first operator is sliced by the j-th slicing axis, and the target data is data required by the p-th bare chip when the ith first operator after slicing is executed. Optionally, the communication time consumption is generally related to the number of the target data (i.e. the communication data volume) and the memory arrangement of the target data (i.e. the memory arrangement of the communication data), and may also be related to factors such as inter-Die interconnection topology and connection bandwidth. p is an integer greater than or equal to 1 and less than or equal to N.
Referring to fig. 8, fig. 8 is a flowchart of a communication time-consuming computing method according to an embodiment of the present application. As shown in fig. 8, the method includes the following steps S21 to S25.
Step S21, reading the communication topology and bandwidth.
Specifically, the graph compiler 102 reads the communication topology between the multi-Die within the Chip, as well as the bandwidth data of the communication link, from the environmental information base 26.
Step S22, reading the segmentation result of the pre-operator.
Specifically, the graph compiler 102 reads the segmentation result of the pre-operator of the current i-th first operator from the tensor segmentation table, and determines the optimal segmentation axis of the pre-operator, i.e., determines the optimal segmentation axis of the i-1-th first operator. If the optimal splitting axis of the ith-1 th first operator is the same as the jth splitting axis of the current ith first operator, the cross-die communication is not needed, i.e. the communication time consumption is zero. Referring to fig. 9, fig. 9 is a schematic diagram showing that the segmentation axes of the front operator and the back operator are the same according to the embodiment of the present application. As shown in fig. 9, the operator a is, for example, the i-1 th first operator, the operator B is, for example, the i-th first operator, and the operator a is the front operator of the operator B. As shown in fig. 9, the splitting axis of the operator a and the splitting axis of the operator B are both M axes, and after the operator a splits, there is only (M/2) ×n output data on the current die. When executing the operator B after segmentation, the current bare chip can directly acquire the input data of (M/2) x N on the current bare chip to calculate, and communication across the bare chip is not needed; otherwise, if the split axis of operator B is N axis, then the current die needs to acquire the other half of the mx (N/2) data from the other die, i.e., needs to communicate across dies.
Step S23, the communication data amount is calculated.
Specifically, if the optimal split axis of the j-1 th first operator is different from the j-th split axis of the current i-th first operator, the graph compiler 102 needs to calculate the communication time, and at this time, the graph compiler 102 may first calculate the communication data volume that needs to be accessed across Die. Optionally, when the current jth split axis is of an axis type that needs to be swapped across Die (e.g., reduction, slidingWindow equiaxed type), the graph compiler 102 also needs to calculate the amount of data that needs to be swapped with other Die based on the type and shape of the operator input tensor.
Step S24, calculating the internal memory arrangement of the communication data.
Specifically, in order to ensure the calculation accuracy of the communication time consumption, the embodiment of the application can increase the memory arrangement of the calculation communication data besides the calculation communication data volume. When the data to be exchanged across the bare chip after the ith first operator is split by the jth split axis is discontinuous memory, multiple exchanging tasks are often needed to complete the exchange, which brings additional task consumption and thus brings more communication time consumption. In addition, when the memory arrangement is very scattered and the data volume is not large, frequent exchange of a small number of times is easy to cause, so that more communication time is consumed.
In step S25, the communication time is calculated.
Specifically, the graph compiler 102 may primarily calculate the communication time consumption according to the communication data amount/inter-Die bandwidth, and in addition, may comprehensively calculate the time consumption required for communication by combining the communication data amount and the memory arrangement information of the communication data. Alternatively, the graph compiler 102 may specifically evaluate the communication time based on some test values of the typical packet length, calculate the communication time using a more complex Cost Model, and so on, which is not specifically limited in the embodiments of the present application. In addition, the communication time consumption is not only affected by the data amount, but also affected by the size of the transmitted packet, the control signal in the communication process, and other factors, which are not particularly limited in the embodiment of the present application.
Step S15, whether the axis is the optimal split axis or not.
Specifically, the graph compiler 102 obtains the segmentation benefit corresponding to the current jth segmentation axis based on the computation of the computation benefit and the communication time consumption in the corresponding embodiments of fig. 7 and 8. The graph compiler 102 then compares the split benefit of the current j-th split axis with the split benefits of each of the previous j-1 split axes. If the segmentation yield of the current jth segmentation axis is greater than the segmentation yields corresponding to the previous j-1 segmentation axes, the graph compiler 102 may determine the current jth segmentation axis as the optimal segmentation axis, and execute step S16, otherwise execute step S12.
Step S16, recording the optimal split axis.
Specifically, if the splitting gain of the current jth splitting axis is greater than the splitting gain corresponding to each of the previous j-1 splitting axes, the graph compiler 102 records the current jth splitting axis as the optimal splitting axis of the current ith first operator. It should be appreciated that if the cut gain of the subsequent j+1th cut axis is greater than the cut gain corresponding to the j-th cut axis, the graph compiler 102 may update the optimal cut axis, i.e., record the j+1th cut axis as the optimal cut axis of the current i-th first operator.
And S17, recording operator segmentation results.
Specifically, after the graph compiler 102 has calculated the segmentation yields for all segmentation axes of the current ith first operator, the optimal segmentation axis for the current ith first operator may be determined. Then, the graph compiler 102 performs segmentation on the current ith first operator through the optimal segmentation axis to obtain a segmentation result of the ith first operator, and records the segmentation result of the ith first operator into a tensor segmentation table.
Alternatively, all the method flows in step 502 described above may be specifically performed by the operator segmentation unit 22 in the graph compiler 102.
Step S503, the first calculation map is segmented based on segmentation results of the M first operators, and N second calculation maps are obtained.
Specifically, the graph compiler 102 segments the first computation graph based on the segmentation results of the M first operators, to obtain N second computation graphs. The N second computation graphs are in one-to-one correspondence with the N bare chips in the chip. N is an integer greater than or equal to 1.
Referring to fig. 10, fig. 10 is a flowchart of a method for graph cut according to an embodiment of the present application. As shown in fig. 10, the method may include the following steps S31 to S35.
Step S31, creating a sub DAG (sub graph).
Specifically, the graph compiler 102 constructs a corresponding number of sub-DAGs according to the number of dies within the Chip. It should be appreciated that each sub-graph is initially a blank graph. Illustratively, the graph compiler 102 creates N subgraphs corresponding to N dies.
Step S32, traversing the first calculation map.
Specifically, the graph compiler 102 traverses each first operator in the first computation graph, and obtains a segmentation result corresponding to each first operator.
Step S33, reading operator segmentation results.
Specifically, the graph compiler 102 reads the segmentation result of the current first operator. For example, the graph compiler 102 reads the segmentation result of the ith first operator. The plurality of operators obtained after the ith first operator segmentation are distributed to a plurality of dies in the N dies (for example, the ith die is included).
Step S34, whether a communication operator needs to be inserted.
Specifically, the graph compiler 102 determines, based on the segmentation result of the current ith first operator, whether a communication operator needs to be inserted before the segmented ith first operator, if not, directly executes step S35, and if yes, executes step S36. As described above, if the dicing axes of two adjacent operators change, the data portion required by the post operator is on the other die, and at this time, a communication operator needs to be inserted to acquire the data required on the die from the other die.
Optionally, please refer to the embodiments corresponding to fig. 3a and fig. 3b together, the original Tensor is generally the complete data, but the operator after segmentation often only needs to use part of the data to calculate, and then a segmentation (Slice) operator needs to be inserted to segment the original Tensor into the data required by the operator after segmentation in the subgraph.
In addition, in addition to the communication operator and the segmentation operator, if the current ith first operator is the segmentation Reduction axis, that is, when the ith first operator is a Reduction calculation (such as Reduction sum/Reduction max, etc.), the data of multiple Die needs to be reduced, so that an AllReduce operator needs to be inserted.
And S35, constructing the operator after segmentation.
Specifically, the graph compiler 102 constructs a plurality of segmented ith first operators according to the segmentation result of the current ith first operator. For example, the ith first operator is operator a in fig. 3a, and the plurality of segmented ith first operators may be operator a1 and operator a2 shown in fig. 3 a. Obviously, after the operator in the original DAG (i.e., the first operator in the first computation graph) is split by the optimal splitting axis, the shape of the corresponding input/output tensor will change, so a new split operator needs to be constructed. In the construction process, each attribute of the original operator can be duplicated, but the shape of the input/output tensor is required to be modified into the shape after segmentation.
Illustratively, still taking fig. 3a as an example, when constructing the operators a1 and a2, each attribute of the operator a (e.g., the calculation type of matrix multiplication) may be copied, the shape of the left input matrix may be modified to (M/2) x K, the shape of the output matrix may be modified to (M/2) x N, and the shape of the right input matrix may be unchanged.
Step S36, constructing a communication operator.
Specifically, as described in step S34 above, if the dicing axes of the adjacent two operators change, the data portion required by the post operator is on the other die. Thus, if the optimal splitting axis of the i-1 th first operator is different from the i-th first operator, the graph compiler 102 may construct a corresponding communication operator before the i-th first operator after splitting.
And step S37, adding the constructed operator to the sub DAG.
Specifically, the graph compiler 102 adds the split operator (for example, the plurality of second operators obtained by splitting the ith first operator by the optimal splitting axis) constructed in step S35 and the insert operator (for example, the communication operator, the splitting operator, etc.) constructed in step S36 to the corresponding sub DAG. Further, step S33-step S37 are looped until each first operator in the first computation graph is traversed, thereby obtaining N second computation graphs corresponding to the N dies one to one. Each second calculation graph comprises a plurality of second operators, wherein each second operator comprises a first operator after being segmented, and an inserted segmentation operator, a communication operator, a reduction operator and the like.
Alternatively, all the method flows in step 503 described above may be specifically performed by the graph cut unit 23 in the graph compiler 102.
In step S504, N second computation graphs are distributed to N dies for execution.
Specifically, the graph cut unit 23 in the graph compiler 102 outputs a sub-graph list including the N second calculation graphs to the model compiling unit 24. Each second computational graph is executed with one die. Then, the model compiling unit 24 outputs a corresponding model list to the model deploying unit 25 based on the sub-graph list. Finally, the model deployment unit 25 deploys respective models corresponding to the N second calculation maps onto the corresponding plurality of dies to be executed.
Alternatively, each of the method flows in the multi-die based computing method described in the embodiments of the present application may be implemented specifically based on software, hardware, or a combination thereof. The hardware implementation manner may include logic circuits, algorithm circuits, analog circuits or the like. The software implemented manner may include program instructions, which may be considered a software product, stored in a memory, and executable by a processor to perform related functions.
In summary, the embodiment of the application improves the graph compiler, and optimizes the compiling deployment aiming at the DAG graph in a multi-Die chip scene. According to the embodiment of the application, each complete calculation in the DAG graph can be segmented into a plurality of smaller sub-calculations based on operator segmentation, an original DAG graph is segmented into a plurality of sub-DAG graphs, the sub-DAG graphs are compiled into a plurality of models, and finally the models are deployed on a plurality of Dies in a Chip of a NUMA architecture. In addition, the embodiment of the application selects the optimal segmentation scheme for each operator in the DAG graph by comparing segmentation gains of different segmentation schemes (namely different segmentation axes), and segments each operator into a plurality of operators according to the optimal segmentation scheme of each operator, and the operators calculate on a plurality of Dies at the same time so as to fully utilize the calculation resources of the plurality of Dies in the Chip and improve the calculation efficiency.
As described above, the embodiments of the present application may bring about the following advantageous effects based on a series of schemes provided by the embodiments of the present application.
(1) The method and the device can enable the user to use the multi-Die Chip as one Chip, do not need to care about multi-Die topology in the Chip, and simplify user development.
(2) The method and the device enable the calculation on each NUMA node to only access the storage of the node through the calculation of the segmentation income and the selection of the optimal segmentation axis. In other words, the computation performed by each Die need only access storage on the Die, so that operator implementation does not need to sense cross Die memory access, simplifying operator development. Moreover, since the computation performed by each Die only needs to access the memory on the Die, the inter-Die bandwidth and topology requirements are reduced, and higher computation performance can be achieved even at lower inter-Die bandwidths. The user can apply the area of the chip to calculation as much as possible, so that the calculation intensity of the chip is improved.
Based on the description of the method embodiment, the embodiment of the application also provides electronic equipment. Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 11, the electronic device 110 includes at least a processor 1101, an input device 1102, an output device 1103, and a memory 1104, and may include other general components, which are not described in detail herein. Wherein the processor 1101, input device 1102, output device 1103 and memory 1104 within the electronic device can be connected by bus or other means. The electronic device 110 may be a smart wearable device, a smart phone, a tablet computer, a notebook computer, a desktop computer, an on-board computer, or a server, or may be a server cluster or a cloud computing service center formed by a plurality of servers.
The Memory 1104 within the electronic device 110 may be, but is not limited to, read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, random access Memory (random access Memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), compact disc read-Only Memory (Compact Disc Read-Only Memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 1104 may be independent and may be coupled to the processor 1101 via a bus. The memory 1104 may also be integral with the processor 1101.
A computer readable storage medium may be stored in the memory 1104 of the electronic device 110, the computer readable storage medium storing a computer program comprising program instructions, the processor 1101 being configured to execute the program instructions stored by the computer readable storage medium. The processor 1101 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the electronic device 110, which are adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function. In one embodiment, the processor 1101 described in embodiments of the present application may be used to perform a series of processes for a multi-die based computing method, including: acquiring a first calculation map, wherein the first calculation map comprises M first operators; respectively segmenting the M first operators to obtain segmentation results of the M first operators; based on the segmentation results of the M first operators, segmenting the first computational graph to obtain N corresponding second computational graphs; each of the N second computation graphs includes a segmented first operator; n, M is an integer greater than or equal to 1; distributing the N second computational graphs to N dies for execution; the N second computational graphs are in one-to-one correspondence with the N dies, and so on. Reference may be made specifically to the above description of the corresponding embodiments of fig. 1 to 10, and the description is omitted here.
The embodiment of the present application also provides a computer readable storage medium, where the computer readable storage medium may store a program, where the program when executed by a processor causes the processor to perform some or all of the steps described in any one of the above method embodiments.
The embodiments of the present application also provide a computer program, which includes instructions that, when executed by a multi-core processor, cause the processor to perform part or all of the steps of any one of the method embodiments described above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments. It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc., in particular may be a processor in the computer device) to perform all or part of the steps of the above-described method of the various embodiments of the present application. Wherein the aforementioned storage medium may comprise: a U-disk, a removable hard disk, a magnetic disk, a compact disk, a read-only memory (ROM), a Double Data Rate (DDR), a flash memory (flash), or a random access memory (random access memory, RAM) or the like.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A multi-die based computing method, the method comprising:
acquiring a first calculation map, wherein the first calculation map comprises M first operators;
respectively segmenting the M first operators to obtain segmentation results of the M first operators;
based on the segmentation results of the M first operators, segmenting the first computational graph to obtain N corresponding second computational graphs; each of the N second computation graphs includes a segmented first operator; n, M is an integer greater than or equal to 1;
distributing the N second computational graphs to N dies for execution; the N second computation graphs are in one-to-one correspondence with the N bare chips.
2. The method according to claim 1, wherein the splitting the M first operators to obtain splitting results of the M first operators includes:
determining an optimal division axis of an ith first operator in the M first operators;
based on the optimal segmentation axis of the ith first operator, segmenting the ith first operator to obtain a segmentation result of the ith first operator; i is an integer greater than or equal to 1 and less than or equal to M.
3. The method of claim 2, wherein the ith first operator comprises K split axes; the determining the optimal dividing axis of the ith first operator in the M first operators includes:
determining calculation gains and communication time consumption corresponding to the K segmentation axes respectively included by the ith first operator;
determining the segmentation benefits corresponding to the K segmentation axes based on the calculated benefits corresponding to the K segmentation axes and the difference value of the communication time consumption; the segmentation axis with the largest segmentation income is the optimal segmentation axis of the ith first operator; k is an integer greater than or equal to 1.
4. The method of claim 3, wherein the computational benefit corresponding to the j-th split axis of the i-th first operator is a difference between the first computational time and the second computational time; the first computation time consuming time is time required by a single die to execute the ith first operator, and the second computation time consuming time is time required by a plurality of dies to execute the ith first operator after being split by the jth split axis in parallel; j is an integer greater than or equal to 1 and less than or equal to K.
5. The method of claim 4, wherein the communication time corresponding to the j-th split axis of the i-th first operator is the time required for the p-th die to acquire the target data from the other die; the p-th bare chip is one of a plurality of bare chips correspondingly allocated after the ith first operator is sliced by the jth slicing axis; the target data is the data required by the p-th bare chip when executing the ith first calculator after being split by the j-th splitting axis; the communication time consumption is related to the quantity of the target data and the memory arrangement of the target data; p is an integer greater than or equal to 1 and less than or equal to N.
6. The method of claim 5, wherein the segmentation result of the ith first operator comprises: the output tensor list of the ith first operator, one or more input tensors corresponding to the ith first operator and the original shape of the output tensor, the shape of the one or more input tensors and/or output tensors corresponding to the ith first operator after being segmented by the optimal segmentation axis, and one or more dies correspondingly distributed after the ith first operator is segmented by the optimal segmentation axis.
7. The method of claim 6, wherein a p-th second computational graph of the N second computational graphs comprises a plurality of second operators; the plurality of second operators comprise the ith first operator after being split by the optimal splitting axis; the p-th second computational graph is a second computational graph assigned to be executed on a p-th die.
8. The method of claim 7, wherein the plurality of second operators in the p-th second computational graph further comprises one or more of a segmentation operator, a communication operator, and a reduction operator; wherein,
the segmentation operator is used for acquiring the input tensor of the ith first operator after being segmented by the optimal segmentation axis;
the communication operator is used for acquiring input tensors of the ith first operator after being split by the optimal splitting axis from other bare chips when the optimal splitting axis of the ith first operator is different from the optimal splitting axis of the (i-1) th first operator;
the reduction operator is used for reducing the data on the corresponding multiple bare chips when the optimal cutting axis of the ith first operator is the reduction axis.
9. An electronic device comprising N dies for implementing the method of claims 1-8; n is an integer greater than or equal to 1.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer or processor, implements the method of the preceding claims 1-8.
CN202211198266.2A 2022-09-29 2022-09-29 Computing method based on multiple bare chips and related equipment Pending CN117827419A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211198266.2A CN117827419A (en) 2022-09-29 2022-09-29 Computing method based on multiple bare chips and related equipment
PCT/CN2023/115085 WO2024066847A1 (en) 2022-09-29 2023-08-25 Multi-die-based computation method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211198266.2A CN117827419A (en) 2022-09-29 2022-09-29 Computing method based on multiple bare chips and related equipment

Publications (1)

Publication Number Publication Date
CN117827419A true CN117827419A (en) 2024-04-05

Family

ID=90476042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211198266.2A Pending CN117827419A (en) 2022-09-29 2022-09-29 Computing method based on multiple bare chips and related equipment

Country Status (2)

Country Link
CN (1) CN117827419A (en)
WO (1) WO2024066847A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157048B2 (en) * 2017-02-03 2018-12-18 International Business Machines Corporation Splitting operators in a streaming application
EP4115342A1 (en) * 2020-03-27 2023-01-11 Huawei Technologies Co., Ltd. Parallel computing scheme generation for neural networks
CN113449859A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and device
CN115456160A (en) * 2020-03-27 2022-12-09 华为技术有限公司 Data processing method and data processing equipment
CN114723014A (en) * 2022-04-20 2022-07-08 上海燧原科技有限公司 Tensor segmentation mode determination method and device, computer equipment and medium

Also Published As

Publication number Publication date
WO2024066847A1 (en) 2024-04-04

Similar Documents

Publication Publication Date Title
WO2022037337A1 (en) Distributed training method and apparatus for machine learning model, and computer device
Lin et al. Modeling and optimization of performance and cost of serverless applications
EP3754496B1 (en) Data processing method and related products
US10841241B2 (en) Intelligent placement within a data center
US9972063B2 (en) Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units
US20200334522A1 (en) Data processing method and related products
Liu et al. Multi-objective scheduling of scientific workflows in multisite clouds
US7647590B2 (en) Parallel computing system using coordinator and master nodes for load balancing and distributing work
US8676874B2 (en) Data structure for tiling and packetizing a sparse matrix
US8762655B2 (en) Optimizing output vector data generation using a formatted matrix data structure
CN110689121A (en) Method for realizing neural network model splitting by using multi-core processor and related product
US11188348B2 (en) Hybrid computing device selection analysis
CN110826708A (en) Method for realizing neural network model splitting by using multi-core processor and related product
US20200090051A1 (en) Optimization problem operation method and apparatus
Lin et al. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Wu et al. Performance analysis of graph neural network frameworks
Shan et al. Mergepath-spmm: Parallel sparse matrix-matrix algorithm for graph neural network acceleration
CN117827419A (en) Computing method based on multiple bare chips and related equipment
JP2023180315A (en) Conversion program and conversion processing method
WO2017135219A1 (en) Design assistance device, design assistance method, and recording medium storing design assistance program
Bhimani et al. Design space exploration of GPU Accelerated cluster systems for optimal data transfer using PCIe bus
Foyer et al. Enabling System Wide Shared Memory for Performance Improvement in PyCOMPSs Applications
Odendahl et al. Optimized buffer allocation in multicore platforms
Vigueras et al. On the use of GPU for accelerating communication-aware mapping techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination