WO2024098793A1 - 计算图的处理方法、装置、设备及存储介质 - Google Patents

计算图的处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024098793A1
WO2024098793A1 PCT/CN2023/103982 CN2023103982W WO2024098793A1 WO 2024098793 A1 WO2024098793 A1 WO 2024098793A1 CN 2023103982 W CN2023103982 W CN 2023103982W WO 2024098793 A1 WO2024098793 A1 WO 2024098793A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
computation
graph
target
node
Prior art date
Application number
PCT/CN2023/103982
Other languages
English (en)
French (fr)
Inventor
孙楚旻
王天祺
周李
孙杰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024098793A1 publication Critical patent/WO2024098793A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the present application relates to the field of computer technology, and in particular to a method, device, equipment and storage medium for processing a computational graph.
  • a computational graph is a general method for representing computational processes. It is a directed acyclic graph used to describe functions and is widely used on various data processing platforms.
  • a computational graph includes multiple nodes and directed edges. Taking the field of machine learning as an example, computational graphs are used to represent the computational logic involved in neural networks.
  • Each node in the computational graph represents a computational task of a neural network (e.g., the add node represents an addition computational task).
  • Directed edges connect the previous node (which can be called the previous node or parent node) to the next node (which can be called the next node or child node), indicating that the output of the parent node is used as the input of the child node.
  • the processing method of the computational graph is usually as follows: the code file of the neural network is compiled and processed to obtain the computational graph of the neural network, and the computational tasks indicated by these ordered nodes are loaded one by one onto the hardware according to the topological sorting results of each node in the computational graph, and each computational task is executed by the hardware resources that specifically execute the computational tasks on the hardware.
  • the computational scale and network complexity of neural networks continue to increase, the computing power resources consumed to process the computational graph are also becoming increasingly huge. Therefore, there is an urgent need for a computational graph processing method that can effectively save computing power resources and improve resource utilization.
  • the embodiment of the present application provides a method, device, equipment and storage medium for processing a computational graph, which can effectively save computing resources and improve resource utilization.
  • the technical solution is as follows:
  • a method for processing a computational graph comprising:
  • the computation graph of the target program is segmented to obtain multiple sub-computation graphs of the target program, the computation graph including multiple computing nodes and directed edges, the computing nodes indicating the computing tasks of the target program, and the directed edges indicating the data flow between the computing tasks indicated by the computing nodes;
  • the computing power deployment result of the computing graph is obtained, the communication reference information indicates the communication resources consumed by the data transmission between the hardware resources, and the computing power deployment result indicates the computing tasks of the multiple sub-computation graphs executed by the multiple hardware resources.
  • the calculation graph of the target program based on the number of multiple hardware resources in the target hardware, the calculation graph is divided into multiple sub-calculation graphs, so that according to the communication reference information between each hardware resource, the calculation tasks of the multiple sub-calculation graphs are respectively deployed to multiple hardware resources for execution, and the computing power deployment result of the calculation graph is obtained.
  • the communication reference information involved in the computing power deployment process can indicate the communication resources consumed for data transmission between hardware resources, the final computing power deployment result can effectively save computing power resources and improve resource utilization.
  • the computation graph of the target program is segmented to obtain multiple sub-computation graphs of the target program, including:
  • the computation graph is segmented to obtain multiple sub-computation graphs, so that the number of the multiple sub-computation graphs after segmentation is equal to the number of the multiple hardware resources, and the importance of the computing nodes and directed edges in the sub-computation graphs meets the target conditions.
  • the method further comprises:
  • the node weights of the computation graph and the directed edge weights of the computation graph are determined.
  • determining the node weights of the computation graph and the directed edge weights of the computation graph based on the first sorting result and the second sorting result includes:
  • the target computing node Based on the difference between the first level and the second level to which the target computing node belongs, determine the node slack of the target computing node, and based on the node slack, the data processing amount of the computing task indicated by the target computing node, and the hardware performance reference value, determine the node weight of the target computing node, the node slack indicates the importance of the target computing node in the computing graph, and the target computing node is any computing node;
  • the directed edge relaxation of the target directed edge is determined. Based on the directed edge relaxation and the amount of data transmission indicated by the target directed edge, the directed edge weight of the target directed edge is determined.
  • the directed edge relaxation indicates the importance of the target directed edge in the computational graph.
  • the target directed edge is any directed edge.
  • multiple computing nodes in the computational graph are topologically sorted twice to obtain the node relaxation of each computing node and the directed edge relaxation of each directed edge, thereby obtaining the node weight and directed edge weight of each computing node. This can ensure that when the computational graph is split, the data processing volume of each sub-computation graph is balanced, and the data transmission volume between different sub-computation graphs is reduced, providing technical support for the splitting of the computational graph.
  • obtaining the computing power deployment result of the computation graph includes:
  • the intermediate computing power deployment result is updated to obtain the computing power deployment result.
  • this application defines a communication cost, which can effectively save computing resources and improve resource utilization by minimizing the communication cost of the computing power deployment result.
  • the method further comprises:
  • bandwidth information, latency information, route information and data transfer information between the multiple hardware resources bandwidth information, latency information, route information and data transfer information between the multiple hardware resources, communication reference information between the multiple hardware resources is obtained.
  • the method further comprises:
  • the computation graph is generated based on the data file and the task file.
  • the method further comprises:
  • the computation graph further includes a plurality of transport nodes
  • the plurality of transport nodes in the computation graph are deleted based on the task file, and the transport nodes indicate data transport tasks of the target program.
  • the code files of the target program are compiled and converted into modelable data files and task files, thereby generating a computational graph based on the computing task, providing technical support for the subsequent process of segmenting the computational graph to achieve computing power deployment.
  • the method further comprises:
  • a simulation scheduling tool to perform simulation scheduling on the computing power deployment result to obtain a simulation scheduling result, wherein the simulation scheduling result includes simulation scheduling time and resource utilization of multiple hardware resources executing multiple computing tasks of the sub-computation graphs;
  • the simulation scheduling tool is called to simulate the computing power deployment results in order to quickly evaluate the performance of the computing power deployment results.
  • the computing power deployment results can be further adjusted to further improve resource utilization.
  • an embodiment of the present application provides a device for processing a computational graph, the device comprising at least one functional module for executing a method for processing a computational graph provided by the aforementioned first aspect or any possible implementation of the first aspect.
  • an embodiment of the present application provides a computing device, which includes a processor and a memory; the memory is used to store at least one piece of program code, and the at least one piece of program code is loaded by the processor and executes a method for processing a computational graph as provided in the first aspect or any possible implementation of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, which is used to store at least one program code, and the at least one program code is used to implement the processing method of the calculation graph provided by the aforementioned first aspect or any possible implementation of the first aspect.
  • the storage medium includes but is not limited to a volatile memory, such as a random access memory, a non-volatile memory, such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
  • an embodiment of the present application provides a computer program product, which, when executed on a computing device, enables the computing device to implement the method for processing a computational graph provided in the aforementioned first aspect or any possible implementation of the first aspect.
  • the computer program product may be a software installation package, and when the aforementioned method for processing a computational graph needs to be implemented, the computer program product may be downloaded and executed on a computing device.
  • FIG1 is a schematic diagram of an implementation environment of a method for processing a computational graph provided in an embodiment of the present application
  • FIG2 is a schematic diagram of the hardware structure of a computing device provided in an embodiment of the present application.
  • FIG3 is a flowchart of a method for processing a computational graph provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a compilation process of a code file provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a calculation graph provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of generating a calculation graph provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a topological sorting provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of node relaxation and directed edge relaxation provided by an embodiment of the present application.
  • FIG9 is a schematic diagram of obtaining a calculation graph weight provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of a hardware resource provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a hardware resource modeling process provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of segmenting a computation graph provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of a deployment process of a computing graph provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of a simulation scheduling process provided in an embodiment of the present application.
  • FIG15 is a schematic diagram of a method for processing a computational graph provided in an embodiment of the present application.
  • FIG16 is a schematic diagram of a method for processing a computational graph provided in an embodiment of the present application.
  • FIG17 is a schematic diagram of a computational graph of a neural network provided in an embodiment of the present application.
  • FIG18 is a schematic diagram of a manually deployed computation graph provided in an embodiment of the present application.
  • FIG19 is a schematic diagram comparing a manual deployment provided in an embodiment of the present application and the solution of the present application;
  • Figure 20 is a structural diagram of a computational graph processing device provided in an embodiment of the present application.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
  • the code files of the target programs involved are obtained with full authorization.
  • a computational graph is a general method for representing a computational process. It is a directed acyclic graph used to describe a function. It is widely used on various data processing platforms.
  • a computational graph includes multiple nodes and directed edges. Taking the field of machine learning as an example, a computational graph is used to represent the computational logic involved in a neural network, where each node in the computational graph represents a computational task of the neural network (e.g., an add node represents a computational task of an addition operation), and a directed edge connects the previous node (which may be called the previous node or parent node) to the next node (which may be called the next node or child node), indicating that the output of the parent node is used as the input of the child node.
  • a computational graph is used to represent the computational logic involved in a neural network, where each node in the computational graph represents a computational task of the neural network (e.g., an add node represents a computational task of an addition operation), and a directed edge connects the previous node
  • the node indicating the computational task in the computational graph is called a computational node.
  • the nodes in the computational graph can also indicate data transfer tasks (e.g., transferring data from the chip where the computational task 1 is located to the shared memory, etc.), and these nodes are called transfer nodes.
  • AI model is a type of mathematical algorithm model that uses machine learning ideas to solve practical problems.
  • the AI model includes a large number of parameters and calculation formulas (or calculation rules).
  • Neural network is a type of mathematical algorithm AI model that imitates the structure and function of biological neural network (central nervous system of animals).
  • a neural network can include multiple neural network layers with different functions, each layer includes parameters and calculation formulas.
  • Topological sorting is a linear sequence of all nodes in a directed acyclic graph (DAG).
  • DAG directed acyclic graph
  • Heuristic algorithm is proposed in contrast to the optimization algorithm.
  • the optimal algorithm for a problem finds the optimal solution for each instance of the problem.
  • Heuristic algorithm can be defined as follows: an algorithm based on intuition or experience, which gives a feasible solution for each instance of the combinatorial optimization problem to be solved at an acceptable cost (computational time and space), and the degree of deviation between the feasible solution and the optimal solution cannot generally be predicted.
  • heuristic algorithms include ant colony algorithm, simulated annealing mapping (SAM), neural network, etc.
  • High performance computing is a technology that uses the power of supercomputers or computer clusters to solve complex problems that require large amounts of computing, such as some data-intensive computing tasks, including simulation, modeling, and rendering.
  • the high performance LINPACK system is a benchmark for testing the floating point performance of high performance computer systems.
  • LINPACK is the abbreviation of linear system package.
  • the LINPACK benchmark tests and evaluates the floating point performance of high performance computer systems by solving dense linear algebraic equations.
  • the test standard includes three different information scale tests: 100 ⁇ 100, 1000 ⁇ 1000 and n ⁇ n.
  • the benchmark programs used in the first two tests can be downloaded from the relevant website. After being compiled and run, this program can provide the performance of the corresponding machine, and this test does not allow any modification of the test program.
  • the test requirement of n ⁇ n data scale is the most relaxed in the LINPACK test standard.
  • HPL-AI high performance LINPACK for accelerator introspection system
  • the message passing interface is a standardized and portable message system (function library).
  • the standard defines the syntax and semantics of library functions used when writing portable programs with message passing.
  • MPI is a software tool for providing communication between branches of parallel applications.
  • Open multi-processing is a multi-threaded programming scheme for shared memory parallel systems, supporting programming languages including C, C++, and Fortran.
  • a high-level programming language is a machine-independent, process-oriented or object-oriented language.
  • a high-level language is a language designed with reference to mathematical languages and similar to daily conversations.
  • high-level languages include BASIC, JAVA, C, C++, Python, etc., without limitation.
  • the embodiment of the present application provides a method for processing a computational graph, which can be applied to computing power deployment scenarios for computational graphs, such as AI neural network training scenarios and HPL scenarios.
  • a computational graph which can be applied to computing power deployment scenarios for computational graphs, such as AI neural network training scenarios and HPL scenarios.
  • the code file of the neural network is compiled to generate a computational graph of the neural network, and the computational tasks indicated by the computational graph are deployed to the corresponding hardware resources for execution, thereby realizing the training process of the neural network.
  • FIG1 is a schematic diagram of an implementation environment of a method for processing a computational graph provided in an embodiment of the present application.
  • the implementation environment includes Terminal 101 and server 102, terminal 101 is directly or indirectly connected to server 102 via a wireless network or a wired network.
  • the terminal 101 may be at least one of a smart phone, a desktop computer, an augmented reality terminal, a tablet computer, an e-book reader, and a laptop computer.
  • the terminal 101 has installed and runs applications, such as client applications, browser applications, etc., which are not limited.
  • applications such as client applications, browser applications, etc., which are not limited.
  • an application that supports training neural networks is running on the terminal 101.
  • the user can use this application to input the code file of the defined neural network, triggering the terminal 101 to send a training request for the neural network to the server 102, so that the server 102 compiles the code file of the neural network, generates a calculation graph, and deploys the calculation tasks indicated by the calculation graph to the corresponding hardware resources for execution, thereby realizing the training process of the neural network.
  • Server 102 is an independent physical server, or a server cluster or distributed file system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
  • Server 102 is used to provide background services for applications running on terminal 101.
  • Server 102 can also be understood as having a computing power deployment function of a computational graph, which can generate a corresponding computational graph for a target program (such as a neural network, HPL test program, etc.), and deploy the computing tasks indicated by the computational graph to the corresponding hardware resources for execution.
  • a target program such as a neural network, HPL test program, etc.
  • Terminal 101 may generally refer to one of multiple terminals, or a collection of multiple terminals; server 102 may be a separate computing device, a computing device cluster, a virtual machine or a container engine, etc.
  • server 102 may be a separate computing device, a computing device cluster, a virtual machine or a container engine, etc.
  • the embodiment of the present application does not limit the number and device type of each device in the implementation environment.
  • the wireless network or wired network described above uses standard communication technologies and/or protocols.
  • the network includes, but is not limited to, a data center network, a storage area network (SAN), a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a mobile, wired or wireless network, a dedicated network or any combination of a virtual private network.
  • technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML), etc. are used to represent data exchanged through the network.
  • HTTP hypertext markup language
  • XML extensible markup language
  • conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private network (VPN), and Internet protocol security (IPsec) can also be used to encrypt all or part of the links.
  • customized and/or dedicated data communication technologies can be used to replace or supplement the above-mentioned data communication technologies.
  • the hardware structure of the server 102 is introduced below.
  • the embodiment of the present application provides a computing device that can be configured as the server 102 in the above implementation environment.
  • Figure 2 is a schematic diagram of the hardware structure of a computing device provided by an embodiment of the present application.
  • the computing device 200 includes a memory 201, a processor 202, a communication interface 203 and a bus 204.
  • the memory 201, the processor 202, and the communication interface 203 are connected to each other through the bus 204.
  • the memory 201 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • the memory 201 is used to store at least one section of program code. When the program code stored in the memory 201 is executed by the processor 202, the processor 202 and the communication interface 203 are used to execute the processing method of the calculation graph shown in the following embodiment.
  • the processor 202 may be an intelligent chip (also referred to as an AI chip), that is, the hardware resources involved in the embodiments of the present application, such as a network processor (network processor unit, NPU), a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a field programmable gate array (field programmable gate array, FPGA), an application-specific integrated circuit (application-specific integrated circuit, ASIC) or an integrated circuit used to control the execution of the program of the present application.
  • the processor 202 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. The number of the processors 202 may be one or more, and there is no limitation on this.
  • the communication interface 203 uses a transceiver module such as a transceiver to implement communication between the computing device 200 and other devices or communication networks. For example, data can be obtained through the communication interface 203.
  • a transceiver module such as a transceiver to implement communication between the computing device 200 and other devices or communication networks. For example, data can be obtained through the communication interface 203.
  • the memory 201 and the processor 202 may be separately provided or integrated together.
  • Bus 204 may include a path for transmitting information between various components of computing device 200 (eg, memory 201 , processor 202 , communication interface 203 ).
  • the computational graph processing method provided in the present application may also be implemented through a plurality of computing devices distributedly deployed in different environments, which is not limited in the embodiments of the present application.
  • Fig. 3 is a flow chart of a method for processing a computational graph provided in an embodiment of the present application. As shown in Fig. 3, the method for processing a computational graph is applied to a server in the above implementation environment, and includes the following steps 301 to 307.
  • the server generates a computation graph of the target program based on the code file of the target program.
  • the target program refers to any program, such as a neural network, an HPL test program, etc., which is not limited.
  • the code file of the target program refers to a code file written in a high-level language, and the embodiment of the present application does not limit the specific form of the high-level language.
  • the calculation graph of the target program includes a plurality of computing nodes and directed edges, and the computing node indicates the computing task of the target program, and the directed edge indicates the data flow between the computing tasks indicated by the computing node, which can also be understood as the dependency between the computing tasks.
  • Step A1 compile the code file of the target program to obtain the data file and task file of the target program.
  • the server obtains the code file of the target program, compiles the code file, and converts it into a modelable data file and task file.
  • the data file includes the data features of the target program.
  • the data features of the data include: data name (data name), data identifier (data ID), data processing volume (data KByteSize) (also called data size), consumer init, and data region (data region) (such as a chip storing data, also called hardware resources), etc., without limitation.
  • the task file includes the task features of the target program.
  • the task features of the task include: task name, task ID, task type (such as computing task, data handling task, etc.), task region (such as the chip specified by the computing task), the predecessor task ID list of the task (predecessor taskIDlist), the successor task ID list of the task (successor taskIDlist), the input data ID list of the task (input dataID) (the length is the same as the length of the predecessor task ID list) and the output data ID list of the task (output dataID), etc., without limitation.
  • the data features and task features of the target program can be stored in different files or in the same file, and the embodiments of the present application do not limit this.
  • Figure 4 is a schematic diagram of a compilation process of a code file provided in an embodiment of the present application.
  • the code file of the target program (matrix decomposition, LU-Decomposition) written in the high-level language Python is shown in Figure 4 (a)
  • the data file and the task file are obtained by compiling the code file, wherein the data file is shown in Figure 4 (b), and the task file is shown in Figure 4 (c).
  • Figure 4 is only a schematic illustration, and the specific contents of the data file and the task file can be added or deleted according to actual needs, and there is no limitation on this.
  • Step A2 Generate a computation graph of the target program based on the data file and the task file.
  • the server generates a directed acyclic graph based on the computing task based on the data file and the task file, that is, the computing graph of the target program. It should be understood that in the computing graph, the computing task will be deployed to the target hardware for execution, and will consume communication resources and computing resources. Therefore, generating a directed acyclic graph of computing tasks can facilitate the subsequent segmentation of the computing graph and improve the efficiency of segmentation of the computing graph.
  • the server uses a recursive algorithm to generate a computational graph.
  • the server obtains the target computational task and the predecessor dependent task ID list of the target computational task. If the predecessor dependent task ID list contains the task ID of the transport task, the predecessor dependent task ID list of the transport task is traversed until all computational task sets are located, and each computational task in the computational task set is used as the predecessor of the target computational task, and all computational tasks are traversed to finally generate a directed acyclic graph based on the computational tasks.
  • the server can generate a complete computational graph of the target program based on the data file and the task file, that is, the computational graph includes multiple computing nodes, multiple transport nodes and directed edges. In this case, the server deletes multiple transport nodes in the computational graph based on the task file to obtain a directed acyclic graph based on the computing task.
  • the embodiments of the present application are not limited to this.
  • FIG. 5 is a schematic diagram of a computational graph provided by an embodiment of the present application.
  • the nodes indicated by circles are computational nodes
  • the nodes indicated by rounded rectangles are transport nodes
  • the arrows are directed edges, which are used to indicate the data flow between the tasks indicated by any two nodes.
  • (a) is a computational graph based on a computational task, which includes multiple computational nodes and directed edges.
  • (b) is a graph including A computational graph with input and output transport tasks includes a plurality of computational nodes, a plurality of transport nodes, and directed edges.
  • Figure 3 is a computational graph containing all transport tasks, including a plurality of computational nodes, a plurality of transport nodes, and directed edges.
  • the data flow between computational tasks corresponds to three transport tasks, that is, taking the data flow indicated by any directed edge as an example, data is transported from the chip where the starting computational task is located to the starting shared memory, from the starting shared memory to the target shared memory, and from the target shared memory to the chip where the terminating computational task is located.
  • the server compiles the code file of the target program and converts it into a modelable data file and task file, thereby generating a calculation graph based on the calculation task, namely, a DAG graph.
  • This process can be referred to in Figure 6, which is a schematic diagram of generating a calculation graph provided in an embodiment of the present application. Through this process, technical support is provided for the subsequent process of segmenting the calculation graph to realize computing power deployment.
  • the server obtains the node weights of the computation graph and the directed edge weights of the computation graph.
  • the node weight indicates the importance of the computing task indicated by the computing node in the target program
  • the directed edge weight indicates the importance of the data flow indicated by the directed edge in the target program.
  • Step B1 Taking the parent computing node in the computing graph as the starting point, a first topological sort is performed on multiple computing nodes to obtain a first sorting result, where the first sorting result indicates the first level to which each computing node belongs.
  • the server takes the parent computing node in the computing graph as the starting point, and based on the directed edge connected to the parent computing node, forward traverses the multiple computing nodes, and layers the multiple computing nodes according to the hierarchy to obtain the first level to which each computing node belongs.
  • Figure 7 is a schematic diagram of a topological sorting provided in an embodiment of the present application.
  • the server takes computing node 1 as the starting point, traverses the remaining computing nodes 2 to 9, and obtains the first sorting result: computing node 1 belongs to layer 0, computing nodes 2 and 3 belong to layer 1, computing nodes 4, 5, and 6 belong to layer 2, computing nodes 7 and 8 belong to layer 3, and computing node 9 belongs to layer 4.
  • this topological sorting method is called topological sorting based on the fastest principle (as soon as possible, ASAP).
  • Step B2 Taking the sub-computing nodes in the computation graph as the starting point, a second topological sorting is performed on the plurality of computation nodes to obtain a second sorting result, wherein the second sorting result indicates the second level to which each computation node belongs.
  • the server takes the sub-computing node in the computational graph as the starting point, and based on the directed edge connected to the sub-computing node, traverses the multiple computing nodes in reverse, and layers the multiple computing nodes according to the hierarchy to obtain the second level to which each computing node belongs. For example, referring to FIG. 7 , as shown in FIG. 7 (b), the server takes computing nodes 5, 8, and 9 as the starting point, traverses the remaining computing nodes, and obtains the second sorting result: computing node 1 belongs to the 0th level, computing node 2 belongs to the 1st level, computing nodes 3 and 4 belong to the 2nd level, computing nodes 7 and 6 belong to the 3rd level, and computing nodes 5, 8, and 9 belong to the 4th level.
  • this topological sorting method is called a topological sorting based on the latest principle (as late as possible, ALAP).
  • the server uses two topological sorting methods to topologically sort multiple computing nodes in the calculation graph and obtains corresponding topological sorting results, that is, the importance of the computing nodes and directed edges in the calculation graph is perceived from different angles, so that the server can determine the node weight of each computing node and the directed edge weight of each directed edge based on the following step B3.
  • Step B3 Based on the first sorting result and the second sorting result, determine the node weights of the calculation graph and the directed edge weights of the calculation graph.
  • the following takes any computing node as an example (hereinafter referred to as the target computing node) to introduce the method of determining the node weight.
  • the server determines the node slack of the target computing node based on the difference between the first level and the second level to which the target computing node belongs, and determines the node weight of the target computing node based on the node slack, the data processing amount of the computing task indicated by the target computing node, and the hardware performance reference value.
  • the node slack indicates the importance of the target computing node in the computing graph. Generally, the smaller the node slack, the more important the target computing node is in the computing graph.
  • the server obtains the data processing amount from the task file based on the task identifier of the computing task indicated by the target computing node.
  • the hardware performance reference value can be a default value, It may also be an average performance reference value of multiple hardware resources in the target hardware (ie, the hardware used to execute the target program), which is not limited in the present embodiment.
  • Slack(node) represents node slack
  • ALAP(node) represents the second level to which the target computing node belongs
  • ASAP(node) represents the first level to which the target computing node belongs
  • lv is an integer, which is the abbreviation of level.
  • Figure 8 is a schematic diagram of node slack and directed edge slack provided in an embodiment of the present application. As shown in Figure 8 (a), based on the topological sorting result shown in Figure 7 above, the node slack of each computing node is calculated by the above formula (1).
  • W node (i) represents the node weight of the target computing node; i is a positive integer, indicating the i-th computing node, that is, the target computing node; GFlop(i) represents the data processing capacity of the computing task indicated by the target computing node; and perf represents the hardware performance reference value.
  • the following takes any directed edge as an example (hereinafter referred to as the target directed edge) to introduce the method of determining the weight of the directed edge.
  • the server determines the directed edge slack of the target directed edge based on the difference between the first level to which the starting computing node connected to the target directed edge belongs and the second level to which the terminating computing node belongs, and determines the directed edge weight of the target directed edge based on the directed edge slack and the data transmission volume indicated by the target directed edge.
  • the directed edge slack indicates the importance of the target directed edge in the calculation graph. Generally, the smaller the directed edge slack, the higher the importance of the target directed edge in the calculation graph.
  • the server determines the data identifier of the data transmitted based on the target directed edge based on the data flow indicated by the target directed edge, and obtains the data transmission volume from the data file based on the data identifier.
  • the process of determining the directed edge weight of the above-mentioned target directed edge is represented by the following formulas (3) and (4).
  • EdgeSlack(src ⁇ dst) ALAP(dst) ⁇ ASAP(src), EdgeSlack ⁇ [1,lv] (3)
  • EdgeSlack(src ⁇ dst) represents the slack of the directed edge, where src (short for the starting source) represents the starting computing node connected to the directed edge, and dst (short for the ending destination) represents the ending computing node connected to the directed edge; ALAP(dst) represents the second level to which the ending computing node belongs; ASAP(src) represents the first level to which the starting computing node belongs; lv is an integer, which is the abbreviation of level.
  • W link (i ⁇ j) data_KByteSize/EdgeSlack (i ⁇ j) (4)
  • W link (i ⁇ j) represents the directed edge weight of the directed edge; i and j are positive integers, representing the i-th computing node and the j-th computing node, respectively, that is, the starting computing node and the ending computing node connected by the directed edge; data_KByteSize represents the data processing amount.
  • the server when it obtains the calculation graph, it performs two topological sorts on the multiple computing nodes in the calculation graph to obtain the node relaxation of each computing node and the directed edge relaxation of each directed edge, thereby obtaining the node weight and directed edge weight of each computing node.
  • This process can be referred to Figure 9, which is a schematic diagram of obtaining the weight of a calculation graph provided in an embodiment of the present application. Through this process, the weights of the computing nodes and the directed edges are assigned to the calculation graph, which can ensure that the data processing volume of each sub-computation graph is balanced when the calculation graph is split, and the data transmission volume between different sub-computation graphs is reduced, providing technical support for the subsequent splitting of the calculation graph.
  • the server obtains the quantity of multiple hardware resources in the target hardware and communication reference information between the multiple hardware resources.
  • the target hardware refers to the hardware used to execute the target program, that is, the hardware used to execute each computing task in the calculation graph.
  • the target hardware includes multiple hardware resources, for example, the hardware resources are CPU, GPU, NPU, etc. It should be noted that the embodiment of the present application does not limit the granularity of the division of hardware resources.
  • the hardware resource can also be any core of the CPU, which can be divided according to actual needs.
  • the communication reference information between hardware resources indicates the communication resources consumed by the data transmission between the hardware resources, which can be calculated based on the parameter information of the hardware resources.
  • the communication reference information can also be understood as the communication distance between the hardware resources.
  • the following describes a method for determining communication reference information between hardware resources.
  • the server obtains communication reference information between multiple hardware resources based on the connection relationship, bandwidth information, latency information, route information and data transfer information between the multiple hardware resources.
  • connection relationship refers to the connectivity relationship between hardware resources. If there is a cascade path between any two hardware resources, it means that the two hardware resources are directly connected, otherwise they are indirectly connected. It should be understood that in principle there are no unconnected hardware resource pairs.
  • Bandwidth information refers to The amount of data transmitted per unit time between directly connected hardware resources. Generally, the larger the bandwidth, the more data can be transmitted per unit time.
  • Latency information refers to the time it takes to transmit data between directly connected hardware resources. It can be related to the bandwidth or a fixed value. Generally, the longer the latency, the greater the communication cost.
  • Route information refers to the number of routes between indirectly connected hardware resources.
  • Data transfer information refers to the data transfer overhead between indirectly connected hardware resources.
  • FIG10 is a schematic diagram of a hardware resource provided in an embodiment of the present application. Taking any two hardware resources as an example, the process of determining the communication reference information is represented by the following formula (5):
  • m and n are positive integers, representing the mth hardware resource and the nth hardware resource; D mn represents the communication reference information between the mth hardware resource and the nth hardware resource (the combination of the communication reference information between all hardware resources can be understood as a hardware resource matrix D); hop represents the number of hops. As shown in FIG10 , if the connection between hardware resources 1 and 4 is indirect, the number of hops between hardware resources 1 and 4 is 2; dly represents the delay; bw represents the bandwidth, and H(bw) represents the harmonic mean; waynum represents the number of routes; and tran represents the data transfer overhead. In some embodiments, if the bandwidth and delay between directly connected hardware resources are set to bw and dly, the above formula (5) can be simplified to the following formula (6):
  • D12 represents the communication reference information between hardware resources 1 and 2
  • D14 represents the communication reference information between hardware resources 1 and 4.
  • the communication reference information shown in the above formula (5) is only an illustrative description provided by the present application.
  • the communication reference information can be determined according to actual needs.
  • the server determines the communication reference information based on at least one of the connection relationship between multiple hardware resources, bandwidth information, latency information, route information, and data transfer information.
  • the communication reference information can also be a default reference value.
  • the communication reference information between directly connected CPUs is 10
  • the communication reference information between indirectly connected CPUs is 20, and so on. The present application embodiment does not limit this.
  • the server uses a clustering algorithm to divide the target hardware to obtain multiple hardware resources.
  • the division principle is large-category clustering, the connection within the large category is relatively tight, there are fewer indirect connections, and the connection between the large categories is relatively sparse, with more indirect connections.
  • the CPU and GPU computing power are calculated and modeled separately to obtain the CPU category and the GPU category.
  • the CPU category is divided to obtain multiple CPUs, and the communication reference information between the CPUs is calculated based on the multiple CPUs;
  • the GPU category is divided to obtain multiple GPUs, and the communication reference information between the GPUs is calculated based on the multiple GPUs.
  • the server obtains the number of multiple hardware resources in the target hardware and the communication reference information between the multiple hardware resources.
  • This process can also be understood as a process of hardware resource modeling. Please refer to Figure 11.
  • Figure 11 is a schematic diagram of a hardware resource modeling process provided by an embodiment of the present application. Through this process, the obtained number of multiple hardware resources can be used in the subsequent process of splitting the calculation graph, and the obtained communication reference information between the multiple hardware resources can be used in the subsequent process of deploying the split sub-calculation graph to improve resource utilization.
  • the embodiment of the present application does not limit the execution timing of the above-mentioned step 303.
  • the server can execute step 303 first, and then execute step 301 and step 302. It can also execute step 303 synchronously when executing step 301 and step 302.
  • the server can also obtain the number of multiple hardware resources when executing the following step 304, and obtain communication reference information between multiple hardware resources when executing the following step 305.
  • the server divides the computation graph based on the quantity of multiple hardware resources in the target hardware, the node weights of the computation graph, and the directed edge weights of the computation graph to obtain multiple sub-computation graphs.
  • the server divides the computational graph based on the number of multiple hardware resources, the node weights of the computational graph, and the directed edge weights of the computational graph to obtain multiple sub-computational graphs, so that the number of the multiple sub-computational graphs after division is equal to the number of multiple hardware resources, and the importance of the computing nodes and directed edges in the sub-computational graphs meets the target conditions.
  • the target condition refers to the minimum number of weighted cut edges between the sub-computational graphs.
  • the server calls a heuristic algorithm to segment the computation graph based on the number of multiple hardware resources, the node weights of the computation graph, and the directed edge weights of the computation graph to obtain multiple sub-computation graphs.
  • the heuristic algorithm includes an ant colony algorithm, a SAM algorithm, a neural network, etc., which are not limited to this.
  • the server divides the computation graph based on the number of multiple hardware resources in the target hardware to obtain multiple sub-computation graphs.
  • This process can be referred to Figure 12.
  • Figure 12 is a schematic diagram of dividing a computation graph provided in an embodiment of the present application. Through this process, the computation graph is divided into multiple sub-computation graphs, which facilitates the subsequent deployment of these sub-computation graphs on multiple hardware resources to improve resource utilization.
  • the server obtains the computing power deployment result of the computation graph based on the communication reference information between the multiple hardware resources and the computing tasks of the multiple sub-computation graphs.
  • the computing power deployment result indicates the computing tasks of multiple sub-computation graphs performed by multiple hardware resources.
  • the server deploys the multiple sub-computation graphs to the multiple hardware resources respectively, so that the multiple hardware resources respectively execute the computing tasks of the multiple sub-computation graphs.
  • the number of multiple hardware resources is equal to the number of multiple sub-computation graphs, that is, each sub-computation graph will be deployed to the corresponding hardware resources.
  • the computing power deployment result can also be understood as indicating the mapping relationship (or matching relationship) between multiple hardware resources and multiple sub-computation graphs.
  • Step C1 Based on multiple hardware resources and multiple sub-computation graphs, obtain the intermediate computing power deployment result of the calculation graph.
  • Step C2 Based on the communication reference information between multiple hardware resources, the data transmission volume between multiple computing graphs, and the intermediate computing power deployment results, the communication cost of the intermediate computing power deployment results is obtained.
  • the amount of data transmission between multiple computation graphs is determined by the following formula (7):
  • x and y are positive integers, representing the x-th sub-computation graph and the y-th sub-computation graph;
  • Cxy represents the data transmission volume between the x-th sub-computation graph and the y-th sub-computation graph (the combination of the data transmission volumes between all sub-computation graphs can be understood as a data transmission volume matrix C);
  • k is a positive integer, representing any directed edge; part(src k ) represents the sub-computation graph where the starting computation node connected by the directed edge is located; part(dst k ) represents the sub-computation graph where the terminating computation node connected by the directed edge is located;
  • the ⁇ function indicates that if part(src k ) and part(dst k ) are inconsistent, it means that k is a cut edge (that is, an edge that has been "deleted”). is 0 if yes, otherwise it is 1.
  • S represents the communication cost
  • P is the permutation matrix
  • C is the data transmission volume matrix
  • D is the hardware resource matrix (refer to the above formula (5)).
  • Step C3 based on the communication cost of the intermediate computing power deployment result, update the intermediate computing power deployment result to obtain the computing power deployment result.
  • the server iteratively updates the intermediate computing power deployment result based on the communication cost of the intermediate computing power deployment result until the iteration cutoff condition is met to obtain the computing power deployment result.
  • the iteration cutoff condition can be that the number of iterations reaches the target number of iterations or that the communication cost is less than the target threshold, which is not limited.
  • the server calls the SAM algorithm to execute the above steps C1 to C3 to obtain the computing power deployment result of the computing graph.
  • SAM algorithm is an algorithm that achieves the optimal solution through iterative updates, including the following steps:
  • the first step is to randomly generate the initial computing power deployment results based on multiple hardware resources and multiple sub-computation graphs, set the initialization temperature to T, and the number of iterations to L.
  • the second step is to calculate the communication cost of the initial computing power deployment result based on the above formulas (7) and (8).
  • the third step is to randomly select two sub-computation graphs from the initial computing power deployment results and exchange their corresponding hardware resources to obtain new computing power deployment results, and recalculate the communication cost and increment ⁇ T.
  • Step 4 If ⁇ T ⁇ 0, accept the new computing power deployment result, otherwise accept the new computing power deployment result with a probability of exp(- ⁇ T/T), and repeat L times.
  • Step 5 Gradually reduce T and return to step 3 until T drops to the preset threshold and output the computing power deployment result.
  • the server deploys multiple sub-computation graphs to multiple hardware resources respectively based on the communication reference information between multiple hardware resources (hardware resource matrix D) and the computing tasks of multiple sub-computation graphs (data transmission volume matrix C), so that multiple hardware resources can respectively execute the computing tasks of multiple sub-computation graphs.
  • This process can be referred to Figure 13, which is a schematic diagram of a deployment process of a computation graph provided in an embodiment of the present application. Through this process, an automatic deployment process for the computation graph is realized. Moreover, since the communication reference information between each hardware resource is taken into consideration, it is possible to effectively save computing power resources and improve resource utilization.
  • the server can also call a simulation scheduling tool to simulate and schedule the above computing power deployment results, so as to quickly evaluate the performance of the computing power deployment results, and further adjust the computing power deployment results to further improve resource utilization.
  • a simulation scheduling tool to simulate and schedule the above computing power deployment results, so as to quickly evaluate the performance of the computing power deployment results, and further adjust the computing power deployment results to further improve resource utilization.
  • the server calls a simulation scheduling tool to perform simulation scheduling on the computing power deployment result to obtain a simulation scheduling result.
  • the simulation scheduling result includes the simulation scheduling time and resource utilization of multiple hardware resources executing the computing tasks of multiple sub-computation graphs.
  • the simulation scheduling tool is a distributed simulation scheduling framework based on a notify table and event-driven.
  • the simulation scheduling process is introduced below with reference to FIG14.
  • FIG14 is a schematic diagram of a simulation scheduling process provided by an embodiment of the present application. As shown in FIG14, during the simulation scheduling process, the scheduler maintains the following four lists:
  • the task graph list is used to store the tasks of the sub-graphs.
  • Release list used to store the identifiers of tasks whose dependencies have been released. It should be understood that the tasks in the computational graph are executed sequentially. When a task is completed, the dependency between the task and other tasks is released. These other tasks are also the tasks whose dependencies have been released.
  • the execution list (on the Fly-list) is used to store the identifiers of the tasks being executed;
  • the commit list is used to store the identifiers of completed tasks.
  • the simulation scheduling process includes the following steps: 1. Submit the list to send the estimated task completion time (EstTime) to the scheduler; 2. Receive the notification message for the simulation scheduling time (wall-clock) returned by the scheduler; 3. Submit the list to execute the task and submit it; 4. Send a notification message for the completed task to the scheduler; 5.
  • the scheduler updates the calculation graph list based on the completed tasks; 6. After the calculation graph list is updated, update the release list; 7. Assign the target number of tasks in the release list to the execution list; 8. Update the tasks in the execution list; 9. Update the estimated task completion time. It should be understood that the above process is executed through a loop iteration until the last task is submitted.
  • the simulation scheduling time generated by the scheduler is the simulation scheduling time of the entire calculation graph.
  • the server adjusts the computing power deployment result based on the simulation scheduling result.
  • the server adjusts the computing power deployment result based on the simulation scheduling result, and re-performs simulation scheduling based on the adjusted computing power deployment result, and through this iterative adjustment method, obtains a computing power deployment result that meets the conditions. For example, the number of iterative adjustments reaches a preset number, or the simulation scheduling result of the adjusted computing power deployment result meets the requirements, etc., and there is no limitation on this.
  • the scheduling performance can be quickly evaluated through event-driven distributed scheduling simulation, and the evaluation results are reliable, so as to further adjust the computing power deployment results to achieve the effect of further improving resource utilization.
  • the calculation graph of the target program can be actually deployed to multiple hardware resources of the target hardware for execution according to the computing power deployment result, thereby effectively saving computing power resources and improving resource utilization.
  • FF is the abbreviation of function flow
  • the processing methods of these calculation graphs are not based on complete calculation graph modeling, and many parallel opportunities between multiple hardware resources are abandoned.
  • MPI has a high programming threshold for programmers, and the calculation graph generation form and process are complex and inefficient, resulting in more consumption of computing power resources.
  • the above steps 301 to 307 are offline processing processes.
  • the server can also perform the above steps 301 to 307 online and adjust the calculation online based on the occupation information of the hardware resources during the online operation.
  • the computing power deployment results can be obtained to achieve the effect of improving resource utilization in real time.
  • the offline and online methods can also be mixed to generate a baseline for the computing power deployment results offline, and then fine-tune the computing power deployment results based on the hardware resource occupancy information during online operation.
  • the embodiments of the present application are not limited to this.
  • Figure 15 is a schematic diagram of a method for processing a computational graph provided by an embodiment of the present application.
  • the method for processing a computational graph provided by an embodiment of the present application is between high-level language and underlying task scheduling execution, and belongs to the category of resource allocation and deployment.
  • Its overall framework includes three subframeworks: a modeling framework 1501, a deployment framework 1502, and a simulation scheduling framework 1503, and the three subframeworks are mutually progressive.
  • the modeling framework 1501 includes: computational graph construction (i.e., the aforementioned step 301), computational graph structure labels (i.e., the aforementioned step 302), and hardware modeling (i.e., the aforementioned step 303).
  • the deployment framework 1502 includes: automatic segmentation (i.e., the aforementioned step 304) and automatic deployment (i.e., the aforementioned step 305). Based on the computing power modeling results of multiple hardware resources in the target hardware and the structural labels of the computing graph, the computing graph is segmented, and the multiple sub-computation graphs obtained after segmentation are deployed to multiple hardware resources, so that each sub-computation graph is adapted to different hardware resources and the communication cost is minimized, preparing for the simulation scheduling framework 1503.
  • the simulation scheduling framework 1503 includes distributed scheduling simulation (i.e., the aforementioned steps 306 and 307), which is an event-driven distributed scheduling method. By simulating the operating status of each hardware resource and synchronization messages, the simulation scheduling results are obtained, and the simulation scheduling results are used as the basis for the overall computing graph deployment performance. Finally, the computing power deployment results are imported into the task scheduling execution, and the scheduling performance verification in a real environment can be performed.
  • distributed scheduling simulation i.e., the aforementioned steps 306 and 307
  • Figure 16 is a schematic diagram of a method for processing a computational graph provided by an embodiment of the present application. As shown in Figure 16, the process of the method for processing the computational graph is between high-level language and scheduling. The hardware modeling is modeled separately to provide computing power and communication cost basis for segmentation and deployment.
  • Figure 17 is a schematic diagram of a calculation graph of a neural network provided in the embodiment of the present application.
  • the neural network is a Megatron neural network, and its function flow (FF) calculation graph includes forward propagation (FP) and backward propagation (BP), with a total of 72 encoders.
  • Figure 18 is a schematic diagram of a manually deployed calculation graph provided in the embodiment of the present application.
  • Figure 19 is a comparative diagram of a manual deployment provided by an embodiment of the present application and the scheme of the present application.
  • the resource utilization rate of each hardware resource such as chip die
  • the resource utilization rate of each hardware resource is improved to varying degrees compared with the manual deployment method, with an overall improvement of 2.4%. It can be seen that the adoption of the scheme of the present application can effectively save computing power resources and improve resource utilization.
  • calculation graph processing method provided in the embodiment of the present application can also be applied to the HPL scenario.
  • the calculation graph structure of HPL is usually more complex, and often needs to be presented in the form of folded graphs and dynamic graphs.
  • the calculation graph processing method provided in the embodiment of the present application can be applied to different segmentation scenarios and requirements such as folded graphs and dynamic graphs, and can achieve multi-level and high-responsive deployment, thereby improving resource utilization in the HPL scenario.
  • the calculation graph processing method for the calculation graph of the target program, based on the number of multiple hardware resources in the target hardware, the calculation graph is divided into multiple sub-calculation graphs, so that according to the communication reference information between each hardware resource, the calculation tasks of the multiple sub-calculation graphs are respectively deployed to multiple hardware resources for execution, and the computing power deployment result of the calculation graph is obtained.
  • the communication reference information involved in the computing power deployment process can indicate the communication resources consumed for data transmission between hardware resources, the final computing power deployment result can effectively save computing power resources and improve resource utilization.
  • FIG20 is a schematic diagram of the structure of a computing graph processing device provided in an embodiment of the present application.
  • the computing graph processing device can implement the aforementioned computing graph processing method through software, hardware, or a combination of both.
  • the computing graph processing device includes a computing graph segmentation module 2001 and a computing power deployment module 2002.
  • a computation graph segmentation module 2001 is used to segment the computation graph of the target program based on the number of multiple hardware resources in the target hardware to obtain multiple sub-computation graphs of the target program, wherein the computation graph includes multiple computation nodes and directed edges, wherein the computation nodes indicate computation tasks of the target program, and the directed edges indicate data flows between the computation tasks indicated by the computation nodes;
  • the computing power deployment module 2002 is used to obtain the computing tasks of the sub-computing graphs based on the communication reference information between the multiple hardware resources and the multiple sub-computing graphs.
  • the computing power deployment result of the computational graph is obtained, wherein the communication reference information indicates the communication resources consumed by data transmission between hardware resources, and the computing power deployment result indicates the computing tasks of multiple sub-computation graphs executed by multiple hardware resources.
  • the computation graph segmentation module 2001 is used to:
  • the computation graph is segmented to obtain multiple sub-computation graphs, so that the number of the multiple sub-computation graphs after segmentation is equal to the number of the multiple hardware resources, and the importance of the computing nodes and directed edges in the sub-computation graphs meets the target conditions.
  • the apparatus further comprises a weight determination module, the weight determination module being configured to:
  • the node weights of the computation graph and the directed edge weights of the computation graph are determined.
  • the weight determination module is used to:
  • the target computing node Based on the difference between the first level and the second level to which the target computing node belongs, determine the node slack of the target computing node, and based on the node slack, the data processing amount of the computing task indicated by the target computing node, and the hardware performance reference value, determine the node weight of the target computing node, the node slack indicates the importance of the target computing node in the computing graph, and the target computing node is any computing node;
  • the directed edge relaxation of the target directed edge is determined. Based on the directed edge relaxation and the amount of data transmission indicated by the target directed edge, the directed edge weight of the target directed edge is determined.
  • the directed edge relaxation indicates the importance of the target directed edge in the computational graph.
  • the target directed edge is any directed edge.
  • the computing power deployment module 2002 is used to:
  • the intermediate computing power deployment result is updated to obtain the computing power deployment result.
  • the device further includes an acquisition module, the acquisition module being configured to:
  • bandwidth information, latency information, route information and data transfer information between the multiple hardware resources bandwidth information, latency information, route information and data transfer information between the multiple hardware resources, communication reference information between the multiple hardware resources is obtained.
  • the apparatus further includes a computational graph generation module, the computational graph generation module being configured to:
  • the computation graph is generated.
  • the computation graph generation module is further used to:
  • the computation graph further includes a plurality of transport nodes
  • the plurality of transport nodes in the computation graph are deleted based on the task file, and the transport nodes indicate data transport tasks of the target program.
  • the apparatus further includes a simulation scheduling module, the simulation scheduling module being configured to:
  • a simulation scheduling tool to perform simulation scheduling on the computing power deployment result to obtain a simulation scheduling result, wherein the simulation scheduling result includes simulation scheduling time and resource utilization of multiple hardware resources executing multiple computing tasks of the sub-computation graphs;
  • the computing graph processing device for the computing graph of the target program, based on the number of multiple hardware resources in the target hardware, the computing graph is divided into multiple sub-computation graphs, so that according to the communication reference information between the various hardware resources, the computing tasks of the multiple sub-computation graphs are respectively deployed to multiple hardware resources for execution, and the computing power deployment result of the computing graph is obtained.
  • the communication reference information involved in the computing power deployment process can indicate the communication resources consumed for data transmission between hardware resources, the computing power deployment result finally obtained can effectively save computing power resources and improve resource utilization.
  • the computing graph segmentation module 2001 and the computing power deployment module 2002 can be implemented by software or by hardware.
  • the implementation of the computing graph segmentation module 2001 is introduced below by taking the computing graph segmentation module 2001 as an example.
  • the implementation of the computing power deployment module 2002 and other modules can refer to the implementation of the computing graph segmentation module 2001.
  • the computing graph splitting module 2001 may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Further, the above-mentioned computing instance may be one or more.
  • the computing graph splitting module 2001 may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the code may be distributed in the same region (region) or in different regions.
  • the multiple hosts/virtual machines/containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one data center or multiple data centers with similar geographical locations. Among them, usually a region may include multiple AZs.
  • VPC virtual private cloud
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs.
  • VPC virtual private cloud
  • a VPC is set up in a region.
  • a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
  • the computation graph segmentation module 2001 may include at least one computing device.
  • the computation graph segmentation module 2001 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • the PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the multiple computing devices included in the computation graph segmentation module 2001 can be distributed in the same region or in different regions.
  • the multiple computing devices included in the computation graph segmentation module 2001 can be distributed in the same AZ or in different AZs.
  • the multiple computing devices included in the computation graph segmentation module 2001 can be distributed in the same VPC or in multiple VPCs.
  • the multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the computation graph segmentation module 2001 can be used to execute any step in the computation graph processing method, that is, the steps that the computation graph segmentation module 2001 and the computing power deployment module 2002 are responsible for implementing can be specified as needed, and the computation graph segmentation module 2001 and the computing power deployment module 2002 respectively implement different steps in the computation graph processing method to realize all functions of the computation graph processing device.
  • the computation graph processing device provided in the above embodiment and the computation graph processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the words such as “first” and “second” are used to distinguish the same or similar items with basically the same effects and functions. It should be understood that there is no logical or temporal dependency between “first”, “second” and “nth”, nor is the quantity and execution order limited. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.
  • the first sorting result can be referred to as the second sorting result, and similarly, the second sorting result can be referred to as the first sorting result.
  • the first sorting result and the second sorting result can both be sorting results, and in some cases, can be separate and different sorting results.
  • the term "at least one” means one or more, and the term “plurality” means two or more.
  • a plurality of sorting results means two or more sorting results.
  • all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
  • all or part of the embodiments may be implemented in the form of program structure information.
  • the program structure information includes one or more program instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了一种计算图的处理方法、装置、设备及存储介质,属于计算机技术领域。该方法包括:对于目标程序的计算图,基于目标硬件中多个硬件资源的数量,将该计算图切分为多个子计算图,从而根据各个硬件资源之间的通信参考信息,将多个子计算图的计算任务分别部署到多个硬件资源上去执行,得到计算图的算力部署结果。在这一过程中,由于对完整的计算图进行了切分,且算力部署过程中涉及的通信参考信息能够指示硬件资源之间进行数据传输所耗费的通信资源,因此最终得到的算力部署结果能够有效节约算力资源,提升资源利用率。

Description

计算图的处理方法、装置、设备及存储介质
本申请要求于2022年11月07日提交的申请号为202211387594.7、发明名称为“计算图的处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种计算图的处理方法、装置、设备及存储介质。
背景技术
计算图(computational graph)是一种通用的计算过程表示方法,用于描述函数的有向无环图,普遍应用在各类数据处理平台上,一个计算图包括多个节点和有向边。以机器学习领域为例,计算图用于表示神经网络涉及的计算逻辑,其中,计算图中的每个节点表示神经网络的计算任务(如,add节点表示一个加法运算的计算任务),有向边将前一个节点(可称为前节点或父节点)连接至后一个节点(可称为后节点或子节点),表示父节点的输出作为子节点的输入。
相关技术中,继续以机器学习为例,对计算图的处理方式通常如下:对神经网络的代码文件进行编译处理,得到该神经网络的计算图,按照计算图中各节点的拓扑排序结果,将这些排好序的节点所指示的计算任务一一加载到硬件上,由硬件上具体执行计算任务的硬件资源来执行各个计算任务。然而,随着神经网络的计算规模和网络复杂度不断提升,对计算图进行处理所耗费的算力资源也越来越庞大,因此,亟需一种能够有效节约算力资源,提升资源利用率的计算图的处理方法。
发明内容
本申请实施例提供了一种计算图的处理方法、装置、设备及存储介质,能够有效节约算力资源,提升资源利用率。该技术方案如下:
第一方面,提供了一种计算图的处理方法,该方法包括:
基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到该目标程序的多个子计算图,该计算图包括多个计算节点和有向边,该计算节点指示该目标程序的计算任务,该有向边指示计算节点所指示的计算任务之间的数据流向;
基于多个该硬件资源之间的通信参考信息和多个该子计算图的计算任务,获取该计算图的算力部署结果,该通信参考信息指示硬件资源之间数据传输所耗费的通信资源,该算力部署结果指示多个该硬件资源所执行的多个该子计算图的计算任务。
在上述方法中,对于目标程序的计算图,基于目标硬件中多个硬件资源的数量,将该计算图切分为多个子计算图,从而根据各个硬件资源之间的通信参考信息,将多个子计算图的计算任务分别部署到多个硬件资源上去执行,得到计算图的算力部署结果。在这一过程中,由于对完整的计算图进行了切分,且算力部署过程中涉及的通信参考信息能够指示硬件资源之间进行数据传输所耗费的通信资源,因此最终得到的算力部署结果能够有效节约算力资源,提升资源利用率。
在一些实施例中,基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到该目标程序的多个子计算图,包括:
获取该计算图的节点权重和该计算图的有向边权重,该节点权重指示计算节点所指示的计算任务在该目标程序中的重要程度,该有向边权重指示有向边所指示的数据流向在该目标程序中的重要程度;
基于多个该硬件资源的数量、该计算图的节点权重以及该计算图的有向边权重,对该计算图进行切分,得到多个该子计算图,以使切分后的多个该子计算图的数量等于多个该硬件资源的数量,且该子计算图中计算节点和有向边的重要程度符合目标条件。
通过上述方法,能够确保各个子计算图中计算任务总量之间达到均衡,且各个子计算图中计算任务和数据流向的重要程度达到均衡,实现计算图的平衡最小切分,便于后续将这些子计算图分别部署至多个硬件资源上,以提升资源利用率。
在一些实施例中,该方法还包括:
以该计算图中的父计算节点为起点,对该多个计算节点进行第一拓扑排序,得到第一排序结果,该第一排序结果指示各个计算节点所属的第一层级;
以该计算图中的子计算节点为起点,对该多个计算节点进行第二拓扑排序,得到第二排序结果,该第二排序结果指示各个计算节点所属的第二层级;
基于该第一排序结果和该第二排序结果,确定该计算图的节点权重和该计算图的有向边权重。
在一些实施例中,基于该第一排序结果和该第二排序结果,确定该计算图的节点权重和该计算图的有向边权重,包括:
基于目标计算节点所属的第一层级与第二层级之间的差值,确定该目标计算节点的节点松弛度,基于该节点松弛度、该目标计算节点所指示的计算任务的数据处理量以及硬件性能参考值,确定该目标计算节点的节点权重,该节点松弛度指示该目标计算节点在该计算图中的重要程度,该目标计算节点为任意一个计算节点;
基于目标有向边所连接的起始计算节点所属的第一层级和终止计算节点所属的第二层级之间的差值,确定该目标有向边的有向边松弛度,基于该有向边松弛度和该目标有向边所指示的数据传输量,确定该目标有向边的有向边权重,该有向边松弛度指示该目标有向边在该计算图中的重要程度,该目标有向边为任意一条有向边。
通过上述方法,对计算图中的多个计算节点分别进行两次拓扑排序,以得到各个计算节点的节点松弛度和各条有向边的有向边松弛度,从而获取到各个计算节点的节点权重和有向边权重,能够确保在对计算图进行切分时各个子计算图的数据处理量均衡的情况下,减小不同子计算图之间的数据传输量,为计算图的切分提供技术支撑。
在一些实施例中,基于多个该硬件资源之间的通信参考信息和多个该子计算图的计算任务,获取该计算图的算力部署结果,包括:
基于多个该硬件资源和多个该子计算图,获取该计算图的中间算力部署结果;
基于多个该硬件资源之间的通信参考信息、多个该计算图之间的数据传输量以及该中间算力部署结果,获取该中间算力部署结果的通信代价;
基于该中间算力部署结果的通信代价,更新该中间算力部署结果,以得到该算力部署结果。
应理解,计算图部署的目标是通过合理分配硬件资源,让“距离远”的硬件资源之间的数据传输需求尽量少,而“距离近”的硬件资源之间数据传输需求尽量多(此处“距离”通过通信参考信息来体现)。因此,本申请定义了一种通信代价,通过最小化算力部署结果的通信代价来得到最终的算力部署结果,能够有效节约算力资源,提升资源利用率。
在一些实施例中,该方法还包括:
基于多个该硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息,获取多个该硬件资源之间的通信参考信息。
在一些实施例中,该方法还包括:
对该目标程序的代码文件进行编译处理,得到该目标程序的数据文件和任务文件,该数据文件包括该目标程序的数据特征,该任务文件包括该目标程序的任务特征;
基于该数据文件和该任务文件,生成该计算图。
在一些实施例中,方法还包括:
在该计算图还包括多个搬运节点的情况下,基于该任务文件,删除该计算图中的多个该搬运节点,该搬运节点指示该目标程序的数据搬运任务。
通过上述方法,对目标程序的代码文件进行编译处理,将其转化为可建模的数据文件和任务文件,从而生成基于计算任务的计算图,为后续对计算图进行切分以实现算力部署的过程提供了技术支撑。
在一些实施例中,该方法还包括:
调用仿真调度工具,对该算力部署结果进行仿真调度,得到仿真调度结果,该仿真调度结果包括多个该硬件资源执行多个该子计算图的计算任务的仿真调度时间和资源利用率;
基于该仿真调度结果,调整该算力部署结果。
通过上述方法,调用仿真调度工具,对算力部署结果进行仿真调度,以便快速评估算力部署结果的性 能,从而进一步调整算力部署结果,以达到进一步提升资源利用率的效果。
第二方面,本申请实施例提供了一种计算图的处理装置,该装置包括至少一个功能模块,用于执行前述第一方面或第一方面的任意一种可能的实现方式所提供的计算图的处理方法。
第三方面,本申请实施例提供了一种计算设备,该计算设备包括处理器和存储器;该存储器用于存储至少一段程序代码,该至少一段程序代码由处理器加载并执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的计算图的处理方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质用于存储至少一段程序代码,该至少一段程序代码用于实现前述第一方面或第一方面的任意一种可能的实现方式所提供的计算图的处理方法。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第五方面,本申请实施例提供了一种计算机程序产品,当该计算机程序产品在计算设备上运行时,使得该计算设备实现前述第一方面或第一方面的任意一种可能的实现方式所提供的计算图的处理方法。该计算机程序产品可以为一个软件安装包,在需要实现前述计算图的处理方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
附图说明
图1是本申请实施例提供的一种计算图的处理方法的实施环境示意图;
图2是本申请实施例提供的一种计算设备的硬件结构示意图;
图3是本申请实施例提供的一种计算图的处理方法的流程图;
图4是本申请实施例提供的一种代码文件的编译处理过程的示意图;
图5是本申请实施例提供的一种计算图的示意图;
图6是本申请实施例提供的一种生成计算图的示意图;
图7是本申请实施例提供的一种拓扑排序的示意图;
图8是本申请实施例提供的一种节点松弛度和有向边松弛度的示意图;
图9是本申请实施例提供的一种获取计算图权重的示意图;
图10是本申请实施例提供的一种硬件资源的示意图;
图11是本申请实施例提供的一种硬件资源建模过程的示意图;
图12是本申请实施例提供的一种计算图的切分示意图;
图13是本申请实施例提供的一种计算图的部署过程的示意图;
图14是本申请实施例提供的一种仿真调度流程的示意图;
图15是本申请实施例提供的一种计算图的处理方法的示意图;
图16是本申请实施例提供的一种计算图的处理方法的示意图;
图17是本申请实施例提供的一种神经网络的计算图的示意图;
图18是本申请实施例提供的一种人工部署计算图的示意图;
图19是本申请实施例提供的一种人工部署和本申请方案的对比示意图;
图20是本申请实施例提供的一种计算图的处理装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中 涉及到的目标程序的代码文件等都是在充分授权的情况下获取的。
为了方便理解,下面先对本申请涉及的关键术语和关键概念进行说明。
计算图(computational graph),一种通用的计算过程表示方法,用于描述函数的有向无环图,普遍应用在各类数据处理平台上,一个计算图包括多个节点和有向边。以机器学习领域为例,计算图用于表示神经网络涉及的计算逻辑,其中,计算图中的每个节点表示神经网络的计算任务(如,add节点表示一个加法运算的计算任务),有向边将前一个节点(可称为前节点或父节点)连接至后一个节点(可称为后节点或子节点),表示父节点的输出作为子节点的输入。在一些实施例中,将计算图中指示计算任务的节点称为计算节点。另外,在一些实施例中,计算图中的节点还能够指示数据搬运任务(如,将数据从计算任务1所在的芯片搬运到共享内存等),将这些节点称为搬运节点。
人工智能(artificial intelligence,AI)模型,是一类用机器学习思想解决实际问题的数学算法模型,AI模型中包括大量的参数和计算公式(或计算规则)。
神经网络,是一类模仿生物神经网络(动物的中枢神经系统)的结构和功能的数学算法AI模型。一个神经网络可以包括多种不同功能的神经网络层,每层包括参数和计算公式。
拓扑排序(topological sorting),是一个有向无环图(directed acyclic graph,DAG)的所有节点的线性序列。
启发式算法(heuristic algorithm),是相对于最优化算法提出的。一个问题的最优算法求得该问题每个实例的最优解。启发式算法可以这样定义:一个基于直观或经验构造的算法,在可接受的花费(指计算时间和空间)下给出待解决组合优化问题每一个实例的一个可行解,该可行解与最优解的偏离程度一般不能被预计。目前,启发式算法包括蚁群算法、模拟退火算法(simulated annealing mapping,SAM)、神经网络等。
高性能计算(high performance computing),是一种利用超级计算机或计算机集群的能力来解决需要大量计算的复杂问题的技术,如一些数据密集型计算任务,包括仿真、建模和渲染等。
高性能LINPACK系统(high performance LINPACK),是一种用于测试高性能计算机系统浮点性能的基准标准检查程序(benchmark),其中,LINPACK是线性系统软件包(linear system package)的简称。LINPACK基准通过求解密集的线性代数方程组,来测试和评估高性能计算机系统的浮点性能。测试标准包括三种不同的信息尺度的测试:100×100、1000×1000和n×n,其中前两种测试所用的基准程序能够从相关网站上下载,经过编译运行的程序,这种程序可以提供相应机器的性能,并且这种测试不允许对测试程序进行任何修正。n×n数据规模的测试要求是LINPACK测试标准中最宽松的,用户可以对任意大小的问题规模,使用任意数量的中央处理器(central processing unit,CPU),使用基于高斯消元的各种优化方法来执行该测试程序,寻求最佳的性能测试结果。在一些实施例中,将用于加速器自检的高性能LINPACK系统(high performance LINPACK for accelerator introspection)简称为HPL-AI。
消息传递接口(message passing interface,MPI),是一个标准化和可移植的消息系统(函数库)。该标准定义了编写具有消息传递的可移植程序时使用的库函数的语法和语义。换言之,MPI是一种用于在并行应用程序的分支之间提供通信的软件工具。
开放式多处理(open multi-processing,OMP),是一种用于共享内存并行系统的多线程程序设计方案,支持的编程语言包括C、C++和Fortran等。
高级语言(high-level programming language),是一种独立于机器,面向过程或对象的语言。高级语言是参照数学语言而设计的近似于日常会话的语言。例如,高级语言包括BASIC、JAVA、C、C++、Python等,对此不作限定。
下面对本申请涉及的应用场景和实施环境进行介绍。
本申请实施例提供了一种计算图的处理方法,能够应用于针对计算图的算力部署场景,例如,AI神经网络训练场景和HPL场景等。示意性地,以AI神经网络训练场景为例,对于一个定义好的神经网络,通过对该神经网络的代码文件进行编译处理,生成该神经网络的计算图,将该计算图所指示的计算任务部署到相应硬件资源上去执行,实现神经网络的训练过程。
图1是本申请实施例提供的一种计算图的处理方法的实施环境示意图。如图1所示,该实施环境包括 终端101和服务器102,终端101通过无线网络或有线网络与服务器102直接或间接相连。
终端101可以是智能手机、台式计算机、增强现实终端、平板电脑、电子书阅读器和膝上型便携计算机中的至少一种。终端101安装和运行有应用程序,如客户端应用、浏览器应用等,对此不作限定。示意性地,以AI神经网络训练场景为例,终端101上运行有支持训练神经网络的应用程序,用户能够通过该应用程序,输入定义好的神经网络的代码文件,触发终端101向服务器102发送针对该神经网络的训练请求,以便服务器102对该神经网络的代码文件进行编译处理,生成计算图,并将计算图所指示的计算任务部署到相应硬件资源上去执行,实现神经网络的训练过程。
服务器102为独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式文件系统,又或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(content delivery network,CDN)以及大数据和人工智能平台等基础云计算服务的云服务器。服务器102用于为终端101上运行的应用程序提供后台服务。服务器102也可以理解为具有计算图的算力部署功能,能够针对目标程序(如神经网络、HPL测试程序等)生成相应的计算图,并将计算图所指示的计算任务部署到相应硬件资源上去执行。
终端101可以泛指多个终端中的一个,或者多个终端组成的集合;服务器102可以是单独的计算设备、计算设备集群、虚拟机或容器引擎等等,本申请实施例对实施环境中每种设备的数量和设备类型不做限定。
在一些实施例中,上述的无线网络或有线网络使用标准通信技术和/或协议。网络包括但不限于数据中心网络(data center network)、存储区域网(storage area network,SAN)、局域网(local area network,LAN)、城域网(metropolitan area network,MAN)、广域网(wide area network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实现方式中,使用包括超级文本标记语言(hyper text markup language,HTML)、可扩展标记语言(extensible markup language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还能够使用诸如安全套接字层(secure sockets layer,SSL)、传输层安全(transport layer security,TLS)、虚拟专用网络(virtual private network,VPN)、网际协议安全(internet protocol security,IPsec)等常规加密技术来加密所有或者部分链路。在另一些实施例中,还能够使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
下面对上述服务器102的硬件结构进行介绍。
本申请实施例提供了一种计算设备,能够配置为上述实施环境中的服务器102。示意性地,参考图2,图2是本申请实施例提供的一种计算设备的硬件结构示意图。如图2所示,该计算设备200包括存储器201、处理器202、通信接口203以及总线204。其中,存储器201、处理器202、通信接口203通过总线204实现彼此之间的通信连接。
存储器201可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。示意性地,存储器201用于存储至少一段程序代码,当存储器201中存储的程序代码被处理器202执行时,处理器202和通信接口203用于执行下述实施例所示的计算图的处理方法。
处理器202可以为智能芯片(也可称为AI芯片),也即是本申请实施例中涉及的硬件资源,例如是网络处理器(network processor unit,NPU)、中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、现场可编程逻辑门阵列(field programmable gate array,FPGA)、特定应用集成电路(application-specific integrated circuit,ASIC)或用于控制本申请方案程序执行的集成电路。该处理器202可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。该处理器202的数量可以是一个,也可以是多个,对此不作限定。
通信接口203使用例如收发器一类的收发模块,来实现计算设备200与其他设备或通信网络之间的通信。例如,可以通过通信接口203获取数据。
其中,存储器201和处理器202可以分离设置,也可以集成在一起。
总线204可包括在计算设备200各个部件(例如,存储器201、处理器202、通信接口203)之间传送信息的通路。
在一些实施例中,本申请提供的计算图的处理方法还可以通过分布式部署在不同的环境中的多个计算设备来实现,本申请实施例对此不作限定。
下面对本申请提供的计算图的处理方法进行介绍。
图3是本申请实施例提供的一种计算图的处理方法的流程图。如图3所示,该计算图的处理方法应用于上述实施环境中的服务器,包括下述步骤301至步骤307。
301、服务器基于目标程序的代码文件,生成该目标程序的计算图。
在本申请实施例中,目标程序是指任意程序,如神经网络、HPL测试程序等,对此不作限定。目标程序的代码文件是指以高级语言编写的代码文件,本申请实施例对于高级语言的具体形式不作限定。目标程序的计算图包括多个计算节点和有向边,该计算节点指示该目标程序的计算任务,该有向边指示计算节点所指示的计算任务之间的数据流向,也可以理解为计算任务之间的依赖关系。
下面结合图4和图5,对服务器生成目标程序的计算图的过程进行详细介绍,包括下述步骤A1和步骤A2。
步骤A1、对目标程序的代码文件进行编译处理,得到该目标程序的数据文件和任务文件。
其中,服务器获取目标程序的代码文件,对该代码文件进行编译处理,将其转化为可建模的数据文件和任务文件。示意性地,数据文件包括该目标程序的数据特征,例如,以目标程序的任意数据为例,该数据的数据特征包括:数据名称(data name)、数据标识(data ID)、数据处理量(data KByteSize)(也称为数据大小)、需求源初始化(consummer init)以及数据区域(data region)(如存储数据的芯片,也称为硬件资源)等等,对此不作限定。任务文件包括该目标程序的任务特征,例如,以目标程序的任意任务为例,该任务的任务特征包括:任务名称(task name)、任务标识(task ID)、任务类型(task type)(如计算任务、数据搬运任务等),任务区域(task region)(如计算任务指定的芯片),该任务的前序任务ID列表(predecessor taskIDlist)、该任务的后继任务ID列表(successor taskIDlist)、该任务的输入数据ID列表(input dataID)(长度与前序任务ID列表长度一致)以及该任务的输出数据ID列表(output dataID)等等,对此不作限定。应理解,目标程序的数据特征和任务特征可以分别存储在不同的文件中,也可以存储在同一个文件中,本申请实施例对此不作限定。
示意性地,参考图4,图4是本申请实施例提供的一种代码文件的编译处理过程的示意图。如图4所示,基于高级语言Python编写的目标程序(矩阵分解,LU-Decomposition)的代码文件如图4中(a)图所示,通过对该代码文件进行编译处理,得到数据文件和任务文件,其中,数据文件如图4中(b)图所示,任务文件如图4中(c)图所示。需要说明的是,图4所示仅为一种示意性说明,数据文件和任务文件的具体内容能够根据实际需求进行增加或删减,对此不作限定。
步骤A2、基于该数据文件和该任务文件,生成该目标程序的计算图。
其中,服务器基于该数据文件和该任务文件,生成基于计算任务的有向无环图,也即该目标程序的计算图。应理解,在计算图中,计算任务会被部署至目标硬件上去执行,且会耗费通信资源和算力资源,因此生成计算任务的有向无环图能够便于后续对计算图进行切分,提高计算图的切分效率。
在一些实施例中,服务器采用递归算法来生成计算图。示意性地,以任务文件中任一个计算任务为例(以下称为目标计算任务),服务器获取目标计算任务和该目标计算任务的前序依赖任务ID列表,若该前序依赖任务ID列表中存在搬运任务的任务ID,则遍历该搬运任务的前序依赖任务ID列表,直到定位到所有计算任务集合,将计算任务集合中的每一个计算任务作为该目标计算任务的前序,遍历所有计算任务,最终生成基于计算任务的有向无环图。
在另一些实施例中,服务器能够基于该数据文件和该任务文件,生成该目标程序的完整的计算图,即,该计算图包括多个计算节点、多个搬运节点和有向边,在这种情况下,服务器基于该任务文件,删除该计算图中的多个搬运节点,以得到基于计算任务的有向无环图,本申请实施例对此不作限定。
示意性地,参考图5,图5是本申请实施例提供的一种计算图的示意图。如图5所示,圆圈所示节点为计算节点,圆角矩形所示节点为搬运节点,箭头为有向边,用于表示任意两个节点所指示的任务之间的数据流向。其中,(a)图为基于计算任务的计算图,该计算图包括多个计算节点和有向边。(b)图为包含 有输入输出搬运任务的计算图,该计算图包括多个计算节点、多个搬运节点以及有向边。例如,搬运节点A指示将数据从存储单元-晶圆(storage unit-wafer,SU-W)搬运至存储单元-共享(storage unit-share,SU-S),搬运节点B指示将数据从芯片单元外部搬运到芯片单元。(c)图为包含有所有搬运任务的计算图,该计算图包括多个计算节点、多个搬运节点以及有向边。在一些实施例中,计算任务之间的数据流向对应3个搬运任务,即,以任一条有向边所指示的数据流向为例,将数据从起始计算任务所在的芯片搬运到起始共享内存上、从起始共享内存搬运到目标共享内存、从目标共享内存搬运到终止计算任务所在的芯片上。应理解,上述(a)、(b)、(c)图之间能够互相转化,即,(c)图可以理解为是目标程序的完整的计算图,(b)图可以理解为是简化后的计算图,(a)图可以理解为是抽象化的仅包含计算任务的计算图,在实际应用中,能够根据需求选择生成相应的计算图来实现针对计算图的算力部署,本申请实施例对此不作限定。
经过上述步骤301,服务器对目标程序的代码文件进行编译处理,将其转化为可建模的数据文件和任务文件,从而生成基于计算任务的计算图,即DAG图,这一过程可参考图6,图6是本申请实施例提供的一种生成计算图的示意图,通过这一过程,为后续对计算图进行切分以实现算力部署的过程提供了技术支撑。
302、服务器获取计算图的节点权重和计算图的有向边权重。
在本申请实施例中,节点权重指示计算节点所指示的计算任务在目标程序中的重要程度,有向边权重指示有向边所指示的数据流向在目标程序中的重要程度。通过获取计算图的节点权重和有向边权重,为后续对计算图的切分提供了技术支撑,使得计算图在切分时能够保留目标程序中重要程度较高的节点和有向边。这一过程也可以理解为给计算图的各个计算节点和各条有向边添加标签的过程,或者说为计算图添加结构标签的过程。
下面结合图7和图8,对服务器获取计算图的节点权重和有向边权重的过程进行介绍,包括下述步骤B1至步骤B3。
步骤B1、以计算图中的父计算节点为起点,对多个计算节点进行第一拓扑排序,得到第一排序结果,该第一排序结果指示各个计算节点所属的第一层级。
其中,服务器以计算图中的父计算节点为起点,基于与该父计算节点连接的有向边,正向遍历该多个计算节点,将多个计算节点按照层次分层,得到各个计算节点所属的第一层级。例如,参考图7,图7是本申请实施例提供的一种拓扑排序的示意图。如图7中(a)图所示,服务器以计算节点1为起点,遍历剩余的计算节点2至9,得到第一排序结果:计算节点1属于第0层,计算节点2和3属于第1层,计算节点4、5、6属于第2层,计算节点7和8属于第3层,计算节点9属于第4层。在一些实施例中,将这种拓扑排序方式称为基于最快原则的拓扑排序(as soon as possible,ASAP)。
步骤B2、以计算图中的子计算节点为起点,对该多个计算节点进行第二拓扑排序,得到第二排序结果,该第二排序结果指示各个计算节点所属的第二层级。
其中,服务器以计算图中的子计算节点为起点,基于与该子计算节点连接的有向边,反向遍历该多个计算节点,将多个计算节点按照层次分层,得到各个计算节点所属的第二层级。例如,继续参考图7,如图7中(b)图所示,服务器以计算节点5、8、9为起点,遍历剩余的计算节点,得到第二排序结果:计算节点1属于第0层,计算节点2属于第1层,计算节点3和4属于第2层,计算节点7和6属于第3层,计算节点5、8、9属于第4层。在一些实施例中,将这种拓扑排序方式称为基于最晚原则的拓扑排序(as late as possible,ALAP)。
经过上述步骤B1和B2,服务器采用两种拓扑排序方式,对计算图中的多个计算节点进行了拓扑排序,得到相应的拓扑排序结果,也即是,从不同角度来感知计算图中计算节点和有向边在整个计算图中的重要程度,从而服务器能够基于下述步骤B3来确定各个计算节点的节点权重和各条有向边的有向边权重。
步骤B3、基于第一排序结果和第二排序结果,确定计算图的节点权重和计算图的有向边权重。
下面以任意一个计算节点为例(以下称为目标计算节点),介绍节点权重的确定方式。
示意性地,服务器基于目标计算节点所属的第一层级与第二层级之间的差值,确定该目标计算节点的节点松弛度,基于该节点松弛度、该目标计算节点所指示的计算任务的数据处理量以及硬件性能参考值,确定该目标计算节点的节点权重。其中,节点松弛度指示该目标计算节点在该计算图中的重要程度,通常节点松弛度越小,表示该目标计算节点在该计算图中的重要程度越高。示意性地,服务器基于目标计算节点所指示的计算任务的任务标识,从任务文件中获取该数据处理量。另外,硬件性能参考值可以是默认值, 也可以是目标硬件(即用于执行目标程序的硬件)中多个硬件资源的平均性能参考值,本申请实施例对此不作限定。上述目标计算节点的节点权重的确定过程通过下述公式(1)和(2)来表示。
Slack(node)=ALAP(node)-ASAP(node),Slack∈[0,lv] (1)
在公式(1)中,Slack(node)表示节点松弛度;ALAP(node)表示目标计算节点所属的第二层级;ASAP(node)表示目标计算节点所属的第一层级;lv为整数,是层级(level)的简称。示意性地,参考图8,图8是本申请实施例提供的一种节点松弛度和有向边松弛度的示意图。如图8中(a)图所示,在上述图7所示的拓扑排序结果的基础上,通过上述公式(1)计算得到各个计算节点的节点松弛度。
在公式(2)中,Wnode(i)表示目标计算节点的节点权重;i为正整数,表示第i个计算节点,也即目标计算节点;GFlop(i)表示目标计算节点所指示的计算任务的数据处理量;perf表示硬件性能参考值。
下面以任意一条有向边为例(以下称为目标有向边),介绍有向边权重的确定方式。
示意性地,服务器基于目标有向边所连接的起始计算节点所属的第一层级和终止计算节点所属的第二层级之间的差值,确定该目标有向边的有向边松弛度,基于该有向边松弛度和该目标有向边所指示的数据传输量,确定该目标有向边的有向边权重。其中,有向边松弛度指示该目标有向边在该计算图中的重要程度,通常有向边松弛度越小,表示该目标有向边在该计算图中的重要程度越高。示意性地,服务器基于该目标有向边所指示的数据流向,确定基于该目标有向边进行传输的数据的数据标识,基于该数据标识,从数据文件中获取该数据传输量。上述目标有向边的有向边权重的确定过程通过下述公式(3)和(4)来表示。
EdgeSlack(src→dst)=ALAP(dst)-ASAP(src),EdgeSlack∈[1,lv] (3)
在公式(3)中,EdgeSlack(src→dst)表示有向边松弛度,其中src(起始source的简称)表示有向边所连接的起始计算节点,dst(终止destination的简称)表示有向边所连接的终止计算节点;ALAP(dst)表示终止计算节点所属的第二层级;ASAP(src)表示起始计算节点所属的第一层级;lv为整数,是层级(level)的简称。示意性地,继续参考图8,如图8中(b)图所示,在上述图7所示的拓扑排序结果的基础上,通过上述公式(3)计算得到各条有向边的有向边松弛度。
Wlink(i→j)=data_KByteSize/EdgeSlack(i→j) (4)
在公式(4)中,Wlink(i→j)表示有向边的有向边权重;i、j为正整数,分别表示第i个计算节点和第j个计算节点,也即有向边所连接的起始计算节点和终止计算节点;data_KByteSize表示数据处理量。
经过上述步骤302,服务器在获取到计算图的情况下,对计算图中的多个计算节点分别进行两次拓扑排序,以得到各个计算节点的节点松弛度和各条有向边的有向边松弛度,从而获取到各个计算节点的节点权重和有向边权重,这一过程可参考图9,图9是本申请实施例提供的一种获取计算图权重的示意图,通过这一过程,为计算图赋予计算节点与有向边的权重,能够确保在对计算图进行切分时各个子计算图的数据处理量均衡的情况下,减小不同子计算图之间的数据传输量,为后续计算图的切分提供技术支撑。
303、服务器获取目标硬件中多个硬件资源的数量和多个硬件资源之间的通信参考信息。
在本申请实施例中,目标硬件是指用于执行目标程序的硬件,也即是用于执行计算图中各个计算任务的硬件。目标硬件包括多个硬件资源,例如,硬件资源为CPU、GPU、NPU等。需要说明的是,本申请实施例对于硬件资源的划分粒度不作限定,例如,硬件资源还可以是CPU的任一个核,能够根据实际需求进行划分。硬件资源之间的通信参考信息指示硬件资源之间数据传输所耗费的通信资源,能够根据硬件资源的参数信息计算得到,该通信参考信息也可以理解为是硬件资源之间的通信距离。
下面对硬件资源之间的通信参考信息的确定方式进行介绍。
其中,服务器基于多个硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息,获取多个硬件资源之间的通信参考信息。
连接关系是指硬件资源之间的连通关系,若任意两个硬件资源之间存在级联通路,则表明这两个硬件资源之间属于直接连接,否则属于间接连接,应理解,原则上不存在不相连的硬件资源对。带宽信息是指 直接连接的硬件资源之间单位时间传输的数据量,通常带宽越大说明单位时间可以传输更多的数据。时延信息是指直接连接的硬件资源之间数据传输的时间,可以与带宽相关,也可以是固定值,通常时延越长说明通信代价越大。路线信息是指间接连接的硬件资源之间的线路条数。数据转运信息是指间接连接的硬件资源之间的数据转运开销。
示意性地,参考图10,图10是本申请实施例提供的一种硬件资源的示意图。以任意两个硬件资源为例,上述确定通信参考信息的过程通过下述公式(5)来表示:
在公式(5)中,m、n为正整数,表示第m个硬件资源和第n个硬件资源;Dmn表示第m个硬件资源和第n个硬件资源之间的通信参考信息(全部硬件资源之间的通信参考信息的组合可以理解为是一个硬件资源矩阵D);hop表示跳数,如图10所示,硬件资源1和4之间为间接连接,则硬件资源1和4之间的跳数为2;dly表示时延;bw表示带宽,H(bw)表示调和平均;waynum表示路线条数;tran表示数据转运开销。在一些实施例中,若将直接连接的硬件资源之间的带宽和时延设置为bw和dly,则上述公式(5)可以简化为下述公式(6):
上述公式(6)中,以图10为例,D12表示硬件资源1和2之间的通信参考信息,D14表示硬件资源1和4之间的通信参考信息。
应理解,上述公式(5)所示的通信参考信息的计算方式仅为本申请提供的一种示意性说明,在一些实施例中,能够根据实际需求来确定通信参考信息,例如,服务器基于多个硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息中的至少一项来确定通信参考信息。当然,通信参考信息还可以是默认参考值,例如,直接连接的CPU之间的通信参考信息为10,间接连接的CPU之间的通信参考信息为20,等等,本申请实施例对此不作限定。
需要说明的是,上述图10所示的硬件资源的形式仅为示意性说明,并不构成对本申请实施例中硬件资源的限定。在一些实施例中,服务器采用聚类算法,将目标硬件进行划分,得到多个硬件资源。其中,划分原则是大类聚类,大类内连接较为紧密,较少间接连接,大类之间连接较为稀疏,较多间接连接。例如,将CPU与GPU算力分开来计算和建模,得到CPU大类和GPU大类,接着,对CPU大类进行划分,得到多个CPU,基于该多个CPU来计算CPU之间的通信参考信息;对GPU大类进行划分,得到多个GPU,基于该多个GPU来计算GPU之间的通信参考信息。通过这种异构算力部分分开计算的方式,实现了有针对性地建模,能够更为有效地利用算力资源。
经过上述步骤303,服务器获取到目标硬件中多个硬件资源的数量和多个硬件资源之间的通信参考信息,这一过程也可以理解为硬件资源建模的过程,可参考图11,图11是本申请实施例提供的一种硬件资源建模过程的示意图,通过这一过程,获取到的多个硬件资源的数量能够用于后续对计算图进行切分的过程中,获取到的多个硬件资源之间的通信参考信息能够用于后续对切分后的子计算图进行部署的过程中,以提升资源利用率。
另外,本申请实施例对于上述步骤303的执行时机不作限定,服务器可以先执行步骤303,再执行步骤301和步骤302,也可以在执行步骤301和步骤302的情况下,同步执行步骤303,或者,服务器还可以在执行下述步骤304时,获取多个硬件资源的数量,在执行下述步骤305时,获取多个硬件资源之间的通信参考信息。
304、服务器基于目标硬件中多个硬件资源的数量、计算图的节点权重以及计算图的有向边权重,对计算图进行切分,得到多个子计算图。
在本申请实施例中,服务器基于多个硬件资源的数量、计算图的节点权重以及计算图的有向边权重,对计算图进行切分,得到多个子计算图,以使切分后的多个子计算图的数量等于多个硬件资源的数量,且子计算图中计算节点和有向边的重要程度符合目标条件。其中,目标条件是指子计算图之间的加权切边数最小,应理解,在对计算图进行切分时是通过删除有向边的方式来实现的,而有向边所指示的数据流向在目标程序中的重要程度能够通过有向边权重来体现,有向边所连接的计算节点所指示的计算任务在目标程序中的重要程度能够通过节点权重来体现,因此,通过最小化子计算图之间的加权切边数,能够确保各个 子计算图中计算任务总量之间达到均衡,且各个子计算图中计算任务和数据流向的重要程度达到均衡,实现计算图的平衡最小切分。
在一些实施例中,服务器调用启发式算法,基于多个硬件资源的数量、计算图的节点权重以及计算图的有向边权重,对计算图进行切分,得到多个子计算图。例如,启发式算法包括蚁群算法、SAM算法、神经网络等,对此不作限定。
经过上述步骤304,服务器基于目标硬件中多个硬件资源的数量,对计算图进行了切分,得到多个子计算图,这一过程可参考图12,图12是本申请实施例提供的一种计算图的切分示意图,通过这一过程,将计算图切分多个子计算图,便于后续将这些子计算图分别部署至多个硬件资源上,以提升资源利用率。
305、服务器基于多个硬件资源之间的通信参考信息和多个子计算图的计算任务,获取该计算图的算力部署结果。
在本申请实施例中,算力部署结果指示多个硬件资源所执行的多个子计算图的计算任务。服务器基于多个硬件资源之间的通信参考信息和多个子计算图的计算任务,将多个子计算图分别部署至多个硬件资源上,以使多个硬件资源分别执行多个子计算图的计算任务。基于前述内容可知,多个硬件资源的数量等于多个子计算图的数量,也即是,每个子计算图均会被部署至相应的硬件资源上,换言之,算力部署结果也可以理解为指示多个硬件资源与多个子计算图之间的映射关系(或称为匹配关系)。
下面对服务器获取算力部署结果的过程进行介绍,包括下述步骤C1至步骤C3:
步骤C1、基于多个硬件资源和多个子计算图,获取计算图的中间算力部署结果。
其中,将多个子计算图随机部署至多个硬件资源上,得到中间算力部署结果。
步骤C2、基于多个硬件资源之间的通信参考信息、多个计算图之间的数据传输量以及中间算力部署结果,获取中间算力部署结果的通信代价。
其中,多个计算图之间的数据传输量通过下前述公式(7)来确定:
在公式(7)中,x、y为正整数,表示第x个子计算图和第y个子计算图;Cxy表示第x个子计算图和第y个子计算图之间的数据传输量(全部子计算图之间的数据传输量的组合可以理解为是一个数据传输量矩阵C);k为正整数,表示任意一条有向边;part(srck)表示有向边所连接的起始计算节点所在的子计算图;part(dstk)表示有向边所连接的终止计算节点所在的子计算图;δ函数表示若part(srck)和part(dstk)不一致,则表明k为切边(也即是被“删除”的边),为0,否则为1。
应理解,计算图部署的目标是通过合理分配硬件资源,让“距离远”的硬件资源之间的数据传输需求尽量少,而“距离近”的硬件资源之间数据传输需求尽量多(此处“距离”通过通信参考信息来体现)。因此,本申请定义了一种通信代价,通过最小化算力部署结果的通信代价来得到最终的算力部署结果。示意性地,算力部署结果的通信代价如下述公式(8)所示:
S=tr(PTCPDT) (8)
在公式(8)中,S表示通信代价,P为置换矩阵,C为数据传输量矩阵,D为硬件资源矩阵(参考前述公式(5))。
步骤C3、基于中间算力部署结果的通信代价,更新中间算力部署结果,以得到算力部署结果。
其中,服务器基于中间算力部署结果的通信代价,迭代更新中间算力部署结果,直至满足迭代截止条件,得到算力部署结果。其中,该迭代截止条件可以是迭代次数达到目标次数,也可以是通信代价小于目标阈值,对此不作限定。
在一些实施例中,服务器调用SAM算法来执行上述步骤C1至步骤C3,得到计算图的算力部署结果。下面对这种方式进行介绍,应理解,SAM算法是一种通过迭代更新来达到最优解的算法,包括下述几个步骤:
第一步、基于多个硬件资源和多个子计算图,随机生成初始算力部署结果,设置初始化温度为T,迭代次数为L。
第二步、基于上述公式(7)和公式(8),计算初始算力部署结果的通信代价。
第三步、从初始算力部署结果中随机选择两个子计算图交换其对应的硬件资源,得到新的算力部署结果,重新计算通信代价和增量ΔT。
第四步、若ΔT<0,接受新的算力部署结果,否则以exp(-ΔT/T)的概率接受新的算力部署结果,重复L次。
第五步、逐渐降低T,返回第三步,直到T降低到预设阈值,输出算力部署结果。
需要说明的是,上述通过SAM算法获取算力部署结果的过程仅为本申请实施例提供的一种示意性说明,其他凡是通过最小化通信代价来得到最终的算力部署结果的方法均能应用于上述过程中,本申请对此不作限定。
经过上述步骤305,服务器基于多个硬件资源之间的通信参考信息(硬件资源矩阵D)和多个子计算图的计算任务(数据传输量矩阵C),将多个子计算图分别部署至多个硬件资源上,以使多个硬件资源分别执行多个子计算图的计算任务,这一过程可参考图13,图13是本申请实施例提供的一种计算图的部署过程的示意图,通过这一过程,实现了针对计算图的自动部署过程,而且,由于考虑到了各个硬件资源之间的通信参考信息,因此能够有效节约算力资源,提升资源利用率。
在一些实施例中,服务器还能够调用仿真调度工具,对上述算力部署结果进行仿真调度,以便快速评估算力部署结果的性能,从而进一步调整算力部署结果,以达到进一步提升资源利用率的效果。下面对这种可选的实施方式进行介绍。
306、服务器调用仿真调度工具,对该算力部署结果进行仿真调度,得到仿真调度结果。
其中,该仿真调度结果包括多个该硬件资源执行多个该子计算图的计算任务的仿真调度时间和资源利用率。
示意性地,该仿真调度工具为基于通知列表(notify table)和事件驱动(event-driven)的分布式仿真调度框架,下面参考图14,对该仿真调度流程进行介绍。图14是本申请实施例提供的一种仿真调度流程的示意图,如图14所示,在仿真调度过程中,调度器维护下述四个列表:
计算图(task graph)列表,用于存储子计算图的任务;
解除列表(release-list),用于存储已经解除依赖关系的任务的标识,应理解,计算图中的任务是依次执行的,随着某一任务的执行完毕,该任务与其他任务之间的依赖关系随即解除,这些其他任务也即是已经解除依赖关系的任务;
执行列表(ontheFly-list),用于存储正在执行的任务的标识;
提交列表(commit-list),用于存储已执行完毕的任务的标识。
在仿真调度过程中,包括下述几个步骤:1、提交列表向调度器发送预计任务完成时间(EstTime);2、接收调度器返回的针对仿真调度时间(wall-clock)的通知消息;3、提交列表执行任务并提交;4、向调度器发送针对执行完毕的任务的通知消息;5、调度器基于执行完毕的任务,更新计算图列表;6、在计算图列表更新后,更新解除列表;7、将解除列表中的目标数量个任务分配至执行列表中;8、更新执行列表中的任务;9、更新预计任务完成时间。应理解,上述过程通过循环迭代执行,直至最后一个任务提交完成后,调度器生成的仿真调度时间即为整个计算图的仿真调度时间。
307、服务器基于该仿真调度结果,调整该算力部署结果。
其中,服务器基于该仿真调度结果,调整算力部署结果,基于调整后的算力部署结果,重新进行仿真调度,通过这种迭代调整的方式,直至得到符合条件的算力部署结果。例如,迭代调整的次数达到预设次数,或者,调整后的算力部署结果的仿真调度结果符合要求,等等,对此不作限定。
经过上述步骤306和步骤307,通过基于事件驱动的分布式调度仿真,可以快速评估调度性能,评估结果可靠,从而进一步调整算力部署结果,以达到进一步提升资源利用率的效果。应理解,在得到最终的算力部署结果后,能够按照该算力部署结果,将目标程序的计算图真正部署到目标硬件的多个硬件资源上去执行,从而有效节约算力资源,提升了资源利用率。反观相关技术,以MPI+OMP模式或者MPI+FF-Graph模式为例(FF为函数流(function flow)的简称),这些计算图的处理方式并不是基于完整的计算图建模,放弃了很多多个硬件资源之间的并行机会,而且MPI对编程人员存在较高的编程门槛,计算图生成形式和过程复杂,效率较低,导致算力资源消耗较多。
需要说明的是,在一些实施例中,上述步骤301至步骤307为离线处理过程,在另一些实施例中,服务器还能够在线执行上述步骤301至步骤307,并基于在线运行过程中硬件资源的占用信息,在线调整算 力部署结果,以达到实时提升资源利用率的效果,当然,还能够将离线和在线方法混合,离线生成算力部署结果的基线,基于在线运行过程中硬件资源的占用信息,再进行算力部署结果的微调,本申请实施例对此不作限定。
下面结合图15和图16,对上述步骤301至步骤307所示的计算图的处理方法进行总结。
参考图15,图15是本申请实施例提供的一种计算图的处理方法的示意图。如图15所示,本申请实施例提供的计算图的处理方法介于高级语言与底层任务调度执行之间,属于资源分配与部署范畴。其总体框架包含三个子框架:建模框架1501、部署框架1502与仿真调度框架1503,三个子框架互为递进关系。其中,建模框架1501包括:计算图构建(也即前述步骤301)、计算图结构标签(也即前述步骤302)以及硬件建模(也即前述步骤303)。通过将目标程序的代码文件编译转化为数据文件和任务文件,构建得到目标程序的计算图,并对计算图的结构与硬件资源的算力进行建模,为部署框架1502提供算力部署依据。部署框架1502包括:自动化切分(也即前述步骤304)和自动化部署(也即前述步骤305),基于目标硬件中多个硬件资源的算力建模结果和计算图的结构标签,对计算图进行切分,将切分后得到的多个子计算图部署至多个硬件资源上,使得各个子计算图适配于不同硬件资源并且通信代价最小,为仿真调度框架1503做准备。仿真调度框架1503包括分布式调度仿真(也即前述步骤306和步骤307),这是基于事件驱动的分布式调度方法,通过仿真各个硬件资源的运行状态以及同步消息,得到仿真调度结果,并将仿真调度结果作为整体计算图部署性能的依据。最后,将算力部署结果导入任务调度执行,可进行真实环境下的调度性能验证。
接下来,参考图16,图16是本申请实施例提供的一种计算图的处理方法的示意图,如图16所示,该计算图的处理方法流程介于高级语言和调度之间。其中硬件建模单独建模,为切分和部署提供算力与通信代价依据。
基于前述介绍可知,本申请实施例提供的计算图的处理方法能够应用于神经网络的训练场景,示意性地,参考图17,图17是本申请实施例提供的一种神经网络的计算图的示意图。如图17所示,该神经网络为Megatron神经网络,其函数流(function flow,FF)计算图包括前向传播(forward propagation,FP)和反向传播(backward propagation,BP),共72个编码器。参考图18,图18是本申请实施例提供的一种人工部署计算图的示意图,如图18所示,人工部署时,在目标硬件包括8个硬件资源的情况下,在每个硬件资源上部署9个编码器,也即是将72个编码器顺序部署至8个硬件资源上去执行。示意性地,参考图19,图19是本申请实施例提供的一种人工部署和本申请方案的对比示意图,如图19所示,对图17所示的神经网络的计算图采用人工部署的方式时,各个硬件资源(如芯片die)的资源利用率平均稳定在68%,而采用本申请方案所示的自动部署的方式时,各个硬件资源的资源利用率相比人工部署方式均有不同程度的提升,综合提升了2.4%,可见采用本申请方案,能够有效节约算力资源,提升资源利用率。
需要说明的是,本申请实施例提供的计算图的处理方法还能够应用于HPL场景,相比盛景网络的训练场景,HPL的计算图结构通常更为复杂,往往需要通过折叠图、动态图的方式展现,而本申请实施例提供的计算图的处理方法能够应用于折叠图、动态图等不同切分场景和需求,可做到多层级高响应的部署,从而提高HPL场景下的资源利用率。
综上,在本申请实施例提供的计算图的处理方法中,对于目标程序的计算图,基于目标硬件中多个硬件资源的数量,将该计算图切分为多个子计算图,从而根据各个硬件资源之间的通信参考信息,将多个子计算图的计算任务分别部署到多个硬件资源上去执行,得到计算图的算力部署结果。在这一过程中,由于对完整的计算图进行了切分,且算力部署过程中涉及的通信参考信息能够指示硬件资源之间进行数据传输所耗费的通信资源,因此最终得到的算力部署结果能够有效节约算力资源,提升资源利用率。
图20是本申请实施例提供的一种计算图的处理装置的结构示意图。该计算图的处理装置可以通过软件、硬件或者两者的结合实现前述计算图的处理方法。如图20所示,该计算图的处理装置包括计算图切分模块2001和算力部署模块2002。
计算图切分模块2001,用于基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到该目标程序的多个子计算图,该计算图包括多个计算节点和有向边,该计算节点指示该目标程序的计算任务,该有向边指示计算节点所指示的计算任务之间的数据流向;
算力部署模块2002,用于基于多个该硬件资源之间的通信参考信息和多个该子计算图的计算任务,获 取该计算图的算力部署结果,该通信参考信息指示硬件资源之间数据传输所耗费的通信资源,该算力部署结果指示多个该硬件资源所执行的多个该子计算图的计算任务。
在一些实施例中,计算图切分模块2001,用于:
获取该计算图的节点权重和该计算图的有向边权重,该节点权重指示计算节点所指示的计算任务在该目标程序中的重要程度,该有向边权重指示有向边所指示的数据流向在该目标程序中的重要程度;
基于多个该硬件资源的数量、该计算图的节点权重以及该计算图的有向边权重,对该计算图进行切分,得到多个该子计算图,以使切分后的多个该子计算图的数量等于多个该硬件资源的数量,且该子计算图中计算节点和有向边的重要程度符合目标条件。
在一些实施例中,该装置还包括权重确定模块,该权重确定模块用于:
以该计算图中的父计算节点为起点,对该多个计算节点进行第一拓扑排序,得到第一排序结果,该第一排序结果指示各个计算节点所属的第一层级;
以该计算图中的子计算节点为起点,对该多个计算节点进行第二拓扑排序,得到第二排序结果,该第二排序结果指示各个计算节点所属的第二层级;
基于该第一排序结果和该第二排序结果,确定该计算图的节点权重和该计算图的有向边权重。
在一些实施例中,该权重确定模块用于:
基于目标计算节点所属的第一层级与第二层级之间的差值,确定该目标计算节点的节点松弛度,基于该节点松弛度、该目标计算节点所指示的计算任务的数据处理量以及硬件性能参考值,确定该目标计算节点的节点权重,该节点松弛度指示该目标计算节点在该计算图中的重要程度,该目标计算节点为任意一个计算节点;
基于目标有向边所连接的起始计算节点所属的第一层级和终止计算节点所属的第二层级之间的差值,确定该目标有向边的有向边松弛度,基于该有向边松弛度和该目标有向边所指示的数据传输量,确定该目标有向边的有向边权重,该有向边松弛度指示该目标有向边在该计算图中的重要程度,该目标有向边为任意一条有向边。
在一些实施例中,该算力部署模块2002,用于:
基于多个该硬件资源和多个该子计算图,获取该计算图的中间算力部署结果;
基于多个该硬件资源之间的通信参考信息、多个该计算图之间的数据传输量以及该中间算力部署结果,获取该中间算力部署结果的通信代价;
基于该中间算力部署结果的通信代价,更新该中间算力部署结果,以得到该算力部署结果。
在一些实施例中,该装置还包括获取模块,该获取模块,用于:
基于多个该硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息,获取多个该硬件资源之间的通信参考信息。
在一些实施例中,该装置还包括计算图生成模块,该计算图生成模块用于:
对该目标程序的代码文件进行编译处理,得到该目标程序的数据文件和任务文件,该数据文件包括该目标程序的数据特征,该任务文件包括该目标程序的任务特征;
基于该数据文件和该任务文件,生成该计算图。
在一些实施例中,该计算图生成模块还用于:
在该计算图还包括多个搬运节点的情况下,基于该任务文件,删除该计算图中的多个该搬运节点,该搬运节点指示该目标程序的数据搬运任务。
在一些实施例中,该装置还包括仿真调度模块,该仿真调度模块用于:
调用仿真调度工具,对该算力部署结果进行仿真调度,得到仿真调度结果,该仿真调度结果包括多个该硬件资源执行多个该子计算图的计算任务的仿真调度时间和资源利用率;
基于该仿真调度结果,调整该算力部署结果。
在本申请实施例提供的计算图的处理装置中,对于目标程序的计算图,基于目标硬件中多个硬件资源的数量,将该计算图切分为多个子计算图,从而根据各个硬件资源之间的通信参考信息,将多个子计算图的计算任务分别部署到多个硬件资源上去执行,得到计算图的算力部署结果。在这一过程中,由于对完整的计算图进行了切分,且算力部署过程中涉及的通信参考信息能够指示硬件资源之间进行数据传输所耗费的通信资源,因此最终得到的算力部署结果能够有效节约算力资源,提升资源利用率。
另外,在上述计算图的处理装置中,计算图切分模块2001和算力部署模块2002均可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以计算图切分模块2001为例,介绍计算图切分模块2001的实现方式。类似的,算力部署模块2002以及其他模块的实现方式可以参考计算图切分模块2001的实现方式。
模块作为软件功能单元的一种举例,计算图切分模块2001可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,计算图切分模块2001可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个区域(region)内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。
模块作为硬件功能单元的一种举例,计算图切分模块2001可以包括至少一个计算设备。或者,计算图切分模块2001也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
计算图切分模块2001包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。计算图切分模块2001包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,计算图切分模块2001包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,该多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
需要说明的是,在其他实施例中,计算图切分模块2001可以用于执行计算图的处理方法中的任意步骤,即,计算图切分模块2001和算力部署模块2002负责实现的步骤可根据需要指定,通过计算图切分模块2001和算力部署模块2002分别实现计算图的处理方法中不同的步骤来实现计算图的处理装置的全部功能。另外,上述实施例提供的计算图的处理装置与计算图的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一排序结果可以被称为第二排序结果,并且类似地,第二排序结果可以被称为第一排序结果。第一排序结果和第二排序结果都可以是排序结果,并且在某些情况下,可以是单独且不同的排序结果。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个排序结果是指两个或两个以上的排序结果。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以程序结构信息的形式实现。该程序结构信息包括一个或多个程序指令。在计算设备上加载和执行该程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请 进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (21)

  1. 一种计算图的处理方法,其特征在于,所述方法包括:
    基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到所述目标程序的多个子计算图,所述计算图包括多个计算节点和有向边,所述计算节点指示所述目标程序的计算任务,所述有向边指示计算节点所指示的计算任务之间的数据流向;
    基于多个所述硬件资源之间的通信参考信息和多个所述子计算图的计算任务,获取所述计算图的算力部署结果,所述通信参考信息指示硬件资源之间数据传输所耗费的通信资源,所述算力部署结果指示多个所述硬件资源所执行的多个所述子计算图的计算任务。
  2. 根据权利要求1所述的方法,其特征在于,所述基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到所述目标程序的多个子计算图,包括:
    获取所述计算图的节点权重和所述计算图的有向边权重,所述节点权重指示计算节点所指示的计算任务在所述目标程序中的重要程度,所述有向边权重指示有向边所指示的数据流向在所述目标程序中的重要程度;
    基于多个所述硬件资源的数量、所述计算图的节点权重以及所述计算图的有向边权重,对所述计算图进行切分,得到多个所述子计算图,以使切分后的多个所述子计算图的数量等于多个所述硬件资源的数量,且所述子计算图中计算节点和有向边的重要程度符合目标条件。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    以所述计算图中的父计算节点为起点,对所述多个计算节点进行第一拓扑排序,得到第一排序结果,所述第一排序结果指示各个计算节点所属的第一层级;
    以所述计算图中的子计算节点为起点,对所述多个计算节点进行第二拓扑排序,得到第二排序结果,所述第二排序结果指示各个计算节点所属的第二层级;
    基于所述第一排序结果和所述第二排序结果,确定所述计算图的节点权重和所述计算图的有向边权重。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第一排序结果和所述第二排序结果,确定所述计算图的节点权重和所述计算图的有向边权重,包括:
    基于目标计算节点所属的第一层级与第二层级之间的差值,确定所述目标计算节点的节点松弛度,基于所述节点松弛度、所述目标计算节点所指示的计算任务的数据处理量以及硬件性能参考值,确定所述目标计算节点的节点权重,所述节点松弛度指示所述目标计算节点在所述计算图中的重要程度,所述目标计算节点为任意一个计算节点;
    基于目标有向边所连接的起始计算节点所属的第一层级和终止计算节点所属的第二层级之间的差值,确定所述目标有向边的有向边松弛度,基于所述有向边松弛度和所述目标有向边所指示的数据传输量,确定所述目标有向边的有向边权重,所述有向边松弛度指示所述目标有向边在所述计算图中的重要程度,所述目标有向边为任意一条有向边。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述基于多个所述硬件资源之间的通信参考信息和多个所述子计算图的计算任务,获取所述计算图的算力部署结果,包括:
    基于多个所述硬件资源和多个所述子计算图,获取所述计算图的中间算力部署结果;
    基于多个所述硬件资源之间的通信参考信息、多个所述计算图之间的数据传输量以及所述中间算力部署结果,获取所述中间算力部署结果的通信代价;
    基于所述中间算力部署结果的通信代价,更新所述中间算力部署结果,以得到所述算力部署结果。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    基于多个所述硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息,获取多个所述硬件资源之间的通信参考信息。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    对所述目标程序的代码文件进行编译处理,得到所述目标程序的数据文件和任务文件,所述数据文件包括所述目标程序的数据特征,所述任务文件包括所述目标程序的任务特征;
    基于所述数据文件和所述任务文件,生成所述计算图。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    在所述计算图还包括多个搬运节点的情况下,基于所述任务文件,删除所述计算图中的多个所述搬运节点,所述搬运节点指示所述目标程序的数据搬运任务。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:
    调用仿真调度工具,对所述算力部署结果进行仿真调度,得到仿真调度结果,所述仿真调度结果包括多个所述硬件资源执行多个所述子计算图的计算任务的仿真调度时间和资源利用率;
    基于所述仿真调度结果,调整所述算力部署结果。
  10. 一种计算图的处理装置,其特征在于,所述装置包括:
    计算图切分模块,用于基于目标硬件中多个硬件资源的数量,对目标程序的计算图进行切分,得到所述目标程序的多个子计算图,所述计算图包括多个计算节点和有向边,所述计算节点指示所述目标程序的计算任务,所述有向边指示计算节点所指示的计算任务之间的数据流向;
    算力部署模块,用于基于多个所述硬件资源之间的通信参考信息和多个所述子计算图的计算任务,获取所述计算图的算力部署结果,所述通信参考信息指示硬件资源之间数据传输所耗费的通信资源,所述算力部署结果指示多个所述硬件资源所执行的多个所述子计算图的计算任务。
  11. 根据权利要求10所述的装置,其特征在于,所述计算图切分模块,用于:
    获取所述计算图的节点权重和所述计算图的有向边权重,所述节点权重指示计算节点所指示的计算任务在所述目标程序中的重要程度,所述有向边权重指示有向边所指示的数据流向在所述目标程序中的重要程度;
    基于多个所述硬件资源的数量、所述计算图的节点权重以及所述计算图的有向边权重,对所述计算图进行切分,得到多个所述子计算图,以使切分后的多个所述子计算图的数量等于多个所述硬件资源的数量,且所述子计算图中计算节点和有向边的重要程度符合目标条件。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括权重确定模块,所述权重确定模块用于:
    以所述计算图中的父计算节点为起点,对所述多个计算节点进行第一拓扑排序,得到第一排序结果,所述第一排序结果指示各个计算节点所属的第一层级;
    以所述计算图中的子计算节点为起点,对所述多个计算节点进行第二拓扑排序,得到第二排序结果,所述第二排序结果指示各个计算节点所属的第二层级;
    基于所述第一排序结果和所述第二排序结果,确定所述计算图的节点权重和所述计算图的有向边权重。
  13. 根据权利要求12所述的装置,其特征在于,所述权重确定模块用于:
    基于目标计算节点所属的第一层级与第二层级之间的差值,确定所述目标计算节点的节点松弛度,基于所述节点松弛度、所述目标计算节点所指示的计算任务的数据处理量以及硬件性能参考值,确定所述目标计算节点的节点权重,所述节点松弛度指示所述目标计算节点在所述计算图中的重要程度,所述目标计算节点为任意一个计算节点;
    基于目标有向边所连接的起始计算节点所属的第一层级和终止计算节点所属的第二层级之间的差值,确定所述目标有向边的有向边松弛度,基于所述有向边松弛度和所述目标有向边所指示的数据传输量,确定所述目标有向边的有向边权重,所述有向边松弛度指示所述目标有向边在所述计算图中的重要程度,所述目标有向边为任意一条有向边。
  14. 根据权利要求10至13中任一项所述的装置,其特征在于,所述算力部署模块,用于:
    基于多个所述硬件资源和多个所述子计算图,获取所述计算图的中间算力部署结果;
    基于多个所述硬件资源之间的通信参考信息、多个所述计算图之间的数据传输量以及所述中间算力部署结果,获取所述中间算力部署结果的通信代价;
    基于所述中间算力部署结果的通信代价,更新所述中间算力部署结果,以得到所述算力部署结果。
  15. 根据权利要求10至14中任一项所述的装置,其特征在于,所述装置还包括获取模块,所述获取模块,用于:
    基于多个所述硬件资源之间的连接关系、带宽信息、时延信息、路线信息以及数据转运信息,获取多个所述硬件资源之间的通信参考信息。
  16. 根据权利要求10至15中任一项所述的装置,其特征在于,所述装置还包括计算图生成模块,所述计算图生成模块用于:
    对所述目标程序的代码文件进行编译处理,得到所述目标程序的数据文件和任务文件,所述数据文件包括所述目标程序的数据特征,所述任务文件包括所述目标程序的任务特征;
    基于所述数据文件和所述任务文件,生成所述计算图。
  17. 根据权利要求16所述的装置,其特征在于,所述计算图生成模块还用于:
    在所述计算图还包括多个搬运节点的情况下,基于所述任务文件,删除所述计算图中的多个所述搬运节点,所述搬运节点指示所述目标程序的数据搬运任务。
  18. 根据权利要求10至17中任一项所述的装置,其特征在于,所述装置还包括仿真调度模块,所述仿真调度模块用于:
    调用仿真调度工具,对所述算力部署结果进行仿真调度,得到仿真调度结果,所述仿真调度结果包括多个所述硬件资源执行多个所述子计算图的计算任务的仿真调度时间和资源利用率;
    基于所述仿真调度结果,调整所述算力部署结果。
  19. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述存储器用于存储至少一段程序代码,所述至少一段程序代码由所述处理器加载并执行如权利要求1至权利要求9中任一项所述的计算图的处理方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储至少一段程序代码,所述至少一段程序代码用于执行如权利要求1至权利要求9中任一项所述的计算图的处理方法。
  21. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算设备上运行时,使得所述计算设备执行如权利要求1至权利要求9中任一项所述的计算图的处理方法。
PCT/CN2023/103982 2022-11-07 2023-06-29 计算图的处理方法、装置、设备及存储介质 WO2024098793A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211387594.7A CN117993456A (zh) 2022-11-07 2022-11-07 计算图的处理方法、装置、设备及存储介质
CN202211387594.7 2022-11-07

Publications (1)

Publication Number Publication Date
WO2024098793A1 true WO2024098793A1 (zh) 2024-05-16

Family

ID=90886190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103982 WO2024098793A1 (zh) 2022-11-07 2023-06-29 计算图的处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN117993456A (zh)
WO (1) WO2024098793A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220321A1 (en) * 2019-03-27 2019-07-18 Intel Corporation Automated resource provisioning using double-blinded hardware recommendations
CN114138284A (zh) * 2021-12-07 2022-03-04 北京奇艺世纪科技有限公司 模型部署处理方法、装置、电子设备及存储介质
CN114626552A (zh) * 2022-03-24 2022-06-14 阿里巴巴(深圳)技术有限公司 机器学习模型的切分方法和装置
CN115099399A (zh) * 2022-06-27 2022-09-23 清华大学 神经网络模型部署方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220321A1 (en) * 2019-03-27 2019-07-18 Intel Corporation Automated resource provisioning using double-blinded hardware recommendations
CN114138284A (zh) * 2021-12-07 2022-03-04 北京奇艺世纪科技有限公司 模型部署处理方法、装置、电子设备及存储介质
CN114626552A (zh) * 2022-03-24 2022-06-14 阿里巴巴(深圳)技术有限公司 机器学习模型的切分方法和装置
CN115099399A (zh) * 2022-06-27 2022-09-23 清华大学 神经网络模型部署方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN117993456A (zh) 2024-05-07

Similar Documents

Publication Publication Date Title
Ben-Nun et al. Groute: An asynchronous multi-GPU programming model for irregular computations
AU2019284011B2 (en) Data processing method and related products
Cui et al. A genetic algorithm based data replica placement strategy for scientific applications in clouds
US9589069B2 (en) Platform for continuous graph update and computation
US11016673B2 (en) Optimizing serverless computing using a distributed computing framework
WO2018019232A1 (zh) 流计算方法、装置及系统
US8856060B2 (en) Creating stream processing flows from sets of rules
Yu et al. Joint optimization of service request routing and instance placement in the microservice system
US20220198296A1 (en) User context migration based on computation graph in artificial intelligence application executing in edge computing environment
JP2014525640A (ja) 並列処理開発環境の拡張
KR20110057070A (ko) 이벤트 처리 네트워크
WO2022222834A1 (zh) 一种数据处理方法以及装置
Carnero et al. Managing and deploying distributed and deep neural models through Kafka-ML in the cloud-to-things continuum
Goudarzi et al. Design of a universal logic block for fault-tolerant realization of any logic operation in trapped-ion quantum circuits
Bin Khunayn et al. Exploiting data dependency to mitigate stragglers in distributed spatial simulation
CN111951112A (zh) 基于区块链的智能合约执行方法、终端设备和存储介质
Cheng et al. Identification of influential modules considering design change impacts based on parallel breadth-first search and bat algorithm
WO2024098793A1 (zh) 计算图的处理方法、装置、设备及存储介质
CN116775041A (zh) 基于流计算框架和rete算法的大数据实时决策引擎
Richthammer et al. Architecture decomposition in system synthesis of heterogeneous many-core systems
Zeliu et al. MapReduce rationality verification based on object Petri net
Luo et al. Going with the flow: Real-time max-flow on asynchronous dynamic graphs
Bengre et al. A learning-based scheduler for high volume processing in data warehouse using graph neural networks
Deng et al. DAG scheduling for heterogeneous systems using biogeography-based optimization
Uscumlic et al. Design space exploration with deterministic latency guarantees for crossbar mpsoc architectures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23887489

Country of ref document: EP

Kind code of ref document: A1