CN111666151B - Topological graph conversion system and method thereof - Google Patents

Topological graph conversion system and method thereof Download PDF

Info

Publication number
CN111666151B
CN111666151B CN202010403719.5A CN202010403719A CN111666151B CN 111666151 B CN111666151 B CN 111666151B CN 202010403719 A CN202010403719 A CN 202010403719A CN 111666151 B CN111666151 B CN 111666151B
Authority
CN
China
Prior art keywords
node
task
task node
host
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010403719.5A
Other languages
Chinese (zh)
Other versions
CN111666151A (en
Inventor
袁进辉
柳俊丞
牛冲
李新奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oneflow Technology Co Ltd
Original Assignee
Beijing Oneflow Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oneflow Technology Co Ltd filed Critical Beijing Oneflow Technology Co Ltd
Priority to CN202010403719.5A priority Critical patent/CN111666151B/en
Publication of CN111666151A publication Critical patent/CN111666151A/en
Application granted granted Critical
Publication of CN111666151B publication Critical patent/CN111666151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Abstract

The application discloses a method for converting an operation logic node topological graph into a task node topological graph, which comprises the following steps: the method comprises the steps that through an operation task node deployment component, tasks of any operation logic node in an operation logic node topological graph are segmented to appointed calculation resources based on task configuration data in task description input by a user on the basis of given calculation resources, so that one or more operation task nodes corresponding to each operation logic node are generated, and position marks corresponding to the appointed calculation resources are given to each operation task node; and inserting one or more transport task nodes between the first operation task node and a second operation task node serving as an upstream operation task node of the transport task node by the transport task node insertion component when the first position mark of the first operation task node and the second position mark of the second operation task node are different, so that a complete task node topological graph with the transport task nodes is obtained.

Description

Topological graph conversion system and method thereof
The application is a divisional application of an application patent application named as a topology map conversion system and a method thereof, wherein the application date is 2020, 02, 13 and the national application number is 202010090334.8.
Technical Field
The present disclosure relates to a data processing technique. More particularly, the present disclosure relates to a conversion system and method for converting an operational logical node-support topology to a task node topology.
Background
With the popularity of distributed computing, large jobs may be split to deploy different portions of data to various computing devices of different distributed data processing systems for processing, such that, during processing of a particular job, computing intermediate parameters or results deployed on one computing device may become input data for computing tasks on another computing device, which may cause call overhead for data migration between computing devices in order to achieve data synchronization of the intermediate parameters. Network communication call is often a bottleneck, and then the performance of network performance communication is poor, which affects the acceleration ratio and expansibility of the multi-machine distributed data processing architecture.
As the computing functions of various single computing devices themselves become more and more powerful, they have been in an extremely serious state in terms of increasing the computing speed of the computing devices. In particular, as the computation speed increases, the speed of data recall has fallen behind the computation speed of data. Thus, the invocation or migration of data becomes a bottleneck that restricts the computing device from processing the data. In fact, most of the developers and users of dedicated AI chips typically focus only on the power consumption and efficiency of the computing portion, such as how to design an AI chip to perform matrix operations more efficiently, but much less on the demands of data migration, data forwarding and routing, which is significant from both power consumption and latency when large-scale tasks are performed cooperatively based on multiple chips.
Thus, in existing systems, migration of data migration between distributed devices costs as much time and cost as computation. How to reduce the communication overhead and "hide" the time during the system operation, so that the system can fully put the hardware resources into shortening the calculation time, is the key to improving the system efficiency. Furthermore, modifying the data routing pattern in flexible parallel patterns (data parallel, model parallel, even hybrid parallel) is actually very complex. The existing deep learning framework only realizes the data flow diagram calculation operation in the model, and does not perform the data migration operation in the data flow diagram of the model. The result of this is that the data flow graph does not have these operations encoded therein, which does not reveal the advantages of the data flow engine being automatically parallel, and thus the software programming effort is trapped in so-called callback traps at the time of synchronous programming.
Therefore, how to make data handling or data exchange look like data operation in a distributed data processing architecture is important, so that data handling or data exchange is regarded as an equal citizen like data processing and computing, so that data handling can be implemented in static deployment, data handling tasks are fixed in a specific handling execution body to be implemented, asynchronous communication in data exchange is implemented, so that the expenditure of time of two calls is reduced, so that data handling and routing can be implemented by a special chip as possible, so that the efficiency of the whole system can be maximized, which is an urgent problem to be solved in the field of large-scale data processing.
Disclosure of Invention
It is an object of the present disclosure to provide a solution to at least one of the above problems. Specifically, the present disclosure provides a method for converting an operational logical node topology to a task node topology, comprising: the method comprises the steps that through an operation task node deployment component, tasks of any operation logic node in an operation logic node topological graph are segmented to appointed calculation resources based on task configuration data in task description input by a user on the basis of given calculation resources, so that one or more operation task nodes corresponding to each operation logic node are generated, and position marks corresponding to the appointed calculation resources are given to each operation task node; and inserting one or more transport task nodes between the first operation task node and a second operation task node serving as an upstream operation task node of the transport task node by the transport task node insertion component when the first position mark of the first operation task node and the second position mark of the second operation task node are different, so that a complete task node topological graph with the transport task nodes is obtained.
The method for converting an operation logic node topological graph into a task node topological graph according to the present disclosure, wherein when a first position mark indicates a first computing device of a first host and a second position mark indicates the first host, the carrying task node inserting component inserts only one carrying task node between the first operation task node and the second operation task node and assigns the inserted carrying task node with the first position mark.
The method for converting an operational logical node topology into a task node topology according to the present disclosure, wherein when a first location indicator indicates a first host and a second location indicator indicates a second computing device of the first host, the transport task node insertion component inserts only one transport task node between the first operational task node and the second operational task node and assigns the inserted transport task node a second location indicator.
According to the method for converting the operation logic node topological graph into the task node topological graph, when a first position mark is designated as a first host and a second position mark is designated as a second host, the carrying task node inserting component inserts only one carrying task node between the first operation task node and the second operation task node and endows the inserted carrying task node with the first position mark.
According to the method for converting the operation logic node topological graph into the task node topological graph, when a first position mark indicates as a first computing device of a first host and a second position mark indicates as a third computing device of the first host or a second host, the carrying task node inserting component inserts two carrying task nodes between the first operation task node and the second operation task node, and assigns the first position mark to the first carrying task node inserted next to the first operation task node and assigns the second position mark to the other inserted carrying task node.
According to the method for converting the operation logic node topological graph into the task node topological graph, when a first position mark indicates a first computing device of a first host and a second position mark indicates a fourth computing device of a second host, the carrying task node inserting component sequentially inserts a first carrying task node, a second carrying task node and a third carrying task node according to the sequence from the first operation task node to the second operation task node, and assigns the first position mark to the first carrying task node, assigns the position mark indicating the first host to the second carrying task node and assigns the second position mark to the third carrying task node.
The method for converting the operation logic node topological graph into the task node topological graph according to the present disclosure, wherein the method further comprises selecting, by a logic distributed signature selection component in the operation task node deployment component, a logic distributed signature composed of a distributed descriptor of an input tensor and a distributed descriptor of an output tensor, designated by a source operation logic node in the operation logic node topological graph based on the task configuration data, from a candidate logic distributed signature set of each downstream operation logic node of each source operation logic node, a logic distributed signature with the minimum data handling cost as a logic distributed signature of each downstream operation logic node before the task of any operation logic node in the operation logic node topological graph is fragmented to the designated computing resource by the operation task node deployment component.
According to another aspect of the present disclosure, there is also provided a conversion system for converting an operational logical node topology into a task node topology, including: the operation task node deployment component is used for fragmenting tasks of any operation logic node in the operation logic node topological graph to appointed calculation resources based on task configuration data in task description input by a user on the basis of given calculation resources, so that one or more operation task nodes corresponding to each operation logic node are generated, and position marks corresponding to the appointed calculation resources are given to each operation task node; and a transport task node insertion component that inserts one or more transport task nodes between a first operation task node and a second operation task node as an upstream operation task node thereof when there is a different position marker between the first operation task node and the second operation task node, thereby obtaining a full task node topology map having transport task nodes.
A conversion system according to the present disclosure converts an operational logical node topology into a task node topology, wherein the transport task node insertion component inserts only one transport task node between the first and second operational task nodes when a first location marker indicates a first computing device of a first host and a second location marker indicates the first host, and assigns the inserted transport task node a first location marker.
According to the conversion system for converting the operation logic node topological graph into the task node topological graph, when the first position mark is designated as a first host and the second position mark is designated as a second computing device of the first host, only one carrying task node is inserted between the first operation task node and the second operation task node, and the second position mark of the inserted carrying task node is given.
According to the conversion system for converting the operation logic node topological graph into the task node topological graph, when the first position mark is designated as a first host and the second position mark is designated as a second host, only one transportation task node is inserted between the first operation task node and the second operation task node, and the first position mark is given to the inserted transportation task node.
According to the conversion system for converting the operation logic node topological graph into the task node topological graph, when the first position mark indicates a first computing device of a first host and the second position mark indicates a third computing device of the first host or a second host, two operation task nodes are inserted between the first operation task node and the second operation task node, and the first position mark is assigned to the first operation task node inserted next to the first operation task node, and the second position mark is assigned to the other inserted operation task node.
According to the conversion system for converting the operation logic node topological graph into the task node topological graph, when the first position mark indicates the first computing device of the first host and the second position mark indicates the fourth computing device of the second host, the first, the second and the third carrying task nodes are sequentially inserted in the sequence from the first operation task node to the second operation task node, the first position mark is given to the first carrying task node, the position mark indicating the first host is given to the second carrying task node, and the second position mark is given to the third carrying task node.
The conversion system for converting an operation logic node topological graph into a task node topological graph according to the present disclosure, wherein the operation task node deployment component comprises a logic distributed signature selection component, which selects, as a logic distributed signature of each downstream operation logic node, a logic distributed signature with the smallest data handling cost from a candidate logic distributed signature set of each downstream operation logic node of each source operation logic node, the logic distributed signature composed of a distributed descriptor of an input tensor and a distributed descriptor of an output tensor, which is designated for a source operation logic node in the operation logic node topological graph based on the task configuration data, before the task of any operation logic node in the operation logic node topological graph is fragmented to the designated computing resource.
According to the conversion system and the method for converting the operation logic node topological graph into the task node topological graph, the running path of data can be known in advance from the global angle, so that the data carrying task nodes are deployed in advance, the carrying of the data can be realized in a static deployment mode, the data carrying task is fixed in a specific carrying execution body to be realized, asynchronous communication in data exchange is realized, and the expenditure of time for two calls is reduced. In particular, by pre-deploying the data handling task nodes from a global perspective, the defect that data scheduling and data calculation overlap cannot be realized due to data processing waiting and delay caused by dynamic scheduling on-line decision data migration in the prior art (the overlapping of data handling and calculation cannot be realized in the prior art) is eliminated. Because the handling task nodes are inserted between the operation task nodes, the data handling path is planned in advance, so that the handling role of each data is fixed, the source and the destination of the data and the operation task node object served by the handling task node are predetermined, the overlapping of handling and calculation can be realized in the whole system, and the explosion situation caused by resource exhaustion or resource non-planning in the flow control is solved.
In addition, the carrying task nodes are inserted in advance, so that the waiting process of operation can be eliminated, the operation equipment corresponding to the operation task nodes is always in an operation state, and the operation utilization rate is improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
Fig. 1 is a schematic diagram of a conversion system for converting an operational logical node topology into a task node topology according to the present disclosure.
Fig. 2 is a partial schematic diagram of a fully tasked node topology according to the present disclosure.
Fig. 3 is a schematic diagram illustrating the structure of a logically distributed signature of a select operation logical node according to the present disclosure.
FIG. 4 is a schematic diagram illustrating SBP signatures for downstream computing logical nodes selected according to the present disclosure.
Fig. 5 illustrates a first schematic diagram of a traffic data volume estimation unit estimating data traffic volumes generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 6 illustrates a second schematic diagram of a traffic data volume estimation unit estimating data traffic volumes generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 7 illustrates a third schematic diagram of a traffic data amount estimation unit estimating data traffic generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 8 illustrates a fourth schematic diagram of a traffic data amount estimation unit estimating data traffic generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 9 illustrates a fifth schematic diagram of a traffic data amount estimation unit estimating data traffic generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 10 illustrates a sixth schematic diagram of a traffic data amount estimation unit estimating data traffic generated between tensors of different distributed descriptors according to the present disclosure.
Detailed Description
The present invention is described in further detail below with reference to examples and drawings to enable those skilled in the art to practice the same and to refer to the description.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, one of the two possible location markers may be referred to hereinafter as a first location marker or a second location marker, and similarly the other of the two possible location markers may be referred to as a second location marker or a first logical location marker, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
In order that those skilled in the art will better understand the present disclosure, the present disclosure will be described in further detail below with reference to the accompanying drawings and detailed description.
Fig. 1 is a schematic diagram of a conversion system for converting an operational logical node topology into a task node topology according to the present disclosure. As shown in fig. 1, a conversion system for converting an operational logical node topology into a task node topology according to the present disclosure includes an operational task node deployment component 10 and a transport task node insertion component 20. When obtaining the topology map of the operation task nodes, the operation task node deployment component 10 segments the task of any operation logic node in the topology map of the operation logic nodes to a designated computing resource based on task configuration data in task description input by a user on the basis of a given computing resource, thereby generating one or more operation task nodes corresponding to each operation logic node, and giving a position mark corresponding to the designated computing resource to each operation task node.
In particular, in a distributed computing system, it is common to include one or more hosts, each of which is connected to a plurality of computing devices, such as GPUs, TPUs, etc., dedicated to large-scale simple computing devices. When data parallel computing is required, large-scale data blocks to be processed are typically fragmented into multiple computing devices for parallel processing. In the case of a relatively large model, the model may also be typically partitioned and distributed to different computing devices for processing. For this purpose, when two devices are available on one HOST (HOST), for example GPU0 and GPU1, the data may be fragmented into two parts along the 0 th dimension of the data, distributed over GPU0 and GPU1 for parallel processing, and if the HOST number is H1, the position markers H1-GPU0 are assigned to the operation task nodes on GPU0 of the HOST H1 for the fragmentation of the operation logic node, and likewise the position markers H1-GPU1 are assigned to the operation task nodes on GPU1 of the HOST H1 for the fragmentation of the operation logic node. As shown in fig. 1, the arithmetic logic node E itself is initially provided with the position markers H1-2G, since it will be allocated to both GPUs of H1. After processing by the compute task node deployment component 10, its two sliced compute task nodes are E1 and E2, which are assigned position markers H1-GPU0 and H1-GPU1, respectively. After the same processing of the operation logic node A by the operation task node deployment component 10, two operation task nodes which are segmented into A1 and A2 are respectively endowed with position marks H1-GPU0 and H1-GPU1. After the downstream operation logic node B as the operation logic nodes a and E is also processed by the operation task node deployment component 10, two operation task nodes which are fragmented are B1 and B2, to which position marks H1-GPU0 and H1-GPU1 are assigned, respectively. By analogy, the arithmetic logic nodes C, D, F are located on two GPU computing cards of the host H2, so that after being processed by the arithmetic task node deployment assembly 10, the position marks of the respective arithmetic task nodes C1 and C2, D1 and D2, and F1 and F2 are respectively H2-GPU0 and H2-GPU1. By combining the task configuration data, the operational task node topology 102 is obtained.
After determining the operational task node topology 102 in the manner described above, the transport task node insertion component 20 inserts one or more transport task nodes between a first operational task node and a second operational task node that is an upstream operational task node thereof when the first location marker of the first operational task node and the second location marker of the second operational task node have different location markers therebetween, thereby obtaining a full task node topology with transport task nodes. Specifically, as shown in FIG. 1, transport task nodes E1-H1 and H1-B2 are interposed between the operational task nodes E1 and B2, transport task nodes E2-H1 and H1-B1 are interposed between the operational task nodes E2 and B1, transport task nodes A1-H1 and H1-B2 are interposed between the operational task nodes A1 and B2, and transport task nodes A2-H1 and H1-B1 are interposed between the operational task nodes A2 and B1. The full task node topology of fig. 1 is ultimately formed. It should be noted, however, that in fig. 1, the diagram is limited to the drawing, and only a part of the full task node topology is shown, that is, the first part 103-1 including the operation task nodes E, A and B inserted by the handling task nodes and the other parts are omitted. It should be noted, however, that where a direct access protocol is provided between different computing devices (e.g., GPUs) connected to the same host, such migration of data between computing devices under the same host may not be interposed between the transport task nodes referred to in this disclosure.
Since the position of the operation task node K is marked as the host H1, only one carrying task node B1-H1 or B2-H1 is inserted between the operation task node B1 or B2 and the operation task node K, namely, part or all of data which is required by the operation task node K and distributed in G0/H1 or G1/H1 is carried by the carrying task node B1-H1 or B2-H1 to the host H1. It should be noted, however, that in the case where a direct access protocol is provided between the host H1 and the computing device (e.g., GPU) to which it is connected, such data migration between the host and the computing device may not be interposed with the transport task nodes mentioned in the present disclosure.
Fig. 2 is a schematic diagram illustrating a portion of a full task node topology after insertion of a transport task node according to the present disclosure. Since the operation logic node C is distributed on the two GPUs 0 and 1 of the host H1 as shown in FIG. 2, and the downstream operation logic node D is distributed on the two GPUs 0 and 1 of the host H2, the positions of the respective operation task nodes C1 and C2 are marked as G0/H1 or G1/H1 and the positions of the operation task nodes D1 and D2 are marked as G0/H2 or G1/H2 as shown in FIG. 1. Therefore, when the input data required for the task node D1 is required to be from the task node C1, it is necessary to insert the transport task nodes C1 to H1, H1 to H2 and H2 to D1 between the task node C1 and the task node D1 as shown in fig. 2. If the input data required by the calculation task node D1 is also required from the calculation task node C2, then the transport task nodes C2-H1, H1-H2 and H2-D1 are also required to be inserted between the calculation task node C2 and the calculation task node D1. Similarly, when the input data required for the task node D2 is required to be from the task node C1, the transport task nodes C1-H1, H1-H2 and H2-D2 are required to be interposed between the task node C1 and the task node D2 as shown in fig. 2. If the input data required by the calculation task node D2 is also required from the calculation task node C2, then the transport task nodes C2-H1, H1-H2 and H2-D2 are also required to be inserted between the calculation task node C2 and the calculation task node D2. Similarly, where a direct access protocol is provided between a host H1 or H2 and the computing device (e.g., GPU) to which it is connected, such data migration between the host and the computing device may not be interposed with the transport task nodes referred to in this disclosure. Thus, only one transport task node H1-H2 needs to be inserted between the operational task nodes C1 or C2 and D1 or D2, i.e., one transport task node H1-H2 can be shared between C1 and C2 and D1 and D2. Although the second portion 103-2 of the full task node topology shown in fig. 2 shows four transport task nodes H1-H2 inserted for visual understanding and ease of description, in practice, the four transport task nodes H1-H2 may be one transport task node even in the absence of a direct access protocol between the host H1 or H2 and the computing device (e.g., GPU) to which it is connected. According to the present disclosure, when there is data migration across hosts, only one transport task node needs to be inserted between a pair of arithmetic logic nodes between a pair of hosts.
The transport task node insertion unit 20 inserts the transport task node, marks the position of the inserted transport task node, and marks the source address and destination address of the transport data, that is, the transport direction of the transport data. The name of each handling node is the source address, the destination address and the handling direction of the handling task node.
It should be noted, however, that in order to simplify and optimize the insertion of the handling nodes and shorten the path of data handling, the computing task node deployment component 10 optionally further comprises a logically distributed signature selection component 11, each computing task node further selecting a determined logically distributed signature from its plurality of candidate logically distributed signatures based on its type of computing operation. Specifically, the logical distributed signature selection module 11 selects, as the logical distributed signature of each downstream operation logical node, a logical distributed signature having the smallest data handling cost from among the candidate logical distributed signature sets of each downstream operation logical node of each source operation logical node, a logical distributed signature composed of the distributed descriptor of the input tensor and the distributed descriptor of the output tensor specified for the source operation logical node in the operation logical node topology based on the task configuration data before the task of any operation logical node in the operation logical node topology is fragmented to the specified computing resource. Thereby obtaining an operational task node topology 102 with logically distributed signatures.
Specifically, in order to obtain a better insertion result of the handling task node, the operation logic nodes of the present disclosure all include candidate logic distributed signature sets for different operation operations. Fig. 3 is a schematic diagram illustrating the structure of a logically distributed signature of a select operation logical node according to the present disclosure. A simple initial operational logical node topology 104 is shown schematically in fig. 3, where nodes A, B, C, D, E, F, L and K are shown. Other omitted alternatives are not shown. In actual data processing, the initial operational logical node topology 104 may be more complex. The initial operational logical node topology 104 contains the basic logical operational nodes that implement the computational tasks described by the user. The manner in which such an initial operational logical node topology 104 is generated is conventional in the art and is therefore not described in detail herein.
The various initial operational logical nodes in the initial operational logical node topology 104 each contain a plurality of SBP signatures. As the source operational logical node that has been configured with the SBP signature by the user or the initial operational logical node that has determined the unique SBP signature based on the user's task description, for example, SBP-1 for initial operational logical node A, SBP-2 for initial operational logical node C, and SBP-3 for initial operational logical node E. In the event that a unique SBP signature is not determined, the initial operational logical node typically contains some of its inherent candidate SBP signatures. The initial operational logical node B, as in FIG. 1, has multiple candidate SBP signatures, e.g., three, including SBP-1, SBP-2, and SBP-3, as shown later in FIG. 3. Other initial operational logical nodes also each have a different candidate SBP signature, not listed here. Different initial operation logic nodes will have different fixed candidate SBP signatures depending on the operation they are performing specifically.
An SBP signature according to the present disclosure is a signature that is applied in a distributed data processing system. In a distributed data processing system, because there are often cases of data parallelism, model parallelism, mixed parallelism, stream parallelism, and the like, tasks of adjacent operation logic nodes are often deployed on different computing devices at the same time, so that in an actual data processing process, intermediate parameters are exchanged between the computing devices, which causes a great deal of handling overhead. Although the handling nodes according to the present disclosure may be arranged directly from the distribution of the operational task nodes. However, in order to reduce the data transfer overhead, it is necessary to further refine the operation logical node topology map based on the initial operation logical node topology map 104, and in particular, to reduce the transfer overhead between the upstream and downstream operation logical nodes, and to minimize the change in the data distribution manner of the upstream and downstream operation logical nodes or the transfer path. To this end, the present disclosure designates a logically distributed signature for each of the compute logical nodes in order to obtain a better downstream compute logical node. The logical distributed signature is a signature of an operation logic node by using a distributed descriptor of tensors, wherein the distributed descriptor of each tensor describes the distribution mode of each tensor in the whole computing system, and mainly comprises a Segmentation (SPLIT) tensor descriptor, a BROADCAST (BROADCAST) tensor descriptor and a PARTIAL VALUE (PARTIAL VALUE) tensor descriptor.
In particular, a SPLIT (SPLIT) tensor descriptor is a SPLIT way of describing a tensor, for example, a data block is SPLIT in a specified dimension according to a user's description, and distributed to different computing devices for a specified computing process. If a data block is a two-dimensional data block, the data block is cut in its 0 th dimension, the distributed descriptors of the data tensors of a batch of data formed by the data block are S (0), and each logical data block obtains such a data tensor at its input is S (0). Similarly, if a block is a two-dimensional block, then the block is cut in its 1 st dimension, then the distributed descriptors of the data tensors of the batch of data formed by the block are S (1), and each logical block obtains such a data tensor at its input as S (1). Similarly, if the dimension of the task data to be processed is more, there will be more distributed descriptors, e.g., S (2), S (3) …, etc. Such mentioned data may be processed data or a model. If the data itself is cut, then data parallel processing is formed on the distributed data processing system, and if the model is split, then model parallel processing is formed on the distributed data processing system. If the input of the operation logic node is such a SPLIT (SPLIT) tensor descriptor, in the actual data processing process, if the data size of one tensor is T, and the tensor is to be distributed to four computing cards for data parallel computation, the data amount distributed to each card is one-fourth of the data amount, and the data amount on the whole four cards is T.
BROADCAST (BROADCAST) tensor descriptor is used to describe the way a tensor is published in a distributed system in a BROADCAST manner. In general, for a data processing system that performs only data parallelism, model data is typically broadcast to the respective computing devices, and thus broadcast tensor descriptors are used for the broadcast data input to the arithmetic logic nodes. In the actual data processing process, the data block size of the broadcasted data on each actual computing card is the same.
The PARTIAL VALUE (PARTIAL VALUE) tensor descriptor indicates that an input or output tensor of one operation logical node is a PARTIAL VALUE of a plurality of similar tensors. These partial values include partial sums (Ps), partial products (Pm), partial and results, partial maxima, and partial minima. Since data is typically processed in parallel for data, processing of the data on different devices is processing of portions of the data. For example, if some tensors are S (0) or S (1), then the result tensors are obtained on some computing devices, and the result tensors on these partial computing devices are combined to form a partial value tensor. Combining the same kind of data on all devices is the final output result.
The above-described distributed descriptors of the various tensors represent the manner in which the tensors are distributed in the distributed computing system, and the respective manners in which the tensors are distributed, whether as inputs and outputs to the operation logic node, also describe the operation data distribution description of the operation logic node. For descriptive convenience, this disclosure refers to such a distributed description Fu Jian as an "SBP descriptor".
To this end, with the generation of the initial operational logic node topology 104, the initial operational logic nodes of the present disclosure, i.e., some of the operational nodes, are also provided with respective input and output distributed descriptors that form a signature of the operational logic nodes, i.e., the signature of the operational logic nodes with tensor distributed descriptors. For convenience of description, the english initials of these three distributed descriptors are used to refer to this signature as "SBP signature".
Such descriptors may include at least three of S (0), B, and P, depending on the user' S description of the computing task and the data parallelism requirements in each distributed computing system. If there are multiple partitioning modes for the data and model, then each partitioning mode is added, then a descriptor is added. For each operational logical node, its signature contains various combinations of these descriptors. Thus, in a distribution system according to the present disclosure, there are at least three, and typically four, distributed descriptors, such as the following four SBP descriptors, S (0), S (1), P, and B. There may be more distributed descriptors, depending on the number of tensor dimensions. In the case of four SBP descriptors, multiple SBP signatures may be formed in a permutation and combination of inputs and outputs. Examples of some SBP signatures are listed below: (S (0), B) →S (0), (S (1), B) →S (1), P→P, B→B, (S (0), S (1))→P, S (0) →P, S (0) →S (0), S (0) →S (1),
P→b, etc. All SBP signatures are a result of various SBP descriptor combinations. For a matrix multiplication logical node, if its input tensor is cut in the first dimension, its output result tensor is also cut in the first dimension. In summary, S, B, P is a descriptor for describing the distribution of data blocks in a data processing system, and SBP signatures describe task operations of an operational logic node using multiple SBP descriptors. Each data block may have a plurality of SBP descriptors, and each operation logic node may represent a plurality of SBP signature cases. For example, the SBP-1 shown in FIG. 1 may be in the form of a signature of (S (0), B) →S (0), while the SBP-2 may be in the form of a signature of (S (1), B) →S (1). In practical applications, different signature forms may have different numbers, where the numbers are given only for convenience of description, and do not mean that each signature needs to be given a number, and there may be no number at all, and different forms of signatures may be distinguished from each other without a number.
Each initial operational logical node may be given an SBP signature as described above based on the task description used. Typical arithmetic logic nodes are some arithmetic operation nodes that perform a particular arithmetic operation and thus have a particular candidate SBP signature. It should be noted that not every operation logic node has the same SBP signature, and that the operation logic node that normally performs a multiplication operation does not have its SBP signature input tensor containing part and tensor, and therefore its SBP descriptor does not contain a distributed descriptor P. The candidate SBP signatures for the arithmetic logic nodes performing the addition operation may then include any combination of the various SBP descriptors with each other or with themselves. For example, the arithmetic logic node performing matrix multiplication, in the case of data-only parallelism, will typically have candidate SBP signatures of (S (0), B) →s (0), (S (1), B) →s (1), (S (0), S (1))→p, etc., but not only these, as technology advances, some of the previous signatures unsuitable for matrix multiplication may also be applied to matrix multiplication, which is merely an example. Thus, each initial operational logical node is accompanied by a candidate logical distributed signature set based on the task configuration data. Each logical distributed signature in the candidate set of logical distributed signatures specifies a distributed descriptor for each input tensor and a distributed descriptor for each output tensor of the initial operational logical node to which it belongs.
Each logical node in the initial logical node topology 104 will use what kind of SBP signature to determine the tensor or which kind of distributed tensor to use and which kind of distributed tensor to input, requiring further determination. Thus, starting from the source operational logical node in the initial operational logical node topology 104, when the logical labels or SBP labels of all upstream operational logical nodes (e.g., operational logical nodes a and E) of the current operational logical node (e.g., operational logical node B) have been determined, the traffic data amount estimation unit 111 calculates, for each candidate logical distributed signature of the operational logical node B, the cost of the traffic data required to transform the distributed descriptor of the tensor of each upstream operational logical node output into the distributed descriptor of the tensor of one of the candidate logical distributed signatures of the corresponding input of the operational logical node B, based on the distributed descriptors of the outputs of all upstream operational logical nodes of the operational logical node B. As shown in FIG. 3, the logical node B is operated with a number of candidate SBP signatures, such as SBP-1, SBP-2, and SBP-3. For example, SBP-1 may be in the form of a signature of (S (1), B) →S (1) or (S (1), P) →S (1), SBP-5 may be in the form of a signature of (S (0), B) →S (0), and SBP-3 may be in the form of B→B or S (0) →P. In each signature form, the left side of the arrow is the distributed descriptor of the input tensor and the right side of the arrow is the distributed descriptor of the output tensor. For convenience of description, the "tensor of the distribution descriptor S (0)" will be referred to simply as "S (0) tensor", the "tensor of the distribution descriptor B" will be referred to simply as "B tensor", the "tensor of the distribution descriptor P" will be referred to simply as "P tensor", and so on.
As shown in fig. 4, if the tag SBP-3 of the operation logical node E in the initial operation logical node topology 104 is in the form of "S (0) →s (0)", its output tensor distribution descriptor is S (0), and thus its output tensor is S (0) tensor. If the signature SBP-3 of the logical node E is in the form of "B→B" or "P→P", the distribution descriptor of the tensor it outputs is B or P, and thus its output tensor is B or P tensor. If the candidate signature SBP-2 of the arithmetic logic node B, i.e. (S (0), S (1)) → P "), is selected as the determined signature, its distribution descriptor of the input tensor at the first input of the output of the corresponding node E must be S (1), i.e. the first input must obtain one S (1) tensor, while its distribution descriptor of the input tensor at the second input of the output of the node a must be S (0), i.e. the second input must obtain one S (0) tensor. As shown in fig. 4, for example, the output tensor of the operation logic node a is P tensor. It is clear that at this point the P of the distribution descriptor of the output tensor of node a does not coincide with the S (0) distribution descriptor of the input tensor of the second input of node B, so that to make the arithmetic logic node B perform the correct arithmetic operation, it is necessary to transform the tensor of the distribution descriptor P of node a output into the tensor of the distribution descriptor S (0). Also, if the distribution descriptor of the tensor output by the node E is S (0), it is inconsistent with the distribution descriptor S (1) of the tensor input sheet of the first input terminal of the node B, and therefore, in order for the arithmetic logic node B to perform a correct arithmetic operation, it is necessary to transform the tensor of the distribution descriptor S (0) output by the node E into the tensor of the distribution descriptor S (1).
In a distributed computing system, since the operational tasks of the individual compute logical nodes, and in particular the compute tasks, are cut and distributed over the individual computing devices (e.g. compute card CPU, GPU or TPU), in order to finally obtain the correct result, the intermediate parameters need to be synchronized constantly, which involves an exchange of intermediate parameters between the different computing devices. When the SBP descriptor of the output tensor contained in the SBP signature of the last operational logic node is inconsistent with the SBP descriptor of the corresponding input tensor of the SBP signature of the current operational logic node, the output conversion is typically performed during actual operation, and this conversion process typically requires the acquisition of a portion of the data located on another computing device to form, together with the locally available data, the data required at the input of the current operational logic node, so as to conform to the distributed descriptor of the data tensor at the input of the current operational logic node. This process of retrieving a portion of data from another device can create a relatively large data handling overhead or cost. Thus, selecting different signatures for the current operational logical node may result in different data handling overheads or costs. For this reason, the transfer data amount estimation unit 111 estimates the data transfer overhead that will be generated for each candidate signature for each operation logic node for which the signature is not determined. For example, for an operational logical node B, the data handling costs that the operational logical node B would incur if one of the SBP signatures were employed are estimated for its three candidate SBP signatures, respectively. For the operational logical node B, selecting any one of the candidate SBP signatures may accomplish its operational tasks. But with different SBP signatures, the data handling costs incurred by their operation are different. Therefore, in order to minimize the cost of data handling during data processing, it is necessary to select a signature with the smallest data handling volume from among the candidate signatures of the respective arithmetic logic nodes as a signature in its actual operation.
Between the operational logical node A and the operational logical node B in the initial operational logical node topology 104 in an upstream-downstream relationship, the operational logical node A may be a source node, its SBP signature may be generated by a user configuration, or may be generated naturally based on a user' S description of the task, or the SBP signature of the operational logical node A may have been determined by decision selection substantially in accordance with aspects of the present disclosure, e.g., the descriptor of the output tensor of the SBP signature of the operational logical node A is S (0). While as an operational logical node B in the initial operational logical node topology 104, it has many candidate SBP signatures, which may include (S (1), B) →s (1), b→p, S (1))→p, and p→b, etc., but from operational logical node a to operational logical node B, since the distribution descriptor of the output tensor of the operational logical node a is S (0), the corresponding input tensor distribution descriptor that the node B can select may be S (1), B, and P.
Thus, after the signatures of some previous operational logic nodes are determined, the SBP signatures of the operational logic nodes downstream thereof are also ultimately selected for determination based on the cost of data handling between the logical distributed descriptors (SBP descriptors) of the output tensors of the upstream operational logic nodes and the logical distributed descriptors (SBP descriptors) of the corresponding input tensors of the candidate logical distributed signatures of the downstream upstream operational logic nodes. In this way, once the candidate SBP signature of such a current operational logic node is selected for calculation, meaning that the respective SBP descriptors of the data blocks at the respective inputs and outputs of the operational logic node are also determined, the total cost of data handling for the current operational logic node is calculated or estimated, and the candidate distributed signature with the minimum total cost is taken as the distributed signature of the current operational logic node. It is noted that if the logical distributed descriptors of the inputs of which of the candidate signatures of the current operational logical node are consistent with the logical distributed descriptors of the output tensors of its upstream operational logical node, the candidate logical distributed signature containing that logical distributed descriptor may be preferentially selected unless the logical distributed descriptors of the other input tensors of that candidate logical distributed signature would result in a greater overall cost in the end.
FIG. 4 is a schematic diagram illustrating SBP signatures for downstream computing logical nodes selected according to the present disclosure. Fig. 4 is an enlarged schematic view of the relationship between nodes A, B and E in fig. 3. As shown in FIG. 4, assuming that the distribution descriptor of the output tensor of the determined SBP signature SBP-3 of the operation logic node E is S (0), the distribution descriptor of the output tensor of the determined SBP signature SBP-5 of the operation logic node A is P, and one SBP-2 of the candidate SBP signatures of the operation logic node B is (S (1), S (0)). Fwdarw.P. Thus, the SBP descriptor of the input tensor of the operation logic node B corresponding to the SBP descriptor S (0) of the output tensor of the operation logic node E is S (1), and the SBP descriptor of the input tensor of the operation logic node B corresponding to the SBP descriptor P of the output tensor of the operation logic node A is S (0). Thus, to meet the distribution requirement of the input logical data blocks of the candidate SBP signature of the operational logical node B, it is necessary to transform the tensor distribution of one of its inputs from the SBP descriptor S (0) of the output tensor of the operational logical node E to S (1) and the tensor distribution of the other of its inputs from the SBP descriptor P of the output tensor of the operational logical node A to S (0). Such a transformation will result in data exchange during the actual data processing.
Fig. 5 illustrates a first schematic diagram of the traffic data amount estimation unit 111 according to the present disclosure estimating the data traffic generated between tensors of different distributed descriptors. The candidate SBP signature SBP-2 for the task node B shown in FIG. 4 is assumed to be (S (1), S (0)). Fwdarw.P. For ease of description, the tasks of the input source task nodes a and E are distributed on the same device set as the received sink node B. For convenience, as shown in FIG. 5, are distributed across both computing cards GPU0 and GPU 1 as shown in FIG. 1. Although only two computing cards are shown here, in practice the source and sink task nodes may be distributed over more cards or may be distributed over different sets of devices. Fig. 5 shows the data exchange process in the case where the tensor of the S (0) descriptor of the task node E in fig. 4 is distributed on two computing cards, and the input of the task node B is to obtain the tensor of the S (0) descriptor.
To obtain S (1) at the compute task node of the compute logical node B distributed on GPU0, the other half of the tensor distributed on GPU 1 (the acquisition of such data part is shown with the dashed arrow) described from the S (0) descriptor of the task node E needs to be supplemented in addition to the half of the tensor distributed on GPU0 (the acquisition of such data part is shown with the solid arrow) described directly from the S (0) descriptor of the get task node E. If the size of the logical data block is T 1 The data size of the task node distributed on the GPU 0, which is carried to the task node B from the logical data block of the task node E on the GPU 1, is T 1 /2. Meanwhile, to obtain S (1) at the task node of the task node B distributed on the GPU 1, it is necessary to supplement the other half of the tensor distributed on the GPU 0 (the acquisition process of such data portion is shown with the solid arrow) described from the S (0) descriptor of the task node E, in addition to the half of the tensor distributed on the GPU 1 directly described from the S (0) descriptor of the task node E. If the size of the logical data block is T 1 The data size of the task node distributed on the GPU 1, which is carried to the task node B from the logical data block of the task node E of the GPU 0, is T 1 /2. Thus, transforming the S (0) description Fu Zhangliang of task node E into the tensor of the S (0) descriptor to be obtained at the input of task node B, the total data handling cost is T 1 =(T 1 /2+T 1 /2)。T 1 Is the size of the logical data block distributed on the source node. In FIG. 4, the logical data blocks are distributed each with a size S (0)The size of the data block in the shaded portion of the card is one half of the total tensor. In the case of a device set with 3, 4 or 5 data cards, the handling cost is also T 1
Fig. 6 illustrates a second schematic diagram of the traffic data amount estimation unit 111 according to the present disclosure estimating the data traffic generated between tensors of different distributed descriptors. Similarly, the candidate SBP signature SBP-2 for the task node B shown in FIG. 4 is assumed to be (S (1), S (0)). Fwdarw.P. For ease of description, the tasks of input source task nodes A and E and sink node B are all distributed on the same device set, as shown in FIG. 6, on computing cards GPU 0, GPU 1, and GPU 2. Although three computing cards are shown here, this is for example only. It may also be two cards as shown in fig. 5. In practice, the source task nodes and sink task nodes may be distributed to more cards or to different device sets. Fig. 6 shows the data exchange process in the case where the tensor of the P descriptor of the task node a in fig. 4 is distributed over three computing cards, and the input of the task node B is to obtain the tensor of the S (0) descriptor.
To obtain S (0) for a task node of task node B distributed on GPU 0, it is necessary to supplement one third of the tensor distributed on GPU 1 (the acquisition process of such data part is shown with the solid arrow) described by the P descriptor of task node a and one third of the tensor distributed on GPU 2 described by the P descriptor of task node a, in addition to one third of the tensor distributed on GPU 0 (the acquisition process of such data part is shown with the solid arrow) described directly from the P descriptor of task node a. For this purpose, there is a partial value tensor P distributed on each of the three cards, denoted here by Ps as a partial and tensor as a descriptive example, for example. If the size of the logical data block of the A task node distributed on each GPU card is T 2 Task node B distributed on GPU 0 is required to obtain S (0) tensor, and is required to supplement task node handling data quantity T distributed on GPU 0 from logical data block of task node A on GPU 1 to task node B 2 3 and task section on GPU2 from the sameCarrying data quantity T to task nodes distributed in GPU 0 on logical data block of point A 2 /3. Likewise, to obtain the S (0) tensor, task node B distributed on GPU 1 also needs to supplement the task node transport data amount T distributed on GPU 1 from the logical data block of task node A on GPU 0 to task node B 2 3 and the amount of data T transferred from logical data blocks of task node A on GPU2 to task nodes distributed on GPU 1 of task node B 2 /3. Similarly, to obtain the S (0) tensor, task node B distributed on GPU2 needs to supplement the task node handling data amount T distributed on GPU2 from the logical data block of task node A on GPU 1 to task node B 2 3 carrying data amount T from logical data block of task node A on GPU 0 to task node B distributed on GPU2 2 /3. Therefore, the data movement amount in the actual data processing process of transforming from the P-distributed tensor to the S (0) distributed tensor shown in FIG. 6 is 2T 2 =(T 2 /3+T 2 /3+T 2 /3+T 2 /3+T 2 /3+T 2 /3). Alternatively, if the number of computing cards distributed by the task node is 2. The data carrying capacity is T 2 =(T 2 /2+T 2 /2). By analogy, in the case that the source node and the sink node have the same device set, if the number of cards k in the device set, the data carrying amount is (k-1) ·t 2
It is apparent that, as described above, for the operation logic node B to perform the bulk operation, the data handling cost required to select the signature SBP-2 signature (e.g., signature (S (1), S (0))→p) is the sum of the handling costs of the two inputs. Combining fig. 5 and 6 (in case of two computing cards in fig. 6), the task node has a total data size T to be handled in case of the candidate signature SBP-2 1 +T 2 . For this purpose, the handling costs estimated by the handling data amount estimation unit 111 for the candidate signature SBP-2 of the operation logical node B need to contain the handling costs for both inputs of the candidate signature.
The calculation table summarizing the amount of data exchange that exists between the various SBP descriptors can be generalized from the exact same situation between the device sets for the source and sink task nodes, as follows Table 1:
TABLE 1 (the distribution sets of source and sink nodes are identical, the number of cards is K)
Conversion mode Data volume of source task node distribution tensor Data exchange volume Remarks
S(i)→S(j) T 1 0 i=j
S(i)→S(j) T 1 T 1 i≠j
S→B T 2 (K-1)·T 2
S→P T 3 T 3
B→S T 4 0
B→B T 5 0
B→P T 6 0
P→S T 7 (K-1)·T 7
P→B T 8 2(K-1)·T 8
P→P T 9 0
Fig. 7 illustrates a third schematic diagram of the traffic data amount estimation unit 111 estimating the data traffic amount generated between tensors of different distributed descriptors according to the present disclosure. Wherein the device set of the source node is completely different from the device set of the sink node. I.e. source task node E is distributed over GPU 0 and GPU1 and sink task node B is distributed over compute cards GPU 2 and GPU 3. If the logical data block size distributed on each computing card is T 3 In the case that the data volume to be carried is 2T 3
Fig. 8 illustrates a fourth schematic diagram of the traffic data amount estimation unit 111 according to the present disclosure estimating the data traffic amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is completely different from the device set of the sink node. That is, source task node A is distributed on GPU 0, GPU1 and GPU 2, and sink task node B is distributed on computing cards GPU 3, GPU 4 and GPU 5. For example, there are partial value tensors P distributed on each of the three cards, denoted here as a partial and tensor by Ps as a descriptive example. If the logical data block size distributed on each computing card of each source task node is T 4 In the case that the data volume to be carried is 9 1/3T 4 I.e. 3T 4 . If the number of computing cards of the task set distributed by the source task node is 2, the data volume to be carried is 2T 4 . If the number of calculation cards of the task set distributed by the source task node A is Ks, the carrying capacity of the data is Ks.T 4
The calculation table summarizing the amount of data exchange that exists between the various SBP descriptors can be generalized from the completely different cases between the device sets for the source and sink task nodes, as follows Table 2:
table 2 (source task node (card number K) s ) And sink task node (card number is K) d ) The respective sets of distribution devices are quite different
Conversion mode Data volume of source task node distribution tensor Data exchange volume Remarks
S(i)→S(j) T 1 T 1 i≠j
S→B T 2 K d ·T 2
S→P T 3 T 3
B→S T 4 1
B→B T 5 K d ·T 5
B→P T 6 T 6
P→S T 7 K s ·T 7
P→B T 8 K s ·K d ·T 8
P→P T 9 K s ·T 9
Fig. 9 illustrates a fifth schematic diagram of the traffic data amount estimation unit 111 according to the present disclosure estimating the data traffic amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is not exactly the same as the device set of the sink node. That is, source task node E is distributed on GPU 0 and GPU1, and sink task node B is distributed on computing cards GPU1 and GPU 2. If the size of the logical data block distributed on the computing card distributed by each source task node is T 5 In the case that the data volume to be carried is 3/2T 3 =(1/2
T 3 +1/2T 3 +1/2T 3 ). In this case, the calculation has no fixed rule and needs to be based on the actual device setAnd the specific composition of each other and the intersection situation between each other.
Fig. 10 illustrates a sixth diagram of the transfer data amount estimation unit 111 according to the present disclosure estimating the data transfer amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is not exactly the same as the device set of the sink node. That is, source task node A is distributed on GPU 0, GPU1 and GPU 2, and sink task node B is distributed on computing cards GPU1, GPU 2 and GPU 3. For example, there are partial value tensors P distributed on each of the three cards, denoted here as a partial and tensor by Ps as a descriptive example. If the logical data block size distributed on each computing card of each source task node is T 6 The data volume to be carried is 7 1/3T 4 I.e. 7/3T 4 . In this case, the calculation has no fixed rule, and it is necessary to perform the calculation according to the specific composition of the actual device set and the intersection condition between each other.
As described above, the carry data amount estimation unit 111 traverses all the candidate signatures SBP-1, SBP-2 and SBP-3 of the operation logic node B in the above-described manner, and acquires the carry cost thereof for each signature. The throughput comparing unit 112 then compares the throughput cost under each candidate signature and obtains the minimum throughput cost for the to-be-determined operation logic node, e.g., operation logic node B. Finally, the SBP signature determination unit 113 determines the candidate SBP signature corresponding to the minimum handling cost as the final SBP signature of the operational logical node B.
The final operation logical node topology output component 12 outputs the final operation logical node topology 101 based on the SBP signature determined by the SBP signature determination unit 113 for each operation logical node, each operation logical node constituting the operation logical node topology 101 is attached with only one SBP signature, or each operation logical node explicitly specifies the distribution manner or distribution descriptor of each of its input tensors, and uniquely determines the distribution manner or distribution descriptor of its input tensors.
While the general case of how the final SBP signature is determined at some candidate SBP signatures is given above, in some specific cases, for some of the operational logic nodes, either with a special configuration by the user or with a user designation, these operational logic nodes have only the user designation of the SBP signature, so the operational logic node downstream thereof will make the determination of the SBP signature based on such a special designation of the upstream operational logic node.
The above estimation of the transmission cost is only performed for the data amount, but it should be noted that the length of the data carrying path, i.e. the complexity of the data carrying, is also a part of the transmission cost to be considered. After giving a certain weight value to the length of the transmission path, the last transmission cost of each candidate SBP signature can be calculated by multiplying the calculated data amount. Selecting the candidate SBP signature based on the corrected transmission cost after considering the transmission path will result in a more optimal insertion of the transport task node.
Although a portion of a full task node topology after inserting a transport task node is shown in fig. 1 and 2, the manner in which such a transport task node is inserted is merely exemplary. The manner in which they are inserted may vary from one computing device resource to another based on the basic principles described above.
Although the above describes the handling of data between a host and a computing device, there is also migration of data between computing task nodes on different hosts when the computing tasks of some computing task nodes are deployed directly on the host. Thus, when a first location flag indicates a first host and a second location flag indicates a second host, the transport task node insertion component inserts only one transport task node between the first and second operational task nodes and assigns the inserted transport task node a first location flag. Specifically, when data handling is performed between hosts across hosts, the location where the handling task node is to be deployed is only on the host that receives the data. On the other hand, in the case where the first location flag indicates the first computing device of the first host and the second location flag indicates the first host, when direct access cannot be performed between the host and the computing device, the handling task node inserting component inserts only one handling task node between the first computing task node and the second computing task node, and assigns the inserted handling task node the first location flag. Alternatively, when the first location flag indicates a first host and the second location flag indicates a second computing device of the first host, the transport task node insertion component inserts only one transport task node between the first and second computing task nodes and assigns the inserted transport task node a second location flag, the transport direction of which is labeled G-H.
According to the conversion system and the method for converting the operation logic node topological graph into the task node topological graph, the running path of data can be known in advance from the global angle, so that the data carrying task nodes are deployed in advance, the carrying of the data can be realized in a static deployment mode, the data carrying task is fixed in a specific carrying execution body to be realized, asynchronous communication in data exchange is realized, and the expenditure of time for two calls is reduced. In particular, by pre-deploying the data handling task nodes from a global perspective, the defect that data scheduling and data calculation overlap cannot be realized due to data processing waiting and delay caused by dynamic scheduling on-line decision data migration in the prior art (the overlapping of data handling and calculation cannot be realized in the prior art) is eliminated. Because the handling task nodes are inserted between the operation task nodes, the data handling path is planned in advance, so that the handling role of each data is fixed, the source and the destination of the data and the operation task node object served by the handling task node are predetermined, the overlapping of handling and calculation can be realized in the whole system, and the explosion situation caused by resource exhaustion or resource non-planning in the flow control is solved.
In addition, the carrying task nodes are inserted in advance, so that the waiting process of operation can be eliminated, the operation equipment corresponding to the operation task nodes is always in an operation state, and the operation utilization rate is improved.
While the basic principles of the present disclosure have been described above in connection with specific embodiments, it should be noted that all or any steps or components of the methods and apparatus of the present disclosure can be implemented in hardware, firmware, software, or combinations thereof in any computing device (including processors, storage media, etc.) or network of computing devices, as would be apparent to one of ordinary skill in the art upon reading the present disclosure.
Thus, the objects of the present disclosure may also be achieved by running a program or set of programs on any computing device. The computing device may be a well-known general purpose device. Thus, the objects of the present disclosure may also be achieved by simply providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is apparent that the storage medium may be any known storage medium or any storage medium developed in the future.
It should also be noted that in the apparatus and methods of the present disclosure, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A topology map conversion method for converting an operation logical node topology map into a task node topology map, comprising:
the method comprises the steps that through an operation task node deployment component, tasks of any operation logic node in an operation logic node topological graph are segmented to appointed calculation resources based on task configuration data in task description input by a user on the basis of given calculation resources, so that one or more operation task nodes corresponding to each operation logic node are generated, and position marks corresponding to the appointed calculation resources are given to each operation task node, wherein before the tasks of any operation logic node in the operation logic node topological graph are segmented to the appointed calculation resources through the operation task node deployment component, a logic distributed signature selection component in the operation task node deployment component is used for selecting a logic distributed signature formed by a distributed descriptor of input tensor and a distributed descriptor of output tensor of each source operation logic node, from a candidate logic distributed signature set of each downstream operation logic node, the logic distributed signature with the minimum data carrying is used as the logic distributed signature of each downstream operation logic node;
Inserting one or more transport task nodes between a first operation task node and a second operation task node serving as an upstream operation task node of the first operation task node when different position marks exist between the first position mark of the first operation task node and the second position mark of the second operation task node through a transport task node insertion component, so that a complete task node topological graph with the transport task nodes is obtained; and
when the first location flag indicates a first computing device of the first host and the second location flag indicates the first host, the transport task node insertion component inserts only one transport task node between the first and second computing task nodes and assigns the inserted transport task node a first location flag.
2. The topology map conversion method of claim 1, further comprising:
when the first location flag indicates a second computing device that is the first host and the second location flag indicates the first host, the transport task node insertion component inserts only one transport task node between the first and second computing task nodes and assigns the inserted transport task node a second location flag.
3. The topology map conversion method of claim 1, further comprising:
when the first position mark is designated as the first host and the second position mark is designated as the second host, the transfer task node inserting component inserts only one transfer task node between the first operation task node and the second operation task node and assigns the inserted transfer task node with the first position mark.
4. The topology map conversion method of claim 1, further comprising:
when the first position mark indicates a first computing device of the first host and the second position mark indicates a third computing device of the first host or the second host, the handling task node inserting component inserts two handling task nodes between the first and second computing task nodes, and assigns a first position mark to a first handling task node inserted immediately adjacent to the first computing task node and a second position mark to another inserted handling task node.
5. The topology map conversion method of claim 1, further comprising:
when the first position mark designates the first computing device of the first host and the second position mark designates the fourth computing device of the second host, the handling task node inserting component inserts the first, second and third handling task nodes in sequence from the first to the second computing task nodes, and assigns the first position mark to the first handling task node, assigns the position mark designating the first host to the second handling task node and assigns the second position mark to the third handling task node.
6. A topology map conversion system for converting an operational logical node topology map to a task node topology map, comprising:
an operation task node deployment component for fragmenting tasks of any operation logic node in an operation logic node topological graph to appointed calculation resources based on task configuration data in task description input by a user on the basis of the given calculation resources, thereby generating one or more operation task nodes corresponding to each operation logic node and giving each operation task node a position mark corresponding to the appointed calculation resources, wherein before fragmenting the tasks of any operation logic node in the operation logic node topological graph to the appointed calculation resources, a logic distributed signature with the minimum data carrying cost is selected as a logic distributed signature of each downstream operation logic node from candidate logic distributed signature sets of each downstream operation logic node of each source operation logic node by a logic distributed signature selection component in the operation task node deployment component, wherein the logic distributed signature is formed by a distributed descriptor of input tensor and a distributed descriptor of output tensor of the source operation logic node appointed by the operation logic node in the operation logic node topological graph based on the task configuration data; and
A transport task node insertion component that inserts one or more transport task nodes between a first operational task node and a second operational task node that is an upstream operational task node thereof when there is a different location marker between the first and second operational task nodes, thereby obtaining a full task node topology with transport task nodes, and inserts only one transport task node between the first and second operational task nodes when the first location marker indicates a first computing device of the first host and the second location marker indicates the first host, and assigns the inserted transport task node with the first location marker.
7. The topology map conversion system of claim 6, wherein:
the transport task node insertion component inserts only one transport task node between the first operational task node and the second operational task node when the first location indicia indicates a first host and the second location indicia indicates a second computing device of the first host, and assigns the inserted transport task node a second location indicia.
8. The topology map conversion system of claim 6, wherein:
the handling task node inserting component inserts only one handling task node between the first operation task node and the second operation task node when the first position mark indicates a first host and the second position mark indicates a second host, and endows the inserted handling task node with the first position mark.
9. The topology map conversion system of claim 6, wherein:
the handling task node insertion component inserts two handling task nodes between the first and second computing task nodes when the first position indicator indicates a first computing device of the first host and the second position indicator indicates a third computing device of the first host or the second host, and assigns a first position indicator to the first handling task node inserted immediately adjacent to the first computing task node and a second position indicator to the other inserted handling task node.
10. The topology map conversion system of claim 6, wherein:
the handling task node insertion component inserts first, second and third handling task nodes in order from the first computing task node to the second computing task node when the first position label indicates a first computing device of the first host and the second position label indicates a fourth computing device of the second host, and assigns a first position label to the first handling task node, a position label indicating the first host to the second handling task node, and a second position label to the third handling task node.
CN202010403719.5A 2020-02-13 2020-02-13 Topological graph conversion system and method thereof Active CN111666151B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010403719.5A CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010090334.8A CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method
CN202010403719.5A CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202010090334.8A Division CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method

Publications (2)

Publication Number Publication Date
CN111666151A CN111666151A (en) 2020-09-15
CN111666151B true CN111666151B (en) 2023-11-03

Family

ID=69854859

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010403719.5A Active CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof
CN202010090334.8A Active CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010090334.8A Active CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method

Country Status (2)

Country Link
CN (2) CN111666151B (en)
WO (1) WO2021159929A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666151B (en) * 2020-02-13 2023-11-03 北京一流科技有限公司 Topological graph conversion system and method thereof
CN111930519B (en) 2020-09-22 2020-12-15 北京一流科技有限公司 Parallel decision system and method for distributed data processing
CN112764940B (en) * 2021-04-12 2021-07-30 北京一流科技有限公司 Multi-stage distributed data processing and deploying system and method thereof
CN114035968B (en) * 2022-01-10 2022-03-18 北京一流科技有限公司 Conflict processing system and method for multi-stream parallelism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388474A (en) * 2018-02-06 2018-08-10 北京易沃特科技有限公司 Intelligent distributed management of computing system and method based on DAG
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN110222005A (en) * 2019-07-15 2019-09-10 北京一流科技有限公司 Data processing system and its method for isomery framework
CN110262995A (en) * 2019-07-15 2019-09-20 北京一流科技有限公司 It executes body creation system and executes body creation method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103516733A (en) * 2012-06-19 2014-01-15 华为技术有限公司 Method and apparatus for processing virtual private cloud
US10102039B2 (en) * 2013-05-17 2018-10-16 Entit Software Llc Converting a hybrid flow
WO2017049439A1 (en) * 2015-09-21 2017-03-30 Splunk Inc. Topology map displays of cloud computing resources
US10649808B2 (en) * 2016-09-16 2020-05-12 Oracle International Corporation Outcome-based job rescheduling in software configuration automation
CN106648859A (en) * 2016-12-01 2017-05-10 北京奇虎科技有限公司 Task scheduling method and device
CN107122244B (en) * 2017-04-25 2020-02-14 华中科技大学 Multi-GPU-based graph data processing system and method
CN107483541A (en) * 2017-07-17 2017-12-15 广东工业大学 A kind of online task immigration method based on rolling time horizon
CN108600321A (en) * 2018-03-26 2018-09-28 中国科学院计算技术研究所 A kind of diagram data storage method and system based on distributed memory cloud
CN109144695B (en) * 2018-08-30 2021-08-10 百度在线网络技术(北京)有限公司 Method, device, equipment and medium for processing task topological relation
CN111666151B (en) * 2020-02-13 2023-11-03 北京一流科技有限公司 Topological graph conversion system and method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN108388474A (en) * 2018-02-06 2018-08-10 北京易沃特科技有限公司 Intelligent distributed management of computing system and method based on DAG
CN110222005A (en) * 2019-07-15 2019-09-10 北京一流科技有限公司 Data processing system and its method for isomery framework
CN110262995A (en) * 2019-07-15 2019-09-20 北京一流科技有限公司 It executes body creation system and executes body creation method

Also Published As

Publication number Publication date
CN110928697B (en) 2020-05-22
CN111666151A (en) 2020-09-15
WO2021159929A1 (en) 2021-08-19
CN110928697A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN111666151B (en) Topological graph conversion system and method thereof
CN110955734B (en) Distributed signature decision system and method for logic node
US11816560B2 (en) Performance estimation-based resource allocation for reconfigurable architectures
CN111930519B (en) Parallel decision system and method for distributed data processing
CN104778079A (en) Method and device used for dispatching and execution and distributed system
EP3458959B1 (en) Reconfigurable distributed processing
CN112764940B (en) Multi-stage distributed data processing and deploying system and method thereof
CN111897580B (en) Instruction scheduling system and method for reconfigurable array processor
CN114661480A (en) Deep learning task resource allocation method and system
CN112799852B (en) Multi-dimensional SBP distributed signature decision system and method for logic node
CN103942083A (en) Compiling implementation method for variable-parameter function
CN104424026B (en) One kind instruction dispatching method and device
CN117194047B (en) Distributed system based on data collaboration
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
CN106445403B (en) Distributed storage method and system for paired storage of mass data
CN110427210A (en) A kind of fast construction method and device of storm topology task
CN104375803A (en) Data processing method and device
US11907725B2 (en) Communication in a computer having multiple processors
Madisetti et al. Efficient distributed simulation
CN117519709A (en) Calculation map compiling method, compiling device, calculating device and storage medium
Kim et al. Parallel performance assessment of moving body overset grid application on PC cluster
CN116524066A (en) Graph task flow processing system and method thereof
CN114185665A (en) Efficient task scheduling technical method for cloud computing operating system
CN115729648A (en) Operator scheduling method, device and system based on directed acyclic graph
CN117762629A (en) Financial big data-oriented GPU kernel task concurrent scheduling method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant