CN110928697B - Topological graph conversion system and method - Google Patents

Topological graph conversion system and method Download PDF

Info

Publication number
CN110928697B
CN110928697B CN202010090334.8A CN202010090334A CN110928697B CN 110928697 B CN110928697 B CN 110928697B CN 202010090334 A CN202010090334 A CN 202010090334A CN 110928697 B CN110928697 B CN 110928697B
Authority
CN
China
Prior art keywords
node
task
logic
data
distributed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010090334.8A
Other languages
Chinese (zh)
Other versions
CN110928697A (en
Inventor
袁进辉
柳俊丞
牛冲
李新奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oneflow Technology Co Ltd
Original Assignee
Beijing Oneflow Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oneflow Technology Co Ltd filed Critical Beijing Oneflow Technology Co Ltd
Priority to CN202010090334.8A priority Critical patent/CN110928697B/en
Priority to CN202010403719.5A priority patent/CN111666151B/en
Publication of CN110928697A publication Critical patent/CN110928697A/en
Application granted granted Critical
Publication of CN110928697B publication Critical patent/CN110928697B/en
Priority to PCT/CN2021/072789 priority patent/WO2021159929A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for converting an operational logic node topological graph into a task node topological graph, which comprises the following steps: partitioning a task of any operation logic node in an operation logic node topological graph to a specified computing resource through an operation task node deployment component based on task configuration data in task description input by a user on the basis of the given computing resource, so as to generate one or more operation task nodes corresponding to each operation logic node and endow each operation task node with a position mark corresponding to the specified computing resource; and inserting one or more transport task nodes between a first operation task node and a second operation task node serving as an upstream operation task node when different position marks are formed between the first position mark of the first operation task node and the second position mark of the second operation task node through a transport task node insertion component, so that a complete task node topological graph with the transport task nodes is obtained.

Description

Topological graph conversion system and method
Technical Field
The present disclosure relates to a data processing technology. More particularly, the present disclosure relates to a conversion system for converting an operation logical node topology into a task node topology and a method thereof.
Background
With the popularization of distributed computing, a large job is divided to deploy different parts of data to each computing device of different distributed data processing systems for processing, so that in the processing process of a specific job, computing intermediate parameters or results deployed on one computing device become input data of a computing task on another computing device, and in order to achieve data synchronization of the intermediate parameters, call overhead of data migration between the computing devices is caused. The network communication call is usually a bottleneck, and then the performance of the network performance communication is not good, so that the acceleration ratio and the expansibility of the multi-machine distributed data processing architecture are influenced.
With the increasing computing functions of various single computing devices, it is in an extreme state to increase the computing speed of the computing devices. Especially, as the computation speed increases, the speed of data calling has lagged behind the computation speed of data. Thus, the invocation or migration of data becomes a bottleneck that restricts the computing device from processing the data. In fact, most developers and users of dedicated AI chips usually only pay attention to the power consumption and efficiency of the computation portion, for example, how to design an AI chip to enable it to perform matrix operations more efficiently, but less attention is paid to data migration, data forwarding and routing requirements, and when a large-scale task is cooperatively performed based on multiple chips, data migration is significant from the power consumption and delay.
Thus, in existing systems, migration of data migration between distributed devices takes almost as much time as computing. How to reduce the communication overhead is to hide the time during the system operation so that the system can fully put the hardware resources into the reduction of the calculation time, which is the key for improving the system efficiency. Furthermore, modifying data routing patterns in flexible parallel patterns (data parallel, model parallel and even hybrid parallel) is really very complex. The existing deep learning framework only realizes the computation operation of the data flow graph in the model, and does not perform the data migration operation in the data flow graph of the model. The result of this is that the advantage of automatic parallelization of dataflow engines cannot be demonstrated because these operations are not encoded in the dataflow graph, and therefore the software programming effort is trapped in so-called callback traps during synchronous programming.
Therefore, how to make data handling or data exchange in a distributed data processing architecture be regarded as important as data operation, so that the data handling or data exchange is regarded as a first-class citizen like data processing and calculation, so that static deployment of data handling can be realized, and a data handling task is fixed in a specific handling executor to be realized, so that asynchronous communication in the data exchange is realized, so that the overhead of time for two calls is reduced, so that data handling and routing can be realized by a special chip to be possible, so that the efficiency of the whole system can be maximized, which is a problem urgently needed to be solved in the field of large-scale data processing.
Disclosure of Invention
It is an object of the present disclosure to provide a solution to at least one of the above problems. Specifically, the present disclosure provides a method for converting an operational logic node topology into a task node topology, including: partitioning a task of any operation logic node in an operation logic node topological graph to a specified computing resource through an operation task node deployment component based on task configuration data in task description input by a user on the basis of the given computing resource, so as to generate one or more operation task nodes corresponding to each operation logic node and endow each operation task node with a position mark corresponding to the specified computing resource; and inserting one or more transport task nodes between a first operation task node and a second operation task node serving as an upstream operation task node when different position marks are formed between the first position mark of the first operation task node and the second position mark of the second operation task node through a transport task node insertion component, so that a complete task node topological graph with the transport task nodes is obtained.
According to the method for converting the operation logic node topological graph into the task node topological graph, when the first position mark is designated as the first computing device of the first host and the second position mark is designated as the first host, the carrying task node inserting component inserts only one carrying task node between the first operation task node and the second operation task node and endows the inserted carrying task node with the first position mark.
According to the method for converting the operation logic node topological graph into the task node topological graph, when the first position mark is designated as the first host and the second position mark is designated as the second computing device of the first host, the carrying task node inserting component inserts only one carrying task node between the first operation task node and the second operation task node and endows the inserted carrying task node with the second position mark.
According to the method for converting the operation logic node topological graph into the task node topological graph, when the first position mark is indicated as a first host and the second position mark is indicated as a second host, the carrying task node inserting component inserts only one carrying task node between the first operation task node and the second operation task node and endows the inserted carrying task node with the first position mark.
According to the method for converting the operation logic node topological graph into the task node topological graph, when the first position mark is indicated as the first computing device of the first host and the second position mark is indicated as the third computing device or the second host of the first host, the carrying task node inserting component inserts two carrying task nodes between the first operation task node and the second operation task node, and endows the first carrying task node inserted next to the first operation task node with the first position mark and endows the other inserted carrying task node with the second position mark.
According to the method for converting the operation logic node topological graph into the task node topological graph, when the first position mark indicates that the first computing device is a first computing device of a first host and the second position mark indicates that the second computing device is a fourth computing device of a second host, the carrying task node inserting component inserts a first carrying task node, a second carrying task node and a third carrying task node in sequence according to the sequence from the first operation task node to the second operation task node, gives the first position mark to the first carrying task node, gives the position mark indicating the first host to the second carrying task node and gives the second position mark to the third carrying task node.
The method for converting the operation logic node topological graph into the task node topological graph according to the disclosure, wherein the method further comprises before the task of any operation logic node in the operation logic node topological graph is fragmented to the specified computing resource through the operation task node deployment component, and selecting a logic distributed signature with the minimum data carrying cost from a candidate logic distributed signature set of each downstream operation logic node of each source operation logic node as the logic distributed signature of each downstream operation logic node based on a logic distributed signature which is specified by the task configuration data for the source operation logic node in the operation logic node topological graph and consists of the distributed descriptors of the input tensor and the distributed descriptors of the output tensor of the operation logic node through a logic distributed signature selection component in the operation task node deployment component.
According to another aspect of the present disclosure, there is also provided a conversion system for converting an arithmetic logic node topology graph into a task node topology graph, including: the operation task node deployment component is used for fragmenting a task of any operation logic node in the operation logic node topological graph to a specified computing resource based on task configuration data in task description input by a user on the basis of the given computing resource, so that one or more operation task nodes corresponding to each operation logic node are generated, and a position mark corresponding to the specified computing resource is given to each operation task node; and a transport task node insertion component which inserts one or more transport task nodes between a first operation task node and a second operation task node as an upstream operation task node when different position marks are provided between the first position mark of the first operation task node and the second position mark of the second operation task node, thereby obtaining a complete task node topological graph with the transport task nodes.
The conversion system for converting an operational logic node topology graph into a task node topology graph according to the present disclosure, wherein the transport task node insertion component inserts only one transport task node between the first operational task node and the second operational task node when the first location flag indicates the first computing device as the first host and the second location flag indicates the first host, and assigns the inserted transport task node a first location flag.
According to the conversion system for converting an operation logic node topological graph into a task node topological graph, when a first position mark indicates a first host and a second position mark indicates a second computing device of the first host, the transport task node insertion component inserts only one transport task node between the first operation task node and the second operation task node, and gives the inserted transport task node a second position mark.
According to the conversion system for converting an operation logic node topological graph into a task node topological graph, when a first position mark indicates a first host and a second position mark indicates a second host, only one transport task node is inserted between the first operation task node and the second operation task node, and the inserted transport task node is given a first position mark.
According to the conversion system for converting an operation logic node topological graph into a task node topological graph, when a first position mark indicates a first computing device of a first host and a second position mark indicates a third computing device of the first host or a second host, the transport task node insertion component inserts two transport task nodes between the first operation task node and the second operation task node, and gives the first position mark to the first transport task node inserted next to the first operation task node and gives the second position mark to the other inserted transport task node.
According to the conversion system for converting an operation logic node topological graph into a task node topological graph, when a first position mark indicates a first computing device of a first host and a second position mark indicates a fourth computing device of a second host, the transport task node insertion component sequentially inserts a first transport task node, a second transport task node and a third transport task node according to the sequence from the first operation task node to the second operation task node, gives the first position mark to the first transport task node, gives the position mark indicating the first host to the second transport task node and gives the second position mark to the third transport task node.
According to the conversion system for converting the operational logic node topology graph into the task node topology graph, the operational task node deployment component comprises a logic distributed signature selection component, and before the task of any operational logic node in the operational logic node topology graph is fragmented to the specified computing resource, the logic distributed signature which is specified for the source operational logic node in the operational logic node topology graph based on the task configuration data and is composed of the distributed descriptor of the input tensor and the distributed descriptor of the output tensor of the operational logic node is selected from the candidate logic distributed signature set of each downstream operational logic node of each source operational logic node, and the logic distributed signature with the minimum data handling cost is used as the logic distributed signature of each downstream operational logic node.
By the conversion system and the conversion method for converting the operational logic node topological graph into the task node topological graph, the operation path of the data can be obtained in advance from the global perspective, so that the data carrying task nodes are deployed in advance, the data carrying can be statically deployed, the data carrying task is fixed in a specific carrying executive body, asynchronous communication in data exchange is realized, and the expenditure of two calling times is reduced. Particularly, by deploying data transportation task nodes in advance from the global perspective, the defect that data scheduling and data calculation cannot be overlapped due to data processing waiting and delay caused by dynamic scheduling online decision data migration in the prior art is overcome (the prior art cannot realize data transportation and calculation overlapping). The data carrying path is planned in advance because the carrying task nodes are inserted between the operation task nodes, so that the carrying role of each data is fixed, the source and the destination of the data and the operation task node object served by the carrying task nodes are predetermined, the carrying and calculating overlapping can be realized in the whole system, and the explosion caused by resource exhaustion or resource non-planning in flow control is solved.
And because the transport task nodes are inserted in advance, the waiting process of operation can be eliminated, so that the operation equipment corresponding to the operation task nodes is always in an operation state, and the operation utilization rate is improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a schematic diagram illustrating a conversion system for converting an arithmetic logic node topology into a task node topology according to the present disclosure.
FIG. 2 is a partial schematic diagram of a full task node topology according to the present disclosure.
FIG. 3 is a schematic diagram illustrating a structure of a logical distributed signature for selecting operational logical nodes according to the present disclosure.
FIG. 4 is a schematic diagram illustrating selection of an SBP signature of a downstream operational logical node according to the present disclosure.
Fig. 5 illustrates a first schematic diagram of a transport data amount estimation unit estimating data transport amounts generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 6 illustrates a second schematic diagram of the transport data amount estimation unit estimating the data transport amount generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 7 illustrates a third schematic diagram of the transport data amount estimation unit estimating the data transport amount generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 8 illustrates a fourth schematic diagram of the transport data amount estimation unit estimating the data transport amount generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 9 illustrates a fifth schematic diagram of the transport data amount estimation unit estimating the data transport amount generated between tensors of different distributed descriptors according to the present disclosure.
Fig. 10 illustrates a sixth schematic diagram of the transport data amount estimation unit estimating the data transport amount generated between tensors of different distributed descriptors according to the present disclosure.
Detailed Description
The present invention will be described in further detail with reference to the following examples and the accompanying drawings so that those skilled in the art can practice the invention with reference to the description.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, one of the two possible position markers may be referred to hereinafter as a first position marker and may also be referred to as a second position marker, and similarly, the other of the two possible position markers may be referred to as a second position marker and may also be referred to as a first logical position marker, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For a better understanding of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram illustrating a conversion system for converting an arithmetic logic node topology into a task node topology according to the present disclosure. As shown in fig. 1, the conversion system for converting an arithmetic logic node topology map into a task node topology map according to the present disclosure includes an arithmetic task node deployment component 10 and a carry task node insertion component 20. When the operation task node deployment component 10 obtains the operation logic node topological graph, based on task configuration data in task description input by a user on the basis of given computing resources, the task of any operation logic node in the operation logic node topological graph is fragmented to the specified computing resources, so that one or more operation task nodes corresponding to each operation logic node are generated, and a position mark corresponding to the specified computing resources is given to each operation task node.
In particular, in a distributed computing system, one or more hosts are typically included, each host having connected thereto a plurality of computing devices, such as GPUs, TPUs, and the like, dedicated to large-scale simple operations. When data parallel computing is required, a large-scale data block required to be processed is generally divided and fragmented onto a plurality of computing devices for parallel processing. When the model is relatively large, the model may be divided and distributed to different computing devices for processing. For this purpose, when two devices are available on one HOST (HOST), for example, GPU0 and GPU1, the data may be divided into two parts along the 0 th dimension of the data, distributed to GPU0 and GPU1 for parallel processing, and if the HOST number is H1, the position marks H1-GPU0 are assigned to the operation task nodes of the operation logical node that are sliced to GPU0 of the HOST H1, and similarly, the position marks H1-GPU1 are assigned to the operation task nodes of the operation logical node that are sliced to GPU1 of the HOST H1. As shown in FIG. 1, the arithmetic logic node E itself is initially provided with position markers H1-2G because it will be allocated to both GPUs of H1. After being processed by the compute task node deployment component 10, the two compute task nodes that are fragmented are E1 and E2, which are assigned position markers H1-GPU0 and H1-GPU1, respectively. Similarly, after the operation logic node a is processed by the operation task node deployment component 10, the two operation task nodes into which the operation logic node a is split are a1 and a2, which are respectively assigned with position markers H1-GPU0 and H1-GPU 1. After the downstream operation logic node B as the operation logic nodes a and E is also processed by the operation task node deployment component 10, the two operation task nodes which are partitioned into the node B1 and the node B2 are respectively assigned with position marks H1-GPU0 and H1-GPU 1. By analogy, the operation logic nodes C, D, F are both located on two GPU computing cards of the host H2, and therefore after being processed by the operation task node deployment component 10, the position labels of their respective operation task nodes C1 and C2, D1 and D2, F1 and F2 are denoted by H2-GPU0 and H2-GPU1, respectively. By combining the task configuration data, the computation task node topology graph 102 is obtained.
After the operation task node topology map 102 is determined in the above manner, the transport task node insertion component 20 inserts one or more transport task nodes between a first operation task node and a second operation task node as an upstream operation task node when there is a different position mark between the first position mark of the first operation task node and the second position mark of the second operation task node, thereby obtaining a complete task node topology map with the transport task nodes. Specifically, as shown in fig. 1, the task nodes E1-H1 and H1-B2 are inserted between the operation task nodes E1 and B2, the task nodes E2-H1 and H1-B1 are inserted between the operation task nodes E2 and B1, the task nodes a1-H1 and H1-B2 are inserted between the operation task nodes a1 and B2, and the task nodes a2-H1 and H1-B1 are inserted between the operation task nodes a2 and B1. Finally, the full task node topology graph in fig. 1 is formed. It should be noted, however, that fig. 1 is limited to the drawing sheet of the drawing, and only a part of the complete task node topology is shown, that is, the first part 103-1 of the complete task node topology including the operation task node E, A and B between each other after the task node is inserted, and the rest is omitted. However, it should be noted that, in the case where a direct access protocol is provided between different computing devices (e.g., GPUs) connected to the same host, such data migration between computing devices on the same host may not be required to insert the transportation task node mentioned in the present disclosure.
Since the position of the operation task node K is marked as host H1, only one transport task node B1-H1 or B2-H1 is inserted between the operation task node B1 or B2 and the operation task node K, i.e. part or all of the data distributed in G0/H1 or G1/H1 required for the operation task node K will be transported by the transport task node B1-H1 or B2-H1 to the host H1. It should be noted, however, that in the case of a direct access protocol between the host H1 and the computing device (e.g., GPU) to which it is connected, such data migration between the host and the computing device may be performed without interposing a transport task node as mentioned in the present disclosure.
Fig. 2 is a schematic diagram illustrating a portion of a full task node topology after insertion of a carry task node according to the present disclosure. As shown in fig. 2, since the operation logic node C is distributed on the two GPUs 0 and 1 of the host H1 and the downstream operation logic node D is distributed on the two GPUs 0 and 1 of the host H2, the positions of their respective operation task nodes C1 and C2 are labeled as G0/H1 or G1/H1 and the positions of the operation task nodes D1 and D2 are labeled as G0/H2 or G1/H2 as shown in fig. 1. Therefore, when the input data required for the operation task node D1 needs to come from the operation task node C1, the task nodes C1-H1, H1-H2, and H2-D1 need to be inserted between the operation task node C1 and the operation task node D1, as shown in FIG. 2. If the input data required by the operation task node D1 also needs to come from the operation task node C2, the operation task nodes C2-H1, H1-H2 and H2-D1 are also required to be inserted between the operation task node C2 and the operation task node D1. Similarly, when the input data required by the operation task node D2 needs to come from the operation task node C1, the task nodes C1-H1, H1-H2, and H2-D2 need to be inserted between the operation task node C1 and the operation task node D2, as shown in FIG. 2. If the input data required by the operation task node D2 also needs to come from the operation task node C2, the operation task nodes C2-H1, H1-H2 and H2-D2 are also required to be inserted between the operation task node C2 and the operation task node D2. Similarly, in the case of a direct access protocol between a host H1 or H2 and a computing device (e.g., GPU) to which it is connected, such data migration between the host and the computing device may be performed without intervening handling task nodes as mentioned in this disclosure. Therefore, only one carry task node H1-H2 needs to be inserted between the compute task node C1 or C2 and D1 or D2, that is, one carry task node H1-H2 can be shared between C1 and C2 and between D1 and D2. Although the second portion 103-2 of the complete task node topology shown in FIG. 2, which is depicted for illustrative understanding and convenience, shows four carry task nodes H1-H2 inserted respectively, in practice the four carry task nodes H1-H2 may be one carry task node even in the absence of a direct access protocol between the host H1 or H2 and the computing device (e.g., GPU) to which it is connected. According to the present disclosure, when data migration exists between cross-hosts, only one transport task node needs to be inserted between a pair of arithmetic logic nodes between the paired hosts.
At the same time as the transport task node insertion component 20 inserts the transport task node, the position flag of the inserted transport task node is also marked, and in addition, the source address and destination address of the transport data, that is, the transport direction of the data is also marked. The name of each of the above-mentioned transport nodes is a source address and a destination address of the transport task node and a transport direction.
It should be noted, however, that in order to simplify and optimize the insertion of the transit nodes and shorten the data transit path, optionally, the operation task node deployment component 10 further includes a logic distributed signature selection component 11, and each operation task node further selects a certain logic distributed signature from a plurality of candidate logic distributed signatures thereof based on its operation type. Specifically, before the task of any computation logic node in the computation logic node topology map is fragmented to the specified computation resource, the logic distributed signature selection component 11 selects, as the logic distributed signature of each downstream computation logic node, a logic distributed signature with the smallest data transfer cost from the candidate logic distributed signature set of each downstream computation logic node of each source computation logic node, based on the logic distributed signature which is specified for the source computation logic node in the computation logic node topology map by the task configuration data and is composed of the distributed descriptor of the input tensor and the distributed descriptor of the output tensor of the computation logic node. Thereby obtaining the computation task node topology graph 102 with logical distributed signatures.
Specifically, in order to obtain better insertion results of the transport task nodes, the disclosed computation logic nodes all include candidate logic distributed signature sets for different computation operations. FIG. 3 is a schematic diagram illustrating a structure of a logical distributed signature for selecting operational logical nodes according to the present disclosure. A simple initial operational logic node topology 104 is shown only schematically in fig. 3, in which nodes A, B, C, D, E, F, L and K are shown. Others not shown are replaced by omissions. In actual data processing, the initial operational logical node topology 104 would be more complex. The initial operational logical node topology 104 contains the basic logical operational nodes that implement the computational tasks described by the user. The generation manner of the initial operation logical node topology 104 belongs to the conventional technology in the art, and therefore is not described herein.
The individual initial operational logic nodes in the initial operational logic node topology 104 each contain a plurality of SBP signatures. As the original operational logical nodes that have been configured with SBP signatures by the user or that have determined unique SBP signatures based on the user's task description, for example, SBP-5 of original operational logical node A, SBP-2 of original operational logical node C, and SBP-3 of original operational logical node E. In the case where a unique SBP signature is not determined, the initial arithmetic logic node typically contains some candidate SBP signatures inherent thereto. The initial operational logical node B in FIG. 1, as shown later in FIG. 3, has a plurality of candidate SBP signatures, e.g., three, including SBP-1, SBP-2, and SBP-3. The other initial operation logical nodes also have different candidate SBP signatures, which are not listed here. Different initial operation logical nodes have different fixed candidate SBP signatures according to the operation operations they perform.
An SBP signature according to the present disclosure is a signature applied in a distributed data processing system. In a distributed data processing system, because there are often situations of data parallel, model parallel, mixed parallel, streaming parallel, and the like, tasks of adjacent arithmetic logic nodes are often deployed to different computing devices at the same time, and thus, in an actual data processing process, intermediate parameters are exchanged among the computing devices, which results in a large amount of transportation overhead. Although the carrier nodes according to the present disclosure may be arranged directly according to the distribution of the operational task nodes. However, in order to reduce the data transfer overhead, it is necessary to further refine the logical node topology map based on the initial logical node topology map 104, and in particular, to reduce the transfer overhead between upstream and downstream logical nodes, it is necessary to minimize the change due to the data distribution pattern of the upstream and downstream logical nodes or minimize the transfer path. To this end, the present disclosure assigns a logical distributed signature for each operational logic node in order to obtain a better downstream operational logic node. The logic distributed signature is a signature of an operation logic node by using a distributed descriptor of a tensor, wherein the distributed descriptor of each tensor describes a distribution mode of each tensor in the whole computing system, and mainly comprises a SPLIT (SPLIT) tensor descriptor, a BROADCAST (BROADCAST) tensor descriptor and a PARTIAL VALUE (PARTIAL VALUE) tensor descriptor.
Specifically, the SPLIT (SPLIT) tensor descriptor is a splitting manner for describing a tensor, for example, a data block is SPLIT in a specified dimension according to the description of a user, and is distributed to different computing devices to perform specified computing processing. If a data block is a two-dimensional data block, when the data block is cut in the 0 th dimension of the data block, the distributed descriptor of the data tensor of a batch of data formed by the data block is S (0), and the distributed descriptor of the data tensor obtained by each logic data block at the input end of the logic data block is S (0). Similarly, if a data block is a two-dimensional data block, when the data block is cut in the 1 st dimension, the distributed descriptor of the data tensor of the batch of data formed by the data block is S (1), and the distributed descriptor of the data tensor obtained by each logic data block at the input end of the logic data block is S (1). Similarly, if the dimension of the task data to be processed is more, there will be more distributed descriptors, e.g., S (2), S (3) …, etc. Such mentioned data may be processed data or models. If the data itself is sliced, parallel processing of the data is performed on the distributed data processing system, and if the model is split, parallel processing of the model is performed on the distributed data processing system. If the input of the operation logic node is the descriptor of the SPLIT (SPLIT) tensor, in the actual data processing process, if the data size of one tensor is T and the tensor is to be distributed to four computation cards for data parallel computation, the data amount distributed to each card is one fourth of the data, and the data amount on the whole four cards is T.
BROADCAST (BROADCAST) tensor descriptors are used to describe the way a tensor is published in a BROADCAST fashion in a distributed system. In general, for a data processing system that performs only data parallelism, model data is generally broadcast to each computing device, and thus broadcast data input to an arithmetic logic node is described using a broadcast tensor descriptor. In the actual data processing process, the data block size of the data to be broadcast is the same on each actual computing card.
The PARTIAL VALUE (PARTIAL VALUE) tensor descriptor represents a PARTIAL VALUE of the input or output tensor of an arithmetic logic node as a plurality of homogeneous tensors. These partial values include partial sum (Ps), partial product (Pm), partial and result, partial maximum, and partial minimum. Since data is usually processed in parallel, the processing of data on different devices is the processing of partial data. For example, if some tensors are S (0) or S (1), the resulting tensor is obtained on some computing devices as S (0), and the resulting tensors on the partial computing devices are combined to form a partial tensor. And the final output result is obtained by combining the same kind of data on all the devices.
The distributed descriptors of the various tensors represent the distribution of the tensors in the distributed computing system, and the respective distribution of the tensors, which are used as the input and the output of the operational logic node, also describes the distribution description of the operational logic node on the operational data. For convenience of description, this disclosure will simply refer to such a distributed descriptor as an "SBP descriptor".
For this reason, as the initial operation logic node topology map 104 is generated, the initial operation logic nodes, that is, some operation nodes, also have data distributed descriptors of respective inputs and outputs, and these input and output distributed descriptors form a signature of the operation logic nodes, that is, the signature of the operation logic nodes by using tensor distributed descriptors. For convenience of expression, the english initials of the three distributed descriptors are used to refer to this signature as an "SBP signature" for short.
Such descriptors would include at least three of S (0), B, and P, depending on the user' S description of the computational tasks and data parallelism requirements in each distributed computing system. If there are multiple ways of partitioning the data and model, then each way of partitioning is added, a descriptor is added. For each operational logical node, the signature contains various combinations of these descriptors. Thus, in a distributed system according to the present disclosure, there are at least three distributed descriptors, typically four distributed descriptors, such as the following four SBP descriptors, S (0), S (1), P, and B. Depending on the number of tensor dimensions, there may be more distributed descriptors. If the SBP descriptors are four types, various SBP signatures can be formed according to the permutation and combination of input and output. Some examples of SBP signatures are listed below: (S (0), B) → S (0), (S (1), B) → S (1), P → P, B → B, (S (0), S (1)) → P, S (0) → P, S (0) → S (0), S (0) → S (1), P → B, and the like. All SBP signatures are the result of various SBP descriptor combinations. For a matrix multiplication logical node, if its input tensor is cut on the first dimension, its output tensor is also cut on the first dimension. In summary, S, B, P is a descriptor for describing the distribution of data blocks in a data processing system, and the SBP signature describes the task operations of an arithmetic logic node using multiple SBP descriptors. Each data block can have various SBP descriptors, and the operation mode represented by each operation logic node can be the situation of various SBP signatures. For example, SBP-1 shown in FIG. 1 may be a signature form of (S (0), B) → S (0), and SBP-2 may be a signature form of (S (1), B) → S (1). In practical applications, different signature forms may have different numbers, and the numbers given herein are for descriptive convenience only and do not mean that each signature needs to be given a number, and may not have any number at all, and the different forms of signatures may be distinguished from each other without requiring numbers.
Each initial operational logical node may be given an SBP signature as described above based on the task description used. A typical arithmetic logic node is a number of arithmetic operation nodes that perform a particular arithmetic operation and therefore have a particular candidate SBP signature. Note that SBP signatures of each operational logic node are not the same, and the input tensor of the SBP signature of the operational logic node that normally performs the multiplication operation does not include the partial sum tensor, and therefore the SBP descriptor of the input tensor does not include the distributed descriptor P. The candidate SBP signatures for the operational logical nodes performing the addition operation may then include any combination of the various SBP descriptors with each other or with themselves. For example, in the case of an arithmetic logic node performing matrix multiplication, in the case of data-only parallel, the SBP signatures to be candidates are usually (S (0), B) → S (0), (S (1), B) → S (1), (S (0), S (1)) → P, etc., but not only these, but with the development of technology, some signatures that were not suitable for matrix multiplication before can also be applied to matrix multiplication, and this is merely an example. Thus, each initial arithmetic logic node is accompanied by a set of candidate logical distributed signatures based on the task configuration data. Each logical distributed signature in the set of candidate logical distributed signatures specifies a distributed descriptor of each input tensor and a distributed descriptor of each output tensor for the initial operational logical node to which it belongs.
Further determination is needed for which SBP signature determined tensor or which distributed tensor is used and which distributed tensor is input for each operational logic node in the initial operational logic node topology 104. Therefore, starting from the source operational logic node in the initial operational logic node topology map 104, when the logic labels or SBP labels of all upstream operational logic nodes (e.g., operational logic nodes a and E) of the current operational logic node (e.g., operational logic node B) have been determined, the carried data amount estimation unit 111 calculates, for each candidate logical distributed signature of the operational logic node B, a cost of carried data required to transform the distributed descriptor of the tensor of each upstream operational logic node output into the distributed descriptor of the tensor of one of the candidate logical distributed signatures of the corresponding input of the operational logic node B, based on the distributed descriptors of the outputs of all upstream operational logic nodes of the operational logic node B corresponding to the input of the operational logic node B. As shown in FIG. 3, a logical node B is operated with many candidate SBP signatures, such as SBP-1, SBP-2, and SBP-3. For example, SBP-1 may be in the form of a signature (S (1), B) → S (1) or (S (1), P) → S (1), the signature SBP-5 of the initial operational logic node A may be in the form of a signature (S (0), B) → S (0), for example, and the signature SBP-3 of the initial operational logic node E may be in the form of a signature B → B or S (0) → P, for example. In each signature form, the left side of the arrow is the distributed descriptor of the input tensor, and the right side of the arrow is the distributed descriptor of the output tensor. For convenience of description, the "tensor whose distribution descriptor is S (0)" is simply referred to as "S (0) tensor", the "tensor whose distribution descriptor is B" is simply referred to as "B tensor", the "tensor whose distribution descriptor is P" is simply referred to as "P tensor", and so on.
As shown in fig. 4, if the form of the tag SBP-3 of the operational logic node E in the initial operational logic node topology 104 is "S (0) → S (0)", the output tensor distribution descriptor thereof is S (0), and thus the output tensor thereof is the S (0) tensor. If the signature SBP-3 of the operational logic node E is in the form of "B → B" or "P → P", the distribution descriptor of the tensor of its output is B or P, and thus its output tensor is B tensor or P tensor. If the candidate signature SBP-2 of the operational logical node B (i.e., "(S (0), S (1)) → P") is selected as the determined signature, the distribution descriptor of the input tensor of the first input terminal corresponding to the output of the node E must be S (1), i.e., the first input terminal must obtain an S (1) tensor, and the distribution descriptor of the input tensor of the second input terminal corresponding to the output of the node a must be S (0), i.e., the second input terminal must obtain an S (0) tensor. As shown in fig. 4, for example, the output tensor of the operation logic node a is P tensor. It is obvious that, at this time, P of the output tensor distribution descriptor of the node a does not coincide with S (0) of the distribution descriptor of the input tensor of the second input terminal of the node B, and therefore, in order to make the operation logic node B perform a correct operation, it is necessary to convert the tensor whose distribution descriptor is P, which is output by the node a, into the tensor whose distribution descriptor is S (0). Similarly, if the distribution descriptor of the tensor output by the node E is S (0), it is not consistent with the distribution descriptor S (1) of the quantity input sheet of the first input terminal of the node B, and therefore, in order to make the arithmetic logic node B perform a correct arithmetic operation, it is necessary to convert the tensor output by the node E, whose distribution descriptor is S (0), into the tensor, whose distribution descriptor is S (1).
In a distributed computing system, since the operation tasks, especially the operation tasks, of the respective operation logic nodes are distributed to the respective computing devices (e.g., computing card CPU, GPU, or TPU) in a cutting manner, in order to obtain correct results, the intermediate parameters need to be synchronized continuously, which involves the exchange of intermediate parameters between different computing devices. When the SBP descriptor of the output tensor contained in the SBP signature of the previous operational logic node is inconsistent with the SBP descriptor of the corresponding input tensor of the SBP signature of the current node, output conversion is usually performed in an actual operation process, and this conversion process usually requires acquiring a part of data located on another computing device so as to constitute data required by the input end of the current operational logic node together with locally available data, thereby conforming to the distributed descriptor of the data tensor at the input end of the current operational logic node. This process of obtaining partial data from another device incurs a relatively large data handling overhead or cost. Therefore, selecting different signatures for the current operational logical node results in different data-handling overhead or cost. For this reason, the carried data amount estimation unit 111 estimates, for each arithmetic logic node for which a signature is not determined, a data carrying overhead to be generated for each candidate signature. For example, for the operational logic node B, the data handling cost that would be generated by the operational logic node B if one of the SBP signatures is adopted is estimated for its three candidate SBP signatures. For the operational logical node B, selecting any one of the candidate SBP signatures may accomplish its operational task. But the data handling cost generated by the operation of the system is different under the condition that different SBP signatures are adopted. Therefore, in order to minimize the data transfer cost during data processing, it is necessary to select a signature with the minimum data transfer amount from the candidate signatures of the respective arithmetic logic nodes as the signature during actual operation thereof.
Between the operational logic node a and the operational logic node B in the initial operational logic node topology 104, which are in the upstream and downstream relationship, the operational logic node a may be a source node, whose SBP signature may be configured and generated by a user, or may be generated naturally based on a description of a task by the user, or the SBP signature of the operational logic node a has been determined by decision selection basically according to the scheme of the present disclosure, for example, a descriptor of an output tensor of the SBP signature of the operational logic node a is S (0). While the operational logical node B in the initial operational logical node topology 104 has many candidate SBP signatures, which may include (S (1), B) → S (1), B → P, S (1)) → P, and P → B, etc., from the operational logical node a to the operational logical node B, since the distribution descriptor of the output tensor of the operational logical node a is S (0), the corresponding input tensor distribution descriptor that the node B can select may be S (1), B, and P.
Therefore, when the signatures of some previous arithmetic logic nodes are determined, the SBP signatures of the arithmetic logic nodes downstream thereof are also finally selected and determined based on the cost of data transfer between the logical distributed descriptor (SBP descriptor) of the output tensor of the upstream arithmetic logic node and the logical distributed descriptor (SBP descriptor) of the corresponding input tensor of the candidate logical distributed signature of the downstream upstream arithmetic logic node. In this way, once the candidate SBP signature of an arithmetic logic node is selected for calculation, the SBP descriptors of the data blocks at the input and output ends of the arithmetic logic node are also determined, so as to calculate or estimate the total cost of data transportation of the current arithmetic logic node, and use the candidate logical distributed signature with the smallest total cost as the logical distributed signature of the current arithmetic logic node. It should be noted that if the logical distributed descriptor of the input end of which signature among the candidate signatures of the current operational logical node coincides with the logical distributed descriptor of the output tensor of the operational logical node upstream thereof, the candidate logical distributed signature containing the logical distributed descriptor may be preferentially selected unless the logical distributed descriptors of the other input end tensors of the candidate logical distributed signature cause the final total cost to be larger.
FIG. 4 is a schematic diagram illustrating selection of an SBP signature of a downstream operational logical node according to the present disclosure. Fig. 4 is an enlarged schematic view of the relationship between nodes A, B and E in fig. 3. As shown in fig. 4, it is assumed that the distribution descriptor of the output tensor of the SBP signature SBP-3 already determined by the operational logic node E is S (0), the distribution descriptor of the output tensor of the SBP signature SBP-5 already determined by the operational logic node a is P, and the distribution descriptor of the candidate SBP signature SBP-2 of the operational logic node B is (S (1), S (0)) → P. Therefore, the SBP descriptor of the input tensor of the operational logic node B corresponding to the SBP descriptor S (0) of the output tensor of the operational logic node E is S (1), and the SBP descriptor of the input tensor of the operational logic node B corresponding to the SBP descriptor P of the output tensor of the operational logic node a is S (0). Therefore, in order to meet the requirement of the input logical block distribution of the SBP candidate signature of the operational logical node B, it is necessary to convert the tensor distribution of one input thereof from the SBP descriptor S (0) of the output tensor of the operational logical node E to S (1) and convert the tensor distribution of the other input thereof from the SBP descriptor P of the output tensor of the operational logical node a to S (0). This transformation will result in data exchange during the actual data processing.
Fig. 5 illustrates a first schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. The candidate SBP signature SBP-2 for the task node B shown in fig. 4 is assumed to be (S (1), S (0)) → P. For ease of description, the tasks of the input source task nodes a and E and the received sink node B are distributed on the same device set. For convenience, fig. 5 shows that the data is distributed over the compute cards GPU0 and GPU1 as in fig. 1. Although only two computing cards are shown, in practice, the source and sink task nodes may be distributed across more cards or across different sets of devices. Fig. 5 shows a data exchange process in the case where the S (0) descriptor tensor of the task of task node E in fig. 4 is distributed over two computing cards, and the tensor of the S (0) descriptor is to be obtained at the input of task node B.
Distribution of operation logic node B to operation task nodes of GPU0 to obtain S (1), except half of tensor distributed on GPU0 (using solid arrow) described by descriptor of S (0) of obtaining task node EThe header shows the acquisition of such a data portion) and the other half of the tensor distributed over GPU1 described by the S (0) descriptor from task node E needs to be supplemented (the acquisition of such a data portion is shown with a dashed arrow). If the size of the logical data block is T1The data amount of the task node distributed in the GPU0 and carried from the logic data block of the task node E on the GPU1 to the task node B is T1/2. Meanwhile, if the task node B distributed on the GPU1 needs to obtain S (1), it needs to supplement the half of the tensor distributed on the GPU0 described by the S (0) descriptor of the task node E (the solid arrow shows the process of acquiring such data portion), in addition to the half of the tensor distributed on the GPU1 described by the S (0) descriptor of the task node E (the dashed arrow shows the process of acquiring such data portion). If the size of the logical data block is T1If the data volume distributed in the task node of GPU1 is T, the data volume carried from the logic data block of the task node E of GPU0 to the task node B is T1/2. Therefore, converting the S (0) descriptor tensor of task node E to the tensor of task node B, in which the S (0) descriptor is to be obtained at the input end of task node E, has a total data transfer cost T1=(T1/2+T1/2)。T1Is the size of the logical data block distributed on the source node. In fig. 4, the size of the logical data block is S (0) which is one-half of the size of the data block of the hatched portion distributed on each card, that is, the entire tensor. In the case where the number of data cards of the equipment set is 3, 4, or 5, the cost of carrying it is also T1
Fig. 6 illustrates a second schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. Similarly, the candidate SBP signature SBP-2 for the task node B shown in fig. 4 is assumed to be (S (1), S (0)) → P. For convenience of description, the tasks of the input source task nodes a and E and the received sink node B are distributed on the same device set, and as shown in fig. 6, are distributed on the computing cards GPU0, GPU1, and GPU 2. Although three computing cards are shown here, this is for example only. It may also be two cards as shown in figure 5. In fact, the source task node and the sink task node can be distributed to more cards, and can also be distributed to different equipment sets. Fig. 6 shows the data exchange process in the case where the P descriptor tensor of the task of task node a in fig. 4 is distributed over three computation cards, the tensor of the S (0) descriptor being to be obtained at the input of task node B.
The task node of task node B distributed on GPU0 needs to obtain S (0), and needs to supplement one third of the tensor distributed on GPU1 described by the P descriptor of task node a (the acquisition process of such data portion is shown by a solid arrow) and one third of the tensor distributed on GPU2 described by the P descriptor of task node a, in addition to one third of the tensor distributed on GPU0 described by the P descriptor of task node a (the acquisition process of such data portion is shown by a dashed arrow). For this purpose, on each of the three cards there is distributed a partial value tensor P, for example, Ps being used here to represent a partial value and a tensor as an example of description. If the size of the logical data block distributed on each GPU card by the A task node is T2If the task nodes B distributed on the GPU0 need to obtain the S (0) tensor, the data volume T for transporting the task nodes distributed on the GPU0 from the logical data block of the task node a on the GPU1 to the task nodes B needs to be supplemented2/3 and the task node transfer data volume T distributed in GPU0 from the logical data block of task node A on GPU2 to task node B2/3. Similarly, task nodes B distributed on GPU1 need to obtain the S (0) tensor, and also need to supplement the data volume T for transporting task nodes distributed on GPU1 from the logical data block of task node a on GPU0 to task nodes B2/3 and the task node transfer data volume T distributed in GPU1 from the logical data block of task node A on GPU2 to task node B2/3. Similarly, task nodes B distributed on GPU2 need to obtain the S (0) tensor, and also need to supplement the data volume T of the task node B transport distributed on GPU2 from the logical data block of task node a on GPU1 to the task node B2/3 and the logical data blocks from task node A on GPU0 to task node B are distributed on any of GPU2Data volume T carried by service node2/3. Therefore, the data transfer amount in the actual data processing process for converting the P distributed tensor into the S (0) distributed tensor shown in fig. 6 is 2T2=(T2/3+T2/3+T2/3+T2/3+T2/3+T2/3). Alternatively, if the number of distributed computing cards of the task node is 2. The transport volume of data is T2=(T2/2+T2/2). By analogy, in the case that the source node and the sink node have the same device set, if the number of cards in the device set is k, the data transportation amount is (k-1) · T2
It is clear that for the arithmetic logic node B to perform the bulk operation, as described above, the data handling cost required to select the signature SBP-2 signature (e.g., signature) is the sum of the handling costs of the two inputs. Combining FIG. 5 and FIG. 6 (if there are two computing cards in FIG. 6), the total data volume that the task node needs to carry in the case of the candidate signature SBP-2 is T1+T2. For this reason, the transport cost estimated by the transport data amount estimation unit 111 for the candidate signature SBP-2 of the operational logical node B needs to include the transport costs of the two input terminals for the candidate signature.
A calculation table summarizing the amount of data exchange that exists between various SBP descriptors can be generalized based on the exact identity between the device sets for the source task node and sink task node, as shown in table 1 below:
table 1 (the distribution equipment sets of the source task node and the sink task node are identical, and the number of the cards is K)
Changing modes Data volume of source task node distribution tensor Amount of data exchange Remarks for note
S(i) →S(j) T1 0 i=j
S(i) →S(j) T1 T1 i≠j
S→B T2 (K-1) ·T2
S→P T3 T3
B→S T4 0
B→B T5 0
B→P T6 0
P→S T7 (K-1) · T7
P→B T8 2(K-1) · T8
P→P T 9 0
Fig. 7 illustrates a third schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is completely different from the device set of the sink node. That is, source task node E is distributed across GPU0 and GPU1, and sink task node B is distributed across compute cards GPU2 and GPU 3. If the size of the logical data block distributed on each computing card is T3When the amount of data to be transferred is 2T3
Fig. 8 illustrates a fourth schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is completely different from the device set of the sink node. I.e. source task node a is distributed over GPU0, GPU1 and GPU2,sink task node B is distributed across compute cards GPU 3, GPU4, and GPU 5. For example, three cards are each distributed with a partial value tensor P, where Ps is taken to represent a partial value tensor and a tensor as an example of description. If the size of the logic data block distributed on each computing card of each source task node is T4The data volume to be transported is 9 1/3T4I.e. 3T4. If the number of the computing cards of the task set distributed by the source task node is 2, the data volume needing to be carried is 2T4. If the number of the calculation cards of the task set distributed by the source task node A is Ks, the data transportation amount is Ks.T4
A calculation table summarizing the amount of data exchange that exists between various SBP descriptors may be generalized according to the completely different scenarios between the device sets for the source task node and sink task node, as shown in Table 2 below:
table 2 (Source task node (card number K)s) And sink task node (card number is K)d) The respective sets of distribution devices being completely different)
Changing modes Data volume of source task node distribution tensor Amount of data exchange Remarks for note
S(i) →S(j) T1 T1 i≠j
S→B T2 Kd ·T2
S→P T3 T3
B→S T4 1
B→B T5 Kd ·T5
B→P T6 T6
P→S T7 Ks ·T7
P→B T8 Ks ·Kd ·T8
P→P T9 Ks ·T9
Fig. 9 illustrates a fifth schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is not identical to the device set of the sink node. That is, source task node E is distributed across GPU0 and GPU1, and sink task node B is distributed across compute cards GPU1 and GPU 2. If the size of the logic data block distributed on the computing card distributed by each source task node is T5The amount of data to be transferred is 3/2T3=(1/2 T3+1/2 T3+1/2 T3). In this case, the calculation has no fixed rule, and needs to be performed according to the specific configuration of the actual device set and the intersection between the actual device set and the actual device set.
Fig. 10 illustrates a sixth schematic diagram of the transport data amount estimation unit 111 according to the present disclosure estimating the data transport amount generated between tensors of different distributed descriptors. Wherein the device set of the source node is not identical to the device set of the sink node. That is, source task node A is distributed across GPU0, GPU1, and GPU2, and sink task node B is distributed across compute cards GPU1, GPU2, and GPU 3. For example, three cards are each distributed with a partial value tensor P, where Ps is taken to represent a partial value tensor and a tensor as an example of description. If the size of the logic data block distributed on each computing card of each source task node is T6The data volume to be transported is 7 1/3T4I.e. 7/3T4. In this case, the calculation has no fixed rule, and needs to be carried outThe calculation is carried out according to the concrete constitution of the actual device set and the intersection condition between the actual device set and the actual device set.
As described above, the carried data amount estimation unit 111 traverses all the candidate signatures SBP-1, SBP-2, and SBP-3 of the operational logical node B in the above-described manner, and acquires the carrying cost thereof for each signature. Then, the total data handling comparing unit 112 compares the handling costs under each candidate signature, and obtains the minimum handling cost of the to-be-determined operation logic node, for example, the operation logic node B. Finally, the SBP signature determining unit 113 determines the candidate SBP signature corresponding to the minimum transportation cost as the final SBP signature of the operational logical node B.
The final arithmetic logic node topology output component 12 outputs the final arithmetic logic node topology 101 based on the SBP signature determined by the SBP signature determination unit 113 for each arithmetic logic node, and each of the arithmetic logic nodes constituting the arithmetic logic node topology 101 is attached with only one SBP signature, or each of the arithmetic logic nodes explicitly specifies the distribution pattern or the distribution descriptor of each input tensor thereof and uniquely determines the distribution pattern or the distribution descriptor of the input tensor thereof.
Although the above gives a general case of how to determine the final SBP signature for some candidate SBP signatures, in some specific cases, for some arithmetic logic nodes, in case of special configuration of the user or user specification, these arithmetic logic nodes only have the SBP signature specified by the user, so the arithmetic logic nodes downstream thereof will perform SBP signature determination based on such specially specified upstream arithmetic logic nodes.
The estimation of the transmission cost is only performed for the data amount, but it should be noted that the length of the data transfer path, i.e. the complexity of the data transfer, is also a part of the transmission cost to be considered. After a certain weight value is given to the length of the transmission path, the final transmission cost of each candidate SBP signature can be calculated by multiplying the calculated data amount. Selecting candidate SBP signatures based on the corrected transmission cost after passing through the considered transmission path results in a more optimal transport task node insertion result.
Although a portion of a complete task node topology after insertion of a transport task node is shown in fig. 1 and 2, this manner of insertion of a transport task node is merely exemplary. The insertion may also vary based on the basic principles described above with different computing device resources.
Although the above describes the handling of data between a host and a computing device, there is also data migration between compute task nodes on different hosts when the compute tasks of some compute task nodes are deployed directly on the host. Therefore, when the first position flag is designated as a first host and the second position flag is designated as a second host, the transport task node insertion component inserts only one transport task node between the first arithmetic task node and the second arithmetic task node and gives the inserted transport task node the first position flag. Specifically, when data transfer across hosts is performed between hosts, the location where a transfer task node is to be deployed is only on the host that receives the data. On the other hand, in the case where the first position flag indicates the first computing device as the first host and the second position flag indicates the first host, when direct access cannot be performed between the host and the computing device, the transport task node insertion component inserts only one transport task node between the first and second operation task nodes and gives the inserted transport task node the first position flag. Optionally, when the first location indicia identifies a first host and the second location indicia identifies a second computing device of the first host, the transport task node insertion component inserts only one transport task node between the first and second compute task nodes and assigns a second location indicia to the inserted transport task node, the transport direction indicia of which is G-H.
By the conversion system and the conversion method for converting the operational logic node topological graph into the task node topological graph, the operation path of the data can be obtained in advance from the global perspective, so that the data carrying task nodes are deployed in advance, the data carrying can be statically deployed, the data carrying task is fixed in a specific carrying executive body, asynchronous communication in data exchange is realized, and the expenditure of two calling times is reduced. Particularly, by deploying data transportation task nodes in advance from the global perspective, the defect that data scheduling and data calculation cannot be overlapped due to data processing waiting and delay caused by dynamic scheduling online decision data migration in the prior art is overcome (the prior art cannot realize data transportation and calculation overlapping). The data carrying path is planned in advance because the carrying task nodes are inserted between the operation task nodes, so that the carrying role of each data is fixed, the source and the destination of the data and the operation task node object served by the carrying task nodes are predetermined, the carrying and calculating overlapping can be realized in the whole system, and the explosion caused by resource exhaustion or resource non-planning in flow control is solved.
And because the transport task nodes are inserted in advance, the waiting process of operation can be eliminated, so that the operation equipment corresponding to the operation task nodes is always in an operation state, and the operation utilization rate is improved.
The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.
Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.
It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (4)

1. A topological graph conversion method is used for converting an operation logic node topological graph into a task node topological graph, and comprises the following steps:
partitioning a task of any operation logic node in an operation logic node topological graph to a specified computing resource through an operation task node deployment component based on task configuration data in task description input by a user on the basis of the given computing resource, so as to generate one or more operation task nodes corresponding to each operation logic node and endow each operation task node with a position mark corresponding to the specified computing resource; and
by the carrying task node inserting component, when different position marks exist between a first position mark of a first operation task node and a second position mark of a second operation task node serving as an upstream operation task node, one or more carrying task nodes are inserted between the first operation task node and the second operation task node, so that a complete task node topological graph with the carrying task nodes is obtained.
2. The topology graph conversion method of claim 1, wherein the method further comprises, prior to fragmenting, by a compute task node deployment component, a task of any compute logic node in a compute logic node topology graph to the specified computing resources:
and selecting a logic distributed signature with the minimum data carrying cost from a candidate logic distributed signature set of each downstream operation logic node of each source operation logic node as the logic distributed signature of each downstream operation logic node based on a logic distributed signature which is specified by the task configuration data for the source operation logic node in the operation logic node topological graph and consists of the distributed descriptors of the input tensor and the distributed descriptors of the output tensor of the operation logic node through a logic distributed signature selection component in the operation task node deployment component.
3. A topology graph conversion system for converting an arithmetic logic node topology graph into a task node topology graph, comprising:
the operation task node deployment component is used for fragmenting a task of any operation logic node in the operation logic node topological graph to a specified computing resource based on task configuration data in task description input by a user on the basis of the given computing resource, so that one or more operation task nodes corresponding to each operation logic node are generated, and a position mark corresponding to the specified computing resource is given to each operation task node; and
and the transport task node inserting component inserts one or more transport task nodes between the first operation task node and the second operation task node when different position marks exist between the first position mark of the first operation task node and the second position mark of the second operation task node as an upstream operation task node, so that a complete task node topological graph with the transport task nodes is obtained.
4. The topology map conversion system according to claim 3, wherein the operation task node deployment component includes a logical distributed signature selection component that selects, as the logical distributed signature of each downstream operation logic node, a logical distributed signature with the smallest data handling cost from the candidate logical distributed signature set of each downstream operation logic node of each source operation logic node based on a logical distributed signature composed of distributed descriptors of input tensors and distributed descriptors of output tensors of operation logic nodes specified for the source operation logic node in the operation logic node topology map by the task configuration data before fragmenting a task of any operation logic node in the operation logic node topology map to the specified computing resource.
CN202010090334.8A 2020-02-13 2020-02-13 Topological graph conversion system and method Active CN110928697B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202010090334.8A CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method
CN202010403719.5A CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof
PCT/CN2021/072789 WO2021159929A1 (en) 2020-02-13 2021-01-20 Topology diagram conversion system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010090334.8A CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202010403719.5A Division CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof

Publications (2)

Publication Number Publication Date
CN110928697A CN110928697A (en) 2020-03-27
CN110928697B true CN110928697B (en) 2020-05-22

Family

ID=69854859

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010090334.8A Active CN110928697B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method
CN202010403719.5A Active CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010403719.5A Active CN111666151B (en) 2020-02-13 2020-02-13 Topological graph conversion system and method thereof

Country Status (2)

Country Link
CN (2) CN110928697B (en)
WO (1) WO2021159929A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110928697B (en) * 2020-02-13 2020-05-22 北京一流科技有限公司 Topological graph conversion system and method
CN111930519B (en) * 2020-09-22 2020-12-15 北京一流科技有限公司 Parallel decision system and method for distributed data processing
CN112764940B (en) * 2021-04-12 2021-07-30 北京一流科技有限公司 Multi-stage distributed data processing and deploying system and method thereof
CN114035968B (en) * 2022-01-10 2022-03-18 北京一流科技有限公司 Conflict processing system and method for multi-stream parallelism
CN114911976A (en) * 2022-03-25 2022-08-16 浙江大华技术股份有限公司 Node layout method and device, electronic equipment and computer readable storage medium

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103516733A (en) * 2012-06-19 2014-01-15 华为技术有限公司 Method and apparatus for processing virtual private cloud
US10102039B2 (en) * 2013-05-17 2018-10-16 Entit Software Llc Converting a hybrid flow
US10693743B2 (en) * 2015-09-21 2020-06-23 Splunk Inc. Displaying interactive topology maps of cloud computing resources
US10649808B2 (en) * 2016-09-16 2020-05-12 Oracle International Corporation Outcome-based job rescheduling in software configuration automation
CN106648859A (en) * 2016-12-01 2017-05-10 北京奇虎科技有限公司 Task scheduling method and device
CN107122244B (en) * 2017-04-25 2020-02-14 华中科技大学 Multi-GPU-based graph data processing system and method
CN107483541A (en) * 2017-07-17 2017-12-15 广东工业大学 A kind of online task immigration method based on rolling time horizon
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN108388474A (en) * 2018-02-06 2018-08-10 北京易沃特科技有限公司 Intelligent distributed management of computing system and method based on DAG
CN108600321A (en) * 2018-03-26 2018-09-28 中国科学院计算技术研究所 A kind of diagram data storage method and system based on distributed memory cloud
CN109144695B (en) * 2018-08-30 2021-08-10 百度在线网络技术(北京)有限公司 Method, device, equipment and medium for processing task topological relation
CN110262995A (en) * 2019-07-15 2019-09-20 北京一流科技有限公司 It executes body creation system and executes body creation method
CN110222005A (en) * 2019-07-15 2019-09-10 北京一流科技有限公司 Data processing system and its method for isomery framework
CN110928697B (en) * 2020-02-13 2020-05-22 北京一流科技有限公司 Topological graph conversion system and method

Also Published As

Publication number Publication date
WO2021159929A1 (en) 2021-08-19
CN111666151B (en) 2023-11-03
CN110928697A (en) 2020-03-27
CN111666151A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN110928697B (en) Topological graph conversion system and method
CN110955734B (en) Distributed signature decision system and method for logic node
Eles et al. Scheduling of conditional process graphs for the synthesis of embedded systems
EP3458959B1 (en) Reconfigurable distributed processing
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
CN111930519B (en) Parallel decision system and method for distributed data processing
CN107633125B (en) Simulation system parallelism identification method based on weighted directed graph
CN111897580B (en) Instruction scheduling system and method for reconfigurable array processor
Yi et al. Optimizing distributed training deployment in heterogeneous GPU clusters
CN103914556A (en) Large-scale graph data processing method
CN112764940B (en) Multi-stage distributed data processing and deploying system and method thereof
CN114661480A (en) Deep learning task resource allocation method and system
Ke et al. Aggregation on the fly: Reducing traffic for big data in the cloud
CN112799852A (en) Multi-dimensional SBP distributed signature decision system and method for logic node
CN102427420A (en) Virtual network mapping method and device based on graph pattern matching
US7839849B1 (en) Formatting fields of communication packets
CN116915700A (en) Front-end micro-service aggregation technology solution
Benoit et al. Optimizing the latency of streaming applications under throughput and reliability constraints
Goldman et al. An efficient parallel algorithm for solving the knapsack problem on hypercubes
CN115729648A (en) Operator scheduling method, device and system based on directed acyclic graph
Panda et al. MSSA: A M-Level Sufferage-Based Scheduling Algorithm in Grid Environment
CN117573379B (en) Micro-service deployment method based on symmetrical scaling merging
US12147829B2 (en) Data processing system and method for heterogeneous architecture
US20220129302A1 (en) Data processing system and method for heterogeneous architecture
CN104375803A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant