WO2021159929A1 - 拓扑图转换系统及其方法 - Google Patents

拓扑图转换系统及其方法 Download PDF

Info

Publication number
WO2021159929A1
WO2021159929A1 PCT/CN2021/072789 CN2021072789W WO2021159929A1 WO 2021159929 A1 WO2021159929 A1 WO 2021159929A1 CN 2021072789 W CN2021072789 W CN 2021072789W WO 2021159929 A1 WO2021159929 A1 WO 2021159929A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
computing
task
task node
host
Prior art date
Application number
PCT/CN2021/072789
Other languages
English (en)
French (fr)
Inventor
袁进辉
柳俊丞
牛冲
李新奇
Original Assignee
北京一流科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京一流科技有限公司 filed Critical 北京一流科技有限公司
Publication of WO2021159929A1 publication Critical patent/WO2021159929A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Definitions

  • the present disclosure relates to a data processing technology. More specifically, the present disclosure relates to a conversion system and method for converting a topological diagram of an arithmetic logic node into a topological diagram of a task node.
  • the present disclosure provides a method for converting a topological diagram of a computing logic node into a topological diagram of a task node. Configure the data to fragment the tasks of any operation logic node in the operation logic node topology to designated computing resources, thereby generating one or more operation task nodes corresponding to each operation logic node, and assign each operation task node to the specified computing resource.
  • One or more transport task nodes are inserted between the first computing task node and the second computing task node when the position is marked, so as to obtain a complete task node topology diagram with the transport task node.
  • the The transportation task node insertion component inserts only one transportation task node between the first operation task node and the second operation task node, and assigns a first position mark to the inserted transportation task node.
  • the The transportation task node insertion component inserts only one transportation task node between the first operation task node and the second operation task node, and assigns a second position mark to the inserted transportation task node.
  • the handling task node is inserted into the component Only one transportation task node is inserted between the first operation task node and the second operation task node, and the inserted transportation task node is given a first position mark.
  • the handling task node insertion component inserts two handling task nodes between the first calculation task node and the second calculation task node, and the first calculation task node is inserted next to the first calculation task node.
  • the transportation task node is assigned a first position mark, and another inserted transportation task node is assigned a second position mark.
  • the handling task node insertion component is inserted into the first, second, and third handling task nodes in sequence from the first calculation task node to the second calculation task node, and is the first handling task node
  • a first location mark is assigned, a location mark indicating the first host is assigned to the second transport task node, and a second location mark is assigned to the third transport task node.
  • the method further includes slicing the tasks of any computing logic node in the topological diagram of the computing logic node to all the tasks through the computing task node deployment component.
  • the logic distributed signature selection component in the computing task node deployment component is selected based on the task configuration data for the input tensor specified by the computing logic node of the source computing logic node in the computing logic node topology.
  • Logical distributed signature composed of distributed descriptors and distributed descriptors of output tensors, select the logical distribution with the least data handling cost from the candidate logical distributed signature set of each downstream operation logical node of each source operation logical node The signature is used as the logical distributed signature of each downstream logic node.
  • a conversion system for converting a computing logic node topology diagram into a task node topology diagram including: computing task node deployment components based on tasks input by users on the basis of given computing resources
  • the task configuration data in the description divides the tasks of any operation logic node in the operation logic node topology into designated computing resources, thereby generating one or more operation task nodes corresponding to each operation logic node, and assigning each operation task
  • one or more transport task nodes are inserted between the first computing task node and the second computing task node, so as to obtain a complete task node topology diagram with the transport task node.
  • the conversion system for converting arithmetic logic node topology diagram into a task node topology diagram when the first position mark indicates the first computing device of the first host and the second position mark indicates the first computing device of the first host and the second position mark indicates In the case of the first host, only one transport task node is inserted between the first computing task node and the second computing task node, and the inserted transport task node is given a first position mark.
  • the handling task node insertion component is marked as the first host at the first position and the second position is marked as the first host at the second position.
  • the second computing device only one transportation task node is inserted between the first operation task node and the second operation task node, and the inserted transportation task node is given a second position mark.
  • the handling task node insertion component is indicated as the first host by the first position mark and the second position mark is indicated as the second host, Only one transportation task node is inserted between the first operation task node and the second operation task node, and the inserted transportation task node is given a first position mark.
  • the handling task node insertion component is marked as the first computing device of the first host at the first position and the second position is marked as
  • the third computing device of the first host is the third computing device or the second host
  • two transport task nodes are inserted between the first computing task node and the second computing task node, and the first computing task node is inserted next to the first computing task node.
  • the transportation task node is assigned a first position mark, and another inserted transportation task node is assigned a second position mark.
  • the transport task node insertion component is marked as the first computing device of the first host at the first position and the second position indicator is indicated as
  • the first, second, and third handling task nodes are inserted in sequence from the first computing task node to the second computing task node, and they are the first handling task nodes
  • a first location mark is assigned, a location mark indicating the first host is assigned to the second transport task node, and a second location mark is assigned to the third transport task node.
  • the computing task node deployment component includes a logical distributed signature selection component, which converts any computing logic node in the computing logic node topology diagram Before the task is fragmented to the designated computing resource, based on the task configuration data, the distributed descriptor of the input tensor of the operation logic node and the output tensor specified by the source operation logic node in the operation logic node topology map Logical distributed signature composed of distributed descriptors, from the candidate logical distributed signature set of each downstream operation logical node of each source operation logical node, the logical distributed signature with the least data handling cost is selected as each downstream operation logical node Logical distributed signature.
  • the running path of the data can be known in advance from a global perspective, thereby pre-deploying the data transport task node, so that the data transport can be static.
  • Deployment is to fix the data transfer task in a specific transfer execution body to realize the asynchronous communication in the data exchange, so as to reduce the time overhead of two calls.
  • the existing technology eliminates the defects of data scheduling and data calculation overlap caused by data processing waiting and delay caused by dynamic scheduling online decision-making data migration (prior art Unable to achieve the overlap of data handling and calculation).
  • the present disclosure inserts the transfer task node between the computing task nodes, the data transfer path is planned in advance, so that the transfer role of each data is fixed, and the source and purpose of the data and the computing task node object served by the transfer task node are determined in advance. , Which can realize the overlap of handling and calculation in the whole system, and solve the explosion situation caused by resource exhaustion or unplanned resources in flow control.
  • the transport task node is inserted in advance, the waiting process of the calculation can be eliminated, so that the computing device corresponding to the computing task node is always in the computing state, and the computing utilization rate is improved.
  • FIG. 1 shows a schematic diagram of the principle of a conversion system for converting a topological diagram of a computing logic node into a topological diagram of a task node according to the present disclosure.
  • Figure 2 shows a partial schematic diagram of a full task node topology diagram according to the present disclosure.
  • FIG. 3 shows a schematic diagram of the structure of a logical distributed signature for selecting arithmetic logic nodes according to the present disclosure.
  • Figure 4 shows a schematic diagram of selecting SBP signatures of downstream arithmetic logic nodes according to the present disclosure.
  • FIG. 5 illustrates a first schematic diagram of the data transported amount generated between different distributed descriptor tensors estimated by the transported data amount estimation unit according to the present disclosure.
  • FIG. 6 illustrates a second schematic diagram of the data carrying amount estimating between the tensors of different distributed descriptors according to the carrying data amount estimation unit of the present disclosure.
  • FIG. 7 illustrates a third schematic diagram of the data carrying amount estimating between the tensors of different distributed descriptors according to the carrying data amount estimation unit of the present disclosure.
  • FIG. 8 illustrates a fourth schematic diagram of the data carrying amount estimating between the tensors of different distributed descriptors according to the carrying data amount estimation unit of the present disclosure.
  • FIG. 9 illustrates a fifth schematic diagram of the data carrying amount estimating between the tensors of different distributed descriptors according to the carrying data amount estimation unit of the present disclosure.
  • FIG. 10 illustrates a sixth schematic diagram of the data transported amount generated between the tensors of different distributed descriptors estimated by the transported data amount estimation unit according to the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • one of the two possible position markers may be referred to as the first position marker or the second position marker, and similarly, the other of the two possible position markers One can be called the second position mark or the first logical position mark.
  • the word "if” as used herein can be interpreted as "when” or "when” or "in response to determination”.
  • FIG. 1 shows a schematic diagram of the principle of a conversion system for converting a topological diagram of a computing logic node into a topological diagram of a task node according to the present disclosure.
  • the conversion system for converting a computing logic node topology into a task node topology according to the present disclosure includes a computing task node deployment component 10 and a handling task node insertion component 20.
  • the computing task node deployment component 10 obtains the computing logic node topology map, based on the task configuration data in the task description input by the user on the basis of the given computing resources, the computing task node deployment component 10 will calculate the value of any computing logic node in the computing logic node topology map.
  • the tasks are fragmented into designated computing resources, thereby generating one or more computing task nodes corresponding to each computing logic node, and assigning each computing task node a position mark corresponding to the designated computing resource.
  • a distributed computing system usually includes one or more hosts, and each host will be connected to multiple computing devices, such as GPUs, TPUs, and other computing devices dedicated to large-scale simple operations.
  • computing devices such as GPUs, TPUs, and other computing devices dedicated to large-scale simple operations.
  • the large-scale data blocks that need to be processed are usually divided into multiple computing devices for parallel processing.
  • the model can usually be divided and distributed to different computing devices for processing. For this reason, when there are two devices on one host (HOST), such as GPU0 and GPU1, the data can be divided into two parts along the 0th dimension of the data and distributed to GPU0 and GPU1.
  • the segment of the computing logical node is assigned to the computing task node on GPU0 of the host H1 with position marks H1-GPU0, and similarly, the segment of the computing logical node is assigned to the GPU1 of the host H1
  • the arithmetic task node on the above is assigned a position mark H1-GPU1.
  • the arithmetic logic node E itself will be allocated to the two GPUs of H1, so it initially has a position mark H1-2G.
  • the two computing task nodes that are fragmented are E1 and E2, which are assigned position marks H1-GPU0 and H1-GPU1, respectively.
  • the two operation task nodes that are fragmented are A1 and A2, which are assigned position marks H1-GPU0 and H1-GPU1, respectively.
  • the downstream operation logic node B which is the operation logic node A and E, is also processed by the operation task node deployment component 10.
  • the two operation task nodes that are fragmented are B1 and B2, which are assigned position marks H1-GPU0 and H1, respectively. -GPU1.
  • the computing logic nodes C, D, and F are all located on the two GPU computing cards of the host H2.
  • the computing task node deployment component 10 After processing by the computing task node deployment component 10, their respective computing task nodes C1 and C2, D1 and D2, F1 The positions of and F2 are marked as H2-GPU0 and H2-GPU1, respectively.
  • the computing task node topology map 102 is obtained.
  • the transport task node insertion component 20 is one of the first position mark of the first computing task node and the second position mark of the second computing task node as its upstream computing task node.
  • one or more transportation task nodes are inserted between the first operation task node and the second operation task node, so as to obtain a complete task node topology diagram with transportation task nodes.
  • the transport task nodes E1-H1 and H1-B2 are inserted between the computing task nodes E1 and B2
  • the transport task nodes E2-H1 and H1- are inserted between the computing task nodes E2 and B1.
  • FIG. 2 shows a schematic diagram of a part of the topological diagram of a complete task node after inserting a handling task node according to the present disclosure.
  • the operation logic node C is distributed on the two GPU0 and GPU1 of the host H1
  • its downstream operation logic node D is distributed on the two GPU0 and GPU1 of the host H2
  • their respective The positions of the computing task nodes C1 and C2 are marked as G0/H1 or G1/H1
  • the positions of the computing task nodes D1 and D2 are marked as G0/H2 or G1/H2.
  • the handling task nodes C1-H1, H1 need to be inserted between the computing task node C1 and the computing task node D2 -H2 and H2-D2. If the input data required by the computing task node D2 also needs to come from the computing task node C2, it is also necessary to insert the handling task nodes C2-H1, H1- between the computing task node C2 and the computing task node D2. H2 and H2-D2.
  • the data migration between the host and the computing device can eliminate the need to insert the handling tasks mentioned in this disclosure. node. Therefore, only one handling task node H1-H2 needs to be inserted between the computing task node C1 or C2 and D1 or D2, that is, one handling task node H1-H2 can be shared between C1 and C2 and D1 and D2.
  • this part shows that four handling task nodes H1- H2, but in fact, even if there is no direct access protocol between the host H1 or H2 and the computing device (such as GPU) connected to it, the four handling task nodes H1-H2 can be one handling task node.
  • the four handling task nodes H1-H2 can be one handling task node.
  • the transport task node insertion assembly 20 When the transport task node insertion assembly 20 is inserted into the transport task node, it also marks the position of the inserted transport task node, and also marks the source address and destination address of the transported data, that is, the transport direction of the data is marked.
  • the name of each transport node mentioned above is the source address and destination address of the transport task node and the transport direction.
  • the computing task node deployment component 10 also includes a logical distributed signature selection component 11, and each computing task node is also based on For its operation type, a certain logical distributed signature is selected from its multiple candidate logical distributed signatures. Specifically, before the logical distributed signature selection component 11 fragments the task of any computing logic node in the computing logic node topology to the designated computing resource, it is based on the task configuration data as the computing logic node topology diagram.
  • the source operation logic node specifies the logical distributed signature composed of the distributed descriptor of the input tensor of the operation logic node and the distributed descriptor of the output tensor, from each downstream operation logic node of each source operation logic node In the candidate logical distributed signature set, the logical distributed signature with the least data handling cost is selected as the logical distributed signature of each downstream operation logical node.
  • the topological graph 102 of the computing task node with the logical distributed signature is obtained.
  • FIG. 3 shows a schematic diagram of the structure of a logical distributed signature for selecting arithmetic logic nodes according to the present disclosure.
  • FIG. 3 only schematically shows a simple initial arithmetic logic node topology 104, in which nodes A, B, C, D, E, F, L, and K are shown. Others not shown are replaced by omission.
  • the initial operation logic node topology 104 will be more complicated.
  • the initial computing logic node topology diagram 104 contains basic logic computing nodes that implement the computing tasks described by the user. This method of generating the initial operation logic node topology map 104 belongs to the conventional technology in the field, and therefore will not be repeated here.
  • Each initial operation logic node in the initial operation logic node topology diagram 104 each contains multiple SBP signatures.
  • the initial operation logic node usually contains some inherent candidate SBP signatures.
  • the initial operation logic node B in Fig. 1 has multiple candidate SBP signatures as shown in Fig. 3 below, for example, three, including SBP-1, SBP-2, and SBP-3.
  • Other initial operation logic nodes also have different candidate SBP signatures, which are not listed here.
  • Different initial computing logic nodes will have different fixed candidate SBP signatures according to their specific computing operations.
  • the SBP signature according to the present disclosure is a signature applied in a distributed data processing system.
  • distributed data processing systems because there are often data parallelism, model parallelism, mixed parallelism, and stream parallelism, there are often tasks of adjacent computing logic nodes that will be deployed to different computing devices at the same time. Therefore, in the actual data processing process, the intermediate parameters are exchanged between various computing devices, which will cause a lot of handling overhead.
  • the handling nodes according to the present disclosure can be directly arranged according to the distribution of computing task nodes.
  • the data distribution method brings the least change or the shortest transportation path. For this reason, in order to obtain a better downstream operation logic node, the present disclosure specifies a logical distributed signature for each operation logic node.
  • the logical distributed signature is the signature of the operation logic node using the distributed descriptor of the tensor.
  • the distributed descriptor of each tensor describes the distribution method of each tensor in the entire computing system, mainly including partitioning (SPLIT) tensor descriptor, broadcast (BROADCAST) tensor descriptor and partial value (PARTIAL VALUE) tensor descriptor.
  • SPLIT partitioning
  • BROADCAST broadcast tensor descriptor
  • PARTIAL VALUE partial value
  • the split (SPLIT) tensor descriptor is to describe a tensor segmentation method, for example, a data block is divided in a specified dimension according to the user's description, and distributed to different computing devices for specified calculations deal with. If a data block is a two-dimensional data block, when the data block is cut in its 0th dimension, the distributed descriptor of the data tensor of a batch of data formed by the data block is S(0), then each The distributed descriptors for each logical data block to obtain this data tensor at its input are all S(0).
  • a data block is a two-dimensional data block
  • the distributed descriptor of the data tensor of a batch of data formed by the data block is S(1)
  • the distributed descriptor for each logical data block to obtain this data tensor at its input is S(1).
  • the dimensions of the task data to be processed are more dimensions, there will be more distributed descriptors, such as S(2), S(3)... and so on.
  • Such mentioned data can be processed data or models. If the data itself is cut, data parallel processing is formed on the distributed data processing system, and if the model is divided, the model parallel processing is formed on the distributed data processing system.
  • the tensor descriptor in the actual data processing process, if the data size of a tensor is T, the tensor will be distributed to four computing cards for processing For data parallel computing, the amount of data allocated on each card is one-quarter of the data, and the amount of data on the entire four cards is T.
  • the BROADCAST tensor descriptor is used to describe the way a tensor is published in a distributed system in a broadcast manner.
  • the model data is usually broadcast to various computing devices, so the broadcast data input to the operation logic node is described using broadcast tensor descriptors.
  • the data block size of the broadcasted data on each actual computing card is the same.
  • Partial value (PARTIAL VALUE) tensor descriptor indicates that the input or output tensor of an operation logic node is the partial value of multiple tensors of the same type. These partial values include partial sum (Ps), partial product (Pm), partial "and" result, partial maximum, and partial minimum. Since data is usually processed in parallel for the purpose of data processing, the processing of data on different devices is the processing of part of the data. For example, if some tensors are S(0) or S(1), the result tensor obtained on some computing devices is S(0), and the result tensor on these partial computing devices is combined into a partial value tensor. Combining similar data on all devices is the final output.
  • the distributed descriptors of the above-mentioned various tensors represent the distribution of these tensors in the distributed computing system, and whether these tensors are used as the input and output of the operation logic node, their respective distribution methods also describe the operation logic The node's description of the distribution of operating data.
  • SBP descriptor for short.
  • the initial operation logic nodes of the present disclosure are also equipped with various input and output data distributed descriptors.
  • These input and output distributed descriptors A kind of signature to the operation logic node is formed, that is, the signature of the operation logic node using the distributed descriptor of the tensor.
  • the English initials of these three distributed descriptors are used to abbreviate this signature as "SBP signature".
  • this descriptor will include at least three types of S(0), B, and P. If there are multiple segmentation methods for data and models, each additional segmentation method will add a descriptor. For each operation logic node, its signature includes various combinations of these descriptors. Therefore, in the distributed system according to the present disclosure, there are at least three types of distributed descriptors, and usually there are four types of distributed descriptors, for example, the following four SBP descriptors, S(0), S(1), P, and B. Depending on the number of tensor dimensions, there can be more distributed descriptors.
  • SBP signatures can be formed according to the arrangement and combination of input and output. Some examples of SBP signatures are listed below: (S(0), B) ⁇ S(0), (S(1), B) ⁇ S(1), P ⁇ P, B ⁇ B, (S(0) ), S(1)) ⁇ P, S(0) ⁇ P, S(0) ⁇ S(0), S(0) ⁇ S(1), P ⁇ B and so on. All SBP signatures are the result of a combination of various SBP descriptors. For a matrix multiplication logic node, if its input tensor is cut on the first dimension, its output result tensor is also cut on the first dimension.
  • S, B, and P are descriptors used to describe the distribution of data blocks in the data processing system, and the SBP signature uses multiple SBP descriptors to describe the task operations of the arithmetic logic nodes.
  • Each data block can have multiple SBP descriptors, and the operation mode represented by each operation logic node can have multiple SBP signature scenarios.
  • SBP-1 shown in Figure 1 can be (S(0), B) ⁇ S(0)
  • SBP-2 can be (S(1), B) ⁇ S(1).
  • different signature forms can have different numbers. The numbers given here are only for the convenience of description, and do not mean that each signature needs to be assigned a number. There can be no number at all.
  • the different forms of signatures are different from each other. They can be distinguished from each other without a number.
  • the SBP signature as described above can be given to each initial operation logic node based on the task description used.
  • the usual arithmetic logic nodes are some arithmetic operation nodes, which perform specific arithmetic operations, so they have specific candidate SBP signatures. It should be pointed out that not every SBP signature of arithmetic logic node is the same.
  • the input tensor of the SBP signature of the arithmetic logic node that performs multiplication operations does not contain part and tensor, so the SBP description of its input tensor The symbol does not contain the distributed descriptor P.
  • the candidate SBP signature of the arithmetic logic node that performs the addition operation can include any combination of various SBP descriptors with each other or between themselves.
  • its candidate SBP signature is usually (S(0), B) ⁇ S(0), (S(1), B) ⁇ S(1 ), (S(0), S(1)) ⁇ P, etc., but not only these.
  • S(0), B the candidate SBP signature
  • S(1), B the candidate SBP signature
  • S(0), S(1) S(1)
  • S(1) S(1)
  • each initial operation logic node is attached with a candidate logic distributed signature set based on the task configuration data.
  • Each logical distributed signature in the candidate logical distributed signature set specifies the distributed descriptor of each input tensor of the initial operation logical node to which it belongs and the distributed descriptor of each output tensor.
  • each operation logic node will use the tensor determined by the SBP signature or the distributed tensor used and the input distributed tensor, which need to be further determined. Therefore, starting from the source operation logic node in the initial operation logic node topology diagram 104, the logic labels or SBPs of all upstream operation logic nodes (for example, operation logic nodes A and E) of the current operation logic node (for example, operation logic node B) When the label has been determined, the transported data amount estimation unit 111 is based on the distributed descriptors of the output terminals corresponding to the input terminals of the operation logic node B of all upstream operation logic nodes of the operation logic node B, and targets each operation logic node B.
  • a candidate logical distributed signature which calculates and transforms the distributed descriptor of the tensor at the output of each upstream logical node into the distributed descriptor of one of the candidate logical distributed signatures at the corresponding input of the logical node B.
  • the cost of the data that needs to be moved As shown in Figure 3, the operational logic node B has many candidate SBP signatures, such as SBP-1, SBP-2, and SBP-3.
  • the possible form of SBP-1 is the signature of (S(1), B) ⁇ S(1) or (S(1), P) ⁇ S(1)
  • the signature of the initial operation logical node A is SBP- 5
  • its possible form is the signature of (S(0), B) ⁇ S(0)
  • the possible form of the signature SBP-3 of the initial operation logic node E is, for example, B ⁇ B or S(0) ⁇ P .
  • the left side of the arrow is the distributed descriptor of the input tensor
  • the right side of the arrow is the distributed descriptor of the output tensor.
  • tensor with distribution descriptor S(0) will be referred to as “S(0) tensor” in the following, and “tensor with distribution descriptor B” will be referred to as “B tensor”. "The tensor whose distribution descriptor is P” is simply referred to as “P tensor”, and so on.
  • the distribution descriptor of the quantity must be S(1), that is, the first input must obtain an S(1) tensor, and the distribution descriptor of the input tensor corresponding to the second input of the output of node A must be It is S(0), that is, the second input must obtain an S(0) tensor.
  • the output tensor of the operation logic node A is a P tensor. Obviously, at this time, the output tensor distribution descriptor P of node A does not match the input tensor distribution descriptor S(0) of the second input of node B. Therefore, the operation logic node B must be made to perform the correct operation.
  • the output conversion is usually performed during the actual operation, and this conversion
  • the process usually needs to obtain part of the data located on another computing device in order to form the data required by the input of the current operation logic node together with the locally available data, so as to conform to the distributed description of the data tensor at the input of the current operation logic node symbol.
  • This process of obtaining part of the data from another device will produce relatively large data handling overhead or handling costs. Therefore, choosing different signatures for the current operation logic node will produce different data handling overheads or costs.
  • the transported data amount estimation unit 111 will estimate the data transport overhead that each candidate signature will generate for each operation logic node with an undetermined signature. For example, for arithmetic logic node B, for its three candidate SBP signatures, the data handling cost that the arithmetic logic node B will generate when one of the SBP signatures is used is estimated. For arithmetic logic node B, selecting any candidate SBP signature can achieve its operational tasks. However, when it uses different SBP signatures, the data handling costs generated by its operation are different. Therefore, in order to minimize the cost of data handling during data processing, it is necessary to select the signature with the smallest amount of data handling from the candidate signatures of each operational logic node as the signature in the actual running process.
  • the arithmetic logic node A may be the source node, and its SBP signature can be generated by user configuration or based on the user’s pairing.
  • the description of the task is generated naturally, or the SBP signature of the operation logic node A has been basically determined according to the scheme of the present disclosure.
  • the descriptor of the output tensor of the SBP signature of the operation logic node A is S(0).
  • the operation logic node B in the initial operation logic node topology diagram 104 it has many candidate SBP signatures, which may include (S(1), B) ⁇ S(1), B ⁇ P, S(1)) ⁇ P, and P ⁇ B, etc.
  • S(0) the distribution descriptor of the output tensor of the arithmetic logic node A
  • node B can select the corresponding input tensor distribution Descriptors can be S(1), B, and P.
  • the SBP signatures of the downstream arithmetic logic nodes are also based on the logical distributed descriptor (SBP descriptor) of the output tensor of the upstream arithmetic logic node and the downstream upstream arithmetic logic.
  • SBP descriptor logical distributed descriptor
  • the cost of data transfer between logical distributed descriptors (SBP descriptors) corresponding to the input tensor of the candidate logical distributed signature of the node is finally selected and determined.
  • the candidate SBP signature of an arithmetic logic node is selected for calculation, it means that the respective SBP descriptors of the data blocks of each input and output of the arithmetic logic node are also determined, so as to calculate Or estimate the total cost of data handling of the current arithmetic logic node, and use the candidate logical distributed signature with the smallest total cost as the logical distributed signature of the current arithmetic logic node. It should be pointed out that if the logical distributed descriptors of the input terminals of which signatures in the candidate signatures of the current operation logical node are consistent with the logical distributed descriptors of the output tensor of the upstream operation logical node, the logical distribution can be selected first. Candidate logical distributed signatures of the candidate logical distributed signature, unless the logical distributed descriptors of other input tensors of the candidate logical distributed signature will cause the final total cost to be greater.
  • Figure 4 shows a schematic diagram of selecting SBP signatures of downstream arithmetic logic nodes according to the present disclosure.
  • Fig. 4 is an enlarged schematic diagram of the relationship between nodes A, B, and E in Fig. 3.
  • the distribution descriptor of the output tensor of the determined SBP signature SBP-3 of the operation logic node E is S(0)
  • the distribution descriptor of the quantity is the distribution descriptor of the input tensor is P
  • one of the candidate SBP signatures of the operation logic node B, SBP-2 is (S(1), S(0)) ⁇ P).
  • the SBP descriptor of the input tensor corresponding to the SBP descriptor S(0) of the output tensor of the operation logic node B is S(1)
  • the SBP descriptor of the operation logic node B and the operation logic node A is S(0).
  • FIG. 5 illustrates a first schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • SBP-2 of task node B shown in FIG. 4 it is assumed to be (S(1), S(0)) ⁇ P.
  • the tasks of input source task nodes A and E and the receiving sink node B are all distributed on the same device set.
  • they are all distributed on the computing cards GPU 0 and GPU 1, as shown in FIG. Although only two computing cards are shown here, in fact, the source task node and the sink task node can be distributed on more cards or on different device sets.
  • Figure 5 shows how the S(0) descriptor tensor of the task of task node E in Figure 4 is distributed on two computing cards. The data exchange process under the circumstances.
  • the task node of task node B distributed on GPU 1 wants to obtain S(1), it needs to directly obtain half of the tensor distributed on GPU 1 described by the S(0) descriptor of task node E.
  • the solid arrow is used to show the acquisition process of this data part
  • T 1 (T 1 /2+ T 1 /2).
  • T 1 is the size of the logical data block distributed on the source node.
  • the size of the logical data block is S(0).
  • the size of the data block distributed in the shaded part on each card is one-half of the entire tensor.
  • the handling cost is still T 1 .
  • FIG. 6 illustrates a second schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • SBP-2 of task node B shown in FIG. 4 it is assumed to be (S(1), S(0)) ⁇ P.
  • the tasks of the input source task nodes A and E and the received sink node B are all distributed on the same device set, as shown in FIG. 6, they are all distributed on the computing cards GPU 0, GPU 1, and GPU 2. Although three calculation cards are shown here, this is just for example. It can also be two cards as shown in FIG. 5. In fact, the source task node and the sink task node can be distributed to more cards or to different device sets.
  • Figure 6 shows the case where the P descriptor tensor of the task of task node A in Figure 4 is distributed on three computing cards.
  • the input of task node B wants to obtain the tensor of S(0) descriptors. Data exchange process.
  • each of the three cards is distributed with a partial value tensor P.
  • Ps is used here to represent a partial and tensor as a description example.
  • task node B distributed on GPU 0 needs to obtain the S(0) tensor and also needs to supplement the task node from GPU 1
  • the amount of data transferred by the node is T 2 /3.
  • task node B distributed on GPU 1 wants to obtain the S(0) tensor, it also needs to supplement the amount of data transferred from the logical data block of task node A on GPU 0 to task node B distributed on GPU 1 T 2 /3 and the amount of data T 2 /3 transferred from the logical data block of task node A on GPU 2 to task nodes of task node B distributed on GPU 1.
  • task node B distributed on GPU 2 needs to obtain the S(0) tensor, but also needs to supplement the logical data block from task node A on GPU 1 to the task node distributed on GPU 2 of task node B to transfer data
  • the data transfer volume is (k-1) ⁇ T 2 .
  • the data handling cost required to select the signature SBP-2 signature (for example, signature (S(1), S(0)) ⁇ P)) is two The sum of the handling costs of the input terminals.
  • the total amount of data that the task node needs to transport is T 1 +T 2 .
  • the transportation cost estimated by the transportation data amount estimation unit 111 for the candidate signature SBP-2 of the computing logical node B needs to include the transportation cost for the two input ends of the candidate signature.
  • FIG. 7 illustrates a third schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • the device set of the source node is completely different from the device set of the sink node. That is, the source task node E is distributed on GPU 0 and GPU1, and the sink task node B is distributed on the computing card GPU 2 and GPU 3. If the size of the logical data block distributed on each computing card is T 3 , the amount of data that needs to be transported is 2T 3 .
  • FIG. 8 illustrates a fourth schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • the device set of the source node is completely different from the device set of the sink node. That is, the source task node A is distributed on GPU 0, GPU1, and GPU 2, and the sink task node B is distributed on the computing cards GPU 3, GPU 4, and GPU 5.
  • each of the three cards has a partial value tensor P.
  • Ps is used to represent a part and a tensor as a description example.
  • the amount of data to be transported is 9 1/3 T 4 , that is, 3T 4 . If the number of computing cards in the task set distributed by the source task node is 2, the amount of data that needs to be transported is 2T 4 . If the number of computing cards in the task set distributed by the source task node A is Ks, then the amount of data transported is Ks ⁇ T 4 .
  • FIG. 9 illustrates a fifth schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • FIG. 10 illustrates a sixth schematic diagram of the transported data amount estimation unit 111 according to the present disclosure in which the data transported amount generated between the tensors of different distributed descriptors is estimated.
  • the device set of the source node is not exactly the same as the device set of the sink node. That is, the source task node A is distributed on GPU 0, GPU1 and GPU 2, and the sink task node B is distributed on the computing cards GPU 1, GPU 2 and GPU 3.
  • each of the three cards has a partial value tensor P.
  • Ps is used to represent a part and a tensor as a description example.
  • the size of the logical data block distributed on each computing card of each source task node is T 6
  • the amount of data that needs to be transported is 7 1/3 T 4 , that is, 7/3 T 4 .
  • the transported data amount estimation unit 111 traverses all candidate signatures SBP-1, SBP-2, and SBP-3 of the arithmetic logical node B in the above-mentioned manner, and obtains the transport cost for each signature. Subsequently, the total transport data comparison unit 112 compares the transport cost under each candidate signature, and obtains the minimum transport cost of the operation logic node to be determined, such as the operation logic node B. Finally, the SBP signature determining unit 113 determines the candidate SBP signature corresponding to the minimum handling cost as the final SBP signature of the operational logic node B.
  • the operation logic node topology map output component 12 outputs the final operation logic node topology map 101 based on the SBP signature determined by the SBP signature determination unit 113 for each operation logic node, and each operation logic node constituting the operation logic node topology map 101 There is only one SBP signature attached, or each operation logic node clearly specifies the distribution mode or distribution descriptor of each input tensor, and uniquely determines the distribution mode or distribution descriptor of its input tensor.
  • the above estimation of the transmission cost is only for the amount of data, but it should be pointed out that the length of the data transport path, that is, the complexity of the data transport is also a part of the transmission cost that needs to be considered.
  • the final transmission cost of each candidate SBP signature can be calculated.
  • the candidate SBP signature is selected based on the corrected transmission cost after considering the transmission path, and a more optimized handling task node insertion result will be obtained.
  • FIGs. 1 and 2 Although a part of the topology of a complete task node after inserting the handling task node is shown in Figs. 1 and 2, this way of inserting the handling task node is only an example. Under different computing device resources, its insertion method will also change based on the above-mentioned basic principles.
  • the above description is the transfer of data between the host and the computing device
  • the computing tasks of some computing task nodes are directly deployed on the host
  • the location where the transfer task node will be deployed is only on the host receiving the data.
  • the handling task node when direct access between the host and the computing device cannot be performed, the handling task node is inserted
  • the component inserts only one transportation task node between the first operation task node and the second operation task node, and assigns a first position mark to the inserted transportation task node.
  • the handling task node insertion component is connected to the first computing task node and the second computing device. Only one transport task node is inserted between the two computing task nodes, and the inserted transport task node is given a second position mark, and its transport direction is marked GH.
  • the running path of the data can be known in advance from a global perspective, thereby pre-deploying the data transport task node, so that the data transport can be static.
  • Deployment is to fix the data transfer task in a specific transfer execution body to realize the asynchronous communication in the data exchange, so as to reduce the time overhead of two calls.
  • the existing technology eliminates the defects of data scheduling and data calculation overlap caused by data processing waiting and delay caused by dynamic scheduling online decision-making data migration (prior art Unable to achieve the overlap of data handling and calculation).
  • the present disclosure inserts the transfer task node between the computing task nodes, the data transfer path is planned in advance, so that the transfer role of each data is fixed, and the source and purpose of the data and the computing task node object served by the transfer task node are determined in advance. , Which can realize the overlap of handling and calculation in the whole system, and solve the explosion situation caused by resource exhaustion or unplanned resources in flow control.
  • the transport task node is inserted in advance, the waiting process of the calculation can be eliminated, so that the computing device corresponding to the computing task node is always in the computing state, and the computing utilization rate is improved.
  • the purpose of the present disclosure can also be realized by running a program or a group of programs on any computing device.
  • the computing device may be a well-known general-purpose device. Therefore, the purpose of the present disclosure can also be achieved only by providing a program product containing program code for implementing the method or device. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure.
  • the storage medium may be any well-known storage medium or any storage medium developed in the future.
  • each component or each step can be decomposed and/or recombined.
  • These decomposition and/or recombination should be regarded as equivalent solutions of the present disclosure.
  • the steps of executing the above-mentioned series of processing can naturally be executed in chronological order in the order of description, but they do not necessarily need to be executed in chronological order. Some steps can be performed in parallel or independently of each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,包括:通过运算任务节点部署组件,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记;以及通过搬运任务节点插入组件,在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。

Description

拓扑图转换系统及其方法 技术领域
本公开涉及一种数据处理技术。更具体地说,本公开涉及一种用于将运算逻辑节点托拓扑图转换为任务节点拓扑图的转换系统及其方法。
背景技术
随着分布式计算的普及,大型的作业会通过分割而将不同部分的数据部署到不同的分布式数据处理系统的各个计算设备上进行处理,这样,在具体作业的处理过程中,部署在一个计算设备上的计算中间参数或结果会成为另一个计算设备上的计算任务的输入数据,这样为了实现中间参数的数据同步,这会引起计算设备之间的数据迁移的调用开销。而网络通信调用往往是个瓶颈,然后网络性能通信的性能不好,就会影响多机分布式数据处理架构的加速比和扩展性。
随着各种单一计算设备自身的运算功能越来越强大,在提高计算设备的运算速度方面已经处于极致状态。尤其是随着计算速度的提升,数据的调用的速度已经落后于数据的运算速度。因此,数据的调用或迁移成为制约计算设备处理数据的瓶颈。实际上,大部分专用AI芯片的研发人员和用户通常只关注计算部分的功耗和效率,譬如怎么设计AI芯片让它能更高效的执行矩阵运算,然而比较少关注数据迁移,数据转发和路由的需求,在基于多个芯片协同执行大规模任务时,数据迁移无论是从功耗还是延迟都非常显著。
因此,在现有系统中,数据迁移在分布式设备之间的迁移花费了和计算差不多的时间成本。如何降低通信开销,在系统运算时“藏起”这个时间,以便系统可以充分将硬件资源投入到缩短计算时间上,是提高系统效率的关键。此外,在灵活的并行模式(数据并行,模型并行甚至是混合并行)中修改数据路由模式实在是非常复杂。现有的深度学习框架都只是实现模型中的数据流图计算操作,而不在模型的数据流图中进行数据迁移操作。这样做的结果就是,因为数据流图中没有编码这些操作,从而无法展现数据流引擎自动并行的优势,也会因此使软件编程工作在同步编程时陷入所谓的回调陷阱。
因此,如何使得数据搬运或数据交换在分布式数据处理架构中与数据运算一样被重视,从而使得数据搬运或数据交换像数据处理和计算一样被视为一等公民,使得数据的搬运可以实现静态部署,将数据搬运任务固定在特定的搬运执行体中来实现,从而实现数据交换中的异步的通信,以减少两个调用的时间的开销,使得数据搬运和路由可由专用芯片来实现称为可能,从而使得整个系统的效率就能最大化,这些是大规模数据处理领域急需解决的问题。
技术解决方案
本公开的目的在于提供一种解决至少上述问题之一的技术方案。具体而言,本公开提供一种将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,包括:通过运算任务节点部署组件,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记;以及通过搬运任务节点插入组件,在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中,当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机时,所述搬运任务节点插入组件在所述第一运算任务节点和第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中,当第一位置标记指明为第一主机而第二位置标记指明为第一主机的第二计算设备时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中,当第一位置标记指明为第一主机而第二位置标记指明为第二主机时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中,当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机的第三计算设备或第二主机时,所述搬运任务节点插入组件在所述第一运算任务节点和第二运算任务节点之间插入两个搬运任务节点,并为紧临第一运算任务节点插入的第一搬运任务节点赋予第一位置标记,而为另一插入的搬运任务节点赋予第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中,当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第二主机的第四计算设备时,所述搬运任务节点插入组件按照从所述第一运算任务节点到第二运算任务节点之间的顺序依次插入第一、第二以及第三搬运任务节点,并为第一搬运任务节点赋予第一位置标记,为第二搬运任务节点赋予指明第一主机的位置标记以及第三搬运任务节点赋予第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的方法,其中所述方法还包括在通过运算任务节点部署组件将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到所述指定计算资源之前,通过运算任务节点部署组件中的逻辑分布式签名选择组件,基于所述任务配置数据为运算逻辑节点拓扑图中的源运算逻辑节点指定的由运算逻辑节点的输入张量的分布式描述符以及输出张量的分布式描述符构成的逻辑分布式签名,从每个源运算逻辑节点的每个下游运算逻辑节点的候选逻辑分布式签名集合中选择数据搬运代价最小的逻辑分布式签名作为每个下游运算逻辑节点的逻辑分布式签名。
根据本公开的另一个方面,还提供了一种将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,包括:运算任务节点部署组件,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记;以及搬运任务节点插入组件,在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述搬运任务节点插入组件当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机时在所述第一运算任务节点和第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述搬运任务节点插入组件在第一位置标记指明为第一主机而第二位置标记指明为第一主机的第二计算设备时,在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述搬运任务节点插入组件在第一位置标记指明为第一主机而第二位置标记指明为第二主机时,在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述搬运任务节点插入组件在第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机的第三计算设备或第二主机时,在所述第一运算任务节点和第二运算任务节点之间插入两个搬运任务节点,并为紧临第一运算任务节点插入的第一搬运任务节点赋予第一位置标记,而为另一插入的搬运任务节点赋予第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述搬运任务节点插入组件在第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第二主机的第四计算设备时,按照从所述第一运算任务节点到第二运算任务节点之间的顺序依次插入第一、第二以及第三搬运任务节点,并为第一搬运任务节点赋予第一位置标记,为第二搬运任务节点赋予指明第一主机的位置标记以及第三搬运任务节点赋予第二位置标记。
根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统,其中所述运算任务节点部署组件包括逻辑分布式签名选择组件,其在将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到所述指定计算资源之前,基于所述任务配置数据为运算逻辑节点拓扑图中的源运算逻辑节点指定的由运算逻辑节点的输入张量的分布式描述符以及输出张量的分布式描述符构成的逻辑分布式签名,从每个源运算逻辑节点的每个下游运算逻辑节点的候选逻辑分布式签名集合中选择数据搬运代价最小的逻辑分布式签名作为每个下游运算逻辑节点的逻辑分布式签名。
通过根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统和方法,能够从全局角度提前获知数据的运行路径,从而将预先部署数据搬运任务节点,使得数据的搬运可以实现静态部署,将数据搬运任务固定在特定的搬运执行体中来实现,从而实现数据交换中的异步的通信,以减少两个调用的时间的开销。尤其是,通过从全局角度将预先部署数据搬运任务节点,消除了现有技术中动态调度在线决策数据迁移导致的数据处理等待和延时而无法实现数据调度和数据计算重叠的缺陷(现有技术无法实现数据搬运和计算的重叠)。正是由于本公开将搬运任务节点插入运算任务节点之间,因此数据搬运路径被提前规划,使得每个数据的搬运角色固定,预先确定数据的来源与目的以及搬运任务节点服务的运算任务节点对象,从而能够在整个系统中实现搬运与计算的重叠,解决了流控中的资源耗尽或资源无规划导致的爆炸情形的出现。
而且由于预先插入搬运任务节点,因此能够消除运算的等待过程,使得运算任务节点对应的运算设备一直处于运算状态,提高运算利用率。
本发明的其它优点、目标和特征将部分通过下面的说明体现,部分还将通过对本发明的研究和实践而为本领域的技术人员所理解。
附图说明
图1所示的是根据本公开的运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统的原理示意图。
图2所示的是根据本公开的完全任务节点拓扑图的部分示意图。
图3所示的是根据本公开选择运算逻辑节点的逻辑分布式签名的结构示意图。
图4所示的是根据本公开选择下游运算逻辑节点的SBP签名的示意图。
图5图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第一示意图。
图6图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第二示意图。
图7图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第三示意图。
图8图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第四示意图。
图9图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第五示意图。
图10图示了根据本公开的搬运数据量估算单元估算不同分布式描述符的张量之间产生的数据搬运量的第六示意图。
本发明的实施方式
下面结合实施例和附图对本发明做进一步的详细说明,以令本领域技术人员参照说明书文字能够据以实施。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,在下文中,两个可能位置标记之一可以被称为第一位置标记也可以被称为第二位置标记,类似地,两个可能位置标记的另一个可以被称为第二位置标记也可以被称为第一逻位置标记。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
为了使本领域技术人员更好地理解本公开,下面结合附图和具体实施方式对本公开作进一步详细说明。
图1所示的是根据本公开的运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统的原理示意图。如图1所示,根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统包括运算任务节点部署组件10和搬运任务节点插入组件20。所述运算任务节点部署组件10在获得运算逻辑节点拓扑图时,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记。
具体而言,在分布式计算系统中,通常会包括一个或多个主机,每个主机上会连接多个运算设备,例如GPU、TPU等专用于大规模简单运算的计算设备。当需要进行数据并行计算时,所需要处理的大规模数据块通常会被分割分片到多个计算设备上进行并行处理。在模型比较大的情况下,通常也可以将模型进行分割而分布到不同计算设备上进行处理。为此,当可利用的一个主机(HOST)上的设备为两个,例如为GPU0和GPU1时,可以沿着数据的第0维度,将数据分片为两部分,分布到GPU0和GPU1上进行并行处理,如果主机编号为H1, 则为运算逻辑节点的分片到该主机H1的GPU0上的运算任务节点赋予位置标记H1-GPU0,同样,为运算逻辑节点的分片到该主机H1的GPU1上的运算任务节点赋予位置标记H1-GPU1。如图1所示,运算逻辑节点E本身由于其将被分配到H1的两个GPU上,因此其初始具备位置标记H1-2G。在经过运算任务节点部署组件10处理后,其被分片的两个运算任务节点为E1和E2,其分别被赋予位置标记H1-GPU0和H1-GPU1。同样运算逻辑节点A经过运算任务节点部署组件10处理后,其被分片的两个运算任务节点为A1和A2,其分别被赋予位置标记H1-GPU0和H1-GPU1。作为运算逻辑节点A和E的下游运算逻辑节点B也经过运算任务节点部署组件10处理后,其被分片的两个运算任务节点为B1和B2,其分别被赋予位置标记H1-GPU0和H1-GPU1。以此类推,运算逻辑节点C、D、F都位于主机H2的两个GPU计算卡上,因此经过运算任务节点部署组件10处理后,其各自的运算任务节点C1和C2、D1和D2、F1和F2的位置标记分别为记H2-GPU0和H2-GPU1。通过结合任务配置数据,获得了运算任务节点拓扑图102。
在通过上述方式确定了运算任务节点拓扑图102之后,搬运任务节点插入组件20在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。具体而言,如图1所示,在运算任务节点E1和B2之间插入搬运任务节点E1-H1和H1-B2,在运算任务节点E2和B1之间插入搬运任务节点E2-H1和H1-B1,在运算任务节点A1和B2之间插入搬运任务节点A1-H1和H1-B2,以及在运算任务节点A2和B1之间插入搬运任务节点A2-H1和H1-B1。最终形成图1中的完全任务节点拓扑图。不过需要指出的是,在图1中,局限于附图的图幅,仅仅显示了完全任务节点拓扑图的一部分,即包括经过搬运任务节点插入后的包含运算任务节点E、A以及B彼此之间的完全任务节点拓扑图的第一部分103-1,其他部分被省略。不过需要指出的是,当连接在同一主机上的不同运算设备(例如GPU)之间具备直接访问协议的情况下,这种同一主机下的运算设备之间的数据迁移可以不用插入本公开所提及的搬运任务节点。
由于运算任务节点K的位置标记为主机H1,在运算任务节点B1或B2与运算任务节点K之间仅仅插入一个搬运任务节点B1-H1或B2-H1,即运算任务节点K所需要的分布在G0/H1或G1/H1的部分或全部数据将由搬运任务节点搬运任务节点B1-H1或B2-H1搬运到主机H1上。不过需要指出的是,在主机H1与其所连接的运算设备(例如GPU)之间具备直接访问协议的情况下,这种主机和运算设备之间的数据迁移可以不用插入本公开所提及的搬运任务节点。
图2所示的是根据本公开的插入搬运任务节点后的完全任务节点拓扑图的一部分的示意图。如图2所示,由于运算逻辑节点C分布在主机H1的两个GPU0和GPU1上,其下游运算逻辑节点D分布在主机H2的两个GPU0和GPU1上,因此如图1所示其各自的运算任务节点C1和C2的位置标记为G0/H1或G1/H1以及运算任务节点D1和D2的位置标记为G0/H2或G1/H2。因此,当运算任务节点D1所需的输入数据需要来自于运算任务节点C1时,则如图2所示,需要在运算任务节点C1和运算任务节点D1之间插入搬运任务节点C1-H1、H1- H2和H2-D1。如果运算任务节点D1所需的输入数据同时还需要来自于运算任务节点C2时,则还需要在运算任务节点C2和运算任务节点D1之间插入搬运任务节点C2-H1、H1- H2和H2-D1。同样,当运算任务节点D2所需的输入数据需要来自于运算任务节点C1时,则如图2所示,需要在运算任务节点C1和运算任务节点D2之间插入搬运任务节点C1-H1、H1- H2和H2-D2。如果运算任务节点D2所需的输入数据同时还需要来自于运算任务节点C2时,则还需要在运算任务节点C2和运算任务节点D2之间插入搬运任务节点C2-H1、H1- H2和H2-D2。类似地,在主机H1或H2与其所连接的运算设备(例如GPU)之间具备直接访问协议的情况下,这种主机和运算设备之间的数据迁移可以不用插入本公开所提及的搬运任务节点。因此,在运算任务节点C1或C2和D1或D2之间只需要插入一个搬运任务节点H1- H2,也就是说,在C1与C2和D1与D2之间可以共享一个搬运任务节点H1- H2。尽管图2所示的完全任务节点拓扑图的第二部分103-2,为了直观理解和方便描述,该部分显示分别插入了四个搬运任务节点H1- H2,但是实际上即使在主机H1或H2与其所连接的运算设备(例如GPU)之间不具备直接访问协议的情况下,这四个搬运任务节点H1- H2可以为一个搬运任务节点。根据本公开,在跨主机之间存在数据迁移时,只需要在成对的主机之间一对运算逻辑节点之间插入一个搬运任务节点。
在搬运任务节点插入组件20插入搬运任务节点的同时,也标记了所插入的搬运任务节点的位置标记,此外也标记了搬运数据的源地址和目的地地址,也就是标记数据的搬运方向。上述每个搬运节点的名称即是搬运任务节点的源地址和目的地地址以及搬运方向。
但是需要指出的是,为了简化和优化搬运节点的插入,缩短数据搬运的路径,可选择地,所述运算任务节点部署组件10还包括逻辑分布式签名选择组件11,每个运算任务节点还基于其运算操作类型,从其多个候选逻辑分布式签名中选择了一个确定的逻辑分布式签名。具体而言,逻辑分布式签名选择组件11在将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到所述指定计算资源之前,基于所述任务配置数据为运算逻辑节点拓扑图中的源运算逻辑节点指定的由运算逻辑节点的输入张量的分布式描述符以及输出张量的分布式描述符构成的逻辑分布式签名,从每个源运算逻辑节点的每个下游运算逻辑节点的候选逻辑分布式签名集合中选择数据搬运代价最小的逻辑分布式签名作为每个下游运算逻辑节点的逻辑分布式签名。从而获得具有逻辑分布式签名的运算任务节点拓扑图102。
具体而言,为了获得更好的搬运任务节点插入结果,本公开的运算逻辑节点都包含有针对不同运算操作的候选逻辑分布式签名集合。图3所示的是根据本公开选择运算逻辑节点的逻辑分布式签名的结构示意图。图3中仅仅示意性地给出了一个简单的初始运算逻辑节点拓扑图104,其中显示了节点A、B、C、D、E、F、L以及K。其他未显示的采用省略方式替代。在实际的数据处理中,初始运算逻辑节点拓扑图104会更复杂。初始运算逻辑节点拓扑图104包含实现用户所描述的计算任务的基本逻辑运算节点。这种初始运算逻辑节点拓扑图104的生成方式属于本领域常规技术,因此不在此赘述。
在初始运算逻辑节点拓扑图104的各个初始运算逻辑节点每个包含多个SBP签名。作为已经由用户配置了SBP签名的源运算逻辑节点或基于用户的任务描述而确定了唯一SBP签名的初始运算逻辑节点,例如初始运算逻辑节点A的SBP-1,初始运算逻辑节点C的SBP-2以及初始运算逻辑节点E的SBP-3。在未确定唯一SBP签名的情况下,初始运算逻辑节点通常包含有其固有的一些候选SBP签名。如图1中的初始运算逻辑节点B,如后面图3所示其具有多个候选SBP签名,例如三个,包括SBP-1、SBP-2以及SBP-3。其他初始运算逻辑节点也各自具有不同的候选SBP签名,在此不一一列出。不同的初始运算逻辑节点根据其具体执行的运算操作不同,会有不同的固定的候选SBP签名。
根据本公开的SBP签名是应用在一种分布式数据处理系统中的签名。分布式数据处理系统中,由于经常存在数据并行、模型并行以及混合并行以及流式并行等的情形,因此,经常会存在相邻的运算逻辑节点的任务将被同时部署到不同的计算设备上,因此在实际数据处理过程中,各个计算设备之间会对中间参数进行交换,会导致大量的搬运开销。尽管根据本公开的搬运节点可以直接根据运算任务节点的分布直接进行布置。但是,为了减少数据搬运开销,需要在初始运算逻辑节点拓扑图104的基础上,进一步完善运算逻辑节点拓扑图,尤其是减少上下游运算逻辑节点之间的搬运开销,需要使得上下游运算逻辑节点的数据分布方式所带来的变化最小或搬运的路径最短。为此,本公开为了获得比较好的下游运算逻辑节点,针对每个运算逻辑节点指定了逻辑分布式签名。所述逻辑分布式签名是采用张量的分布式描述符对运算逻辑节点的签名,每个张量的分布式描述符描述了每个张量的在整个计算系统中的分布方式,主要包括分割(SPLIT)张量描述符、广播(BROADCAST)张量描述符以及部分值(PARTIAL VALUE)张量描述符。
具体而言,分割(SPLIT)张量描述符就是描述一个张量的分割方式,例如将一个数据块根据用户的描述在指定的维度上进行分割,并分布到不同的计算设备上进行指定的计算处理。如果一个数据块为二维数据块,则该数据块在其第0维被切割时,则该数据块所形成的一批数据的数据张量的分布式描述符为S(0),则每个逻辑数据块在其输入端获得这种数据张量的分布式描述符都为S(0)。同样,如果一个数据块为二维数据块,则该数据块在其第1维被切割时,则该数据块所形成的一批数据的数据张量的分布式描述符为S(1),则每个逻辑数据块在其输入端获得这种数据张量的分布式描述符都为S(1)。类似地,如果待处理的任务数据的维度为更多维度,则会有更多的分布式描述符,例如S(2)、S(3)…等等。这类所提到的数据可以是被处理的数据或模型。如果数据本身被切割,则在分布式数据处理系统上形成数据并行处理,如果模型被分割,则在分布式数据处理系统上会形成模型并行处理。如果运算逻辑节点的输入为这种分割(SPLIT)张量描述符,则在实际数据处理过程中,如果一个张量的数据大小为T,而该张量将被分布到四张计算卡上进行数据并行计算,则每张卡上分配到的数据量为四分之一的数据,整个四张卡上的数据量则为T。
广播(BROADCAST)张量描述符是用来描述一个张量以广播方式在分布式系统中进行发布的方式。通常,对于仅仅进行数据并行的数据处理系统,模型数据通常被广播到各个计算设备,因此对于被输入到运算逻辑节点的广播数据采用广播张量描述符进行描述。在实际数据处理过程中,被广播的数据,在每张实际计算卡上的数据块大小都是相同的。
部分值(PARTIAL VALUE)张量描述符表示一个运算逻辑节点的输入或输出张量为多个同类张量的部分值。这些部分值包括部分和(Ps)、部分积(Pm)、部分“与”结果、部分最大以及部分最小。由于通常会为了对数据进行数据并行处理,因此,在不同设备上对数据的处理是对部分数据的处理。例如有些张量为S(0)或S(1),则在一些计算设备上获得结果张量为S(0),这些部分计算设备上的结果张量合并起来就是部分值张量。将所有设备上的同类数据合并起来才是最后的输出结果。
上述各种张量的分布式描述符代表了这些张量在分布式计算系统中的分布方式,而这些张量无论是作为运算逻辑节点的输入和输出,其各自的分布方式也描述了运算逻辑节点对操作数据的分布描述。为了描述方便,本公开将这种分布式描述符简称为“SBP描述符”。
为此,随着初始运算逻辑节点拓扑图104的生成,本公开的初始运算逻辑节点,也就是一些运算节点也具备了各个输入和输出的数据分布式描述符,这些输入和输出分布式描述符形成了对运算逻辑节点的一种签名,即采用张量的分布式描述符对运算逻辑节点的签名。为了方便表述,采用这三种分布式描述符的英文首字母来简称这种签名为“SBP签名”。
根据每个分布式计算系统中用户对计算任务的描述和数据并行的要求,这种描述符会包括至少三种S(0)、B以及P。如果对数据和模型存在多种分割方式,则每增加一种分割方式,则增加一种描述符。针对每个运算逻辑节点,其签名都包含了这些描述符的各种组合方式。因此,在根据本公开分布系统中,至少有三种分布式描述符,通常为有四种分布式描述符,例如如下四种SBP描述符,S(0)、S(1)、P以及B。根据张量维度数量不同,可以有更多分布式描述符。如果为四种SBP描述符,则可以按照输入输出的排列组合方式形成多种SBP签名。下面列出了一些SBP签名的实例:(S(0), B)→S(0),(S(1), B)→S(1),P→P,B→B,(S(0), S(1))→P,S(0)→P,S(0)→S(0),S(0)→S(1),P→B等等。所有SBP签名是各种SBP描述符组合结果。对于矩阵乘法运算逻辑节点,如果其输入张量是在第一维上面切割,其输出结果张量也是得到第一维上切割。综上所述,S、B、P是用于描述数据块在数据处理系统中的分布的描述符,而SBP签名利用多个SBP描述符描述运算逻辑节点的任务操作。每个数据块可以有多种SBP描述符,而每个运算逻辑节点所代表的运算方式可以多种SBP 签名的情形。例如,图1所示的SBP-1可以是(S(0), B)→S(0)这种签名形式,而SBP-2可以是(S(1), B)→S(1)这种签名形式。实际应用中,不同签名形式可以具有不同的编号,这里给出的编号仅仅是为了描述的方便,并不意味着需要对每个签名都赋予一个编号,可以完全没有编号,签名的不同形式彼此之间不需要编号就可以彼此区分。
可以基于用于的任务描述赋予每个初始运算逻辑节点如上所述的SBP签名。通常的运算逻辑节点是一些运算操作节点,其执行特定的运算操作,因此其具有特定的候选SBP签名。需要指出的是,并不是每个运算逻辑节点所具备的SBP签名都一样,通常进行乘法操作的运算逻辑节点其SBP签名的输入张量不包含部分和张量,因此其输入张量的SBP描述符不包含分布式描述符P。对于执行加法操作的运算逻辑节点的候选SBP签名则可以包括各种SBP描述符彼此之间或自己之间的任意组合。例如执行矩阵乘法的运算逻辑节点,在仅有数据并行的情况下,其候选SBP签名通常为(S(0), B)→S(0),(S(1), B)→S(1), (S(0), S(1))→P等,但不仅这些,随着技术的发展,以前一些不适合矩阵乘法的签名也可以应用到矩阵乘法,此处仅仅是举例。因此,每个初始运算逻辑节点基于所述任务配置数据附有候选逻辑分布式签名集合。所述候选逻辑分布式签名集合中的每个逻辑分布式签名指定了其所属的初始运算逻辑节点的每个输入张量的分布式描述符以及每个输出的张量的分布式描述符。
初始运算逻辑节点拓扑图104中每个运算逻辑节点将使用哪种SBP签名所确定的张量或者说使用哪种分布式张量以及输入何种分布式张量,需要进一步确定。因此,从初始运算逻辑节点拓扑图104中的源运算逻辑节点开始,在当前运算逻辑节点(例如运算逻辑节点B)的所有上游运算逻辑节点(例如运算逻辑节点A和E)的逻辑标签或SBP标签已经被确定时,搬运数据量估算单元111基于所述运算逻辑节点B的所有上游运算逻辑节点的与运算逻辑节点B的输入端对应的输出端的分布式描述符,针对运算逻辑节点B的每一个候选逻辑分布式签名,计算将每个上游运算逻辑节点输出端的张量的分布式描述符变换为运算逻辑节点B的对应输入端的候选逻辑分布式签名之一的张量的分布式描述符所需的搬运的数据的代价。如图3所示,运算逻辑节点B,其具有很多候选SBP签名,例如SBP-1、SBP-2以及SBP-3。举例而言, SBP-1其可能形式为(S(1), B)→S(1)或(S(1), P)→S(1)的签名,初始运算逻辑节点A的签名SBP-5其可能形式举例而言为(S(0), B)→S(0)的签名,初始运算逻辑节点E的签名SBP-3可能形式举例而言为B→B或S(0)→P。每个签名形式中,箭头左侧为输入张量的分布式描述符,箭头右侧为输出张量的分布式描述符。为了描述方便,下面将“分布描述符为S(0)的张量”简称为“S(0)张量”,将“分布描述符为B的张量”简称为“B张量”,将“分布描述符为P的张量”简称为“P张量”,以此类推。
如图4所示,初始运算逻辑节点拓扑图104中运算逻辑节点E的标签SBP-3的形式如果为“S(0)→S(0)”,则其输出张量分布描述符则为S(0),因此其输出张量为S(0)张量。如果运算逻辑节点E的签名SBP-3的形式为“B→B”或 “P→P”,则其输出的张量的分布描述符为B或P,因此其输出张量为B张量或P张量。如果运算逻辑节点B的候选签名SBP-2(即“(S(0), S(1))→P”)被选择为确定的签名,则其对应节点E的输出端的第一输入端的输入张量的分布描述符则必须是S(1),即第一输入端必须获得一个S(1)张量,而其对应于节点A的输出端的第二输入端的输入张量的分布描述符则必须是S(0),即第二输入端必须获得一个S(0)张量。如图4所示,举例而言,运算逻辑节点A的输出张量为P张量。很显然,此时节点A的输出张量分布描述符的P与节点B的第二输入端的输入张量的分布描述符S(0)不符,因此,要使得运算逻辑节点B执行正确的运算操作,就需要将节点A输出的分布描述符为P的张量变换为分布描述符为S(0)的张量。同样,如果节点E输出的张量的分布描述符为S(0),则与节点B的第一输入端的量输入张的分布描述符S(1)不一致,因此,要使得运算逻辑节点B执行正确的运算操作,就需要将节点E输出的分布描述符为S(0)的张量变换为分布描述符为S(1)的张量。
在分布式计算系统中,由于各个运算逻辑节点的操作任务尤其是运算任务被切割分布到各个计算设备(例如计算卡CPU、GPU或TPU)上,因此为了最终获得正确的结果,需要不断对中间参数进行同步,这就会涉及到不同计算设备之间的中间参数的交换。当上一运算逻辑节点的SBP签名所含有的输出张量的SBP描述符与当前节点的SBP签名的对应输入张量的SBP描述符不一致时,通常在实际运行过程中进行输出转换,而这个转换过程通常需要获取位于另一个计算设备上的部分数据,以便与本地能够获得的数据一起构成当前运算逻辑节点输入端所需的数据,从而符合当前运算逻辑节点的输入端的数据张量的分布式描述符。这种从另一个设备上获取部分数据的过程将会产生比较大的数据搬运开销或搬运代价。因此,为当前运算逻辑节点选择不同的签名会产生不同的数据搬运开销或代价。为此,搬运数据量估算单元111会对每个未确定签名的运算逻辑节点估算每个候选签名将会产生的数据搬运开销。例如,针对运算逻辑节点B,针对其三个候选SBP签名分别估算运算逻辑节点B在采用其中一个SBP签名的情况下会产生的数据搬运代价。对于运算逻辑节点B而言,选择任意一个候选SBP签名都可以实现其操作任务。但是其采用不同的SBP签名情况下,其运行所产生的数据搬运代价不同。因此,为了在数据处理过程中使得数据搬运代价最小化,需要从各个运算逻辑节点的候选签名中选择数据搬运量最小的签名作为其实际运行过程中的签名。
在初始运算逻辑节点拓扑图104中处于上下游关系的运算逻辑节点A和运算逻辑节点B之间,运算逻辑节点A可能是源节点,其SBP签名可以由用户配置生成,也可以基于用户的对任务的描述自然生成,或者运算逻辑节点A的SBP签名已经基本按照本公开的方案进行了决策选择确定,例如运算逻辑节点A的SBP签名的输出张量的描述符为S(0)。而作为初始运算逻辑节点拓扑图104中的运算逻辑节点B,其具有很多候选SBP签名,其可能包括(S(1),B)→S(1) , B→P , S(1))→P,以及P→B等等但是,从运算逻辑节点A到运算逻辑节点B,由于运算逻辑节点A的输出张量的分布描述符为S(0),节点B可以选择的对应输入张量分布描述符可以为S(1)、B以及P。
因此,当前面一些运算逻辑节点的签名被确定下来以后,其下游的运算逻辑节点的SBP签名也基于上游运算逻辑节点的输出张量的逻辑分布式描述符(SBP描述符)和下游上游运算逻辑节点的候选逻辑分布式签名的对应输入张量的逻辑分布式描述符(SBP描述符)之间的数据搬运的代价而最终被选择确定。通过这种方式,对这样当一个运算逻辑节点的候选SBP签名一旦被选定进行计算,意味着该运算逻辑节点的各个输入端和输出端的数据块的各自的SBP描述符也确定下来,从而计算或估算出当前运算逻辑节点的数据搬运的总代价,并将总代价最小的候选逻辑分布式签名作为该当前运算逻辑节点的逻辑分布式签名。需要指出的是,如果当前运算逻辑节点的候选签名中有哪些签名的输入端的逻辑分布式描述符与其上游运算逻辑节点的输出张量的逻辑分布式描述符一致,则可以优先选择含有该逻辑分布式描述符的候选逻辑分布式签名,除非该候选逻辑分布式签名的其他输入端张量的逻辑分布式描述符会导致最后的总代价更大。
图4所示的是根据本公开选择下游运算逻辑节点的SBP签名的示意图。图4是对图3中节点A、B以及E之间的关系的放大示意图。如图4所示,假设运算逻辑节点E的已经确定的SBP签名SBP-3的输出张量的分布描述符为S(0),运算逻辑节点A的已经确定的SBP签名SBP-5的输出张量的分布描述符为输入张量的分布描述符为P,运算逻辑节点B的候选SBP签名之一SBP-2为(S(1), S(0))→P)。因此运算逻辑节点B的与运算逻辑节点E的输出张量的SBP描述符S(0)对应的输入张量的SBP描述符为S(1),而运算逻辑节点B的与运算逻辑节点A的输出张量的SBP描述符P对应的输入张量的SBP描述符为S(0)。因此,要符合运算逻辑节点B的候选SBP签名的输入逻辑数据块分布要求,则需要使得其一个输入的张量分布从运算逻辑节点E的输出张量的SBP描述符S(0)变换为S(1)以及使得其另一个输入的张量分布从运算逻辑节点A的输出张量的SBP描述符P变换为S(0)。这种变换将会在实际数据处理过程中产生数据交换。
图5图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第一示意图。针对图4所示的任务节点B的候选SBP签名SBP-2,假设为(S(1), S(0))→P。为了描述方便,输入源任务节点A和E与接收的汇节点B的任务都分布在同一设备集上。为了方便,如图5所示是与图1所示一样都分布在计算卡GPU 0和GPU 1上。尽管这里只是显示了两张计算卡,实际上源任务节点和汇任务节点可以分布到更多张卡上,也可以分布到不同设备集上。图5显示了图4中的任务节点E的任务的S(0)描述符张量分布在两张计算卡上的情况下任务节点B的输入端要获得S(0)描述符的张量的情况下的数据交换过程。
运算逻辑节点B的分布在GPU0的运算任务节点要获得S(1),则除了需要直接从获得任务节点E的S(0)描述符所描述的分布在GPU 0上的张量一半外(采用实线箭头显示了这种数据部分的获取过程),还需要补充从任务节点E的S(0)描述符所描述的分布在GPU 1上的张量的另外一半(采用虚线箭头显示了这种数据部分的获取过程)。如果逻辑数据块的大小为T 1,则从GPU 1上的任务节点E的逻辑数据块上搬运到任务节点B的分布在GPU 0的任务节点的数据量为T 1/2。与此同时,任务节点B的分布在GPU 1的任务节点要获得S(1),则除了需要直接从获得任务节点E的S(0)描述符所描述的分布在GPU 1上的张量一半外(采用实线箭头显示了这种数据部分的获取过程),还需要补充从任务节点E的S(0)描述符所描述的分布在GPU 0上的张量的另外一半(采用虚线箭头显示了这种数据部分的获取过程)。如果逻辑数据块的大小为T 1,则从GPU 0的任务节点E的逻辑数据块上搬运到任务节点B的分布在GPU 1的任务节点的数据量为T 1/2。因此,将任务节点E的S(0)描述符张量变换为任务节点B的输入端要获得S(0)描述符的张量,总的数据搬运代价为T 1=(T 1/2+T 1/2)。T 1是源节点上所分布的逻辑数据块的大小。在图4中,逻辑数据块的大小为S(0)分布在每张卡上阴影部分的数据块的大小也就是整个张量的二分之一。在设备集的数据卡的数量为3、4或5的情况下,其搬运代价也还是T 1
图6图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第二示意图。同样,针对图4所示的任务节点B的候选SBP签名SBP-2,假设为(S(1), S(0))→P。为了描述方便,输入源任务节点A和E与接收的汇节点B的任务都分布在同一设备集上,如图6所示,都分布在计算卡GPU 0、GPU 1以及GPU 2上。尽管这里显示了三张计算卡,这仅仅是为了举例。其也可以与图5所示的那样可以是两张卡。实际上源任务节点和汇任务节点可以分布到更多张卡上,也可以分布到不同设备集上。图6显示了图4中的任务节点A的任务的P描述符张量分布在三张计算卡上的情况下任务节点B的输入端要获得S(0)描述符的张量的情况下的数据交换过程。
任务节点B的分布在GPU 0的任务节点要获得S(0),则除了需要直接从获得任务节点A的P描述符所描述的分布在GPU 0上的张量三分之一外(采用实线箭头显示了这种数据部分的获取过程),还需要补充任务节点A的P描述符所描述的分布在GPU 1上的张量的三分之一(采用虚线箭头显示了这种数据部分的获取过程)以及任务节点A的P描述符所描述的分布在GPU 2上的张量的三分之一。为此,在三张卡上每个分布有部分值张量P,为举例而言,这里采用Ps来表示为一种部分和张量作为一种描述实例。如果A任务节点分布在每张GPU卡上的逻辑数据块的大小为T 2,则分布在GPU 0上的任务节点B要获得S(0)张量,还需要补充从GPU 1上的任务节点A的逻辑数据块上向任务节点B的分布在GPU 0的任务节点搬运数据量T 2/3以及从GPU 2上的任务节点A的逻辑数据块上向任务节点B的分布在GPU 0的任务节点搬运数据量T 2/3。同样,分布在GPU 1上的任务节点B要获得S(0)张量,还需要补充从GPU 0上的任务节点A的逻辑数据块向任务节点B的分布在GPU 1的任务节点搬运数据量T 2/3以及从GPU 2上的任务节点A的逻辑数据块向任务节点B的分布在GPU 1的任务节点搬运数据量T 2/3。类似地,分布在GPU 2上的任务节点B要获得S(0)张量,还需要补充从GPU 1上的任务节点A的逻辑数据块向任务节点B的分布在GPU 2的任务节点搬运数据量T 2/3以及从GPU 0上的任务节点A的逻辑数据块向任务节点B的分布在GPU 2的任务节点搬运数据量T 2/3。因此,图6所示的从P分布式张量变换成S(0)分布式张量在实际数据处理过程中的数据搬运量为2T 2=(T 2/3+T 2/3+T 2/3+T 2/3+T 2/3+T 2/3)。可选择地,如果任务节点所分布的计算卡的数量为2。则数据的搬运量为T 2=(T 2/2+T 2/2)。以此类推,在源节点和汇节点具有相同的设备集的情况下,如果设备集中的卡数量k,则数据的搬运量为(k-1)·T 2
很显然,如上所述,对于运算逻辑节点B要执行体运算操作,选择签名SBP-2签名(例如签名(S(1), S(0))→P))所需要的数据搬运代价是两个输入端的搬运代价的总和。综合图5和图6(如果图6中为两张计算卡的情况下),任务节点在候选签名SBP-2情况下,其需要搬运的总的数据量为T 1 +T 2。为此,搬运数据量估算单元111针对运算逻辑节点B的候选签名SBP-2所估算的搬运代价需要包含针对该候选签名的两个输入端的搬运代价。
根据针对源任务节点和汇任务节点的设备集之间的完全相同的情况可以归纳总结各种SBP描述符彼此之间存在的数据交换量的计算表,如下表1:
表1(源任务节点和汇任务节点的分布设备集完全相同,卡数为K)
Figure 357793dest_path_image001
图7图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第三示意图。其中的源节点的设备集与汇节点的设备集完全不同。即源任务节点E分布在GPU 0和GPU1上,汇任务节点B分布在计算卡GPU 2和GPU 3上。如果分布在各个计算卡上的逻辑数据块大小为T 3,在需要搬运的数据量为2T 3
图8图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第四示意图。其中的源节点的设备集与汇节点的设备集完全不同。即源任务节点A分布在GPU 0、 GPU1和GPU 2上,汇任务节点B分布在计算卡GPU 3、 GPU 4和GPU 5上。举例而言,在三张卡上每个分布有部分值张量P,这里采用Ps来表示为一种部分和张量作为一种描述实例。如果分布在各个源任务节点的各个计算卡上的逻辑数据块大小为T 4,在需要搬运的数据量为9个1/3 T 4,即,3T 4。如果源任务节点所分布的任务集的计算卡的数量为2, 则需要搬运的数据量为2T 4。如果源任务节点A所分布的任务集的计算卡的数量为Ks,则数据的搬运量则为Ks ·T 4
根据针对源任务节点和汇任务节点的设备集之间的完全不同的情况可以归纳总结各种SBP描述符彼此之间存在的数据交换量的计算表,如下表2:
表2(源任务节点(卡数为K s)和汇任务节点(卡数为K d)各自的分布设备集完全不同)
Figure 313854dest_path_image002
图9图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第五示意图。其中的源节点的设备集与汇节点的设备集不完全相同。即源任务节点E分布在GPU 0和GPU1上,汇任务节点B分布在计算卡GPU 1和GPU 2。如果分布在各个源任务节点所分布的计算卡上的逻辑数据块大小为T 5,在需要搬运的数据量为3/2 T 3=(1/2 T 3+1/2 T 3+1/2 T 3)。这种情况下,计算没有固定规律,需要根据实际设备集的具体构成以及彼此之间的交集情况进行计算。
图10图示了根据本公开的搬运数据量估算单元111估算不同分布式描述符的张量之间产生的数据搬运量的第六示意图。其中的源节点的设备集与汇节点的设备集不完全相同。即源任务节点A分布在GPU 0、 GPU1和GPU 2上,汇任务节点B分布在计算卡GPU 1、 GPU 2和GPU 3上。举例而言,在三张卡上每个分布有部分值张量P,这里采用Ps来表示为一种部分和张量作为一种描述实例。如果分布在各个源任务节点的各个计算卡上的逻辑数据块大小为T 6,在需要搬运的数据量为在需要搬运的数据量为7个1/3 T 4,即,7/3T 4。这种情况下,计算没有固定规律,需要根据实际设备集的具体构成以及彼此之间的交集情况进行计算。
如上所述,搬运数据量估算单元111按照上述方式遍历运算逻辑节点B的所有候选签名SBP-1、SBP-2以及SBP-3,并针对每个签名获取其搬运代价。随后,搬运数据总量比较单元112会比较每个候选签名下的搬运代价,并获取待确定运算逻辑节点,例如运算逻辑节点B,的最小搬运代价。最后由SBP签名确定单元113将最小搬运代价所对应的候选SBP签名确定为该运算逻辑节点B的最终SBP签名。
最后运算逻辑节点拓扑图输出组件12基于SBP签名确定单元113针对每个运算逻辑节点确定的SBP签名,输出最终的运算逻辑节点拓扑图101,构成该运算逻辑节点拓扑图101的每个运算逻辑节点都只附带有一个SBP签名,或者说每个运算逻辑节点都明确指定了其每个输入张量的分布方式或分布描述符,并且唯一地确定了其输入张量的分布方式或分布描述符。
尽管上面给出了如何在一些候选SBP签名确定最终SBP签名的常规情况,但是在一些特定的情况下,对于某些运算逻辑节点,在用户有特殊的配置的情况下或有用户指定的情况下,这些运算逻辑节点只有用户指定的SBP签名,因此其下游的运算逻辑节点将基于这种特别指定的上游运算逻辑节点进行SBP签名的确定。
上面传输代价的估算只是针对数据量进行,但是需要指出的是,数据搬运路径的长短,即数据搬运的复杂度也是传输代价需要考虑的部分。在赋予传输路径的长度一定的权重值之后,与上述计算的数据量相乘,可以计算出每个候选SBP签名的最后传输代价。基于经过考虑传输路径之后的校正传输代价来选择候选SBP签名,会获得更为优化的搬运任务节点插入结果。
尽管在图1和2中给出了一种插入搬运任务节点之后的完全任务节点拓扑图的一部分,但是这种搬运任务节点的插入方式仅仅是一种示例性的。在不同的计算设备资源下,其插入方式也会基于上述基本原则进行变化。
尽管上面描述的是在主机和计算设备之间的数据的搬运,但是在有些运算任务节点的运算任务直接部署在主机时,也存在不同主机上的运算任务节点之间的数据迁移。因此,在第一位置标记指明为第一主机而第二位置标记指明为第二主机时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。具体而言,在主机之间执行跨主机的数据搬运时,搬运任务节点的将要部署的位置只是在接收数据的主机上。另一方面,对于第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机的情形,主机与计算设备之间不能执行直接访问时,所述搬运任务节点插入组件在所述第一运算任务节点和第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。可选择地,当第一位置标记指明为第一主机而第二位置标记指明为第一主机的第二计算设备时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第二位置标记,其搬运方向标记为G-H。
通过根据本公开的将运算逻辑节点拓扑图转换为任务节点拓扑图的转换系统和方法,能够从全局角度提前获知数据的运行路径,从而将预先部署数据搬运任务节点,使得数据的搬运可以实现静态部署,将数据搬运任务固定在特定的搬运执行体中来实现,从而实现数据交换中的异步的通信,以减少两个调用的时间的开销。尤其是,通过从全局角度将预先部署数据搬运任务节点,消除了现有技术中动态调度在线决策数据迁移导致的数据处理等待和延时而无法实现数据调度和数据计算重叠的缺陷(现有技术无法实现数据搬运和计算的重叠)。正是由于本公开将搬运任务节点插入运算任务节点之间,因此数据搬运路径被提前规划,使得每个数据的搬运角色固定,预先确定数据的来源与目的以及搬运任务节点服务的运算任务节点对象,从而能够在整个系统中实现搬运与计算的重叠,解决了流控中的资源耗尽或资源无规划导致的爆炸情形的出现。
而且由于预先插入搬运任务节点,因此能够消除运算的等待过程,使得运算任务节点对应的运算设备一直处于运算状态,提高运算利用率。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,对本领域的普通技术人员而言,能够理解本公开的方法和装置的全部或者任何步骤或者部件,可以在任何计算装置(包括处理器、存储介质等)或者计算装置的网络中,以硬件、固件、软件或者它们的组合加以实现,这是本领域普通技术人员在阅读了本公开的说明的情况下运用他们的基本编程技能就能实现的。
因此,本公开的目的还可以通过在任何计算装置上运行一个程序或者一组程序来实现。所述计算装置可以是公知的通用装置。因此,本公开的目的也可以仅仅通过提供包含实现所述方法或者装置的程序代码的程序产品来实现。也就是说,这样的程序产品也构成本公开,并且存储有这样的程序产品的存储介质也构成本公开。显然,所述存储介质可以是任何公知的存储介质或者将来所开发出来的任何存储介质。
还需要指出的是,在本公开的装置和方法中,显然,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。并且,执行上述系列处理的步骤可以自然地按照说明的顺序按时间顺序执行,但是并不需要一定按照时间顺序执行。某些步骤可以并行或彼此独立地执行。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,取决于设计要求和其他因素,可以发生各种各样的修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (14)

  1. 一种拓扑图转换方法,用于将运算逻辑节点拓扑图转换为任务节点拓扑图,包括:
    通过运算任务节点部署组件,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记;以及
    通过搬运任务节点插入组件,在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。
  2. 根据权利要求1所述的拓扑图转换方法,其中:
    当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机时,所述搬运任务节点插入组件在所述第一运算任务节点和第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
  3. 根据权利要求1所述的拓扑图转换方法,其中:
    当第一位置标记指明为第一主机而第二位置标记指明为第一主机的第二计算设备时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第二位置标记。
  4. 根据权利要求1所述的拓扑图转换方法,其中:
    当第一位置标记指明为第一主机而第二位置标记指明为第二主机时,所述搬运任务节点插入组件在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
  5. 根据权利要求1所述的拓扑图转换方法,其中:
    当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机的第三计算设备或第二主机时,所述搬运任务节点插入组件在所述第一运算任务节点和第二运算任务节点之间插入两个搬运任务节点,并为紧临第一运算任务节点插入的第一搬运任务节点赋予第一位置标记,而为另一插入的搬运任务节点赋予第二位置标记。 
  6. 根据权利要求1所述的拓扑图转换方法,其中:
    当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第二主机的第四计算设备时,所述搬运任务节点插入组件按照从所述第一运算任务节点到第二运算任务节点之间的顺序依次插入第一、第二以及第三搬运任务节点,并为第一搬运任务节点赋予第一位置标记,为第二搬运任务节点赋予指明第一主机的位置标记以及第三搬运任务节点赋予第二位置标记。
  7. 根据权利要求1-6之一所述的拓扑图转换方法,其中所述方法还包括在通过运算任务节点部署组件将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到所述指定计算资源之前:
    通过运算任务节点部署组件中的逻辑分布式签名选择组件,基于所述任务配置数据为运算逻辑节点拓扑图中的源运算逻辑节点指定的由运算逻辑节点的输入张量的分布式描述符以及输出张量的分布式描述符构成的逻辑分布式签名,从每个源运算逻辑节点的每个下游运算逻辑节点的候选逻辑分布式签名集合中选择数据搬运代价最小的逻辑分布式签名作为每个下游运算逻辑节点的逻辑分布式签名。
  8. 一种拓扑图转换系统,用于将运算逻辑节点拓扑图转换为任务节点拓扑图,包括:
    运算任务节点部署组件,基于用户在给定计算资源的基础上输入的任务描述中的任务配置数据,将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到指定计算资源,从而生成每个运算逻辑节点对应一个或多个运算任务节点,并赋予每个运算任务节点与所述指定计算资源对应的位置标记;以及
    搬运任务节点插入组件,在第一运算任务节点的第一位置标记和作为其上游运算任务节点的第二运算任务节点的第二位置标记之间具有不同的位置标记时在所述第一运算任务节点和第二运算任务节点之间插入一个或多个搬运任务节点,从而获得具有搬运任务节点的完全任务节点拓扑图。
  9. 根据权利要求8所述的拓扑图转换系统,其中:
    所述搬运任务节点插入组件当第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机时在所述第一运算任务节点和第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
  10. 根据权利要求8所述的拓扑图转换系统,其中:
    所述搬运任务节点插入组件在第一位置标记指明为第一主机而第二位置标记指明为第一主机的第二计算设备时,在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第二位置标记。
  11. 根据权利要求8所述的拓扑图转换系统,其中:
    所述搬运任务节点插入组件在第一位置标记指明为第一主机而第二位置标记指明为第二主机时,在所述第一运算任务节点和所述第二运算任务节点之间只插入一个搬运任务节点,并赋予所插入的搬运任务节点第一位置标记。
  12. 根据权利要求8所述的拓扑图转换系统,其中:
    所述搬运任务节点插入组件在第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第一主机的第三计算设备或第二主机时,在所述第一运算任务节点和第二运算任务节点之间插入两个搬运任务节点,并为紧临第一运算任务节点插入的第一搬运任务节点赋予第一位置标记,而为另一插入的搬运任务节点赋予第二位置标记。
  13. 根据权利要求8所述的拓扑图转换系统,其中:
    所述搬运任务节点插入组件在第一位置标记指明为第一主机的第一计算设备而第二位置标记指明为第二主机的第四计算设备时,按照从所述第一运算任务节点到第二运算任务节点之间的顺序依次插入第一、第二以及第三搬运任务节点,并为第一搬运任务节点赋予第一位置标记,为第二搬运任务节点赋予指明第一主机的位置标记以及第三搬运任务节点赋予第二位置标记。
  14. 根据权利要求8-13之一所述的拓扑图转换系统,其中所述运算任务节点部署组件包括逻辑分布式签名选择组件,其在将运算逻辑节点拓扑图中的任意运算逻辑节点的任务分片到所述指定计算资源之前,基于所述任务配置数据为运算逻辑节点拓扑图中的源运算逻辑节点指定的由运算逻辑节点的输入张量的分布式描述符以及输出张量的分布式描述符构成的逻辑分布式签名,从每个源运算逻辑节点的每个下游运算逻辑节点的候选逻辑分布式签名集合中选择数据搬运代价最小的逻辑分布式签名作为每个下游运算逻辑节点的逻辑分布式签名。
PCT/CN2021/072789 2020-02-13 2021-01-20 拓扑图转换系统及其方法 WO2021159929A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010090334.8A CN110928697B (zh) 2020-02-13 2020-02-13 拓扑图转换系统及其方法
CN202010090334.8 2020-02-13

Publications (1)

Publication Number Publication Date
WO2021159929A1 true WO2021159929A1 (zh) 2021-08-19

Family

ID=69854859

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/072789 WO2021159929A1 (zh) 2020-02-13 2021-01-20 拓扑图转换系统及其方法

Country Status (2)

Country Link
CN (2) CN111666151B (zh)
WO (1) WO2021159929A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666151B (zh) * 2020-02-13 2023-11-03 北京一流科技有限公司 拓扑图转换系统及其方法
CN111930519B (zh) 2020-09-22 2020-12-15 北京一流科技有限公司 用于分布式数据处理的并行决策系统及其方法
CN112764940B (zh) * 2021-04-12 2021-07-30 北京一流科技有限公司 多级分布式数据处理部署系统及其方法
CN114035968B (zh) * 2022-01-10 2022-03-18 北京一流科技有限公司 用于多流并行的冲突处理系统及其方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140344817A1 (en) * 2013-05-17 2014-11-20 Hewlett-Packard Development Company, L.P. Converting a hybrid flow
CN106648859A (zh) * 2016-12-01 2017-05-10 北京奇虎科技有限公司 一种任务调度方法和装置
CN110222005A (zh) * 2019-07-15 2019-09-10 北京一流科技有限公司 用于异构架构的数据处理系统及其方法
CN110928697A (zh) * 2020-02-13 2020-03-27 北京一流科技有限公司 拓扑图转换系统及其方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103516733A (zh) * 2012-06-19 2014-01-15 华为技术有限公司 一种虚拟私有云的处理方法及装置
WO2017049439A1 (en) * 2015-09-21 2017-03-30 Splunk Inc. Topology map displays of cloud computing resources
US10649808B2 (en) * 2016-09-16 2020-05-12 Oracle International Corporation Outcome-based job rescheduling in software configuration automation
CN107122244B (zh) * 2017-04-25 2020-02-14 华中科技大学 一种基于多gpu的图数据处理系统及方法
CN107483541A (zh) * 2017-07-17 2017-12-15 广东工业大学 一种基于滚动时域的在线任务迁移方法
CN110018817A (zh) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 数据的分布式运行方法及装置、存储介质及处理器
CN108388474A (zh) * 2018-02-06 2018-08-10 北京易沃特科技有限公司 基于dag的智能分布式计算管理系统及方法
CN108600321A (zh) * 2018-03-26 2018-09-28 中国科学院计算技术研究所 一种基于分布式内存云的图数据存储方法和系统
CN109144695B (zh) * 2018-08-30 2021-08-10 百度在线网络技术(北京)有限公司 一种任务拓扑关系的处理方法、装置、设备和介质
CN110262995A (zh) * 2019-07-15 2019-09-20 北京一流科技有限公司 执行体创建系统和执行体创建方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140344817A1 (en) * 2013-05-17 2014-11-20 Hewlett-Packard Development Company, L.P. Converting a hybrid flow
CN106648859A (zh) * 2016-12-01 2017-05-10 北京奇虎科技有限公司 一种任务调度方法和装置
CN110222005A (zh) * 2019-07-15 2019-09-10 北京一流科技有限公司 用于异构架构的数据处理系统及其方法
CN110928697A (zh) * 2020-02-13 2020-03-27 北京一流科技有限公司 拓扑图转换系统及其方法

Also Published As

Publication number Publication date
CN111666151A (zh) 2020-09-15
CN110928697B (zh) 2020-05-22
CN111666151B (zh) 2023-11-03
CN110928697A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021159929A1 (zh) 拓扑图转换系统及其方法
US10031775B2 (en) Backfill scheduling for embarrassingly parallel jobs
US11886929B2 (en) Deploying cloud-native services across control planes
CN110955734B (zh) 逻辑节点的分布式签名决策系统及其方法
CN107633125B (zh) 一种基于带权有向图的仿真系统并行性识别方法
CN102169500A (zh) 一种业务流程动态展示装置
CN104778079A (zh) 用于调度、执行的装置和方法以及分布式系统
CN111897580B (zh) 一种可重构阵列处理器的指令调度系统及方法
CN109687998B (zh) 一种面向任务服务的卫星网络资源管理模型的构建方法
CN112764940B (zh) 多级分布式数据处理部署系统及其方法
WO2022062529A1 (zh) 用于分布式数据处理的并行决策系统及其方法
JP2021520578A (ja) タスクスケジューリング
CN104794095B (zh) 分布式计算处理方法及装置
CN112799852B (zh) 逻辑节点的多维sbp分布式签名决策系统及其方法
CN102427420B (zh) 基于图模式匹配的虚拟网络映射方法及装置
Hoefler et al. Group operation assembly language-a flexible way to express collective communication
CN111049900B (zh) 一种物联网流计算调度方法、装置和电子设备
Yuan et al. A framework for executing parallel simulation using RTI
US7839849B1 (en) Formatting fields of communication packets
CN106897137B (zh) 一种基于虚拟机热迁移的物理机与虚拟机映射转换方法
Yamamoto et al. Direct estimation of deformable motion parameters from range image sequence
Sandoval et al. Runtime hardware/software task transition scheduling for data-adaptable embedded systems
Barbudo et al. A New Mapping Methodology for Coarse-Grained Programmable Systolic Architectures
CN114285784B (zh) 数据传输和管道搭建方法、装置、计算设备和存储介质
CN107818071A (zh) 一种基于fpga的硬件线程实现方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21754247

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21754247

Country of ref document: EP

Kind code of ref document: A1