CN112764940B - Multi-stage distributed data processing and deploying system and method thereof - Google Patents

Multi-stage distributed data processing and deploying system and method thereof Download PDF

Info

Publication number
CN112764940B
CN112764940B CN202110386635.XA CN202110386635A CN112764940B CN 112764940 B CN112764940 B CN 112764940B CN 202110386635 A CN202110386635 A CN 202110386635A CN 112764940 B CN112764940 B CN 112764940B
Authority
CN
China
Prior art keywords
logic
node
sbp
data processing
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110386635.XA
Other languages
Chinese (zh)
Other versions
CN112764940A (en
Inventor
李新奇
柳俊丞
郭冉
李一鹏
袁进辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oneflow Technology Co Ltd
Original Assignee
Beijing Oneflow Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oneflow Technology Co Ltd filed Critical Beijing Oneflow Technology Co Ltd
Priority to CN202110386635.XA priority Critical patent/CN112764940B/en
Publication of CN112764940A publication Critical patent/CN112764940A/en
Application granted granted Critical
Publication of CN112764940B publication Critical patent/CN112764940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/465Distributed object oriented systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multistage distributed data processing and deploying system and a method thereof. The system comprises: a device hierarchy setting component that sets the plurality of logical data processing devices as at least two levels of parallel logical data processing devices, thereby determining the number of dimensions of the SBP distributed signature; a position mark acquisition component for acquiring position marks of all the logic data processing devices; the initial logic node topological graph generating component is used for generating an initial logic node topological graph for the multi-stage distributed data processing system based on task configuration data input by a receiving user; a transmission cost query component, which is used for querying a transmission cost conversion table to obtain the transmission cost between the multidimensional SBP distribution descriptor at the output end of the upstream logic node and the multidimensional SBP distribution descriptor at the corresponding input end of the previous initial logic node aiming at each current initial logic node; and the result logic node topological graph generating component is used for selecting the candidate multidimensional SBP distributed signature with the minimum transmission cost sum based on the transmission cost query component query result so as to obtain the current result logic node with the determined multidimensional SBP distributed signature.

Description

Multi-stage distributed data processing and deploying system and method thereof
Technical Field
The present disclosure relates to a data processing technology. More particularly, the present disclosure relates to a multi-level distributed data processing deployment system and a method thereof based on multi-dimensional SBP distributed signatures.
Background
With the popularization of distributed computing, a large job or a large logic tensor can deploy different parts of data to different data processing devices of different distributed data processing systems for processing by division, and interaction of intermediate parameters is required in the computing process of each part. Thus, during the processing of a particular job, the computational intermediate parameters or results deployed on one data processing device may be referred to as input data for a computational task on another data processing device, which may cause data transfer overhead between the data processing devices. In the case of large job data, this overhead of transmission between different data processing devices will cause a significant computational burden on the distributed data processing system.
However, as the model becomes larger and the data to be processed also becomes larger, in the case that the model processing cannot be realized by a single computer, on one hand, the situation that the model is too large is satisfied by increasing the memory of the data processing device (such as a GPU card), but the price of one 16G GPU card is usually twice as expensive as the price of two 8G GPU cards. Therefore, it is not cost effective to increase the memory resources of a single computer. Therefore, in some scenarios, the model is too large to use data parallel communication overhead, or the model exceeds the GPU video memory capacity, in which case the model must be segmented, and only a part of the corresponding calculation of the model is completed on each device, which is called model parallel. People usually meet the condition of a large model by adopting two or more GPU cards with small memory resources in a model parallel mode, namely meet the requirement of data processing in a mode of performing model parallel. Model parallelism does not require synchronization of models between devices, but requires synchronization of data between devices. Currently, most deep learning frames do not support model parallelism or support weakly, and can be efficiently executed only by very subtle adjustment, so that manual and repeated debugging is required. Nevertheless, the results of repeated debugging are not satisfactory. Model parallelism is an industry recognized challenge. In addition to the complexity of model parallelism itself, model parallelism modules in conjunction with other parallel models are also very complex, requiring careful management of data transfers (routing) between upstream and downstream. In addition, in many cases, the communication overhead and synchronization consumption caused by model parallelism exceed those of data parallelism, so the acceleration ratio is not as high as that of data parallelism. However, for a large model which cannot be accommodated by a single-machine memory, model parallelism is a good choice. On the other hand, in the case where the size of data to be processed is also relatively large, it is also necessary to satisfy the demand by data parallel. However, since many deep learning frameworks cannot be automatically implemented by performing hybrid parallel of model parallel and data parallel at the same time, people still need to solve the parallel processing problem by pursuing a large-capacity GPU card, and in the case of a GPU card with a large capacity, a single data parallel mode or a model parallel mode is still usually selected to achieve the purpose of data processing in order to reduce the labor of personnel.
And considering both large-scale data and large-scale model situations, it is more difficult to adopt hybrid parallelism. Taking two adjacent layers of neural networks as an example, if the first layer uses data parallelism and the second layer uses model parallelism, then in forward calculation, the result of the data parallelism part needs to be copied (Copy), the Concat two layers of routes are summarized to two devices with parallel models, and if the two layers are executed on different data processing devices, cross-machine communication is also needed. If these complex data routes require manual user involvement for management, they are on the one hand too complex (imagine various combinations of data and model parallelism) and on the other hand very error-prone. Ideally, these complexities should be handled by the deep learning platform, but unfortunately, none of the existing open-source deep learning platforms support this functionality.
Therefore, it is desirable to obtain a technical scheme for implementing large-scale model and data processing on the premise of distributed computing resources of a small-capacity GPU card, so that on one hand, model parallelism can be implemented, on the other hand, the same data processing effect as that of data parallelism performed simultaneously under the condition of model parallelism can be satisfied, and parallel deployment can be automatically implemented. Obviously, at present, no deployment system capable of realizing hybrid parallel data processing, in particular, a system capable of realizing automatic deployment, exists.
Disclosure of Invention
To this end, the inventor of the present application proposes a multi-level distributed data processing deployment system, comprising: a device hierarchy setting component that sets a plurality of logical data processing devices as at least two levels of parallel logical data processing devices and specifies a logical hierarchical relationship between each other, thereby determining the number of dimensions of the SBP distributed signature, wherein each upper level logical data processing device contains or is connected with lower level data processing devices of the same data amount; a position mark acquisition component for acquiring position marks of all the logic data processing devices; an initial logic node topological graph generating component, which generates an initial logic node topological graph for the multi-level distributed data processing system based on task configuration data received from a user, wherein each initial logic node is attached with one or more candidate multi-dimensional SBP distributed signatures and position marks, and the descriptor of each dimension of the multi-dimensional SBP distribution descriptor at the input end and the output end of each multi-dimensional SBP distributed signature describes the distribution mode of a logic tensor on the logic data processing equipment of a corresponding level and each position mark indicates the logic data processing equipment where the logic tensor is deployed; a transmission cost query component, for each current initial logic node, based on the multidimensional SBP distribution descriptor of the output end of the upstream logic node of each determined multidimensional SBP distributed signature and the multidimensional SBP distribution descriptor of each candidate multidimensional SBP distributed signature of the current initial logic node corresponding to the output end of the upstream logic node, querying a transmission cost conversion table to obtain the transmission cost between the multidimensional SBP distribution descriptor of the output end of the upstream logic node and the multidimensional SBP distribution descriptor of the corresponding input end of the previous initial logic node; and a result logic node topological graph generating component, which is used for acquiring the required transmission cost sum of all input ends of each candidate multidimensional SBP distributed signature of the current initial logic node based on the query result of the transmission cost query component, so as to select the candidate multidimensional SBP distributed signature with the minimum transmission cost sum as the determined multidimensional SBP distributed signature of the current initial logic node, and thus, the current result logic node with the determined multidimensional SBP distributed signature is acquired.
The multi-stage distributed data processing deployment system according to the present disclosure further comprises: a computation graph generation component for generating a task logic computation graph based on the result logic node composition result logic node topological graph with the multi-dimensional SBP distributed signature determined, and when the multi-dimensional SBP distribution descriptor of one input of the current result logical node is different from the multi-dimensional SBP distribution descriptor of the output of the corresponding upstream logical node, inserting a transformed compute node between the input of each current compute node corresponding to the current result logical node and the output of each compute node corresponding to the corresponding upstream logical node, so as to transform the logic tensor output by the output end of each computation node corresponding to the upstream logic node and described by the multi-dimensional SBP distribution descriptor of the output end into the logic tensor to be input by the corresponding input end of each computation node corresponding to the current result logic node and described by the multi-dimensional SBP distribution descriptor of the input end.
According to the multi-level distributed data processing deployment system disclosed by the disclosure, the device hierarchy setting component sets an upper-level logic data processing device as a time dimension logic data processing device above a logic hierarchy to which a logic data processing device belongs when actual computing resources of the logic data processing device are smaller than computing resources required by a logic tensor and a result logic tensor of an input end of the logic data processing device based on task configuration data input by a user, an SBP distribution descriptor of a dimension corresponding to the time dimension logic data processing device in each candidate multidimensional SBP distributed signature is a time dimension distribution descriptor, the time dimension distribution descriptor belonging to the same candidate multidimensional SBP distributed signature comprises a division logic tensor descriptor and a broadcast logic tensor distribution descriptor, wherein the division times of logic tensors attached to the division tensor logic descriptor and the division times of logic tensors attached to the broadcast logic tensor distribution descriptor are the same as those attached to the broadcast logic tensor distribution descriptor The number of broadcasts of the logic tensor is the same.
According to the multi-stage distributed data processing deployment system disclosed by the disclosure, when the time dimension distribution descriptor is contained in the multi-dimensional SBP distributed signature, the computation graph generation component inserts a split computation node before the corresponding input end of the computation node corresponding to the current logic node based on the split logic tensor descriptor, inserts a repeat broadcast computation node before the input end of the computation node corresponding to the current logic node based on the broadcast logic tensor distribution descriptor, and inserts an aggregation computation node after the output end of the computation node corresponding to the current logic node.
According to another aspect of the present disclosure, there is provided a multi-level distributed data processing deployment method, including: setting a plurality of logic data processing devices into at least two levels of parallel logic data processing devices through a device hierarchy setting component and specifying a logic hierarchy relationship between the logic hierarchy setting component and the parallel logic data processing devices, thereby determining the dimension number of the SBP distributed signature, wherein each upper level logic data processing device contains or is connected with lower level data processing devices with the same data quantity; acquiring the position marks of all the logic data processing equipment through a position mark acquisition component; generating an initial logic node topological graph for the multi-level distributed data processing system based on task configuration data input by a user through an initial logic node topological graph generating component, wherein each initial logic node is attached with one or more candidate multi-dimensional SBP distributed signatures and position marks, and the descriptor of each dimension of the multi-dimensional SBP distributed descriptor at the input end and the output end of each multi-dimensional SBP distributed signature describes the distribution mode of a logic tensor on the logic data processing equipment of a corresponding hierarchy and each position mark indicates the logic data processing equipment where the logic tensor is deployed; inquiring, by a transmission cost inquiry component, for each current initial logic node, a transmission cost between the multi-dimensional SBP distribution descriptor at the output end of the upstream logic node and the multi-dimensional SBP distribution descriptor at the corresponding input end of the previous initial logic node based on the multi-dimensional SBP distribution descriptor at the output end of the upstream logic node of each multi-dimensional SBP distributed signature determined by the current initial logic node and the multi-dimensional SBP distribution descriptor at the corresponding input end of the output end of the upstream logic node of each candidate multi-dimensional SBP distributed signature of the current initial logic node; and acquiring the required transmission cost sum of all input ends of each candidate multidimensional SBP distributed signature of the current initial logic node based on the transmission cost query component query result through a result logic node topological graph generating component, so as to select the candidate multidimensional SBP distributed signature with the minimum transmission cost sum as the determined multidimensional SBP distributed signature of the current initial logic node, thereby acquiring the current result logic node with the determined multidimensional SBP distributed signature.
The multistage distributed data processing deployment method according to the present disclosure further includes: generating a task logic computation graph based on a result logic node topological graph formed by the result logic nodes with the determined multidimensional SBP distributed signature through a computation graph generation component, and when the multi-dimensional SBP distribution descriptor of one input of the current result logical node is different from the multi-dimensional SBP distribution descriptor of the output of the corresponding upstream logical node, inserting a transformed compute node between the input of each current compute node corresponding to the current result logical node and the output of each compute node corresponding to the corresponding upstream logical node, so as to transform the logic tensor output by the output end of each computation node corresponding to the upstream logic node and described by the multi-dimensional SBP distribution descriptor of the output end into the logic tensor to be input by the corresponding input end of each computation node corresponding to the current result logic node and described by the multi-dimensional SBP distribution descriptor of the input end.
According to the multi-level distributed data processing and deploying method disclosed by the disclosure, the device level setting component sets an upper-level logic data processing device as a time dimension logic data processing device above a logic level to which the logic data processing device belongs based on task configuration data input by a user, when actual computing resources of the logic data processing device are smaller than computing resources required by a logic tensor and a result logic tensor of an input end of the logic data processing device, an SBP distribution descriptor of a dimension corresponding to the time dimension logic data processing device in each candidate multi-dimensional SBP distributed signature is a time dimension distribution descriptor, the time dimension distribution descriptor belonging to the same candidate multi-dimensional SBP distributed signature comprises a segmentation logic tensor descriptor and a broadcast logic tensor distribution descriptor, and the segmentation times of a logic tensor attached to the segmentation tensor logic descriptor and the segmentation times of a broadcast logic tensor attached to the broadcast logic tensor distribution descriptor are the same as those of the broadcast logic tensor The number of broadcasts of the logic tensor is the same.
According to the multistage distributed data processing deployment method disclosed by the disclosure, when the time dimension distribution descriptor is contained in the multidimensional SBP distributed signature, the computation graph generating component inserts a division computation node before the corresponding input end of the computation node corresponding to the current logic node based on the division logic tensor descriptor, inserts a repeat broadcast computation node before the input end of the computation node corresponding to the current logic node based on the broadcast logic tensor distribution descriptor, and inserts an aggregation computation node after the output end of the computation node corresponding to the current logic node.
By the multistage distributed data processing deployment system and the multistage distributed data processing deployment method, the data exchange amount of the multistage distributed data processing system among different data processing devices in the data processing process is minimized from the global perspective, so that the overhead generated in the data interaction process is reduced, and the adverse effect of data exchange on actual operation is effectively reduced. And the requirement on the single-card computing resource amount of the data processing equipment can be reduced under the requirements of large-scale models and large-scale data processing, so that the required hardware cost is reduced, and on the other hand, the parallel deployment can be automatically carried out, and especially, the same data processing effect can be automatically realized under the condition of hybrid parallel deployment needing manual intervention.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a schematic diagram illustrating a multi-level distributed data processing system employing a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure.
FIG. 2 is a schematic diagram illustrating a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure.
FIG. 3 is a partial schematic diagram illustrating an initial logical node topology of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure.
Fig. 4 illustrates a first schematic diagram of a transmission cost calculation component of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system calculating the amount of data transmission generated between the logic tensors of different two-dimensional SBP distribution descriptors according to the present disclosure.
Fig. 5 illustrates a second schematic diagram of the transmission cost calculation component of the multi-dimensional SBP distributed signature based multi-stage distributed data processing deployment system calculating the amount of data transmission generated between the logical tensors of different two-dimensional SBP distribution descriptors according to the present disclosure.
FIG. 6 is a schematic diagram illustrating another embodiment of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure.
FIG. 7 illustrates an example of the transformation of a logical node topology graph into a computational graph using the multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure illustrated in FIG. 6.
Detailed Description
The present invention will be described in further detail with reference to the following examples and the accompanying drawings so that those skilled in the art can practice the invention with reference to the description.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, one of the two possible devices may be referred to hereinafter as a first logical distributed signature and may also be referred to as a second logical distributed signature, and similarly, the other of the two possible devices may be referred to as a second logical distributed signature and may also be referred to as a first logical distributed signature, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For a better understanding of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram illustrating a multi-level distributed data processing system employing a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure. A distributed data processing system includes multiple levels of data processing equipment. As shown in FIG. 1, the multi-stage distributed data processing system includes at least two stages of parallel data processing devices, a first stage of data processing devices D1, D2 … Dn. Each first level data processing device comprises a plurality of parallel second level data processing devices, for example a first level data processing device D1 having connected thereto a plurality of parallel second level data processing devices D11, D12 … D1m, and a first level data processing device Dn comprising a plurality of parallel second level data processing devices Dn1, Dn2 … Dnm. Optionally, each second stage data processing apparatus comprises a plurality of parallel third stage data processing apparatuses. For example, the second stage data processing device Dn2 includes a plurality of parallel third stage data processing devices Dn21, Dn22 …, and the like. Although three stages are shown in fig. 1, in practical cases, two stages are possible, and more stages are possible. Corresponding to the actual data processing system, for example, the first-stage data processing device may be a server, the second-stage data processing device may be a host on the server, and the third-stage data processing device may be a GPU or other coprocessor on the host, such as a TPU. Alternatively, from the internet perspective, the first level data processing device may be a computer room, the second level data processing device may be various servers in the computer room, the third level data processing device is a host on the server, and the fourth level data processing device is a GPU or other co-processor on the host, such as a TPU.
Although fig. 1 shows the first stage data processing apparatus as a physical data processing apparatus, logically, the first stage data processing apparatus may be a logical data processing apparatus, which represents only a sum total designation of the next stage data processing apparatuses. In other words, for example, the first-stage data processing device D1 does not actually exist physically, but a set of the second-stage data processing devices D11, D12 … D1m is collectively referred to as a first-stage data processing device D1, and likewise, the first-stage data processing device Dn does not actually exist physically, but a plurality of parallel second-stage data processing devices Dn1, Dn2 … Dnm are collectively referred to as a first-stage data processing device Dn. Similarly, the second level data processing apparatus is a logical collective name of a plurality of third level data processing apparatuses provided thereunder. By analogy, all actual data processing devices distributed in parallel can be logically divided into multiple stages according to actual needs, each stage containing a predetermined number of physical data processing devices, so that all the physical data processing devices form a logical topological relationship. Although the number of second logical data processing apparatuses belonging to different first-stage logical data processing apparatuses is the same as shown in fig. 1, the number of second logical data processing apparatuses belonging to different first-stage logical data processing apparatuses may be different. When the number of the second logical data processing devices belonging to different first-stage logical data processing devices is different, the actual data segmentation can be performed according to the number of the devices and the predetermined weight. Typically, the number of logical data processing apparatuses in the same level is the same.
FIG. 2 is a schematic diagram illustrating a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure. As shown in fig. 2, the multi-level distributed data processing deployment system 100 includes a device hierarchy setting component 110, a location marker obtaining component 120, an initial logical node generating component 130, and a transmission cost calculating component 150. In particular, the device hierarchy setting component 110 obtains all data processing devices throughout the distributed data processing system and groups all data processing devices into a particular hierarchy and specifies a logical hierarchical relationship between each other. For example, in a distributed data processing system, there are compute cards GPU0-7, which can be grouped into one group for GPU0-3 and another group for GPU4-7, so that compute card GPU0-3 as a whole is designated as a first level data processing device, while GPU0, 1, GPU2 and GPU3 are second level data processing devices. Similarly, compute card GPUs 4-7 as a whole are designated as first level data processing devices, while GPUs 4, 5, 6, and 7 are second level data processing devices. In this way, the logical hierarchical relationships between levels of data processing devices and between peer data processing devices are specified, thereby forming a logical topological relationship diagram for the entire distributed data processing system, which may or may not be the same as the actual physical relationship topology diagram, but is typically the same. The location indicia acquisition component 120 acquires location indicia for all data processing devices, which may include their physical location, logical hierarchical relationship, and the like. For this reason, we refer to each specific data processing apparatus as a logical data processing apparatus. As shown in fig. 1, as the first-stage logical data processing device D1, it may be an actual physical data processing device, that is, a plurality of actual data processing devices D11, D12, … D1m are plugged into an actual first data processing device D1 (e.g., a server) at the same time, or may be a collective term including a set of a plurality of actual data processing devices D11, D12, … D1m, that is, there is no actual first data processing device D1.
The initial logical node generation component 130 receives user-entered task configuration data and generates an initial logical node topology graph 140 for the multi-level distributed data processing system. After the operation is input, the multi-stage distributed data processing system can automatically decompose the operation into a plurality of tiny operation tasks based on the operation description input by a user, the tiny operation tasks are composed of various operation components, and the operation components are used as logic nodes and are mutually connected in front of and behind to form a preliminary logic tensor processing neural network topological graph. Each of these neural networks includes a plurality of logical nodes, and two adjacent neural networks are connected to each other to provide guidance for Placement (PLACEMENT) of executors that perform actual job processing in the distributed data processing system. A simple initial logical node topology 140 is shown only schematically in fig. 2, in which nodes A, B, C, D, E, F, L are shown, as well as K. Others not shown are replaced by omissions. In actual data processing, the initial logical node topology 140 would be more complex. The initial logical node topology 140 contains the basic compute nodes that implement the computational tasks described by the user. The generation manner of the initial logical node topology 140 belongs to the conventional art in the art, and therefore, is not described herein.
The individual initial logical nodes of the initial logical node topology map 140 each contain a plurality of candidate multi-dimensional SBP distributed signatures. The initial logical nodes that have been configured with a multidimensional SBP distributed signature by a user, such as initial logical node A, E and B, as the source logical nodes or based on a user's task description, have only unique multidimensional SBP distributed signatures, such as SBP-1 of initial logical node A, SBP-2 of initial logical node C, and SBP-3 of initial logical node E. And the other initial logical nodes comprise some inherent candidate multi-dimensional SBP distributed signatures. Such as the initial logical node B in fig. 2, which has a plurality of candidate multi-dimensional SBP distributed signatures, e.g., three, including SBP-1, SBP-2, and SBP-3. The other initial logical nodes also each have different candidate multidimensional SBP distributed signatures, not listed here. Different initial logical nodes have different fixed candidate multi-dimensional SBP distributed signatures according to different operation operations executed by the initial logical nodes.
A multidimensional SBP distributed signature according to the present disclosure is a signature applied in a multi-level distributed data processing system. In the multi-stage distributed data processing system as described above, since there are often cases of data parallel, model parallel, mixed parallel, streaming parallel, and the like, there often exists a case where tasks of adjacent logical nodes are to be simultaneously deployed to different next-stage data processing apparatuses belonging to different first-stage data processing apparatuses. Therefore, in the actual data processing process, intermediate parameters are exchanged between the data processing devices of the same level, which results in a large amount of data transmission overhead between the data processing devices of the same level. Therefore, in order to reduce the data transmission overhead, more logical nodes need to be further generated on the basis of the initial logical node topology map 140, so as to complete the logical node topology map, and especially, to reduce the transmission overhead between the upstream and downstream logical nodes, it is necessary to minimize the change caused by the data distribution manner of the upstream and downstream logical nodes. For this reason, in order to obtain a good transmission cost, the present disclosure needs to specify an appropriate logical distributed signature for each logical node. The logic distributed signature is a signature of a logic node by adopting a distribution descriptor of a logic tensor, the distribution descriptor of each logic tensor describes a distribution mode of each logic tensor in the whole multi-stage distributed computing system, and the logic distributed signature mainly comprises a Splitting (SPLIT) logic tensor descriptor, a Broadcasting (BROADCAST) logic tensor descriptor and a PARTIAL VALUE (PARTIAL VALUE) logic tensor descriptor. Thus, the multidimensional logical distributed signature applied to the present disclosure is referred to as a multidimensional SBP distributed signature, and the multidimensional logical distribution descriptor for each logical tensor is referred to as a multidimensional SBP distribution descriptor. Specifically, in the multi-stage distributed data processing system, a multi-dimensional distribution descriptor is required to describe a distribution manner of logic tensor distribution to each stage of data processing equipment, and the number of dimensions of the multi-dimensional distribution descriptor is equal to or greater than the level of the multi-stage distributed data processing system. When any logical node does not need to be deployed to the last-level data processing device, the multidimensional distribution descriptor of the logical node actually only contains the dimension number of the level of the data processing device deployed by the logical node, and the subdivision descriptors of the rest dimensions are blank by default.
Specifically, for the SBP distributed signature, for each dimension of the distribution descriptor, the SPLIT (SPLIT) logical tensor descriptor is a way of splitting a logical tensor to describe it, for example, a logical tensor is SPLIT in a specified dimension according to the description of the user and distributed to different data processing devices to perform specified calculation processing. If a logic tensor is a two-dimensional logic tensor, when the logic tensor is cut in the 0 th dimension of the logic tensor, the distribution descriptor of the data logic tensor of a batch of data formed by the logic tensor is S (0), each logic node obtains the data logic tensor at the input end of the logic node, and the data logic tensor is the tensor of the original logic tensor which is divided in the 0 th dimension, and the division deployment mode is described as S (0). Similarly, if a logic tensor is a two-dimensional logic tensor, when the logic tensor is cut in the 1 st dimension, the distribution descriptor of the data logic tensor of the batch of data formed by the logic tensor is S (1), each logic node obtains at its input end the data logic tensor which is the original data tensor divided in the 1 st dimension, and describes the division deployment mode as S (1). Similarly, if the dimension of the task data to be processed is more, there will be more distribution descriptors, such as S (2), S (3) …, and so on. Such mentioned logical tensors may be processed data tensors or model tensors. If the data tensor itself is sliced, then parallel processing of the data is formed on the distributed data processing system, and if the model tensor is sliced, then parallel processing of the model is formed on the distributed data processing system. The parallelism referred to herein may be either spatially or temporally parallel. If the input of the logic node is the descriptor of the SPLIT (SPLIT) logic tensor, in the actual data processing process, if the data size of one logic tensor is T and the logic tensor is to be distributed to four first-stage data processing devices (or four calculation cards) for data parallel calculation, the data amount distributed to each card is one fourth of the data, and the data amount on the whole four cards is T. If the distribution of one logic tensor is firstly divided on a plurality of data processing devices which are parallel to the first level in the 0 th dimension when the logic tensor is distributed, and then the sliced logic tensor formed after the division is further divided on the 1 st dimension on each first level data processing device and then is distributed on a plurality of (for example, four) second level data processing devices which are included in the first level data processing device in parallel, the distribution tree descriptor of the logic tensor is a two-dimensional SBP distribution descriptor (S (0), S (1)), and the size of the data tensor actually on each second level data processing device is 1/16 of the size of the original logic tensor. If the original logic tensor is distributed on a plurality of data processing devices which are divided in the 0 th dimension and are parallel to the first level, and then the divided logic tensor on each first level data processing device is further divided in the 0 th dimension again and is distributed on a plurality of second level data processing devices connected with the first level data processing devices in parallel, the distribution tree descriptor of the logic tensor is a two-dimensional distribution descriptor (S (0), S (0)). And so on. The distribution descriptor may also be three-dimensional or more.
BROADCAST (BROADCAST) logical tensor descriptors are used to describe the way a logical tensor is published in a BROADCAST fashion in a distributed system. In general, for a data processing system that performs only data parallelism, model data is generally broadcast to the respective data processing apparatuses, and thus broadcast logic tensor descriptors are used to describe broadcast data input to logical nodes. In the actual data processing process, the size of the logic tensor of the broadcasted data on each actual calculation card is the same. If the distribution of a logic tensor is broadcasted to each data processing device of the first level first, and then the broadcasted logic tensor is divided in the 0 th dimension on each data processing device of the first level and then distributed in parallel to a plurality of data processing devices of the second level connected to the data processing device of the first level, the distribution descriptor of the logic tensor is a two-dimensional distribution descriptor (B, S (0)), the size of the actual logic tensor on each data processing device of the second level is 1/4 of the size of the original logic tensor (if the number of the data processing devices of the second level included in or connected to each data processing device of the first level is 4). If the distribution of a logic tensor is firstly divided on a plurality of data processing devices which are parallel to a first level in a 0-dimension when the logic tensor is distributed, and then the divided and formed slicing logic tensor on each first level data processing device is further broadcast on the 0-dimension to a plurality of second level data processing devices which are included in the first level data processing devices, the distribution tree descriptor of the logic tensor is a two-dimensional distribution descriptor (S (0), B). And so on.
PARTIAL VALUE (PARTIAL VALUE) logic tensor descriptor P represents a PARTIAL VALUE of the input or output logic tensor of one logic node as a plurality of homogeneous logic tensors. These partial values include partial sums, partial products, partial AND results, partial maximums, and partial minimums. Since data is usually processed in parallel, the processing of data on different devices is the processing of partial data. For example, if some of the logical tensors are S (0) or S (1), the resulting logical tensor is S (0) obtained in some data processing apparatuses, and the resulting logical tensors in these partial data processing apparatuses are combined to be the partial logical tensor P. And the final output result is obtained by combining the same kind of data on all the devices.
The distribution descriptors of the various logic tensors represent the distribution of the logic tensors in the distributed computing system, and the respective distribution descriptors of the logic tensors, which are used as the input and the output of the logic nodes, also describe the distribution description of the logic nodes on the operation data. For convenience of description, the present disclosure simply refers to such a multidimensional distribution descriptor as a "multidimensional SBP distribution descriptor" or a "multidimensional distribution descriptor".
To this end, with the generation of the initial logical node topology 140, the initial logical nodes for distributed data processing deployment, that is, some operational logical nodes, also have data distribution descriptors of respective inputs and outputs, which form a signature of the logical node, that is, the signature of the operational logical node using the distribution descriptor of the logical tensor. For convenience of expression, the english initials of the three distribution descriptors are used to refer to this signature as "multidimensional SBP distributed signature" for short.
Such descriptors would include at least three primitives S, B and P, depending on the user's description of the computing task and data parallelism requirements in each distributed computing system. If there are multiple ways of partitioning the data and model, then each way of partitioning is added, a descriptor is added. If a logic tensor is divided successively or simultaneously in two different dimensions, the distribution descriptor is the two-dimensional distribution descriptor as described above. If a logic tensor is distributed in two distribution modes, its distribution descriptor can be a two-dimensional distribution descriptor as described above. If a logic tensor is firstly divided in one dimension and then the logic tensor is divided in the same dimension, the distribution descriptor is also a two-dimensional distribution descriptor as described above. By analogy, descriptors can be distributed in three or more dimensions. For each logical node, its signature contains various combinations of these descriptors. Thus, in a distribution system according to the present disclosure, there are at least three basic distribution descriptors, and typically four distribution descriptors, for example, the following four SBP descriptors, S (0), S (1), P, and B, for a one-dimensional SBP descriptor. Depending on the number of logical tensor dimensions, there may be more distribution descriptors. If the number of the SBP descriptors is four, various multi-dimensional SBP distributed signatures can be formed according to the permutation and combination of input and output. Some multidimensional SBP distributed signatures are listed below: examples of one-dimensional, multidimensional SBP distributed signatures are, for example: (S (0), B) → S (0), (S (1), B) → S (1), P → P, B → B, (S (0), S (1)) → P, S (0) → P, S (0) → S (0), S (0) → S (1), P → B. For a two-dimensional multi-dimensional SBP distributed signature, which is composed of two-dimensional distribution descriptors composed of a combination of one-dimensional distribution descriptors, such as (S (0), S (0)), (S (1), S (1)), (S (0, B)), (S (1, B)), (B, B)), (P, S (0)), (B, P), (P, P), and so on, the two-dimensional SBP distributed signature is, for example: [ (S (0), S (0)) (B, B) → (S (0), S (0)) ], [ (S (1), S (1)) (B, B) → (S (1), S (1)) ], [ (S (0), B) (S (1), S (1)) → (P, S (1)) ], [ (S (0), B) (B, S (1)) → (S (0), S (1)) ], and the like. In each signature form, the left side of the arrow is a distribution descriptor of the input logic tensor, and the right side of the arrow is a distribution descriptor of the output logic tensor. For the sake of description convenience, hereinafter, the "logic tensor whose distribution descriptor is S (0)" is simply referred to as "S (0) logic tensor", the "logic tensor whose distribution descriptor is B" is simply referred to as "B logic tensor", the "logic tensor whose distribution descriptor is P" is simply referred to as "P logic tensor", similarly, the "logic tensor whose distribution descriptor is (S (0), B)" is simply referred to as "(S (0), B) logic tensor", the "logic tensor whose distribution descriptor is (B, S (1)" is simply referred to as "(B, S (1)) logic", the "logic tensor whose distribution descriptor is (P, S (1)" is simply referred to as "(P, S (1)) logic tensor", the "logic tensor whose distribution descriptor is (S (0), B, S (1)" "is simply referred to as" (S (0), b, S (1)) logic tensor "and" logic tensor whose distribution descriptor is (B, S (1) "are simply referred to as" (B, S (1)) logic tensor ", and" logic tensor whose distribution descriptor is (P, S (1), S (0) "" are simply referred to as "(P, S (1), S (0)) logic tensor", and so on.
The multi-dimensional SBP distributed signature may also be more dimensional, such as three-dimensional or four-dimensional or more, as the case may be. All multi-dimensional SBP distributed signatures are the result of various SBP descriptor combinations. The number of dimensions of the SBP descriptor of the input logic tensor and the SBP descriptor of the output logic tensor included in the arbitrary multidimensional SBP distributed signature are corresponding, and the SBP descriptor of the input logic tensor and the SBP descriptor of the output logic tensor at the same dimensional position conform to the conversion logic between the input logic tensor and the output logic tensor of the one-dimensional SBP distributed signature.
For example, a two-dimensional SBP distributed signature [ (S (1), S (1)) (B, B) → (S (1), S (1)) ] indicates that, on a first-tier logical data processing apparatus, the SBP distributed signatures of a first-dimension logical tensor are (S (1), B) → S (1), and, on a plurality of second-tier logical data processing apparatuses included in each first-tier logical data processing apparatus, the SBP distributed signatures of its corresponding second-dimension logical tensor are (S (1), B) → S (1). Accordingly, as a result, the SBP distributed signature possessed by each logical node deployed on the second-tier logical data processing apparatus is, in particular, a two-dimensional SBP distributed signature [ (S (1), S (1)) (B, B) → (S (1), S (1)) ]. In other words, the division manner for obtaining the input logic tensor at the first input end of the logic node is to firstly divide the input logic tensor at the 1 st dimension once, that is, to divide the input logic tensor to, for example, four first-stage data processing devices, and then divide each sliced logic tensor to, for example, four second-stage data processing devices under each first-stage data processing device at the 1 st dimension again. Therefore, the logic tensor finally deployed on each second-stage data processing apparatus is the logic tensor of 1/16 divided in the 1 st dimension. And at the second input of the logical node the other logical tensor is obtained, which is partly to the original by means of two broadcasts. And the resultant logic tensor obtained by the operation is also divided 16 times in the 1 st dimension.
Similarly, for the SBP distributed signature [ (S (0), B) (B, S (1)) → (S (0), S (1)) ], logically, the first-level deployment manner is S (0), B → S (0), and the second-level deployment manner is B, S (1) → S (1), and in a colloquial manner, the logical node processes data in such a manner that the first logical tensor is divided in the 0 th dimension to form the S (0) logical tensor, the S (0) logical tensor is deployed on the plurality of first-level logical data processing apparatuses, and the second logical tensor is broadcast as the B logical tensor on the plurality of first-level logical data processing apparatuses, and then, the sliced logical tensor, which is deployed on each first-level logical data processing apparatus in the S (0) division, is broadcast on the plurality of second-level logical data processing apparatuses connected to or included in the first-level logical data processing apparatus, and then dividing the broadcast logic tensor broadcasted to each first-stage logic data processing device according to the B in the 1 st dimension and distributing the divided broadcast logic tensor to a plurality of second-stage logic data processing devices connected or included by the first-stage logic data processing devices. The logical result tensor thus obtained by the logical node is the resultant logical tensor in which the final result logical tensor is divided in both the 0 th dimension and the 1 st dimension, i.e., the (S (0), S (1)) logical tensor, which is 1/16 of the final result logical tensor (if the number of divisions is 4).
Each logic tensor can have a plurality of SBP descriptors, and the operation mode represented by each logic node can be a plurality of SBP signature situations. For example, SBP-1 shown in FIG. 2 may be a two-dimensional, multidimensional SBP distributed signature, such as [ (B, S (1), (S (1), S (0))) → (S (0), P) ]. In practical applications, different signature forms may have different numbers, and the numbers given herein are for descriptive convenience only and do not mean that each signature needs to be given a number, and may not have any number at all, and the different forms of signatures may be distinguished from each other without requiring numbers.
Each initial logical node may be given a multi-dimensional SBP distributed signature based on the user task description as described above. Common task logic nodes are arithmetic operation nodes that perform a particular arithmetic operation and therefore have a particular candidate multi-dimensional SBP distributed signature. It should be noted that, the multidimensional SBP distributed signatures of each task logic node are not the same, and the input logic tensor of the multidimensional SBP distributed signature of the task logic node which usually performs the multiplication operation does not include the part and the logic tensor, so the SBP descriptor of the input logic tensor does not include the distribution descriptor P. The candidate multi-dimensional SBP distributed signatures for the task logical nodes performing the addition operation may then include any combination of the various SBP descriptors with each other or with themselves. Taking a two-dimensional SBP distributed signature [ (S (0), B) (B, S (1)) → (S (0, S (1)) ], for a logical node having such a multidimensional SBP distributed signature, descriptors of logical tensors representing two inputs thereof, i.e., (S (0), B) and (B, S (1)), and descriptors of logical tensors of outputs thereof, i.e., (S (0), S (1)), i.e., descriptors of logical tensors of the two-dimensional SBP distributed signature, are also two-dimensional. As described above, the descriptor (S (0), B) of the first input logic tensor represents that the first logic tensor to be processed (for example, the size of T) is first divided in the 0 th dimension (herein, referred to as the dimension of the logic tensor itself) (i.e., S (0) of the first dimension) into a plurality of first sliced logic tensors (the size of T/4 if the number of divisions is 4), then is spatially broadcast or temporally continuously output (broadcast of the time dimension) (i.e., B of the second dimension) for each of the first sliced logic tensors after division, the descriptor (B, S (1)) of the second input logic tensor represents that the second input logic tensor to be processed is first spatially broadcast, then is divided in the 1 st dimension (herein, referred to as the dimension of the logic tensor itself) (i.e., S (1) of the second dimension) into a plurality of second sliced logic tensors for the broadcast of the second logic tensor to be processed, finally, the distribution descriptor of the resultant logic tensor formed by the first logic tensor and the second logic tensor processed by the logic node is (S (0), S (1)). Each initial logical node is accompanied by a set of candidate logical distributed signatures based on the task configuration data. Each logical distributed signature in the candidate set of logical distributed signatures specifies a distribution descriptor of each input logical tensor and a distribution descriptor of each output logical tensor of an initial logical node to which it belongs.
Although the initial logical node generation component 130 generates the initial logical node topology graph 140, a further determination is needed for which logical tensor the initial logical node topology graph 140 will use or which distributed logical tensor to use and which distributed logical tensor to input using which multidimensional SBP distributed signature.
Thus, the transmission cost calculation component 150 of the logic node distributed signature decision system 100 according to the present disclosure starts from the source logic node in the initial logic node topology map 140, when the logic labels or SBP labels of all upstream logic nodes (e.g., logic nodes a and E) of the current logic node (e.g., logic node B) have been determined, the transmission cost calculation component 150 signs the distribution descriptors of the outputs of all upstream logic nodes of the logic node B corresponding to the inputs of the logic node B for each candidate multi-dimensional logic distribution of the logic node B, calculates the cost of the transmitted data required to transform the logic tensor described by the multi-dimensional SBP distribution descriptor at the output of each upstream logic node into the logic tensor described by the multi-dimensional SBP distribution descriptor at the corresponding input of the logic node B. As shown in FIG. 2, the logical node B has many candidate multidimensional SBP distributed signatures, such as SBP-1, SBP-2, and SBP-3. For example, SBP-1 may be in the form of a signature of [ ((B, S (1)), (S (1), S (0)) → (S (1), P) ], [ (S (0), B) (B, S (1)) → (S (0), S (1)) ], [ (S (0), S (0)) ], [ (S (0), B) → (S (0), S (0)) ], [ (S (0), B) (S (1), S (1)) → (P, S (1)) ], a signature of initial logical node a SBP-5 may be in the form of a signature of [ (S (0), S (0)) (B, B) → (S (0), S (0)) ], a signature of initial logical node E SBP-3 may be in the form of [ (B, B) → (B), b) ]. The SBP distributed signature of the logical node a and the logical node E, which are upstream logical nodes of the logical node B, is also a two-dimensional SBP distributed signature.
As shown in fig. 2, if the form of the label SBP-3 of the logic node E in the initial logic node topology 140 is "[ S (1), S (0) ] → [ S (1), S (0) ]", the input logic tensor distribution descriptor thereof is [ S (0), S (1) ], so the input logic tensor thereof is [ S (1), S (0) ] logic tensor, and similarly the output logic tensor distribution descriptor thereof is [ S (1), S (0) ], so the output logic tensor thereof is [ S (1), S (0) ] logic tensor, more specifically, the first-dimension logic tensor distribution descriptor thereof is S (1) and the second-dimension logic tensor distribution descriptor thereof is S (0) regardless of the input or output logic tensor descriptor thereof. Therefore, whether it is its input or its output, the size of the logic tensor described by the logic tensor descriptor [ S (1), S (0) ] is 1/16 of the size of the initial logic tensor if the number of divisions of each dimension is 4.
If the candidate multi-dimensional SBP distributed signature SBP-2 of logical node B (i.e., [ (S (1), (S (1)), (S (0), B) → (P, S (1)) ] ") is selected as the determined signature, then the first-dimension distribution descriptor of the input logic tensor at the first input corresponding to the output of node E must be (S (1), (S (1)), i.e., the first input must obtain a (S (1), (S (1)) logic tensor having a first-dimension descriptor S (1) and a second-dimension descriptor also S (1), and its first-dimension distribution descriptor corresponding to the input logic tensor at the second input corresponding to the output of node a must be (S (0), B), i.e., the second input must obtain a (S (0) having a first-dimension descriptor S (0) and a second-dimension descriptor also B (S (0), b) A logic tensor. In a distributed computing system, since the operation tasks, especially the calculation tasks, of the respective logical nodes are divided and distributed to the respective data processing devices (e.g., CPU, GPU, or TPU of a computing card), to finally obtain correct results, the intermediate parameters need to be synchronized continuously, which involves the exchange of intermediate parameters between different data processing devices. When the SBP descriptor of the output logic tensor contained in the multidimensional SBP distributed signature of the previous logic node is inconsistent with the SBP descriptor of the corresponding input logic tensor of the multidimensional SBP distributed signature of the current node, output conversion is usually performed in an actual operation process, and this conversion process usually needs to acquire a part of data located on another data processing device, so as to form data required by the input end of the current logic node together with locally available data, thereby conforming to the distribution descriptor of the data logic tensor at the input end of the current logic node. This process of obtaining a portion of data from another device incurs a relatively large data transmission overhead or transmission cost. Therefore, selecting different signatures for the current logical node may result in different data transmission overhead or cost.
It is obvious that, at this time, if the output logic tensors (P, S (1)) of the logic node a are not matched with the first-dimension distribution descriptor S (0) of the input logic tensor at the second input end of the logic node B, so that to make the logic node B perform correct operation, it is necessary to convert the first-stage logic tensor whose first-dimension distribution descriptor is P output by the node a into the logic tensor whose first-dimension distribution descriptor is S (0) and convert the second-stage logic tensor whose second-dimension distribution descriptor is S (1) output by the node a into the logic tensor whose second-dimension distribution descriptor is B, thereby obtaining the (S (0), B) logic tensor distribution mode suitable for the second input end of the logic node B. Also, if the distribution descriptor of the logic tensor output by the logic node E is (S (0), (S (0)), the first-dimension distribution descriptor S (1) and the second-dimension distribution descriptor S (1) of the input tensor of the first input terminal of the node B do not coincide, and therefore, in order for the logic node B to perform a correct operation, it is necessary to convert the first-dimension distribution descriptor S (0) of the distribution descriptor S (0), (S (0)) output by the node E into the first-dimension distribution descriptor S (1) at the first-level logic tensor and to convert the second-dimension distribution descriptor S (0) into the second-level logic tensor at the second-level distribution descriptor S (1), thereby obtaining the (S (1), (S (1)) logic tensor distribution pattern necessary for the first input terminal of the logic node B, although the first-level logic tensor and the second-level logic tensor are used for expression here, the transformation is not meant to be performed twice, but is performed once directly, that is, the output logic tensor (P, S (1)) of the logic node a is directly replaced by other missing data parts acquired by data transmission, so as to obtain the complete logic tensor (S (0), B). The transformation for the first input is similar. Namely, one or more logic nodes are inserted, so that the output logic tensor described by the distribution descriptor (P, S (1)) of the logic node A is converted into the output logic tensor described by the distribution descriptor (S (0), B) through data transmission and the output logic tensor described by the distribution descriptor (S (0), (S (0)) of the logic node E is converted into the output logic tensor described by the distribution descriptor (S (1), (S (1)) through data transmission, thereby meeting the requirement of correct operation.
To do so, the transmission cost calculation component 150 calculates the data transmission cost that each candidate signature would incur for each current logical node for which a multidimensional SBP distributed signature is not determined. For example, for a logical node B, the data transmission cost that would be generated by the logical node B if one of the multidimensional SBP distributed signatures is adopted is calculated for three candidate multidimensional SBP distributed signatures of the logical node B. For the logical node B, selecting any one of the candidate multidimensional SBP distributed signatures can achieve its operation task. But the data transmission cost generated by the operation of the distributed signature system is different under the condition of adopting different multidimensional SBP distributed signatures. Therefore, in order to minimize the data transmission cost in the data processing process, a signature with the minimum data transmission amount needs to be selected from candidate signatures of each logical node as a signature in the actual operation process.
For example, between the logical node a and the logical node B in the initial logical node topology 140, which are in the upstream and downstream relationship, the logical node a may be a source node, and the multidimensional SBP distributed signature thereof may be generated by user configuration, or may be generated naturally based on the description of the task by the user, or the multidimensional SBP distributed signature of the logical node a has been subjected to decision selection determination substantially according to the scheme of the present disclosure, for example, the descriptor of the output logical tensor of the multidimensional SBP distributed signature of the logical node a is (P, S (1)). And as a logical node B in the initial logical node topology 140, it has many candidate multidimensional SBP distributed signatures, which may include [ ((B, S (1)), (S (1), S (0)) → (S (1), P) ], [ (S (0), B), (B, S (1)) → (S (0), S (1)) ], [ (S (0), S (0)), (B, B) → (S (0), S (0)) ], [ (S (0), B), (S (1), S (1)) → (P, S (1)) ] and so on, but from the logical node a to the logical node B, since the distribution descriptor of the output logical tensor of the logical node a is (P, S (1)), the corresponding input logical tensor distribution descriptor that the node B may select may be (S (0), b) (B, S (1)), (S (0), S (0)), (B, B), (S (0), S (0)), (S (0), B), (S (1), S (1)), (P, S (1)), and the like.
Therefore, after the signatures of some previous logic nodes are determined, the multidimensional SBP distributed signature of a logic node downstream thereof is also finally selected and determined based on the cost of data transmission between the same-dimension logic distribution descriptor (SBP descriptor) of the output logic tensor of the upstream logic node and the same-dimension logic distribution descriptor (SBP descriptor) of the corresponding input logic tensor of the candidate logic distributed signature of the downstream upstream logic node. In this way, once the candidate multidimensional SBP distributed signature of a logical node is selected for calculation, the SBP descriptors of respective dimensions of the logical tensor of each input end and output end of the logical node are also determined, so as to calculate the total cost of data transmission of the current logical node in the same dimension, and the candidate logical distributed signature with the minimum total cost of the SBP descriptors of the dimension is taken as the logical distributed signature of the current logical node. It should be noted that if the logic distribution descriptor of a certain dimension of the input end of which signature is in the candidate signatures of the current logic node is consistent with the logic distribution descriptor of the same dimension of the output logic tensor of the logic node upstream of the current logic node, the candidate logic distribution signature containing the logic distribution descriptor of the same dimension may be preferentially selected unless the logic distribution descriptors of the same dimension of the logic tensors of other input ends of the candidate logic distribution signature cause the final total cost to be larger.
FIG. 3 is a partial schematic diagram illustrating an initial logical node topology of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure. Fig. 3 is an enlarged schematic diagram of the relationship between nodes A, B and E in the initial logical node topology 140 of fig. 2. As shown in fig. 3, assuming that the SBP distributed descriptors of the output logic tensors of the determined multidimensional SBP distributed signature SBP-3 of the logic node E are (S (1), S (0)), the first dimension distribution descriptor is S (1), and the second dimension distribution descriptor is S (0). If each division is divided into two parts, the output logic of the logic node E described by the SBP distributed descriptor is 1/4 of the divided tensor. Assuming that the SBP distributed descriptor of the output logic tensor of the determined multidimensional SBP distributed signature SBP-5 of the logic node a is (S (0), B), the first dimension distribution descriptor is S (0), and the second dimension distribution descriptor is B. For the candidate multidimensional SBP distributed signature [ ((B, S (1)), (S (1), S (0)) → (S (1, P) ] of the logical node B, the SBP descriptor of the first input logical tensor required by the logical node B is ((B, S (1)), which is inconsistent with the SBP descriptor (S (1, S (0)) of the output logical tensor of the logical node E, and the SBP descriptor of the second input logical tensor required by the logical node B is (S (1, S (0)), which is inconsistent with the SBP descriptor (S (0, B)) of the output logical tensor of the logical node a, and thus the input logical tensor distribution requirement of the candidate multidimensional SBP distributed signature [ ((B, S (1)), (S (1), S (0)) → (S (1, P) ] of the logical node B is to be met, it is necessary to make the logical tensor distribution of one input thereof converted from the SBP descriptor (S (1), S (0)) of the output logical tensor of the logical node E to ((B, S (1)) and the logical tensor distribution of the other input thereof converted from the SBP descriptor (S (0), B) of the output logical tensor of the logical node a to (S (1), S (0)). this conversion will generate data exchange between a plurality of data processing apparatuses of the same rank, which are connected corresponding to the dimension and belong to the same data processing apparatus of the previous rank, during actual data processing.
Fig. 4 illustrates a first schematic diagram of a transmission cost calculation component of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system calculating the amount of data transmission generated between the logic tensors of different two-dimensional SBP distribution descriptors according to the present disclosure. For the candidate multi-dimensional SBP distributed signature SBP-2 of task node B shown in fig. 3, as shown in fig. 4, the multi-level distributed data processing system includes a first-level logical data processing device 0 and a first-level logical data processing device 1, the first-level logical data processing device 0 is connected or includes two compute cards GPU0 and GPU1, and the first-level logical data processing device 1 is connected or includes two compute cards GPU2 and GPU 3. The two-dimensional SBP distribution descriptor at the output of logical node E distributes the output tensors (S (1), S (0)) across the four cards, as shown in the upper part of the black solid line of fig. 4. The two-dimensional SBP distribution descriptor of the corresponding input terminal of the logical node B distributes the output tensors ((B, S (1)) on the four cards, as shown in the lower part of the solid black line of fig. 4. the graph with shaded blocks in fig. 4 represents the logical tensors, and the dashed lines between the values of the logical tensors represent the required transmission directions.
The distribution of the logical node B is acquired in each task node of the GPUs 0-3 ((B, S (1)), then in addition to the corresponding tensor part needing to be acquired directly from the local (the acquisition process of the data part is shown by the solid arrows of the GPUs 0 and 3), some of the logical tensors distributed on other GPUs in the GPUs 0-3 and described by the descriptors (S (1), S (0)) of the task node E (the acquisition process of the data part is shown by the dashed arrows) need to be acquired1Then the size of all data transfers shown in dashed lines is 3/2T1. Therefore, transforming the (S (1), S (0)) descriptor logical tensor of task node E into the logical tensor of the ((B, S (1)) descriptor at the input of task node B, the total data transfer cost is 3/2T1
Fig. 5 illustrates a second schematic diagram of the transmission cost calculation component of the multi-dimensional SBP distributed signature based multi-stage distributed data processing deployment system calculating the amount of data transmission generated between the logical tensors of different two-dimensional SBP distribution descriptors according to the present disclosure. For the candidate multi-dimensional SBP distributed signature SBP-2 of task node B shown in fig. 3, as shown in fig. 5, the multi-level distributed data processing system includes a first-level logical data processing device 0 and a first-level logical data processing device 1, the first-level logical data processing device 0 is connected or includes two compute cards GPU0 and GPU1, and the first-level logical data processing device 1 is connected or includes two compute cards GPU2 and GPU 3. The two-dimensional SBP distribution descriptor at the output of logical node a distributes the output tensors (S (0), B) across the four cards, as shown in the upper part of the black solid line of fig. 5. The two-dimensional SBP distribution descriptor for the corresponding input of logical node B distributes the output tensors (S (1), S (0)) across the four cards, as shown in the lower part of the black solid line of fig. 5. The graph with shaded blocks in fig. 5 represents the logic tensor, the dashed lines between the logic tensor values representing the required transmission direction. In the following description, for convenience of description, the GPU is equivalent to a logical data processing apparatus. The data processing device may comprise other devices than a GPU, such as a computer room, a server, a CPU, etc.
The distribution of the logical node B is to be obtained at each task node of the GPUs 0-3 (S (1), S (0)), then in addition to directly obtaining the corresponding tensor part from the local (the acquisition process of such data part is shown by the solid arrow of the GPUs 0 and 3), some of the logical tensors distributed on other GPUs of the GPUs 0-3 described by the descriptors (S (0), B) of the task node a are also needed to be obtained (the acquisition process of such data part is shown by the dashed arrow). If the size of the logic tensor is T2Then the size of all data transfers shown in dashed lines is 1/2T2. Therefore, transforming the (S (0), B) descriptor logical tensor of task node A to the (S (1), S (0)) descriptor logical tensor of task node B, the total data transfer cost is 1/2T1
By summing the data transmission costs calculated by the first input terminal and the second input terminal, respectively, the total transmission cost required for the global transformation can be obtained. The required transmission cost is the change of the logic tensor described by the SBP distribution descriptors of various combinations obtained by those skilled in the art according to the cost calculation method illustrated and explained in fig. 4 and fig. 5. For convenience of explanation, the size of the original logic tensor is represented by T. I.e. all assume T1= T2T, no distinction is made. However, in the actual calculation process, the logic tensors of the input ends may be different according to the task configuration. For this reason, the actual user can first initialize the transformation costs caused by the various combinations according to the actual job description, and form a transformation resultThe transmission cost table of (1). Assuming that the sizes of the various initial logic tensors are the same, applicants provide the following transmission cost look-up table. It should be noted that those skilled in the art can obtain the specific transmission cost look-up table directly in combination with the actual job description according to the teachings of the present disclosure.
Tables 1-6 below show a list of transmission costs for transforming from a row-listed two-dimensional distributed descriptor to a column-listed two-dimensional distributed descriptor. The data transmission cost required for converting the two logic tensors can be directly obtained by directly querying the tables.
Each lookup table indicates a calculation condition, and those skilled in the art can modify the physical topology map formed by the multi-stage distributed data processing system and the actual job configuration conditions and the logical relationship of the actual computing devices according to the teachings of the present disclosure, so as to initialize the lookup table during actual operation, thereby obtaining a correct lookup table. In addition, the user can also perform normalization processing on the data blocks (BLOBs) according to actual operation, so as to form a normalized weight for each specific data block, that is, when the lookup table is initialized by calculating the transmission cost, the transmission coefficient can be generated by using the table, and the transmission cost is obtained by multiplying the coefficient by the data quantity T, so that the lookup table is simplified.
TABLE 1
Figure 961315DEST_PATH_IMAGE001
Conditions are as follows: i! = j, k! = l (i is not equal to j and k is not equal to l)
And T is the total data volume of the data block, and when the transmission cost is calculated, a transmission coefficient can be generated by using a table, and the transmission cost is obtained by multiplying the coefficient by the data volume T.
The first fraction having n primary machines and the second fraction having m secondary machines
Arranging on the same cluster of machines
The calculation method comprises the following steps: when the P is not involved, the overlapped part of the sender and the receiver is calculated, and the overlapped part is symmetrical, and the total data amount required by the receiver is subtracted.
When P is involved, if P is at the receiver, P can always take the most similar part (S for the transmitter) or a sub-part (B for the transmitter) from the same part as the sender, and the transmission cost is minimized by assigning 0 elsewhere.
If P is on the transmitting side, P is transmitted through the ReduceScatter.
TABLE 2
Figure DEST_PATH_IMAGE002
Conditions are as follows: t is the total data size of the data block
The first time is divided into n first-stage machines
Second fractionation of m two-stage machines
Arranging on the same cluster of machines
TABLE 3
Figure 940773DEST_PATH_IMAGE003
Conditions are as follows: i = j, k! = l (i equals j and k does not equal l)
Grey representation is completely unaffected with respect to the first table
T is the total data size of the data block, when the transmission cost is calculated, a table can be used for generating a transmission coefficient, and the coefficient is multiplied by the data size T to obtain the transmission cost
The first fraction having n primary machines and the second fraction having m secondary machines
Arranging on the same cluster of machines
The calculation mode is that the overlapped part of the sender and the receiver is calculated when P is not involved, and the overlapped part is symmetrical at the moment, and the total data amount required by the receiver is subtracted.
When P is involved, if P is at the receiver, P can always take the most similar part (S for the transmitter) or a sub-part (B for the transmitter) from the same part as the sender, and the transmission cost is minimized by assigning 0 elsewhere.
If P is on the transmitting side, P is transmitted through the ReduceScatter.
TABLE 4
Figure DEST_PATH_IMAGE004
Conditions are as follows: i! = j, k = l (i is not equal to j and k is equal to l)
Grey representation is completely unaffected with respect to the first table
T is the total data size of the data block, when the transmission cost is calculated, a table can be used for generating a transmission coefficient, and the coefficient is multiplied by the data size T to obtain the transmission cost
The first fraction having n primary machines and the second fraction having m secondary machines
Arranging on the same cluster of machines
The calculation mode is that the overlapped part of the sender and the receiver is calculated when P is not involved, and the overlapped part is symmetrical at the moment, and the total data amount required by the receiver is subtracted.
When P is involved, if P is at the receiver, P can always take the most similar part (S for the transmitter) or a sub-part (B for the transmitter) from the same part as the sender, and the transmission cost is minimized by assigning 0 elsewhere.
TABLE 5
Figure 720510DEST_PATH_IMAGE005
Conditions are as follows: i = j, k = l (i equals j and k equals l)
Grey representation is completely unaffected with respect to the first table
T is the total data size of the data block, when the transmission cost is calculated, a table can be used for generating a transmission coefficient, and the coefficient is multiplied by the data size T to obtain the transmission cost
The first fraction having n primary machines and the second fraction having m secondary machines
Arranging on the same cluster of machines
The calculation mode is that the overlapped part of the sender and the receiver is calculated when P is not involved, and the overlapped part is symmetrical at the moment, and the total data amount required by the receiver is subtracted.
When P is involved, if P is at the receiver, P can always take the most similar part (S for the transmitter) or a sub-part (B for the transmitter) from the same part as the sender, and the transmission cost is minimized by assigning 0 elsewhere.
If P is on the transmitting side, P is transmitted through the ReduceScatter.
TABLE 6
Figure DEST_PATH_IMAGE006
Conditions are as follows: it is no longer important whether i is equal to j, k is equal to l
T is the total data size of the data block, when the transmission cost is calculated, a table can be used for generating a transmission coefficient, and the coefficient is multiplied by the data size T to obtain the transmission cost
On the conveying machine cluster, n primary machines are classified for the first time, and m secondary machines are classified for the second time
On the receiving machine cluster, N primary machines are classified for the first time, and M secondary machines are classified for the second time
Arranging on the same cluster of machines
The amount of transmission being equal to the amount of data to be accepted when the transmitting party does not have P
When the transmitting side has P, the ReduceScatter operation is firstly carried out locally, and then the P is transmitted to the receiving side.
Of course, the effect is equivalent to the corresponding data blocks being directly summed at the receiver (s (j), s (l)), and then additionally broadcast if necessary
Therefore, in summary, the transmission cost calculation component 150 performs transmission cost calculation for the SBP distribution descriptor at the input end of each candidate multidimensional SBP distributed signature without a lookup table, that is, directly performs calculation. After a lookup table is formed by performing initialization calculation on the basis of job description through the transmission cost calculation component 150, the same type of job is directly searched to obtain the multidimensional SBP distributed signature corresponding to the minimum transmission cost of the current logical node. Thus, the transmission cost calculation component 150 comprises a transmission cost query component. The transmission cost query component is also referred to as the transmission cost calculation component 150. Reference to the transmission cost query component 150 below refers to the transmission cost calculation component 150 as well.
The above is merely exemplified by a two-dimensional SBP distributed signature. One skilled in the art, based on the teachings of the present disclosure, may calculate the transmission cost for other multidimensional SBP distributed signatures. This is not repeated herein.
Subsequently, the result logical node topology generation component 160 obtains the required transmission cost sums of all input ends of each candidate multidimensional SBP distributed signature of the current initial logical node based on the result queried by the transmission cost query component 150, so as to select the candidate multidimensional SBP distributed signature with the smallest transmission cost sum as the determined multidimensional SBP distributed signature of the current initial logical node, thereby obtaining the current result logical node with the determined multidimensional SBP distributed signature. Thereby obtaining a resulting logical node topology map 170.
Optionally, for performing actual distributed data processing, the computation graph generating component 180 of the multi-level distributed data processing deployment system 100 generates the task logic computation graph based on the result logic node topology graph formed by the result logic nodes with the determined multi-dimensional SBP distributed signatures, and inserts a transformation computation node between the input end of each current computation node corresponding to the current result logic node and the output end of each computation node corresponding to the corresponding upstream logic node when the multi-dimensional SBP distribution descriptor at one input end of the current result logic node is different from the multi-dimensional SBP distribution descriptor at the output end of the corresponding upstream logic node, so as to transform the logic tensor described by the multi-dimensional SBP distribution descriptor at the output end of each computation node corresponding to the upstream logic node into the multi-dimensional SBP distribution descriptor at the input end to be input to the corresponding input end of each computation node corresponding to the current result logic node And (4) a logic tensor, thereby finally completing the deployment of the multi-level distributed data processing on the actual multi-level distributed data processing system.
FIG. 6 is a schematic diagram illustrating another embodiment of a multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure. The embodiment shown in fig. 6 is different from the embodiment shown in fig. 2 in that, based on task configuration data received from a user, when actual computing resources of a logic data processing device are less than computing resources required for a logic tensor at an input end of the logic data processing device and a resultant logic tensor, an upper logic data processing device is arranged above a logic level to which the logic data processing device belongs as a time dimension logic data processing device, an SBP distribution descriptor of a dimension corresponding to the time dimension logic data processing device in each candidate multidimensional SBP distributed signature is a time dimension distribution descriptor, and the time dimension distribution descriptor belonging to the same candidate multidimensional SBP distributed signature includes a division logic tensor descriptor and a broadcast logic tensor distribution descriptor, wherein the division times of the logic tensor attached to the division logic tensor descriptor and the broadcast logic tensor attached to the broadcast logic tensor distribution descriptor are the same as each other Some of the logic tensors are broadcast the same number of times. This eliminates the need for large memory computational processing devices by splitting the logic tensor into multiple small logic tensors. Since the logical tensor descriptor in the time dimension is attached with the number of times of logical tensor division and the number of times of broadcasting, the time dimension distributed descriptor in the multidimensional SBP distributed signature can be identified. Based on this, the computation graph generating component 180 includes a time dimension identifying unit 181 that acquires the hierarchy or dimension position of the time dimension distribution descriptor and the number of times of tensor division and the number of times of broadcasting, so that when the computation graph generating component 180 contains the time dimension distribution descriptor in the multidimensional SBP distributed signature, the divided computation node is inserted before the input end corresponding to the computation node corresponding to the current logical node based on the divided logical tensor descriptor, the repeatedly broadcasted computation node is inserted before the input end of the computation node corresponding to the current logical node based on the broadcasted logical tensor distribution descriptor, and the aggregate computation node is inserted after the output end of the computation node corresponding to the current logical node based on the broadcasted logical tensor distribution descriptor.
Illustrated in fig. 7 is an example of the transformation of a logical node topology graph into a computational graph using the multi-dimensional SBP distributed signature based multi-level distributed data processing deployment system according to the present disclosure illustrated in fig. 6, as shown on the left side of fig. 7, if the computational resources of one of the computing devices assigned by logical node N3 are sufficient to satisfy the computational resources required for first and second logic tensors 1 and 2 and the resulting logic tensor 3, then the low-dimensional candidate SBP distributed signatures may be employed directly, for example, because the computational resources of one of the computing devices assigned by logical node N3 do not satisfy the computational resources required for first and second logical tensors 1 and 2 and the resulting logical tensor 3, the two-dimensional candidate SBP distributed signatures [ (S (1), S (1)), (S (0), B)) → (P, S (1)).]By further dividing the description of one of the input logic tensors (e.g., the first logic tensor 1), the requirement for the computing resources of the computing device allocated by the logic node N3 is reduced, so that the computing resources of the computing device allocated by the logic node N3 are sufficient to meet the computing resources required for processing the further divided logic tensor. To this end, the decision system according to the present disclosure provides a deployment system as shown in fig. 6. For example, referring to fig. 7, the device hierarchy setting component 110 may obtain the computing resources, such as memory resources, required for the first logic tensor 1, the second logic tensor 2, and the resulting logic tensor 3 of the current logic node N3 when the logic node N3 is processed, and obtain the computing resources, such as memory resources, required for the current logic node N3, and obtain the current logic node N3 at the same timeNode N3 is the actual computing resources that a deployed computing device (e.g., a certain GPU or a certain CPU or a certain server) is capable of providing. As previously described, if the deployed computing device is sufficient to satisfy the computing resources required to process all the input and output logic tensors of the current logical node N3, the time dimension hierarchy logical data processing device is not increased. If the deployed computing devices are not sufficient to satisfy the computing resources required to process all the input logic tensors and output logic tensors (or the resultant logic tensors) of the current logic node N3, the device hierarchy setting component 110 sets an upper-level logic data processing device as a time-dimension logic data processing device above the logic hierarchy to which the logic data processing device belongs, thereby generating a time-dimension distribution descriptor in the corresponding multi-dimensional SBP distributed signature at the time of deployment. For the descriptors of the time dimension in the multi-dimensional SBP distributed signature of the logic nodes which do not need the time dimension distribution, the attached division or broadcast times are zero, which means that the multi-dimensional SBP distributed signature does not need to generate corresponding computation nodes in the subsequent computation graph generation process despite the descriptors of the time dimension. For convenience of description, the time dimension descriptor of the multidimensional SBP distributed signature of the logical node that does not need the time dimension distribution is omitted in fig. 7. The distribution descriptors of the output tensors of the logical nodes N1 and N2, for example, are directly described in a one-dimensional manner. For example, the descriptor S (1) of the output tensor of the logical node N1 should be (S (1), S (1)0) And the descriptor S (0) of the output tensor of the logical node N2 should be (S (0), B)0) The superscript 0 therein indicates that its logical nodes N1 and N2 themselves need not correspond to split nodes and broadcast nodes formed before their corresponding compute nodes. Note that the multi-dimensional SBP distributed signature of logical node N3 [ (S (1), S (1)), (S (0), B)) → (P, S (1))]The distributed descriptor in the second dimension is accompanied by the number of divisions and the number of broadcasts. Specifically, it may be [ (S (1), S (1))3), (S(0), B3))→(P, S(1)3)]. But is not shown in fig. 7 for ease of illustration. Therefore, the description of each SBP descriptor of the SBP distributed signature in each time dimensionA predetermined number or number of distributions are attached to the character. For example, each dimension SBP descriptor S (1) of the SBP descriptors (S (1), (S (1)) of the first sliced logical tensor 1 contains the number of computing devices deployed in parallel or a predetermined number divided into sliced tensors, and the SBP descriptor S (0), B) of the second logical tensor 2 is that the SBP descriptor S (0) of the first dimension contains the number of computing devices deployed in parallel or a predetermined number divided into sliced tensors, and the SBP descriptor B of the second dimension contains a predetermined number of times that the second logical tensor 2 is to be repeatedly broadcasted, the predetermined number attached to the SBP descriptor S (1) of the first logical tensor 1 is equal to the predetermined number of times that the SBP descriptor B of the SBP descriptors (S (0), B) of the second logical tensor 2 is contained, and, as a result, the SBP descriptor of the sliced logical tensor 3 is (P, s (1)), the predetermined number of times included in the SBP descriptor S (1) of the second dimension is also equal to the predetermined number of times included in the SBP descriptor B of the SBP descriptor (S (0), B) of the second logic tensor 2.
It should be noted that, when selecting SBP descriptors in the second or last dimension, it is usually preferable to select the split or parallel distribution descriptor S for the largest of the input tensors and the broadcast descriptor B for the other input tensors. The input tensor selected for segmentation may be a data tensor or a model tensor.
The computation graph generating component 180 of the multi-level distributed data processing deployment system 100 generates a computation graph including a division node and a broadcast node based on the recognition result of the time dimension recognition unit 181, wherein each task logic node forms a corresponding number of computation nodes corresponding to the number of distributed or parallel computing devices, and for a case where the distribution descriptor of the input tensor of the computation node corresponding to the current logic node and the distribution descriptor of the output tensor of the upstream computation node do not correspond, it is necessary to insert a transformation computation node, for example, shown in the right part of fig. 7, between the computation nodes N1 and N3, and the computation node N4, so as to divide the logic 1 whose distribution descriptor is S (1) output by the computation node N1 into the first division logic tensor 1 of the SBP descriptors (S (1), (S (1)), also between the computation nodes N2 and N3 of fig. 7, the computation node N5 is inserted, thereby dividing the tensor whose distribution descriptor is P output by the computation node N2 into the second logic tensor 2 of the SBP descriptor (P, B). Specifically, the computation node N4 is a divided computation node that, when processing the logical tensor 1 whose distribution descriptor is S (1) output by the computation node N1, continues to perform division processing (UPACK) on the 1 st dimension of the tensor, divides the logical tensor 1 into a predetermined number of first divided logical tensors 1 that embody the distribution result described by the SBP descriptors (S (1), (S (1)), and the divided computation node N4 outputs a predetermined number of first divided logical tensors one by one to the computation node N3 (corresponding to the logical node N3), and the computation node N5 is a REPEAT broadcast output node that, when processing the second logical tensor 2 output by the computation node, performs REPEAT broadcast output processing (REPEAT output) on the second logical tensor 2, and REPEATs the predetermined number of times of REPEAT output and the predetermined number of first divided logical tensors 1 And the like. Therefore, the calculation node N3, when performing processing, the actual tensors of each processing are the first sliced logic tensor 1 and the second logic tensor 2, which obtain an output resultant logic tensor that is the resultant sliced tensor 3 instead of the resultant logic tensor 3. As a result, the calculation resource required by the calculation node N3 in the case of actually processing the first and second sliced logic tensors 1 and 2 and obtaining the output resultant sliced logic tensor 3 is much smaller than that in the case of actually processing the first and second logic tensors 1 and 2 and obtaining the resultant logic tensor 3, and the calculation resource of the calculation device disposed by the calculation node N3 can satisfy the calculation resource required for the actual calculation, reducing the need for a high-cost calculation device.
In addition, after the compute node N3, the compute node N6 needs to be inserted. The compute node N6 is a rendezvous compute node. It performs aggregation processing (accumulator) on the result sliced logic tensor 3 output by the compute node N3, and aggregates the result sliced logic tensor 3 one by one into a result logic tensor 3.
Although the above gives a general case of how to determine the final SBP distributed signature at some candidate SBP distributed signatures, in some specific cases, for some logical nodes, in case of special configuration of the user or user specification, these logical nodes have only the SBP distributed signature specified by the user, so that the logical nodes downstream thereof will make the SBP distributed signature determination based on such specially specified upstream logical nodes.
By the multi-stage distributed data processing deployment system based on the multi-dimensional SBP distributed signature, on one hand, the data exchange amount between different data processing devices in the data processing process of the distributed data processing system can be minimized from the global perspective, thereby reducing the overhead generated in the data interaction process, effectively reducing the adverse effect of data exchange on the actual operation, reducing the waiting time of the operation, thereby accelerating the data processing speed, more importantly, reducing the requirement on the single-card computing resource quantity of the data processing equipment under the large-scale model and large-scale data processing requirements, therefore, the required hardware cost is reduced, and on the other hand, the parallel deployment can be automatically carried out, and especially, the same data processing effect can be automatically achieved under the condition of mixed parallel deployment needing manual intervention. Moreover, when the data processing device deployed by the large logic tensor cannot meet the computing resource required by the processing of the large logic tensor under the condition that the large logic tensor needs to be processed locally, the multi-level distributed data processing deployment system based on the multi-dimensional SBP distributed signature can eliminate the requirement that the computing resource of the data processing device needs to be increased due to the local large logic tensor. If the computational resources of the data processing apparatus are increased for the processing of the local large logic tensor, the increased computational resources are mostly left idle, which also results in the waste of computational resources.
The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.
Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.
It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (6)

1. A multi-level distributed data processing deployment system comprising:
a device hierarchy setting component that sets a plurality of logical data processing devices as at least two levels of parallel logical data processing devices and specifies a logical hierarchical relationship between each other, thereby determining the number of dimensions of the SBP distributed signature, wherein each upper level logical data processing device contains or is connected with a lower level data processing device of the same data amount;
a position mark acquisition component for acquiring position marks of all the logic data processing devices;
an initial logic node topological graph generating component, which generates an initial logic node topological graph for the multi-level distributed data processing system based on task configuration data received from a user, wherein each initial logic node is attached with one or more candidate multi-dimensional SBP distributed signatures and position marks, and the descriptor of each dimension of the multi-dimensional SBP distribution descriptor at the input end and the output end of each multi-dimensional SBP distributed signature describes the distribution mode of a logic tensor on the logic data processing equipment of a corresponding level and each position mark indicates the logic data processing equipment where the logic tensor is deployed;
a transmission cost query component, for each current initial logic node, based on the multidimensional SBP distribution descriptor of the output end of the upstream logic node of each determined multidimensional SBP distributed signature and the multidimensional SBP distribution descriptor of each candidate multidimensional SBP distributed signature of the current initial logic node corresponding to the output end of the upstream logic node, querying a transmission cost conversion table to obtain the transmission cost between the multidimensional SBP distribution descriptor of the output end of the upstream logic node and the multidimensional SBP distribution descriptor of the corresponding input end of the current initial logic node;
a result logic node topological graph generation component, which acquires the required transmission cost sum of all input ends of each candidate multidimensional SBP distributed signature of the current initial logic node based on the transmission cost query component query result, so as to select the candidate multidimensional SBP distributed signature with the minimum transmission cost sum as the determined multidimensional SBP distributed signature of the current initial logic node, thereby acquiring the current result logic node with the determined multidimensional SBP distributed signature;
wherein the device hierarchy setting component is based on task configuration data received from a user, when actual computational resources of a logical data processing device are less than computational resources required for a logical tensor of an input of the logical data processing device and a resulting logical tensor, setting an upper-level logic data processing device as a time dimension logic data processing device on a logic level to which the logic data processing device belongs, wherein an SBP distribution descriptor of a dimension corresponding to the time dimension logic data processing device in each candidate multi-dimensional SBP distributed signature is a time dimension distribution descriptor, the time dimension distribution descriptor belonging to the same candidate multi-dimensional SBP distributed signature comprises a segmentation logic tensor descriptor and a broadcast logic tensor distribution descriptor, wherein the number of times of dividing the logic tensor attached to the divided logic tensor descriptor is the same as the number of times of broadcasting the logic tensor attached to the broadcast logic tensor distribution descriptor.
2. The multi-level distributed data processing deployment system of claim 1, further comprising:
a computation graph generation component for generating a task logic computation graph based on the result logic node composition result logic node topological graph with the multi-dimensional SBP distributed signature determined, and when the multi-dimensional SBP distribution descriptor of one input of the current result logical node is different from the multi-dimensional SBP distribution descriptor of the output of the corresponding upstream logical node, inserting a transformed compute node between the input of each current compute node corresponding to the current result logical node and the output of each compute node corresponding to the corresponding upstream logical node, so as to transform the logic tensor output by the output end of each computation node corresponding to the upstream logic node and described by the multi-dimensional SBP distribution descriptor of the output end into the logic tensor to be input by the corresponding input end of each computation node corresponding to the current result logic node and described by the multi-dimensional SBP distribution descriptor of the input end.
3. The multi-level distributed data processing deployment system of claim 2, wherein the computation graph generation component inserts split computation nodes before corresponding inputs of computation nodes corresponding to the current logical node based on split logic tensor descriptors, inserts repeat broadcast computation nodes before corresponding inputs of computation nodes corresponding to the current logical node based on broadcast logic tensor distribution descriptors, and inserts aggregate computation nodes after corresponding outputs of computation nodes of the current logical node when the time dimension distribution descriptors are included in the multi-dimensional SBP distributed signature.
4. A multi-level distributed data processing deployment method comprises the following steps:
setting a plurality of logical data processing devices as at least two levels of parallel logical data processing devices and specifying a logical hierarchical relationship between each other by a device hierarchy setting component, thereby determining the number of dimensions of the SBP distributed signature, wherein each upper level logical data processing device contains or is connected with a lower level data processing device of the same data amount;
acquiring the position marks of all the logic data processing equipment through a position mark acquisition component;
generating an initial logic node topological graph for the multi-level distributed data processing system based on task configuration data input by a user through an initial logic node topological graph generating component, wherein each initial logic node is attached with one or more candidate multi-dimensional SBP distributed signatures and position marks, and the descriptor of each dimension of the multi-dimensional SBP distributed descriptor at the input end and the output end of each multi-dimensional SBP distributed signature describes the distribution mode of a logic tensor on the logic data processing equipment of a corresponding hierarchy and each position mark indicates the logic data processing equipment where the logic tensor is deployed;
inquiring, by a transmission cost inquiry component, for each current initial logic node, a transmission cost between the multi-dimensional SBP distribution descriptor at the output end of the upstream logic node and the multi-dimensional SBP distribution descriptor at the corresponding input end of the current initial logic node based on the multi-dimensional SBP distribution descriptor at the output end of the upstream logic node of each determined multi-dimensional SBP distributed signature of the current initial logic node and the multi-dimensional SBP distribution descriptor at the corresponding input end of the output end of the upstream logic node of each candidate multi-dimensional SBP distributed signature of the current initial logic node;
acquiring the required transmission cost sum of all input ends of each candidate multidimensional SBP distributed signature of the current initial logic node based on the transmission cost query component query result through a result logic node topological graph generating component, so as to select the candidate multidimensional SBP distributed signature with the minimum transmission cost sum as the determined multidimensional SBP distributed signature of the current initial logic node, thereby acquiring the current result logic node with the determined multidimensional SBP distributed signature;
wherein the device hierarchy setting component is based on task configuration data received from a user, when actual computational resources of a logical data processing device are less than computational resources required for a logical tensor of an input of the logical data processing device and a resulting logical tensor, setting an upper-level logic data processing device as a time dimension logic data processing device on a logic level to which the logic data processing device belongs, wherein an SBP distribution descriptor of a dimension corresponding to the time dimension logic data processing device in each candidate multi-dimensional SBP distributed signature is a time dimension distribution descriptor, the time dimension distribution descriptor belonging to the same candidate multi-dimensional SBP distributed signature comprises a segmentation logic tensor descriptor and a broadcast logic tensor distribution descriptor, wherein the number of times of dividing the logic tensor attached to the divided logic tensor descriptor is the same as the number of times of broadcasting the logic tensor attached to the broadcast logic tensor distribution descriptor.
5. The multi-level distributed data processing deployment method of claim 4, further comprising:
generating a task logic computation graph based on a result logic node topological graph formed by the result logic nodes with the determined multidimensional SBP distributed signature through a computation graph generation component, and when the multi-dimensional SBP distribution descriptor of one input of the current result logical node is different from the multi-dimensional SBP distribution descriptor of the output of the corresponding upstream logical node, inserting a transformed compute node between the input of each current compute node corresponding to the current result logical node and the output of each compute node corresponding to the corresponding upstream logical node, so as to transform the logic tensor output by the output end of each computation node corresponding to the upstream logic node and described by the multi-dimensional SBP distribution descriptor of the output end into the logic tensor to be input by the corresponding input end of each computation node corresponding to the current result logic node and described by the multi-dimensional SBP distribution descriptor of the input end.
6. The deployment method for multi-level distributed data processing according to claim 5, wherein the computation graph generation component inserts a split computation node before the corresponding input end of the computation node corresponding to the current logical node based on a split logic tensor descriptor, inserts a repeat broadcast computation node before the corresponding input end of the computation node corresponding to the current logical node based on a broadcast logic tensor distribution descriptor, and inserts an aggregate computation node after the corresponding output end of the computation node corresponding to the current logical node when the time dimension distribution descriptor is included in the multi-dimensional SBP distributed signature.
CN202110386635.XA 2021-04-12 2021-04-12 Multi-stage distributed data processing and deploying system and method thereof Active CN112764940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110386635.XA CN112764940B (en) 2021-04-12 2021-04-12 Multi-stage distributed data processing and deploying system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110386635.XA CN112764940B (en) 2021-04-12 2021-04-12 Multi-stage distributed data processing and deploying system and method thereof

Publications (2)

Publication Number Publication Date
CN112764940A CN112764940A (en) 2021-05-07
CN112764940B true CN112764940B (en) 2021-07-30

Family

ID=75691421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110386635.XA Active CN112764940B (en) 2021-04-12 2021-04-12 Multi-stage distributed data processing and deploying system and method thereof

Country Status (1)

Country Link
CN (1) CN112764940B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626652B (en) * 2021-10-11 2021-12-17 北京一流科技有限公司 Data processing network system, data processing network deployment system and method thereof
CN116166846B (en) * 2023-04-13 2023-08-01 广东广宇科技发展有限公司 Distributed multidimensional data processing method based on cloud computing
CN116227585B (en) * 2023-05-10 2023-07-25 之江实验室 Parallel execution method and device for cluster tasks, computer equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532072A (en) * 2019-07-24 2019-12-03 中国科学院计算技术研究所 Distributive type data processing method and system based on Mach

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493406B2 (en) * 2006-06-13 2009-02-17 International Business Machines Corporation Maximal flow scheduling for a stream processing system
EP3109778B1 (en) * 2015-06-25 2018-07-04 Commissariat A L'energie Atomique Et Aux Energies Alternatives Computer-implemented method of performing parallelized electronic-system level simulations
CN110928697B (en) * 2020-02-13 2020-05-22 北京一流科技有限公司 Topological graph conversion system and method
CN110955734B (en) * 2020-02-13 2020-08-21 北京一流科技有限公司 Distributed signature decision system and method for logic node
CN111708685B (en) * 2020-05-18 2022-08-09 福建天晴在线互动科技有限公司 Log acquisition monitoring method and system for distributed server

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532072A (en) * 2019-07-24 2019-12-03 中国科学院计算技术研究所 Distributive type data processing method and system based on Mach

Also Published As

Publication number Publication date
CN112764940A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112764940B (en) Multi-stage distributed data processing and deploying system and method thereof
CN110955734B (en) Distributed signature decision system and method for logic node
CN112464784A (en) Distributed training method based on hybrid parallel
CN111930519B (en) Parallel decision system and method for distributed data processing
CN111666151B (en) Topological graph conversion system and method thereof
US20120200580A1 (en) Synchronous parallel pixel processing for scalable color reproduction systems
CN111049900B (en) Internet of things flow calculation scheduling method and device and electronic equipment
CN111444309B (en) System for learning graph
KR101780534B1 (en) Method and system for extracting image feature based on map-reduce for searching image
CN111651507B (en) Big data processing method and system
CN117194047B (en) Distributed system based on data collaboration
JP6303032B2 (en) Method and apparatus for generating instances of technical indicators
CN115391170A (en) Parallel decision system and method for distributed data processing
CN115208954A (en) Parallel strategy presetting system and method for distributed data processing
Goldman et al. An efficient parallel algorithm for solving the knapsack problem on hypercubes
CN112948087A (en) Task scheduling method and system based on topological sorting
KR20170085396A (en) Feature Vector Clustering and Database Generating Method for Scanning Books Identification
Rim et al. An efficient dynamic load balancing using the dimension exchange method for balancing of quantized loads on hypercube multiprocessors
CN110018832B (en) Radar software component deployment strategy based on improved dynamic programming
Abubaker et al. Scalable unsupervised ml: Latency hiding in distributed sparse tensor decomposition
CN117544552A (en) Automatic construction system and method for data exchange communication path
Antola et al. Two-dimensional object recognition on parallel machines
CN118277040A (en) Deployment method of model training and electronic equipment
CN114491913A (en) Model processing method and related equipment
CN116433947A (en) Point cloud panorama segmentation method based on sparse instance proposal and aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant